Durability is the D in the ACID properties of transactions in the context of RDBMS. Durability is the guarantee that data has been physically recorded to permanent storage (such as a hard disk), preventing any loss of data in the case of a sudden power outage or a hardware failure. In this sense, RDBMSare heavy IO-bound applications, so it’s necessary to apply some techniques to improve performance while making the data durable.
In this article, I would like to explain the flushing mechanisms that InnoDB uses to persist data, so that you can choose the best for your own situations, in order to improve overall system performance.
I will explain the "write/read paths” in the InnoDB engine; starting from the buffer pools of MySQL, passing through the Linux OS buffer caches to the final permanent storage.
Finally, I'll explain the options InnoDB gives to us to select a flushing mechanism for IO optimization purposes.
IO Access Mechanisms in Linux
Read/Write system calls
When an application (MySQL, or other specific application running in the user space) executes a read system call, page cache (write-back cacheimplemented in the OS kernel inside the system space) is looked at first. If the data is in page cache, then it’s copied out into the buffers (memory area) in the application address space. Otherwise, it’s loaded from persistent storage (disk) into page cache for further accesses, as well as copied out into the application space.
When an application executes a write system call, the data moves from the buffers in the application address space into page cache. and the underlying page (every piece of data lives inside a logical box called a page) gets marked as a dirty page. In order to synchronize dirty pages with the storage, a background process called write-back flushes them to the storage and evicts them from the page cache some time afterward.
Mmap system call
Using this mechanism, the data file is mapped into the process address space (MySQL process) using the mmap system call. Read/write operations are performed by directly accessing the address space. In this way, an extra step is eliminated while accessing the data. So, there is no need for intermediate buffers in the user space because every buffer cache is in the system space implemented as page cache by the kernel.
If the data is in page cache, the kernel is bypassed and read operations are performed at memory speed. If the data is not in page cache, a page-fault is issued and the kernel looks for the data for that page and loads the data in page cache to be accessible to the application.
It’s very common for database engines to use this mechanism to access data files.
Asynchronous IO (AIO) is a mechanism that prevents the calling thread from blocking. The application schedules the asynchronous operations using the io_submit system call, but it’s not blocked. So, the IO operation and the application logic can run in parallel. A separate io_getevents system call is used to wait for and get the data as part of a completed IO operation.
Sync/fsync/fdatasync/msync system calls
By default, the write system call returns after all data has been copied from the user space into page cache in the system space. There is no guarantee that data has actually reached storage.
To support this scenario, the sync system call is used to be sure the data is actually transferred from page cache into storage.
The sync system call allows a process to flush all buffers to disk while the fsync system call allows a process to flush the buffers specific to an open file.
Both system calls return to the calling process only when the data has reached permanent storage and in the case of a hardware error, then it’s reported. RDBMS use sync/fsync in order to commit changes in storage to make it durable.
The fdatasync system call is similar to fsync, except that the file metadata may not be updated unless the metadata is needed to access the data. For example, the last modified time won’t be updated, but it will make sure that all the blocks of the file can be found.
The msync system call is used to flush modified data into storage when a file is mapped into memory using the mmap system call. Without use of this call there is no guarantee that changes are written back before munmap is called.
O_DIRECT is a flag passed when a file is opened. It instructs to bypass page cache and perform any IO operations directly against storage.
So, the buffers in the application space are flushed directly to disk, without copying the data into page cache first and waiting for the kernel to schedule write-back operations. Also, the disk controller copies the data directly into user space, bypassing the kernel as well.
Note that storage devices have a physical sector size and I/O operations must be performed in multiples of this block size (typically 4K). This is called block alignment. This is some complexity in the access implementation but this also provides an advantage: The kernel will usually over-align to a 4K boundary even when a 512-byte boundary suffices, but applications can align 512-byte reads for small reads, improving the performance.
Usually database systems (such as MySQL) use this mechanism to avoid the kernel caching mechanism and to implement their own specific caching layer and custom IO scheduling in order to have fine-grained control of the access pattern.
O_SYNC is a flag passed when a file is opened. In this scenario, the write system call transfers data to page cache, but it’s blocked until the data is actually transferred from page cache to physical storage. There is no need to call the sync system call after the write system call.
O_SYNC + O_DIRECT flags
When a file is opened with both O_SYNC + O_DIRECT flags, any write operation is guaranteed to be durable.
InnoDB flushing mechanism
The parameter innodb_flush_method allows tuning the IO scheduling. We have the following options:
- Empty: This is the default value, and is equivalent to using the FSYNC option (see below).
- FSYNC: Data and log files are opened with no options, and the fsyncsystem call is used when the engine requires flushing the data and log files. This option causes double buffering:
- page cache
- InnoDB buffer pools.
- O_DSYNC: O_SYNC: This flag is used to open the log files while the data files are opened with no options. The fsync system call is executed to flush data files only. O_SYNC doesn’t disable double buffering caching at system level and all writes are synchronous.
- O_DIRECT: Data files are opened with the O_DIRECT flag. Log files are opened with no options. It usesthe fsync system call to flush the log files to storage, ensuring no double buffering on the data files, because all read and write operations go directly to disk.
- O_DIRECT_NO_FSYNC: InnoDB uses O_DIRECT during flushing I/O, but skips the fsync system call afterwards. This can provide some performance benefits if you are using a file system that does not require the fsync to store metadata.
- ALL_O_DIRECT: Both data and log files are opened with the O_DIRECT flag. It uses the fsync system call to flush the data files. This is an extra option in Percona Server.
In this article, I've explained concepts pertaining to the read/write path in Unix-based OS and their direct effect on the performance of RDBMS, in particular InnoDB engine in MySQL.
With these concepts cleared in mind, we can tune the IO scheduling of InnoDB in order to positively impact the performance of the overall system.
Now you can start tuning your InnoDB environment to speed up your own database applications.