Nov 15, 2017 11:40:47 AM by Juan Carlos Olamendy
When we're facing a write-intensive workload where it's required to provide the best possible write throughput (writes per seconds), we need to think out-of-the-box and select a persistent data structure different than the traditional ones.
The persistent data structure in InnoDB is the B+Tree on the primary key. In a few words, it means that the data is physically stored as B+Tree. It's well-known when we face a very high volume of inserts or random updates on a B+Tree (a heavy-write workload; for example, an event/tracking recording system), then there might be a trend to system degradation; because there is an overhead on the index maintenance while inserting rows in order to keep the tree as balanced as possible.
Log-structured merge tree (or LSM-tree) is a data structure with performance characteristics very attractive on high-insertion scenarios. It's based on the principle that the most efficient operation on persistent storage (SSD, hard disks) is sequential access, so the strategy to improve the performance on write-heavy workloads is to append data similar to traditional logs, thus eliminating any type of slow random access.
LSM-tree has now matured and is being used in several database engines, such as HBase, Cassandra, LevelDB, SQLite; even MongoDB 3.0 comes with an optional LSM engine.
Facebook has also developed a database engine based on the LSM-tree called RocksDB. And MyRocks is a new storage engine in MySQL based on RocksDB in order to take advantages of LSM-tree data structures while still using the mature features of MySQL such as replication, backups, etc.
In this article, I'd like to introduce the concepts and design of log-structured merge trees and in particular how MyRocks is implemented in MySQL.
LSM-tree is basically a key/value data structure. It’s an immutable data structure that allows only appending data to files, avoiding any update-in-place operations, thus turning any costly random access into sequential access.
Let’s talk about the implementation of LSM-trees. LSM is based on Sorted String Table (SSTable), which is an immutable map/dictionary containing a set of sorted key/values where both keys and values are strings.
SSTable has two parts: index and data files.
The data files contain a set of the sorted key/values concatenated one after another, enabling fast sequential range scans.
The index files contain keys and offsets, where each the offset references the underlying record in the data file, enabling fast lookup operations. Notice that this index can be implemented using any known data structure as a B+Tree.
The drawback in SSTables is related to the need to rewrite the entire data file when an insert, update or delete operation occurs in order to keep the immutability property. This is where LSM Trees appears on the scene to improve the write path while preserving the fast read path that SSTable gives us.
In LSM-Trees, the write operations are performed first against a mutable in-memory data structure called MemTable. When the MemTable reaches a certain size, then the MemTable is flushed to disk, creating a new SSTable.
The read path requires checking the MemTable first, then searching in the SSTables and finally merging the results together in order to get the most recent data. Notice that each row in the SSTable has a timestamp to identify the most recent change.
Serving a read request can be slow due to the merging step, where in the worst case the system needs to hit several files in the disk or the data files grow a lot over time. In order to mitigate this problem there are two techniques: a bloom filter and a compaction process.
Bloom filters can eliminate unnecessary reads for queries of keys that don't exist. They reduce the number of files to be checked during a lookup query but they can't be used on pure range queries.
The compaction process runs to merge the SSTables and remove the older data.
It’s noteworthy that in order to support durability and failover, the writes are also performed against the write-ahead log (WAL).
Many database systems, such as RocksDB and Cassandra, are using LSM-trees. In particular, RocksDB is an embedded database, written in C++, and widely used within Facebook.
Facebook also created MyRocks, a storage engine for MySQL, based on RocksDB. The idea is to use the nice features of LSM-tree data structure combined with MySQL mature features such as amenability to automation, replication, transactions, optimistic/MVCC locking mechanisms, logical backup using mysqldump, etc.
The key advantages of MyRocks are:
However, the MyRocks engine has some disadvantages, such as:
There are some public online performance testing/benchmarks at:
If you want to use the MyRocks engine in MySQL, one option is to use the Percona MySQL distribution. Percona MyRocks is distributed as a separate package that can be enabled as a plugin for Percona Server 5.7 and later versions.
To install the engine in Ubuntu/Debian distribution, run the command as shown in the listing01
$ sudo apt-get install percona-server-rocksdb-5.7
In order to enable the storage engine, we need to run the ps-admin script as system root as shown in the listing 02.
$ sudo ps-admin --enable-rocksdb -u root -p
When MyRocks is successfully enabled on the server, we can verify it from the MySQL console as shown in the listing 03.
mysql> SHOW ENGINES;
Notice in the listing 03 that the RocksDB storage engine is not set as default. To make the RocksDB storage engine default, set the default-storage-engine=rocksdb variable in the [mysqld] section of the my.cnf configuration file and restart the server.
Creating a table
In order to create a table using the RocksDB storage engine, run the following SQL statement as shown in the listing 04.
mysql> CREATE DATABASE db_lsm;
When we create a table using the RocksDB storage engine, we need to consider:
Notice that by setting rev: before the column family name, the index is mainly created and used for a descending scan (ORDER BY .. DESC). This feature is called Reverse Column Family.
RocksDB data is saved in the /var/lib/mysql/.rocksdb directory in the MySQL data directory. We can find several files such as:
We can also change to another directory by setting rocksdb_datadir and rocksdb_wal_dir parameters in the MySQL configuration file.
Finally, it’s noteable that we can check the status of the RocksDB storage engine and global variables as shown in listing 05.
mysql> SHOW ENGINE rocksdb STATUS;
In this article, I've explained the concepts regarding LSM-tree, a new data structure that ‘rocks’ when there is a high volume of inserts or random updates because it takes advantage of sequential access to the storage.
So, if you need to implement an event-tracking system or similar in your own environment, and you need to have nice throughput as well, now you can use the new MySQL engine called MyRocks, which implements an LSM-tree persistent data structure.
Written by Juan Carlos Olamendy
CEO and Founder of Nubisera Enterprise Software Architect Expert. Advisor. Entrepreneur. Oracle ACE. Microsoft MVP. External technical consultant for Microsoft, Oracle, HP and Dell. Prolific blogger on all subjects related to software technology and entrepreneurship. Graduate degree in Computer Science Masters of Science in Business Informatics(MBA)