An LSM-tree engine for MySQL

Introduction

When we're facing a write-intensive workload where it's required to provide the best possible write throughput (writes per seconds), we need to think out-of-the-box and select a persistent data structure different than the traditional ones.

The persistent data structure in InnoDB is the B+Tree on the primary key. In a few words, it means that the data is physically stored as B+Tree. It’s well-known when we face a very high volume of inserts or random updates on a B+Tree (a heavy-write workload; for example, an event/tracking recording system), then there might be a trend to system degradation; because there is an overhead on the index maintenance while inserting rows in order to keep the tree as balanced as possible.

Log-structured merge tree (or LSM-tree) is a data structure with performance characteristics very attractive on high-insertion scenarios. It’s based on the principle that the most efficient operation on persistent storage (SSD, hard disks) is sequential access, so the strategy to improve the performance on write-heavy workloads is to append data similar to traditional logs, thus eliminating any type of slow random access.

LSM-tree has now matured and is being used in several database engines, such as HBase, Cassandra, LevelDB, SQLite; even MongoDB 3.0 comes with an optional LSM engine.

Facebook has also developed a database engine based on the LSM-tree called RocksDB. And MyRocks is a new storage engine in MySQL based on RocksDB in order to take advantages of LSM-tree data structures while still using the mature features of MySQL such as replication, backups, etc.

In this article, I'd like to introduce the concepts and design of log-structured merge trees and in particular how MyRocks is implemented in MySQL.

LSM-tree data structure

LSM-tree is basically a key/value data structure. It’s an immutable data structure that allows only appending data to files, avoiding any update-in-place operations, thus turning any costly random access into sequential access.

Let’s talk about the implementation of LSM-trees. LSM is based on Sorted String Table (SSTable), which is an immutable map/dictionary containing a set of sorted key/values where both keys and values are strings.

SSTable has two parts: index and data files.

The data files contain a set of the sorted key/values concatenated one after another, enabling fast sequential range scans.

The index files contain keys and offsets, where each the offset references the underlying record in the data file, enabling fast lookup operations. Notice that this index can be implemented using any known data structure as a B+Tree.

The drawback in SSTables is related to the need to rewrite the entire data file when an insert, update or delete operation occurs in order to keep the immutability property. This is where LSM Trees appears on the scene to improve the write path while preserving the fast read path that SSTable gives us.

In LSM-Trees, the write operations are performed first against a mutable in-memory data structure called MemTable. When the MemTable reaches a certain size, then the MemTable is flushed to disk, creating a new SSTable.

The read path requires checking the MemTable first, then searching in the SSTables and finally merging the results together in order to get the most recent data. Notice that each row in the SSTable has a timestamp to identify the most recent change.

Serving a read request can be slow due to the merging step, where in the worst case the system needs to hit several files in the disk or the data files grow a lot over time. In order to mitigate this problem there are two techniques: a bloom filter and a compaction process.

Bloom filters can eliminate unnecessary reads for queries of keys that don't exist. They reduce the number of files to be checked during a lookup query but they can't be used on pure range queries.

The compaction process runs to merge the SSTables and remove the older data.

It’s noteworthy that in order to support durability and failover, the writes are also performed against the write-ahead log (WAL).

MyRocks database engine

Many database systems, such as RocksDB and Cassandra, are using LSM-trees. In particular, RocksDB is an embedded database, written in C++, and widely used within Facebook.

Facebook also created MyRocks, a storage engine for MySQL, based on RocksDB. The idea is to use the nice features of LSM-tree data structure combined with MySQL mature features such as amenability to automation, replication, transactions, optimistic/MVCC locking mechanisms, logical backup using mysqldump, etc.

The key advantages of MyRocks are:

Better compression and storage efficiency compared to other MySQL database engines, because the page size is not fixed. For example, when a page is compressed to 5KB, it will use only 5KB of storage, compared with 8KB with InnoDB.
Random operation reduction, because the access pattern is sequential. This allows us to use almost the full I/O capacity of the storage.

However, the MyRocks engine has some disadvantages, such as:

Not as many features as InnoDB. Key missing features are text, geo, native partitioning, online DDL, and XA recovery on the master.
LSM-trees might use more CPU per query because more data structures can be checked to satisfy a query. It might also increase the number of CPU cache misses.
LSM-trees suffer from the range read penalty because they have to check several files to get data for a range query. Most LSM-trees, including MyRocks, use bloom filters to reduce the number of files to be checked during a point query.

There are some public online performance testing/benchmarks at:

Installing Percona MyRocks

If you want to use the MyRocks engine in MySQL, one option is to use the Percona MySQL distribution. Percona MyRocks is distributed as a separate package that can be enabled as a plugin for Percona Server 5.7 and later versions.

To install the engine in Ubuntu/Debian distribution, run the command as shown in the listing01

$ sudo apt-get install percona-server-rocksdb-5.7

Listing 01

In order to enable the storage engine, we need to run the ps-admin script as system root as shown in the listing 02.

$ sudo ps-admin --enable-rocksdb -u root -p

Listing 02

When MyRocks is successfully enabled on the server, we can verify it from the MySQL console as shown in the listing 03.

mysql> SHOW ENGINES;
+--------------------+---------+----------------------------------------------------------------------------+--------------+------+------------+
| Engine             | Support | Comment                                                                    | Transactions | XA   | Savepoints |
+--------------------+---------+----------------------------------------------------------------------------+--------------+------+------------+
| ROCKSDB            | YES     | RocksDB storage engine                                                     | YES          | YES  | YES        |
  InnoDB             | DEFAULT | Percona-XtraDB, Supports transactions, row-level locking, and foreign keys | YES          | YES  | YES        |
+--------------------+---------+----------------------------------------------------------------------------+--------------+------+------------+

Listing 03

Notice in the listing 03 that the RocksDB storage engine is not set as default. To make the RocksDB storage engine default, set the default-storage-engine=rocksdb variable in the [mysqld] section of the my.cnf configuration file and restart the server.

Creating a table

In order to create a table using the RocksDB storage engine, run the following SQL statement as shown in the listing 04.

mysql> CREATE DATABASE db_lsm;
mysql> USE db_lsm;
 
mysql> CREATE TABLE `lsm_table` (
  `id1` bigint(20) unsigned NOT NULL DEFAULT '0',
  `id1_type` int(10) unsigned NOT NULL DEFAULT '0',
  `id2` bigint(20) unsigned NOT NULL DEFAULT '0',
  `id2_type` int(10) unsigned NOT NULL DEFAULT '0',
  `payload` varchar(255) NOT NULL DEFAULT '',
  `unix_time` bigint(20) unsigned NOT NULL DEFAULT '0',
  PRIMARY KEY (`id1`,`id2`) COMMENT 'pk_cf',
  KEY `ndx_id1` (`id1`,`id1_type`,`unix_time`,`payload`) COMMENT 'rev:ndx_id1'
) ENGINE=RocksDB DEFAULT COLLATE=latin1_bin;

Listing 04

When we create a table using the RocksDB storage engine, we need to consider:

Data is stored per index basis. RocksDB internally allocates a column family to store indexes. A column family is key/value data structure where the key contains the primary key and the value contains the remaining columns. By default, all data is stored in the default column family. We can change the column family by setting an index comment. In the former example, primary key is stored in the pk_cf column family, while the ndx_id1secondary index is stored in the rev:cf_ndx_id1 column family.

Notice that by setting rev: before the column family name, the index is mainly created and used for a descending scan (ORDER BY .. DESC). This feature is called Reverse Column Family.

The statement ENGINE=RocksDB creates a RocksDB table
PRIMARY KEY is a clustered key (similar to InnoDB). It helps with fast lookup and range scans.
KEY/INDEX COMMENT is used to specify a column family. Each index belongs to one column family. Multiple indexes can belong to the same column family
Fulltext, foreign key, and tablespaces features are not supported

RocksDB data is saved in the /var/lib/mysql/.rocksdb directory in the MySQL data directory. We can find several files such as:

data files -> *.sst
WAL files -> *.log
Manifiest files
Options files

We can also change to another directory by setting rocksdb_datadir and rocksdb_wal_dir parameters in the MySQL configuration file.

Finally, it’s noteable that we can check the status of the RocksDB storage engine and global variables as shown in listing 05.

mysql> SHOW ENGINE rocksdb STATUS;
mysql> SHOW GLOBAL STATUS LIKE 'rocksdb%';

Listing 05

Conclusion

In this article, I've explained the concepts regarding LSM-tree, a new data structure that ‘rocks’ when there is a high volume of inserts or random updates because it takes advantage of sequential access to the storage.

So, if you need to implement an event-tracking system or similar in your own environment, and you need to have nice throughput as well, now you can use the new MySQL engine called MyRocks, which implements an LSM-tree persistent data structure.

Introduction

LSM-tree data structure

MyRocks database engine

Installing Percona MyRocks

Conclusion

About the Author

Juan Carlos Olamendy

About the Author

Start the discussion at forums.toadworld.com