Log-structured merge-tree

Log-structured merge-tree
Log-structured merge-tree
Type	Hybrid (two tree-like components)
Invented	1996
Invented by	Patrick O'Neil, Edward Cheng, Dieter Gawlick, Elizabeth O'Neil
Operation
Time complexity in big O notation
Operation	Average
Insert	O(T * L / B)
Find-min	O(1)
Space complexity
Space	O((T + 1) / T)

Last updated December 16, 2024

In computer science, the log-structured merge-tree (also known as LSM tree, or LSMT^[1]) is a data structure with performance characteristics that make it attractive for providing indexed access to files with high insert volume, such as transactional log data. LSM trees, like other search trees, maintain key-value pairs. LSM trees maintain data in two or more separate structures, each of which is optimized for its respective underlying storage medium; data is synchronized between the two structures efficiently, in batches.

One simple version of the LSM tree is a two-level LSM tree.^[2] As described by Patrick O'Neil, a two-level LSM tree comprises two tree-like structures, called C₀ and C₁. C₀ is smaller and entirely resident in memory, whereas C₁ is resident on disk. New records are inserted into the memory-resident C₀ component. If the insertion causes the C₀ component to exceed a certain size threshold, a contiguous segment of entries is removed from C₀ and merged into C₁ on disk. The performance characteristics of LSM trees stem from the fact that each component is tuned to the characteristics of its underlying storage medium, and that data is efficiently migrated across media in rolling batches, using an algorithm reminiscent of merge sort. Such tuning involves writing data in a sequential manner as opposed to as a series of separate random access requests. This optimization reduces seek time in hard-disk drives (HDDs) and latency in solid-state drives (SSDs).

Most LSM trees used in practice employ multiple levels. Level 0 is kept in main memory, and might be represented using a tree. The on-disk data is organized into sorted runs of data. Each run contains data sorted by the index key. A run can be represented on disk as a single file, or alternatively as a collection of files with non-overlapping key ranges. To perform a query on a particular key to get its associated value, one must search in the Level 0 tree and also each run. The Stepped-Merge version of the LSM tree^[3] is a variant of the LSM tree that supports multiple levels with multiple tree structures at each level.

A particular key may appear in several runs, and what that means for a query depends on the application. Some applications simply want the newest key-value pair with a given key. Some applications must combine the values in some way to get the proper aggregate value to return. For example, in Apache Cassandra, each value represents a row in a database, and different versions of the row may have different sets of columns.^[4]

In order to keep down the cost of queries, the system must avoid a situation where there are too many runs.

Extensions to the 'leveled' method to incorporate B+ tree structures have been suggested, for example bLSM^[5] and Diff-Index.^[6] LSM-tree was originally designed for write-intensive workloads. As increasingly more read and write workloads co-exist under an LSM-tree storage structure, read data accesses can experience high latency and low throughput due to frequent invalidations of cached data in buffer caches by LSM-tree compaction operations. To re-enable effective buffer caching for fast data accesses, a Log-Structured buffered-Merged tree (LSbM-tree) is proposed and implemented.^[7]

Operations

Write

In LSM-trees, write operations are designed to optimize performance by reducing random I/O and leveraging sequential disk writes. When a write operation is initiated, the data is first buffered in an in-memory component, often implemented using a sorted data structure such as a Skip list or B+ tree.

Once the in-memory buffer becomes full, it is flushed to the disk as an immutable sorted component at the first level (C1). This flush is performed sequentially, ensuring high I/O efficiency by avoiding the costly random writes typical of traditional indexing methods. To maintain durability, the system may use a write-ahead log (WAL) that records all incoming writes before they are added to the memory buffer. This ensures that no data is lost in the event of a crash during a write.

As data accumulates across levels on the disk, LSM trees employ a merge process similar to merge sort to consolidate entries and maintain a consistent sorted order across levels. This process also handles updates and deletes, removing redundant or obsolete entries. Updates are treated as new writes, while deletes are marked with a tombstone entry, which is a placeholder indicating that the key has been deleted. These tombstones are later removed during the merging process.

Two common merging policies govern how data flows through the levels: levelling and tiering. In levelling, only one component exists per level, and merging happens more frequently, reducing the total number of components but increasing write amplification. In tiering, multiple components can coexist within a level, and merging occurs less frequently, reducing write amplification but increasing read costs because more components need to be searched.

This design leads to amortized write costs of $O\left({T\cdot L \over B}\right)$ for leveling merge policies, where $T$ is the size ratio between levels, $L$ is the number of levels, and $B$ is the number of entries per page.^[8]

Point lookup

A point lookup operation retrieves the value associated with a specific key. In LSM trees, because of the multi-level structure and the immutability of disk components, point lookups involve checking several levels to ensure the most up-to-date value for the key is returned.

The process begins with a search in the in-memory component (C0), which holds the latest data. Since this component is organized as a sorted structure, the search is efficient. If the key is found here, its value is returned right away.

If the key isn’t found in memory, the search moves to the disk components, starting with the first level (C1) and continuing through deeper levels (C2, C3, etc.). Each disk level is also sorted, allowing for efficient searches using methods like binary search or tree search. Newer levels are checked first because they contain the most recent updates and deletions for the key.

To make the search faster, LSM trees often use a bloom filter for each on-disk component. These filters are probabilistic structures that help quickly determine whether a key is definitely absent from a component. Before accessing any disk component, the system checks the Bloom filter. If the filter indicates the key is not present, the component is skipped, saving time. If the filter suggests the key might be there, the system proceeds to search for the component.

The lookup operation stops as soon as the key is found, ensuring the most recent version is returned. If the search reaches the final level without finding the key, the system concludes that the key doesn’t exist.

Point lookup complexity is $O(L)$ without Bloom filters, as each level must be searched. With Bloom filters, the cost for zero-result lookups is significantly reduced to $O\left(L\cdot e^{-{\frac {M}{N}}}\right)$ , where M is the total Bloom filter size and N is the number of keys. For existing key lookups, the cost is $O(1)$ due to the presence of Bloom filters.^[8]

Range query

A range query retrieves all key-value pairs within a specified range. The operation involves scanning multiple components across the LSM tree's hierarchical structure and consolidating entries to ensure accurate and complete results.

A range query begins by searching the in-memory component (C0). Since this component is typically a sorted data structure, range queries can be done efficiently. Once the in-memory component is finished, the query proceeds to the disk components, starting from the first level (C1) and continuing to deeper levels.

Each disk component is also stored as a sorted structure allowing efficient range scans within individual components. To perform the range query, the system locates the starting point in each relevant component and scans sequentially until the end of the range is reached. The results from each component are then merged into a priority queue to reconcile duplicates, updates, and deletes, ensuring the final result only includes the latest version of each key.

For efficiency, LSM trees often divide disk components into smaller, disjoint key ranges. In this way, when processing a range query, the system can search only the partitions that have overlap ranges to reduce the number of components accessed. Similar to point lookup, Bloom filters are sometimes used to quickly decide whether a disk component contains any keys within the queried range, allowing the system to skip components that are guaranteed to be irrelevant.

The performance of a range query in LSM-trees depends on the query's selectivity. For short-range queries, which access fewer keys than twice the number of levels, the query must examine components across all levels, leading to a cost proportional to the number of levels. For long-range queries, which access many keys, the price is dominated by scanning the largest level, as it contains most of the data. For this reason, the runtime for sort-range queries is $O(L)$ and $O\left({\frac {s}{B}}\right)$ for long-range queries, where $s$ is the number of keys in the range.^[8]

Related Research Articles

In computer science, binary search, also known as half-interval search, logarithmic search, or binary chop, is a search algorithm that finds the position of a target value within a sorted array. Binary search compares the target value to the middle element of the array. If they are not equal, the half in which the target cannot lie is eliminated and the search continues on the remaining half, again taking the middle element to compare to the target value, and repeating this until the target value is found. If the search ends with the remaining half being empty, the target is not in the array.

In computer science, a B-tree is a self-balancing tree data structure that maintains sorted data and allows searches, sequential access, insertions, and deletions in logarithmic time. The B-tree generalizes the binary search tree, allowing for nodes with more than two children. Unlike other self-balancing binary search trees, the B-tree is well suited for storage systems that read and write relatively large blocks of data, such as databases and file systems.

In computer science, a trie, also known as a digital tree or prefix tree, is a specialized search tree data structure used to store and retrieve strings from a dictionary or set. Unlike a binary search tree, nodes in a trie do not store their associated key. Instead, each node's position within the trie determines its associated key, with the connections between nodes defined by individual characters rather than the entire key.

<span class="mw-page-title-main">Treap</span> Random search tree data structure

In computer science, the treap and the randomized binary search tree are two closely related forms of binary search tree data structures that maintain a dynamic set of ordered keys and allow binary searches among the keys. After any sequence of insertions and deletions of keys, the shape of the tree is a random variable with the same probability distribution as a random binary tree; in particular, with high probability its height is proportional to the logarithm of the number of keys, so that each search, insertion, or deletion operation takes logarithmic time to perform.

In computer science, a self-balancing binary search tree (BST) is any node-based binary search tree that automatically keeps its height small in the face of arbitrary item insertions and deletions. These operations when designed for a self-balancing binary search tree, contain precautionary measures against boundlessly increasing tree height, so that these abstract data structures receive the attribute "self-balancing".

A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in other words, a query returns either "possibly in set" or "definitely not in set". Elements can be added to the set, but not removed ; the more items added, the larger the probability of false positives.

In computing, a persistent data structure or not ephemeral data structure is a data structure that always preserves the previous version of itself when it is modified. Such data structures are effectively immutable, as their operations do not (visibly) update the structure in-place, but instead always yield a new updated structure. The term was introduced in Driscoll, Sarnak, Sleator, and Tarjan's 1986 article.

In computer programming, a rope, or cord, is a data structure composed of smaller strings that is used to efficiently store and manipulate longer strings or entire texts. For example, a text editing program may use a rope to represent the text being edited, so that operations such as insertion, deletion, and random access can be done efficiently.

A B+ tree is an m-ary tree with a variable but often large number of children per node. A B+ tree consists of a root, internal nodes and leaves. The root may be either a leaf or a node with two or more children.

External sorting is a class of sorting algorithms that can handle massive amounts of data. External sorting is required when the data being sorted do not fit into the main memory of a computing device and instead they must reside in the slower external memory, usually a disk drive. Thus, external sorting algorithms are external memory algorithms and thus applicable in the external memory model of computation.

In computer science, an interval tree is a tree data structure to hold intervals. Specifically, it allows one to efficiently find all intervals that overlap with any given interval or point. It is often used for windowing queries, for instance, to find all roads on a computerized map inside a rectangular viewport, or to find all visible elements inside a three-dimensional scene. A similar data structure is the segment tree.

A database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of additional writes and storage space to maintain the index data structure. Indexes are used to quickly locate data without having to search every row in a database table every time said table is accessed. Indexes can be created using one or more columns of a database table, providing the basis for both rapid random lookups and efficient access of ordered records.

Database tables and indexes may be stored on disk in one of a number of forms, including ordered/unordered flat files, ISAM, heap files, hash buckets, or B+ trees. Each form has its own particular advantages and disadvantages. The most commonly used forms are B-trees and ISAM. Such forms or structures are one aspect of the overall schema used by a database engine to store information.

Apache Cassandra is a free and open-source database management system designed to handle large volumes of data across multiple commodity servers. The system prioritizes availability and scalability over consistency, making it particularly suited for systems with high write throughput requirements due to its LSM tree indexing storage layer. As a wide-column database, Cassandra supports flexible schemas and efficiently handles data models with numerous sparse columns. The system is optimized for applications with well-defined data access patterns that can be incorporated into the schema design. Cassandra supports computer clusters which may span multiple data centers, featuring asynchronous and masterless replication. It enables low-latency operations for all clients and incorporates Amazon's Dynamo distributed storage and replication techniques, combined with Google's Bigtable data storage engine model.

A quotient filter is a space-efficient probabilistic data structure used to test whether an element is a member of a set. A query will elicit a reply specifying either that the element is definitely not in the set or that the element is probably in the set. The former result is definitive; i.e., the test does not generate false negatives. But with the latter result there is some probability, ε, of the test returning "element is in the set" when in fact the element is not present in the set. There is a tradeoff between ε, the false positive rate, and storage size; increasing the filter's storage size reduces ε. Other AMQ operations include "insert" and "optionally delete". The more elements are added to the set, the larger the probability of false positives.

In computer science, a fractal tree index is a tree data structure that keeps data sorted and allows searches and sequential access in the same time as a B-tree but with insertions and deletions that are asymptotically faster than a B-tree. Like a B-tree, a fractal tree index is a generalization of a binary search tree in that a node can have more than two children. Furthermore, unlike a B-tree, a fractal tree index has buffers at each node, which allow insertions, deletions and other changes to be stored in intermediate locations. The goal of the buffers is to schedule disk writes so that each write performs a large amount of useful work, thereby avoiding the worst-case performance of B-trees, in which each disk write may change a small amount of data on disk. Like a B-tree, fractal tree indexes are optimized for systems that read and write large blocks of data. The fractal tree index has been commercialized in databases by Tokutek. Originally, it was implemented as a cache-oblivious lookahead array, but the current implementation is an extension of the B^ε tree. The B^ε is related to the Buffered Repository Tree. The Buffered Repository Tree has degree 2, whereas the B^ε tree has degree B^ε. The fractal tree index has also been used in a prototype filesystem. An open source implementation of the fractal tree index is available, which demonstrates the implementation details outlined below.

Bitcask is an Erlang application that provides an API for storing and retrieving key/value data into a log-structured hash table. The design owes a lot to the principles found in log-structured file systems and draws inspiration from a number of designs that involve log file merging.

Approximate membership query filters comprise a group of space-efficient probabilistic data structures that support approximate membership queries. An approximate membership query answers whether an element is in a set or not with a false positive rate of $.$

The PH-tree is a tree data structure used for spatial indexing of multi-dimensional data (keys) such as geographical coordinates, points, feature vectors, rectangles or bounding boxes. The PH-tree is space partitioning index with a structure similar to that of a quadtree or octree. However, unlike quadtrees, it uses a splitting policy based on tries and similar to Crit bit trees that is based on the bit-representation of the keys. The bit-based splitting policy, when combined with the use of different internal representations for nodes, provides scalability with high-dimensional data. The bit-representation splitting policy also imposes a maximum depth, thus avoiding degenerated trees and the need for rebalancing.

References

↑ Zhang, Weitao; Xu, Yinlong; Li, Yongkun; Li, Dinglong (December 2016). "Improving Write Performance of LSMT-Based Key-Value Store". 2016 IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS). pp. 553–560. doi:10.1109/ICPADS.2016.0079. ISBN 978-1-5090-4457-3. S2CID 13611447.
↑ O’Neil, Patrick; Cheng, Edward; Gawlick, Dieter; O’Neil, Elizabeth (1996-06-01). "The log-structured merge-tree (LSM-tree)" (PDF). Acta Informatica. 33 (4): 351–385. doi:10.1007/s002360050048. ISSN 1432-0525. S2CID 12627452.
↑ Jagadish, H.V.; Narayan, P.P.S.; Seshadri, S.; Sudarshan, S.; Kanneganti, Rama (1997). "Incremental Organization for Data Recording and Warehousing" (PDF). Proceedings of the VLDB Conference. VLDB Foundation: 16–25.
↑ "Leveled Compaction in Apache Cassandra : DataStax". February 13, 2014. Archived from the original on February 13, 2014.
↑ Sears, Russell; Ramakrishnan, Raghu (2012-05-20). "BLSM". Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. SIGMOD '12. New York, NY, USA: Association for Computing Machinery. pp. 217–228. doi:10.1145/2213836.2213862. ISBN 978-1-4503-1247-9. S2CID 207194816.
↑ Tan, Wei; Tata, Sandeep; Tang, Yuzhe; Fong, Liana (2014), Diff-Index: Differentiated Index in Distributed Log-Structured Data Stores (PDF), OpenProceedings.org, doi:10.5441/002/edbt.2014.76 , retrieved 2022-05-22
↑ Dejun Teng; Lei Guo; Rubao Lee; Feng Chen; Yanfeng Zhang; Siyuan Ma; Xiaodong Zhang (2018). "A Low-cost Disk Solution Enabling LSM-tree to Achieve High Performance for Mixed Read/Write Workloads". ACM Transactions on Storage. pp. 1–26. doi:10.1145/3162615.
1 2 3 Luo, Chen; Carey, Michael J. (July 2019). "LSM-based storage techniques: a survey". The VLDB Journal. 29: 393–418. arXiv: 1812.07527 . doi:10.1007/s00778-019-00555-y. S2CID 56178614.

General

O'Neil, Patrick E.; Cheng, Edward; Gawlick, Dieter; O'Neil, Elizabeth (June 1996). "The log-structured merge-tree (LSM-tree)". Acta Informatica. 33 (4): 351–385. CiteSeerX 10.1.1.44.2782 . doi:10.1007/s002360050048. S2CID 12627452.
Li, Yinan; He, Bingsheng; Luo, Qiong; Yi, Ke (2009). "Tree Indexing on Flash Disks". 2009 IEEE 25th International Conference on Data Engineering. pp. 1303–6. CiteSeerX 10.1.1.144.6961 . doi:10.1109/ICDE.2009.226. ISBN 978-1-4244-3422-0. S2CID 2343303.

External links

An Overview of Log Structured Merge Trees

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Zhang, Weitao; Xu, Yinlong; Li, Yongkun; Li, Dinglong (December 2016). "Improving Write Performance of LSMT-Based Key-Value Store". 2016 IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS). pp. 553–560. doi:10.1109/ICPADS.2016.0079. ISBN 978-1-5090-4457-3. S2CID 13611447.

[2] O’Neil, Patrick; Cheng, Edward; Gawlick, Dieter; O’Neil, Elizabeth (1996-06-01). "The log-structured merge-tree (LSM-tree)" (PDF). Acta Informatica. 33 (4): 351–385. doi:10.1007/s002360050048. ISSN 1432-0525. S2CID 12627452.

[3] Jagadish, H.V.; Narayan, P.P.S.; Seshadri, S.; Sudarshan, S.; Kanneganti, Rama (1997). "Incremental Organization for Data Recording and Warehousing" (PDF). Proceedings of the VLDB Conference. VLDB Foundation: 16–25.

[4] "Leveled Compaction in Apache Cassandra : DataStax". February 13, 2014. Archived from the original on February 13, 2014.

[5] Sears, Russell; Ramakrishnan, Raghu (2012-05-20). "BLSM". Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. SIGMOD '12. New York, NY, USA: Association for Computing Machinery. pp. 217–228. doi:10.1145/2213836.2213862. ISBN 978-1-4503-1247-9. S2CID 207194816.

[6] Tan, Wei; Tata, Sandeep; Tang, Yuzhe; Fong, Liana (2014), Diff-Index: Differentiated Index in Distributed Log-Structured Data Stores (PDF), OpenProceedings.org, doi:10.5441/002/edbt.2014.76 , retrieved 2022-05-22

[7] Dejun Teng; Lei Guo; Rubao Lee; Feng Chen; Yanfeng Zhang; Siyuan Ma; Xiaodong Zhang (2018). "A Low-cost Disk Solution Enabling LSM-tree to Achieve High Performance for Mixed Read/Write Workloads". ACM Transactions on Storage. pp. 1–26. doi:10.1145/3162615.

[:0-8] 1 2 3 Luo, Chen; Carey, Michael J. (July 2019). "LSM-based storage techniques: a survey". The VLDB Journal. 29: 393–418. arXiv: 1812.07527 . doi:10.1007/s00778-019-00555-y. S2CID 56178614.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

v t e Tree data structures
Search trees (dynamic sets/associative arrays)	2–3 2–3–4 AA (a,b) AVL B B+ B* B^x (Optimal) Binary search Dancing HTree Interval Order statistic Palindrome (Left-leaning) Red–black Scapegoat Splay T Treap UB Weight-balanced
Heaps	Binary Binomial Brodal d-ary Fibonacci Leftist Pairing Skew binomial Skew van Emde Boas Weak
Tries	Ctrie C-trie (compressed ADT) Hash Radix Suffix Ternary search X-fast Y-fast
Spatial data partitioning trees	Ball BK BSP Cartesian Hilbert R k-d (implicit k-d) M Metric MVP Octree PH Priority R Quad R R+ R* Segment VP X
Other trees	Cover Exponential Fenwick Finger Fractal tree index Fusion Hash calendar iDistance K-ary Left-child right-sibling Link/cut Log-structured merge Merkle PQ Range SPQR Top