Bitmap index

Last updated

A bitmap index is a special kind of database index that uses bitmaps.

Contents

Bitmap indexes have traditionally been considered to work well for low-cardinality columns , which have a modest number of distinct values, either absolutely, or relative to the number of records that contain the data. The extreme case of low cardinality is Boolean data (e.g., does a resident in a city have internet access?), which has two values, True and False. Bitmap indexes use bit arrays (commonly called bitmaps) and answer queries by performing bitwise logical operations on these bitmaps. Bitmap indexes have a significant space and performance advantage over other structures for query of such data. Their drawback is they are less efficient than the traditional B-tree indexes for columns whose data is frequently updated: consequently, they are more often employed in read-only systems that are specialized for fast query - e.g., data warehouses, and generally unsuitable for online transaction processing applications.

Some researchers argue that bitmap indexes are also useful for moderate or even high-cardinality data (e.g., unique-valued data) which is accessed in a read-only manner, and queries access multiple bitmap-indexed columns using the AND, OR or XOR operators extensively. [1]

Bitmap indexes are also useful in data warehousing applications for joining a large fact table to smaller dimension tables such as those arranged in a star schema.

Example

Continuing the internet access example, a bitmap index may be logically viewed as follows:

IdentifierHasInternetBitmaps
YN
1Yes10
2No01
3No01
4Unspecified00
5Yes10

On the left, Identifier refers to the unique number assigned to each resident, HasInternet is the data to be indexed, the content of the bitmap index is shown as two columns under the heading bitmaps. Each column in the left illustration under the Bitmaps header is a bitmap in the bitmap index. In this case, there are two such bitmaps, one for "has internet" Yes and one for "has internet" No. It is easy to see that each bit in bitmap Y shows whether a particular row refers to a person who has internet access. This is the simplest form of bitmap index. Most columns will have more distinct values. For example, the sales amount is likely to have a much larger number of distinct values. Variations on the bitmap index can effectively index this data as well. We briefly review three such variations.

Note: Many of the references cited here are reviewed at (John Wu (2007)). [2] For those who might be interested in experimenting with some of the ideas mentioned here, many of them are implemented in open source software such as FastBit, [3] the Lemur Bitmap Index C++ Library, [4] the Roaring Bitmap Java library [5] and the Apache Hive Data Warehouse system.

Compression

For historical reasons, bitmap compression and inverted list compression were developed as separate lines of research, and only later were recognized as solving essentially the same problem. [6]

Software can compress each bitmap in a bitmap index to save space. There has been considerable amount of work on this subject. [7] [8] Though there are exceptions such as Roaring bitmaps, [9] Bitmap compression algorithms typically employ run-length encoding, such as the Byte-aligned Bitmap Code, [10] the Word-Aligned Hybrid code, [11] the Partitioned Word-Aligned Hybrid (PWAH) compression, [12] the Position List Word Aligned Hybrid, [13] the Compressed Adaptive Index (COMPAX), [14] Enhanced Word-Aligned Hybrid (EWAH) [15] and the COmpressed 'N' Composable Integer SEt (CONCISE). [16] [17] These compression methods require very little effort to compress and decompress. More importantly, bitmaps compressed with BBC, WAH, COMPAX, PLWAH, EWAH and CONCISE can directly participate in bitwise operations without decompression. This gives them considerable advantages over generic compression techniques such as LZ77. BBC compression and its derivatives are used in a commercial database management system. BBC is effective in both reducing index sizes and maintaining query performance. BBC encodes the bitmaps in bytes, while WAH encodes in words, better matching current CPUs. "On both synthetic data and real application data, the new word aligned schemes use only 50% more space, but perform logical operations on compressed data 12 times faster than BBC." [18] PLWAH bitmaps were reported to take 50% of the storage space consumed by WAH bitmaps and offer up to 20% faster performance on logical operations. [13] Similar considerations can be done for CONCISE [17] and Enhanced Word-Aligned Hybrid. [15]

The performance of schemes such as BBC, WAH, PLWAH, EWAH, COMPAX and CONCISE is dependent on the order of the rows. A simple lexicographical sort can divide the index size by 9 and make indexes several times faster. [19] The larger the table, the more important it is to sort the rows. Reshuffling techniques have also been proposed to achieve the same results of sorting when indexing streaming data. [14]

Encoding

Basic bitmap indexes use one bitmap for each distinct value. It is possible to reduce the number of bitmaps used by using a different encoding method. [20] [21] For example, it is possible to encode C distinct values using log(C) bitmaps with binary encoding. [22]

This reduces the number of bitmaps, further saving space, but to answer any query, most of the bitmaps have to be accessed. This makes it potentially not as effective as scanning a vertical projection of the base data, also known as a materialized view or projection index. Finding the optimal encoding method that balances (arbitrary) query performance, index size and index maintenance remains a challenge.

Without considering compression, Chan and Ioannidis analyzed a class of multi-component encoding methods and came to the conclusion that two-component encoding sits at the kink of the performance vs. index size curve and therefore represents the best trade-off between index size and query performance. [20]

Binning

For high-cardinality columns, it is useful to bin the values, where each bin covers multiple values and build the bitmaps to represent the values in each bin. This approach reduces the number of bitmaps used regardless of encoding method. [23] However, binned indexes can only answer some queries without examining the base data. For example, if a bin covers the range from 0.1 to 0.2, then when the user asks for all values less than 0.15, all rows that fall in the bin are possible hits and have to be checked to verify whether they are actually less than 0.15. The process of checking the base data is known as the candidate check. In most cases, the time used by the candidate check is significantly longer than the time needed to work with the bitmap index. Therefore, binned indexes exhibit irregular performance. They can be very fast for some queries, but much slower if the query does not exactly match a bin.

History

The concept of bitmap index was first introduced by Professor Israel Spiegler and Rafi Maayan in their research "Storage and Retrieval Considerations of Binary Data Bases", published in 1985. [24] The first commercial database product to implement a bitmap index was Computer Corporation of America's Model 204. Patrick O'Neil published a paper about this implementation in 1987. [25] This implementation is a hybrid between the basic bitmap index (without compression) and the list of Row Identifiers (RID-list). Overall, the index is organized as a B+tree. When the column cardinality is low, each leaf node of the B-tree would contain long list of RIDs. In this case, it requires less space to represent the RID-lists as bitmaps. Since each bitmap represents one distinct value, this is the basic bitmap index. As the column cardinality increases, each bitmap becomes sparse and it may take more disk space to store the bitmaps than to store the same content as RID-lists. In this case, it switches to use the RID-lists, which makes it a B+tree index. [26] [27]

In-memory bitmaps

One of the strongest reasons for using bitmap indexes is that the intermediate results produced from them are also bitmaps and can be efficiently reused in further operations to answer more complex queries. Many programming languages support this as a bit array data structure. For example, Java has the BitSet class.

Some database systems that do not offer persistent bitmap indexes use bitmaps internally to speed up query processing. For example, PostgreSQL versions 8.1 and later implement a "bitmap index scan" optimization to speed up arbitrarily complex logical operations between available indexes on a single table.

For tables with many columns, the total number of distinct indexes to satisfy all possible queries (with equality filtering conditions on either of the fields) grows very fast, being defined by this formula:

. [28] [29]

A bitmap index scan combines expressions on different indexes, thus requiring only one index per column to support all possible queries on a table.

Applying this access strategy to B-tree indexes can also combine range queries on multiple columns. In this approach, a temporary in-memory bitmap is created with one bit for each row in the table (1  MB can thus store over 8 million entries). Next, the results from each index are combined into the bitmap using bitwise operations. After all conditions are evaluated, the bitmap contains a "1" for rows that matched the expression. Finally, the bitmap is traversed and matching rows are retrieved. In addition to efficiently combining indexes, this also improves locality of reference of table accesses, because all rows are fetched sequentially from the main table. [30] The internal bitmap is discarded after the query. If there are too many rows in the table to use 1 bit per row, a "lossy" bitmap is created instead, with a single bit per disk page. In this case, the bitmap is just used to determine which pages to fetch; the filter criteria are then applied to all rows in matching pages.

Related Research Articles

A relational database is a database based on the relational model of data, as proposed by E. F. Codd in 1970. A system used to maintain relational databases is a relational database management system (RDBMS). Many relational database systems are equipped with the option of using SQL for querying and updating the database.

<span class="mw-page-title-main">Trie</span> K-ary search tree data structure

In computer science, a trie, also called digital tree or prefix tree, is a type of k-ary search tree, a tree data structure used for locating specific keys from within a set. These keys are most often strings, with links between nodes defined not by the entire key, but by individual characters. In order to access a key, the trie is traversed depth-first, following the links between nodes, which represent each character in the key.

The Burrows–Wheeler transform rearranges a character string into runs of similar characters. This is useful for compression, since it tends to be easy to compress a string that has runs of repeated characters by techniques such as move-to-front transform and run-length encoding. More importantly, the transformation is reversible, without needing to store any additional data except the position of the first original character. The BWT is thus a "free" method of improving the efficiency of text compression algorithms, costing only some extra computation. The Burrows–Wheeler transform is an algorithm used to prepare data for use with data compression techniques such as bzip2. It was invented by Michael Burrows and David Wheeler in 1994 while Burrows was working at DEC Systems Research Center in Palo Alto, California. It is based on a previously unpublished transformation discovered by Wheeler in 1983. The algorithm can be implemented efficiently using a suffix array thus reaching linear time complexity.

bzip2 File compression software

bzip2 is a free and open-source file compression program that uses the Burrows–Wheeler algorithm. It only compresses single files and is not a file archiver. It relies on separate external utilities for tasks such as handling multiple files, encryption, and archive-splitting.

The BMP file format or bitmap, is a raster graphics image file format used to store bitmap digital images, independently of the display device, especially on Microsoft Windows and OS/2 operating systems.

The move-to-front (MTF) transform is an encoding of data designed to improve the performance of entropy encoding techniques of compression. When efficiently implemented, it is fast enough that its benefits usually justify including it as an extra step in data compression algorithm.

<span class="mw-page-title-main">Z-order curve</span> Mapping function that preserves data point locality

In mathematical analysis and computer science, functions which are Z-order, Lebesgue curve, Morton space-filling curve, Morton order or Morton code map multidimensional data to one dimension while preserving locality of the data points. It is named in France after Henri Lebesgue, who studied it in 1904, and named in the United States after Guy Macdonald Morton, who first applied the order to file sequencing in 1966. The z-value of a point in multidimensions is simply calculated by interleaving the binary representations of its coordinate values. Once the data are sorted into this ordering, any one-dimensional data structure can be used, such as simple one dimensional arrays, binary search trees, B-trees, skip lists or hash tables. The resulting ordering can equivalently be described as the order one would get from a depth-first traversal of a quadtree or octree.

A database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of additional writes and storage space to maintain the index data structure. Indexes are used to quickly locate data without having to search every row in a database table every time said table is accessed. Indexes can be created using one or more columns of a database table, providing the basis for both rapid random lookups and efficient access of ordered records.

<span class="mw-page-title-main">MonetDB</span> Open source column-oriented relational database management system

MonetDB is an open-source column-oriented relational database management system (RDBMS) originally developed at the Centrum Wiskunde & Informatica (CWI) in the Netherlands. It is designed to provide high performance on complex queries against large databases, such as combining tables with hundreds of columns and millions of rows. MonetDB has been applied in high-performance applications for online analytical processing, data mining, geographic information system (GIS), Resource Description Framework (RDF), text retrieval and sequence alignment processing.

In computer science, an inverted index is a database index storing a mapping from content, such as words or numbers, to its locations in a table, or in a document or a set of documents. The purpose of an inverted index is to allow fast full-text searches, at a cost of increased processing when a document is added to the database. The inverted file may be the database file itself, rather than its index. It is the most popular data structure used in document retrieval systems, used on a large scale for example in search engines. Additionally, several significant general-purpose mainframe-based database management systems have used inverted list architectures, including ADABAS, DATACOM/DB, and Model 204.

A column-oriented DBMS or columnar DBMS is a database management system (DBMS) that stores data tables by column rather than by row. Benefits include more efficient access to data when only querying a subset of columns, and more options for data compression. However, they are typically less efficient for inserting new data.

In computer science, a succinct data structure is a data structure which uses an amount of space that is "close" to the information-theoretic lower bound, but still allows for efficient query operations. The concept was originally introduced by Jacobson to encode bit vectors, (unlabeled) trees, and planar graphs. Unlike general lossless data compression algorithms, succinct data structures retain the ability to use them in-place, without decompressing them first. A related notion is that of a compressed data structure, insofar as the size of the stored or encoded data similarly depends upon the specific content of the data itself.

In a SQL database query, a correlated subquery is a subquery that uses values from the outer query. Because the subquery may be evaluated once for each row processed by the outer query, it can be slow.

In computer science, streaming algorithms are algorithms for processing data streams in which the input is presented as a sequence of items and can be examined in only a few passes, typically just one. These algorithms are designed to operate with limited memory, generally logarithmic in the size of the stream and/or in the maximum value in the stream, and may also have limited processing time per item.

SAND Nucleus CDBMS is a column-oriented DBMS software system optimized for business intelligence applications, delivering the data warehousing component, developed by SAND Technology Inc.

Patrick Eugene O'Neil was an American computer scientist, an expert on databases, and a professor of computer science at the University of Massachusetts Boston.

<span class="mw-page-title-main">Color Cell Compression</span> Lossy color image compression algorithm

Color Cell Compression is a lossy image compression algorithm developed by Campbell et al., in 1986, which can be considered an early forerunner of modern texture compression algorithms, such as S3 Texture Compression and Adaptive Scalable Texture Compression. It is closely related to Block Truncation Coding, another lossy image compression algorithm, which predates Color Cell Compression, in that it uses the dominant luminance of a block of pixels to partition said pixels into two representative colors. The primary difference between Block Truncation Coding and Color Cell Compression is that the former was designed to compress grayscale images and the latter was designed to compress color images. Also, Block Truncation Coding requires that the standard deviation of the colors of pixels in a block be computed in order to compress an image, whereas Color Cell Compression does not use the standard deviation. Both algorithms, though, can compress an image down to effectively 2 bits per pixel.

<span class="mw-page-title-main">Log-structured merge-tree</span> Data structure

In computer science, the log-structured merge-tree is a data structure with performance characteristics that make it attractive for providing indexed access to files with high insert volume, such as transactional log data. LSM trees, like other search trees, maintain key-value pairs. LSM trees maintain data in two or more separate structures, each of which is optimized for its respective underlying storage medium; data is synchronized between the two structures efficiently, in batches.

<span class="mw-page-title-main">Martin L. Kersten</span> Dutch computer scientist (born 1953)

Martin L. Kersten was a computer scientist with research focus on database architectures, query optimization and their use in scientific databases. He was an architect of the MonetDB system, an open-source column store for data warehouses, online analytical processing (OLAP) and geographic information systems (GIS). He has been (co-) founder of several successful spin-offs of the Centrum Wiskunde & Informatica (CWI).

<span class="mw-page-title-main">Apache Pinot</span> Open-source distributed data store

Apache Pinot is a column-oriented, open-source, distributed data store written in Java. Pinot is designed to execute OLAP queries with low latency. It is suited in contexts where fast analytics, such as aggregations, are needed on immutable data, possibly, with real-time data ingestion. The name Pinot comes from the Pinot grape vines that are pressed into liquid that is used to produce a variety of different wines. The founders of the database chose the name as a metaphor for analyzing vast quantities of data from a variety of different file formats or streaming data sources.

References

Notes
  1. Bitmap Index vs. B-tree Index: Which and When?, Vivek Sharma, Oracle Technical Network.
  2. John Wu (2007). "Annotated References on Bitmap Index". Archived from the original on 2012-06-30.
  3. FastBit
  4. Lemur Bitmap Index C++ Library
  5. Roaring bitmaps
  6. Jianguo Wang; Chunbin Lin; Yannis Papakonstantinou; Steven Swanson. "An Experimental Study of Bitmap Compression vs. Inverted List Compression". 2017. doi: 10.1145/3035918.3064007
  7. T. Johnson (1999). "Performance Measurements of Compressed Bitmap Indices" (PDF). In Malcolm P. Atkinson; Maria E. Orlowska; Patrick Valduriez; Stanley B. Zdonik; Michael L. Brodie (eds.). VLDB'99, Proceedings of 25th International Conference on Very Large Data Bases, September 7–10, 1999, Edinburgh, Scotland, UK. Morgan Kaufmann. pp. 278–89. ISBN   978-1-55860-615-9.
  8. Wu K, Otoo E, Shoshani A (March 5, 2004). "On the performance of bitmap indices for high cardinality attributes" (PDF).
  9. Chambi, S.; Lemire, D.; Kaser, O.; Godin, R. (2016). "Better bitmap performance with Roaring bitmaps". Software: Practice and Experience. 46 (5): 709–719. arXiv: 1402.6407 . doi:10.1002/spe.2325. S2CID   1139669.
  10. Byte aligned data compression
  11. Word aligned bitmap compression method, data structure, and apparatus
  12. van Schaik, Sebastiaan; de Moor, Oege (2011). "A memory efficient reachability data structure through bit vector compression". Proceedings of the 2011 international conference on Management of data. SIGMOD '11. Athens, Greece: ACM. pp. 913–924. doi:10.1145/1989323.1989419. ISBN   978-1-4503-0661-4.
  13. 1 2 Deliège F, Pedersen TB (2010). "Position list word aligned hybrid: optimizing space and performance for compressed bitmaps" (PDF). In Ioana Manolescu, Stefano Spaccapietra, Jens Teubner, Masaru Kitsuregawa, Alain Leger, Felix Naumann, Anastasia Ailamaki, Fatma Ozcan (eds.). EDBT '10, Proceedings of the 13th International Conference on Extending Database Technology. New York, NY, USA: ACM. pp. 228–39. doi:10.1145/1739041.1739071. ISBN   978-1-60558-945-9. S2CID   12234453.
  14. 1 2 F. Fusco; M. Stoecklin; M. Vlachos (September 2010). "NET-FLi: on-the-fly compression, archiving and indexing of streaming network traffic" (PDF). Proc. VLDB Endow. 3 (1–2): 1382–93. doi:10.14778/1920841.1921011. S2CID   787443.
  15. 1 2 Lemire, D.; Kaser, O.; Aouiche, K. (2010). "Sorting improves word-aligned bitmap indexes". Data & Knowledge Engineering. 69: 3–28. arXiv: 0901.3751 . doi:10.1016/j.datak.2009.08.006. S2CID   6297890.
  16. Concise: Compressed 'n' Composable Integer Set Archived May 28, 2011, at the Wayback Machine
  17. 1 2 Colantonio A, Di Pietro R (31 July 2010). "Concise: Compressed 'n' Composable Integer Set" (PDF). Information Processing Letters. 110 (16): 644–50. arXiv: 1004.0403 . doi:10.1016/j.ipl.2010.05.018. S2CID   8092695. Archived from the original (PDF) on 22 July 2011. Retrieved 2 February 2011.
  18. Wu K, Otoo EJ, Shoshani A (2001). "A Performance comparison of bitmap indexes" (PDF). In Henrique Paques, Ling Liu, David Grossman (eds.). CIKM '01 Proceedings of the tenth international conference on Information and knowledge management. New York, NY, USA: ACM. pp. 559–61. doi:10.1145/502585.502689. ISBN   978-1-58113-436-0. S2CID   10974671.
  19. D. Lemire; O. Kaser; K. Aouiche (January 2010). "Sorting improves word-aligned bitmap indexes". Data & Knowledge Engineering. 69 (1): 3–28. arXiv: 0901.3751 . doi:10.1016/j.datak.2009.08.006. S2CID   6297890.
  20. 1 2 C.-Y. Chan; Y. E. Ioannidis (1998). "Bitmap index design and evaluation" (PDF). In Ashutosh Tiwary; Michael Franklin (eds.). Proceedings of the 1998 ACM SIGMOD international conference on Management of data (SIGMOD '98). New York, NY, USA: ACM. pp. 355–6. doi:10.1145/276304.276336. ISBN   0897919955.
  21. C.-Y. Chan; Y. E. Ioannidis (1999). "An efficient bitmap encoding scheme for selection queries" (PDF). Proceedings of the 1999 ACM SIGMOD international conference on Management of data (SIGMOD '99). New York, NY, USA: ACM. pp. 215–26. doi:10.1145/304182.304201. ISBN   1581130848.
  22. P. E. O'Neil; D. Quass (1997). "Improved Query Performance with Variant Indexes". In Joan M. Peckman; Sudha Ram; Michael Franklin (eds.). Proceedings of the 1997 ACM SIGMOD international conference on Management of data (SIGMOD '97). New York, NY, USA: ACM. pp. 38–49. doi:10.1145/253260.253268. ISBN   0897919114.
  23. N. Koudas (2000). "Space efficient bitmap indexing". Proceedings of the ninth international conference on Information and knowledge management (CIKM '00). New York, NY, USA: ACM. pp. 194–201. doi:10.1145/354756.354819. ISBN   978-1581133202. S2CID   7504216.
  24. Spiegler I; Maayan R (1985). "Storage and retrieval considerations of binary data bases". Information Processing and Management. 21 (3): 233–54. doi:10.1016/0306-4573(85)90108-6.
  25. O'Neil, Patrick (1987). "Model 204 Architecture and Performance". In Dieter Gawlick; Mark N. Haynie; Andreas Reuter (eds.). Proceedings of the 2nd International Workshop on High Performance Transaction Systems. London, UK: Springer-Verlag. pp. 40–59.
  26. D. Rinfret; P. O'Neil; E. O'Neil (2001). "Bit-sliced index arithmetic". In Timos Sellis (ed.). Proceedings of the 2001 ACM SIGMOD international conference on Management of data (SIGMOD '01). New York, NY, USA: ACM. pp. 47–57. doi:10.1145/375663.375669. ISBN   1581133324.
  27. E. O'Neil; P. O'Neil; K. Wu (2007). "Bitmap Index Design Choices and Their Performance Implications" (PDF). 11th International Database Engineering and Applications Symposium (IDEAS 2007). pp. 72–84. doi:10.1109/IDEAS.2007.19. ISBN   978-0-7695-2947-9.
  28. Alex Bolenok (2009-05-09). "Creating indexes".
  29. Egor Timoshenko. "On minimal collections of indexes" (PDF).
  30. Tom Lane (2005-12-26). "Re: Bitmap indexes etc". PostgreSQL mailing lists. Retrieved 2007-04-06.
Bibliography