Succinct data structure

Last updated

In computer science, a succinct data structure is a data structure which uses an amount of space that is "close" to the information-theoretic lower bound, but (unlike other compressed representations) still allows for efficient query operations. The concept was originally introduced by Jacobson [1] to encode bit vectors, (unlabeled) trees, and planar graphs. Unlike general lossless data compression algorithms, succinct data structures retain the ability to use them in-place, without decompressing them first. A related notion is that of a compressed data structure, insofar as the size of the stored or encoded data similarly depends upon the specific content of the data itself.

Contents

Suppose that is the information-theoretical optimal number of bits needed to store some data. A representation of this data is called:

For example, a data structure that uses bits of storage is compact, bits is succinct, bits is also succinct, and bits is implicit.

Implicit structures are thus usually reduced to storing information using some permutation of the input data; the most well-known example of this is the heap.

Succinct indexable dictionaries

Succinct indexable dictionaries, also called rank/select dictionaries, form the basis of a number of succinct representation techniques, including binary trees, -ary trees and multisets, [2] as well as suffix trees and arrays. [3] The basic problem is to store a subset of a universe , usually represented as a bit array where iff An indexable dictionary supports the usual methods on dictionaries (queries, and insertions/deletions in the dynamic case) as well as the following operations:

for .

In other words, returns the number of elements equal to up to position while returns the position of the -th occurrence of .

There is a simple representation [4] which uses bits of storage space (the original bit array and an auxiliary structure) and supports rank and select in constant time. It uses an idea similar to that for range-minimum queries; there are a constant number of recursions before stopping at a subproblem of a limited size. The bit array is partitioned into large blocks of size bits and small blocks of size bits. For each large block, the rank of its first bit is stored in a separate table ; each such entry takes bits for a total of bits of storage. Within a large block, another directory stores the rank of each of the small blocks it contains. The difference here is that it only needs bits for each entry, since only the differences from the rank of the first bit in the containing large block need to be stored. Thus, this table takes a total of bits. A lookup table can then be used that stores the answer to every possible rank query on a bit string of length for ; this requires bits of storage space. Thus, since each of these auxiliary tables take space, this data structure supports rank queries in time and bits of space.

To answer a query for in constant time, a constant time algorithm computes:

In practice, the lookup table can be replaced by bitwise operations and smaller tables that can be used to find the number of bits set in the small blocks. This is often beneficial, since succinct data structures find their uses in large data sets, in which case cache misses become much more frequent and the chances of the lookup table being evicted from closer CPU caches becomes higher. [5] Select queries can be easily supported by doing a binary search on the same auxiliary structure used for rank; however, this takes time in the worst case. A more complicated structure using bits of additional storage can be used to support select in constant time. [6] In practice, many of these solutions have hidden constants in the notation which dominate before any asymptotic advantage becomes apparent; implementations using broadword operations and word-aligned blocks often perform better in practice. [7]

Entropy-compressed solutions

The space approach can be improved by noting that there are distinct -subsets of (or binary strings of length with exactly 1’s), and thus is an information theoretic lower bound on the number of bits needed to store . There is a succinct (static) dictionary which attains this bound, namely using space. [8] This structure can be extended to support rank and select queries and takes space. [2] Correct rank queries in this structure are however limited to elements contained in the set, analogous to how minimal perfect hashing functions work. This bound can be reduced to a space/time tradeoff by reducing the storage space of the dictionary to with queries taking time. [9]

It is also possible to construct a indexible dictionary supporting rank (but not select) that uses fewer than bits. Such a dictionary is called a monotone minimal perfect hash function, and can be implemented using as few as bits. [10] [11]

Succinct hash tables

A succinct hash table, also known as a succinct unordered dictionary, is a data structure that stores keys from a universe using space bits, and while supporting membership queries in constant expected time. If a succinct hash table also supports insertions and deletions in constant expected time, then it is referred to as dynamic, and otherwise it is referred to as static.

The first dynamic succinct hash table was due to Raman and Rao in 2003. [12] In the case where , their solution uses space bits. Subsequently, it was shown that this space bound could be improved to bits for any constant number of logarithms [13] and a little after that this bound was also optimal. [14] [15] The latter solution supports all operations in worst-case constant time with high probability.

The first static succinct hash table was due to Pagh in 1999. [16] [17] In the case where , their solution uses space bits, and supports worst-case constant-time queries. This bound was subsequently improved to bits, [18] and then to bits. [19] Whereas the first two solutions [17] [18] support worst-case constant-time queries, the final one supports constant expected-time queries. [19] The final solution also requires access to a lookup table of size , but this lookup table is independent of the set of elements being stored. [19]

Other examples

A string with an arbitrary length (Pascal string) takes Z + log(Z) space, and is thus succinct. If there is a maximum length – which is the case in practice, since 232 = 4 GiB of data is a very long string, and 264 = 16 EiB of data is larger than any string in practice – then a string with a length is also implicit, taking Z + k space, where k is the number of data to represent the maximum length (e.g., 64 bits).

When a sequence of variable-length items (such as strings) needs to be encoded, there are various possibilities. A direct approach is to store a length and an item in each record – these can then be placed one after another. This allows efficient next, but not finding the kth item. An alternative is to place the items in order with a delimiter (e.g., null-terminated string). This uses a delimiter instead of a length, and is substantially slower, since the entire sequence must be scanned for delimiters. Both of these are space-efficient. An alternative approach is out-of-band separation: the items can simply be placed one after another, with no delimiters. Item bounds can then be stored as a sequence of length, or better, offsets into this sequence. Alternatively, a separate binary string consisting of 1s in the positions where an item begins, and 0s everywhere else is encoded along with it. Given this string, the function can quickly determine where each item begins, given its index. [20] This is compact but not succinct, as it takes 2Z space, which is O(Z).

Another example is the representation of a binary tree: an arbitrary binary tree on nodes can be represented in bits while supporting a variety of operations on any node, which includes finding its parent, its left and right child, and returning the size of its subtree, each in constant time. The number of different binary trees on nodes is . For large , this is about ; thus we need at least about bits to encode it. A succinct binary tree therefore would occupy only bits per node.

See also

Related Research Articles

<span class="mw-page-title-main">Binary search algorithm</span> Search algorithm finding the position of a target value within a sorted array

In computer science, binary search, also known as half-interval search, logarithmic search, or binary chop, is a search algorithm that finds the position of a target value within a sorted array. Binary search compares the target value to the middle element of the array. If they are not equal, the half in which the target cannot lie is eliminated and the search continues on the remaining half, again taking the middle element to compare to the target value, and repeating this until the target value is found. If the search ends with the remaining half being empty, the target is not in the array.

<span class="mw-page-title-main">Hash function</span> Mapping arbitrary data to fixed-size values

A hash function is any function that can be used to map data of arbitrary size to fixed-size values, though there are some hash functions that support variable length output. The values returned by a hash function are called hash values, hash codes, hash digests, digests, or simply hashes. The values are usually used to index a fixed-size table called a hash table. Use of a hash function to index a hash table is called hashing or scatter storage addressing.

<span class="mw-page-title-main">Perfect hash function</span> Hash function without any collisions

In computer science, a perfect hash functionh for a set S is a hash function that maps distinct elements in S to a set of m integers, with no collisions. In mathematical terms, it is an injective function.

A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in other words, a query returns either "possibly in set" or "definitely not in set". Elements can be added to the set, but not removed ; the more items added, the larger the probability of false positives.

In mathematics, especially in algebraic geometry and the theory of complex manifolds, coherent sheaves are a class of sheaves closely linked to the geometric properties of the underlying space. The definition of coherent sheaves is made with reference to a sheaf of rings that codifies this geometric information.

In computing, a persistent data structure or not ephemeral data structure is a data structure that always preserves the previous version of itself when it is modified. Such data structures are effectively immutable, as their operations do not (visibly) update the structure in-place, but instead always yield a new updated structure. The term was introduced in Driscoll, Sarnak, Sleator, and Tarjan's 1986 article.

In computer science, a fusion tree is a type of tree data structure that implements an associative array on w-bit integers on a finite universe, where each of the input integers has size less than 2w and is non-negative. When operating on a collection of n key–value pairs, it uses O(n) space and performs searches in O(logwn) time, which is asymptotically faster than a traditional self-balancing binary search tree, and also better than the van Emde Boas tree for large values of w. It achieves this speed by using certain constant-time operations that can be done on a machine word. Fusion trees were invented in 1990 by Michael Fredman and Dan Willard.

An autoencoder is a type of artificial neural network used to learn efficient codings of unlabeled data. An autoencoder learns two functions: an encoding function that transforms the input data, and a decoding function that recreates the input data from the encoded representation. The autoencoder learns an efficient representation (encoding) for a set of data, typically for dimensionality reduction.

In computer science, locality-sensitive hashing (LSH) is a fuzzy hashing technique that hashes similar input items into the same "buckets" with high probability. Since similar items end up in the same buckets, this technique can be used for data clustering and nearest neighbor search. It differs from conventional hashing techniques in that hash collisions are maximized, not minimized. Alternatively, the technique can be seen as a way to reduce the dimensionality of high-dimensional data; high-dimensional input items can be reduced to low-dimensional versions while preserving relative distances between items.

In computer science, streaming algorithms are algorithms for processing data streams in which the input is presented as a sequence of items and can be examined in only a few passes, typically just one. These algorithms are designed to operate with limited memory, generally logarithmic in the size of the stream and/or in the maximum value in the stream, and may also have limited processing time per item.

In cryptography, Very Smooth Hash (VSH) is a provably secure cryptographic hash function invented in 2005 by Scott Contini, Arjen Lenstra and Ron Steinfeld. Provably secure means that finding collisions is as difficult as some known hard mathematical problem. Unlike other provably secure collision-resistant hashes, VSH is efficient and usable in practice. Asymptotically, it only requires a single multiplication per log(n) message-bits and uses RSA-type arithmetic. Therefore, VSH can be useful in embedded environments where code space is limited.

In discrete mathematics, ideal lattices are a special class of lattices and a generalization of cyclic lattices. Ideal lattices naturally occur in many parts of number theory, but also in other areas. In particular, they have a significant place in cryptography. Micciancio defined a generalization of cyclic lattices as ideal lattices. They can be used in cryptosystems to decrease by a square root the number of parameters necessary to describe a lattice, making them more efficient. Ideal lattices are a new concept, but similar lattice classes have been used for a long time. For example, cyclic lattices, a special case of ideal lattices, are used in NTRUEncrypt and NTRUSign.

In computer science, the range query problem consists of efficiently answering several queries regarding a given interval of elements within an array. For example, a common task, known as range minimum query, is finding the smallest value inside a given range within a list of numbers.

HyperLogLog is an algorithm for the count-distinct problem, approximating the number of distinct elements in a multiset. Calculating the exact cardinality of the distinct elements of a multiset requires an amount of memory proportional to the cardinality, which is impractical for very large data sets. Probabilistic cardinality estimators, such as the HyperLogLog algorithm, use significantly less memory than this, but can only approximate the cardinality. The HyperLogLog algorithm is able to estimate cardinalities of > 109 with a typical accuracy (standard error) of 2%, using 1.5 kB of memory. HyperLogLog is an extension of the earlier LogLog algorithm, itself deriving from the 1984 Flajolet–Martin algorithm.

In data structures, the range mode query problem asks to build a data structure on some input data to efficiently answer queries asking for the mode of any consecutive subset of the input.

In computational learning theory, Occam learning is a model of algorithmic learning where the objective of the learner is to output a succinct representation of received training data. This is closely related to probably approximately correct (PAC) learning, where the learner is evaluated on its predictive power of a test set.

LSH is a cryptographic hash function designed in 2014 by South Korea to provide integrity in general-purpose software environments such as PCs and smart devices. LSH is one of the cryptographic algorithms approved by the Korean Cryptographic Module Validation Program (KCMVP). And it is the national standard of South Korea.

In computer science, a retrieval data structure, also known as static function, is a space-efficient dictionary-like data type composed of a collection of pairs that allows the following operations:

Approximate Membership Query Filter (AMQ-Filter) is a group of space-efficient probabilistic data structures that supports approximate membership queries. An approximate membership query answers if an element is in a set or not with a false positive rate of .

The compressed cover tree is a type of data structure in computer science that is specifically designed to facilitate the speed-up of a k-nearest neighbors algorithm in finite metric spaces. Compressed cover tree is a simplified version of explicit representation of cover tree that was motivated by past issues in proofs of time complexity results of cover tree. The compressed cover tree was specifically designed to achieve claimed time complexities of cover tree in a mathematically rigorous way.

References

  1. Jacobson, G. J (1988). Succinct static data structures (Ph.D. thesis). Pittsburgh, PA: Carnegie Mellon University.
  2. 1 2 Raman, R.; V. Raman; S. S Rao (2002). "Succinct indexable dictionaries with applications to encoding k-ary trees and multisets". Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms. pp.  233–242. arXiv: 0705.0552 . CiteSeerX   10.1.1.246.3123 . doi:10.1145/1290672.1290680. ISBN   0-89871-513-X.
  3. Sadakane, K.; R. Grossi (2006). "Squeezing succinct data structures into entropy bounds" (PDF). Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm. pp. 1230–1239. ISBN   0-89871-605-5. Archived from the original (PDF) on 2011-09-29.
  4. Jacobson, G. (1 November 1989). Space-efficient static trees and graphs (PDF). 30th IEEE Symposium on Foundations of Computer Science. doi:10.1109/SFCS.1989.63533. Archived from the original (PDF) on 2016-03-12.
  5. González, R.; S. Grabowski; V. Mäkinen; G. Navarro (2005). "Practical implementation of rank and select queries" (PDF). Poster Proceedings Volume of 4th Workshop on Efficient and Experimental Algorithms (WEA). pp. 27–38.
  6. Clark, David (1996). Compact pat trees (PDF) (Ph.D. thesis). University of Waterloo.
  7. Vigna, S. (2008). "Broadword Implementation of Rank/Select Queries" (PDF). Experimental Algorithms. Lecture Notes in Computer Science. Vol. 5038. pp. 154–168. CiteSeerX   10.1.1.649.8950 . doi:10.1007/978-3-540-68552-4_12. ISBN   978-3-540-68548-7. S2CID   13963489.
  8. Brodnik, A.; J. I Munro (1999). "Membership in constant time and almost-minimum space" (PDF). SIAM J. Comput. 28 (5): 1627–1640. CiteSeerX   10.1.1.530.9223 . doi:10.1137/S0097539795294165.
  9. Pătraşcu, M. (2008). "Succincter" (PDF). Foundations of Computer Science, 2008. FOCS'08. IEEE 49th Annual IEEE Symposium on. pp. 305–313.
  10. Belazzougui, Djamal; Boldi, Paolo; Pagh, Rasmus; Vigna, Sebastiano (2009-01-04). "Monotone Minimal Perfect Hashing: Searching a Sorted Table with O(1) Accesses". Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms. Philadelphia, PA: Society for Industrial and Applied Mathematics: 785–794. doi: 10.1137/1.9781611973068.86 . ISBN   978-0-89871-680-1.
  11. Assadi, Sepehr; Farach-Colton, Martín; Kuszmaul, William (January 2023), "Tight Bounds for Monotone Minimal Perfect Hashing", Proceedings of the 2023 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), Philadelphia, PA: Society for Industrial and Applied Mathematics, pp. 456–476, arXiv: 2207.10556 , doi:10.1137/1.9781611977554.ch20, ISBN   978-1-61197-755-4 , retrieved 2023-04-28
  12. Raman, Rajeev; Rao, Satti Srinivasa (2003), "Succinct Dynamic Dictionaries and Trees", Automata, Languages and Programming, Berlin, Heidelberg: Springer Berlin Heidelberg, pp. 357–368, doi:10.1007/3-540-45061-0_30, ISBN   978-3-540-40493-4 , retrieved 2023-04-28
  13. Bender, Michael A.; Farach-Colton, Martín; Kuszmaul, John; Kuszmaul, William; Liu, Mingmou (2022-06-09). "On the optimal time/Space tradeoff for hash tables". Proceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing. New York, NY, USA: ACM. pp. 1284–1297. doi:10.1145/3519935.3519969. hdl:1721.1/146419. ISBN   9781450392648. S2CID   240354692.
  14. Li, Tianxiao; Liang, Jingxun; Yu, Huacheng; Zhou, Renfei (2023). "Tight Cell-Probe Lower Bounds for Dynamic Succinct Dictionaries". ArXiv . arXiv: 2306.02253 .
  15. Nadis, Steve (8 February 2024). "Scientists Find Optimal Balance of Data Storage and Time". Quanta Magazine .
  16. Pagh, Rasmus (1998-01-28). "Low Redundancy in Dictionaries with O(1) Worst Case Lookup Time". BRICS Report Series. 5 (28). doi: 10.7146/brics.v5i28.19434 . ISSN   1601-5355.
  17. 1 2 Pagh, Rasmus (January 2001). "Low Redundancy in Static Dictionaries with Constant Query Time". SIAM Journal on Computing. 31 (2): 353–363. doi:10.1137/s0097539700369909. ISSN   0097-5397.
  18. 1 2 Patrascu, Mihai (October 2008). "Succincter". 2008 49th Annual IEEE Symposium on Foundations of Computer Science. IEEE. pp. 305–313. doi:10.1109/focs.2008.83. ISBN   978-0-7695-3436-7. S2CID   257721481.
  19. 1 2 3 Yu, Huacheng (2020-06-22). "Nearly optimal static Las Vegas succinct dictionary". Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing. New York, NY, USA: ACM. pp. 1389–1401. arXiv: 1911.01348 . doi:10.1145/3357713.3384274. ISBN   9781450369794. S2CID   207780523.
  20. Belazzougui, Djamal. "Hash, displace, and compress" (PDF).