Retrieval Data Structure

Last updated July 30, 2024

In computer science, a retrieval data structure, also known as static function, is a space-efficient dictionary-like data type composed of a collection of (key, value) pairs that allows the following operations:^[1]

As can be derived from the operations, this data structure does not need to store the keys at all and may actually use less space than would be needed for a simple list of the key value pairs. This makes it attractive in situations where the associated data is small (e.g. a few bits) compared to the keys because we can save a lot by reducing the space used by keys.

To give a simple example suppose $n$ video game names annotated with a boolean indicating whether the game contains a dog that can be petted are given. A static function built from this database can reproduce the associated flag for all names contained in the original set and an arbitrary one for other names. The size of this static function can be made to be only $(1+\epsilon )n$ bits for a small $\epsilon$ which is obviously much less than any pair based representation.^[1]

Examples

A trivial example of a static function is a sorted list of the keys and values which implements all the above operations and many more. However, the retrieve on a list is slow and we implement many unneeded operations that can be removed to allow optimizations. Furthermore, we are even allowed to return junk if the queried key is not contained which we did not use at all.

Perfect hash functions

Another simple example to build a static function is using a perfect hash function: After building the PHF for our keys, store the corresponding values at the correct position for the key. As can be seen, this approach also allows updating the associated values, the keys have to be static. The correctness follows from the correctness of the perfect hash function. Using a minimum perfect hash function gives a big space improvement if the associated values are relatively small.

XOR-retrieval

Hashed filters can be categorized by their queries into OR, AND and XOR-filters. For example, the bloom filter is an AND-filter since it returns true for a membership query if all probed locations match. XOR filters work only for static retrievals and are the most promising for building them space efficiently.^[2] They are built by solving a linear system which ensures that a query for every key returns true.

Retrieval of an element XOR-Filter.svg — Retrieval of an element

Construction

Given a hash function $h$ that maps each key to a bitvector of length $m\geq \left\vert S\right\vert =n$ where all $(h(x))_{x\in S}$ are linearly independent the following system of linear equations has a solution $Z\in \{0,1\}^{m\times r}$ :

(h(x)\cdot Z=b(x))_{x\in S}

Therefore, the static function is given by $h$ and $Z$ and the space usage is dominated by $Z$ which is roughly $(1+\epsilon )n$ bits per key for $m=(1+\epsilon )n$ , the hash function is assumed to be small.

A retrieval for $x\in {\mathcal {U}}$ can be expressed as the bitwise XOR of the rows $Z_{i}$ for all set bits $i$ of $h(x)$ . Furthermore, fast queries require sparse $h(x)$ , thus the problems that need to be solved for this method are finding a suitable hash function and still being able to solve the system of linear equations efficiently.

Ribbon retrieval

Using a sparse random matrix $h$ makes retrievals cache inefficient because they access most of $Z$ in a random non local pattern. Ribbon retrieval improves on this by giving each $h(x)$ a consecutive "ribbon" of width $w={\mathcal {O}}(\log n/\epsilon )$ in which bits are set at random.^[2]

Using the properties of $(h(x))_{x\in S}$ the matrix $Z$ can be computed in ${\mathcal {O}}(n/\epsilon ^{2})$ expected time: Ribbon solving works by first sorting the rows by their starting position (e.g. counting sort). Then, a REM form can be constructed iteratively by performing row operations on rows strictly below the current row, eliminating all 1-entries in all columns below the first 1-entry of this row. Row operations do not produce any values outside of the ribbon and are very cheap since they only require an XOR of ${\mathcal {O}}(\log n/\epsilon )$ bits which can be done in ${\mathcal {O}}(1/\epsilon )$ time on a RAM. It can be shown that the expected amount of row operations is ${\mathcal {O}}(n/\epsilon )$ . Finally, the solution is obtained by backsubstitution.^[3]

Applications

Approximate membership

To build an approximate membership data structure use a fingerprinting function $h\colon \,{\mathcal {U}}\to \{0,1\}^{r}$ . Then build a static function $D_{h_{S}}$ on $h_{S}$ restricted to the domain of our keys $S$ .

Checking the membership of an element $x\in {\mathcal {U}}$ is done by evaluating $D_{h_{S}}$ with $x$ and returning true if the returned value equals $h(x)$ .

If $x\in S$ , $D_{h_{S}}$ returns the correct value $h(x)$ and we return true.
Otherwise, $D_{h_{S}}$ returns a random value and we might give a wrong answer. The length $r$ of the hash allows controlling the false positive rate $f=2^{r}$ .

The performance of this data structure is exactly the performance of the underlying static function.^[4]

Perfect hash functions

A retrieval data structure can be used to construct a perfect hash function: First insert the keys into a cuckoo hash table with $H=2^{r}$ hash functions $h_{i}$ and buckets of size 1. Then, for every key store the index of the hash function that lead to a key's insertion into the hash table in a $r$ -bit retrieval data structure $D$ . The perfect hash function is given by $h_{D(x)}(x)$ .^[5]

Related Research Articles

A hash function is any function that can be used to map data of arbitrary size to fixed-size values, though there are some hash functions that support variable length output. The values returned by a hash function are called hash values, hash codes, hash digests, digests, or simply hashes. The values are usually used to index a fixed-size table called a hash table. Use of a hash function to index a hash table is called hashing or scatter storage addressing.

In computer science, a perfect hash function $h$ for a set $S$ is a hash function that maps distinct elements in $S$ to a set of $m$ integers, with no collisions. In mathematical terms, it is an injective function.

Kademlia is a distributed hash table for decentralized peer-to-peer computer networks designed by Petar Maymounkov and David Mazières in 2002. It specifies the structure of the network and the exchange of information through node lookups. Kademlia nodes communicate among themselves using UDP. A virtual or overlay network is formed by the participant nodes. Each node is identified by a number or node ID. The node ID serves not only as identification, but the Kademlia algorithm uses the node ID to locate values.

A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in other words, a query returns either "possibly in set" or "definitely not in set". Elements can be added to the set, but not removed ; the more items added, the larger the probability of false positives.

Statistical learning theory is a framework for machine learning drawing from the fields of statistics and functional analysis. Statistical learning theory deals with the statistical inference problem of finding a predictive function based on data. Statistical learning theory has led to successful applications in fields such as computer vision, speech recognition, and bioinformatics.

In mathematics and computing, universal hashing refers to selecting a hash function at random from a family of hash functions with a certain mathematical property. This guarantees a low number of collisions in expectation, even if the data is chosen by an adversary. Many universal families are known, and their evaluation is often very efficient. Universal hashing has numerous uses in computer science, for example in implementations of hash tables, randomized algorithms, and cryptography.

In mathematical optimization, the ellipsoid method is an iterative method for minimizing convex functions over convex sets. The ellipsoid method generates a sequence of ellipsoids whose volume uniformly decreases at every step, thus enclosing a minimizer of a convex function.

In information theory, information dimension is an information measure for random vectors in Euclidean space, based on the normalized entropy of finely quantized versions of the random vectors. This concept was first introduced by Alfréd Rényi in 1959.

In computer science, a succinct data structure is a data structure which uses an amount of space that is "close" to the information-theoretic lower bound, but still allows for efficient query operations. The concept was originally introduced by Jacobson to encode bit vectors, (unlabeled) trees, and planar graphs. Unlike general lossless data compression algorithms, succinct data structures retain the ability to use them in-place, without decompressing them first. A related notion is that of a compressed data structure, insofar as the size of the stored or encoded data similarly depends upon the specific content of the data itself.

In computer science, locality-sensitive hashing (LSH) is a fuzzy hashing technique that hashes similar input items into the same "buckets" with high probability. Since similar items end up in the same buckets, this technique can be used for data clustering and nearest neighbor search. It differs from conventional hashing techniques in that hash collisions are maximized, not minimized. Alternatively, the technique can be seen as a way to reduce the dimensionality of high-dimensional data; high-dimensional input items can be reduced to low-dimensional versions while preserving relative distances between items.

Overcompleteness is a concept from linear algebra that is widely used in mathematics, computer science, engineering, and statistics. It was introduced by R. J. Duffin and A. C. Schaeffer in 1952.

In computer science and data mining, MinHash is a technique for quickly estimating how similar two sets are. The scheme was invented by Andrei Broder, and initially used in the AltaVista search engine to detect duplicate web pages and eliminate them from search results. It has also been applied in large-scale clustering problems, such as clustering documents by the similarity of their sets of words.

In computing, the count–min sketch is a probabilistic data structure that serves as a frequency table of events in a stream of data. It uses hash functions to map events to frequencies, but unlike a hash table uses only sub-linear space, at the expense of overcounting some events due to collisions. The count–min sketch was invented in 2003 by Graham Cormode and S. Muthu Muthukrishnan and described by them in a 2005 paper.

Fuzzy extractors are a method that allows biometric data to be used as inputs to standard cryptographic techniques, to enhance computer security. "Fuzzy", in this context, refers to the fact that the fixed values required for cryptography will be extracted from values close to but not identical to the original key, without compromising the security required. One application is to encrypt and authenticate users records, using the biometric inputs of the user as a key.

The sample complexity of a machine learning algorithm represents the number of training-samples that it needs in order to successfully learn a target function.

In computational learning theory, Occam learning is a model of algorithmic learning where the objective of the learner is to output a succinct representation of received training data. This is closely related to probably approximately correct (PAC) learning, where the learner is evaluated on its predictive power of a test set.

In two-dimensional conformal field theory, Virasoro conformal blocks are special functions that serve as building blocks of correlation functions. On a given punctured Riemann surface, Virasoro conformal blocks form a particular basis of the space of solutions of the conformal Ward identities. Zero-point blocks on the torus are characters of representations of the Virasoro algebra; four-point blocks on the sphere reduce to hypergeometric functions in special cases, but are in general much more complicated. In two dimensions as in other dimensions, conformal blocks play an essential role in the conformal bootstrap approach to conformal field theory.

A cuckoo filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set, like a Bloom filter does. False positive matches are possible, but false negatives are not – in other words, a query returns either "possibly in set" or "definitely not in set". A cuckoo filter can also delete existing items, which is not supported by Bloom filters. In addition, for applications that store many items and target moderately low false positive rates, cuckoo filters can achieve lower space overhead than space-optimized Bloom filters.

Adding controlled noise from predetermined distributions is a way of designing differentially private mechanisms. This technique is useful for designing private mechanisms for real-valued functions on sensitive data. Some commonly used distributions for adding noise include Laplace and Gaussian distributions.

Approximate Membership Query Filters comprise a group of space-efficient probabilistic data structures that support approximate membership queries. An approximate membership query answers whether an element is in a set or not with a false positive rate of $.$

References

1 2 Stefan, Walzer (2020). Random hypergraphs for hashing-based data structures (PhD). pp. 27–30.
1 2 Dillinger, Peter C.; Walzer, Stefan (2021). "Ribbon filter: practically smaller than Bloom and Xor". arXiv: 2103.02515 [cs.DS].
↑ Dietzfelbinger, Martin; Walzer, Stefan (2019). "Efficient Gauss elimination for near-quadratic matrices with one short random block per row, with applications". In Bender, Michael A.; Svensson, Ola; Herman, Grzegorz (eds.). 27th Annual European Symposium on Algorithms, ESA 2019, September 9–11, 2019, Munich/Garching, Germany. LIPIcs. Vol. 144. Schloss Dagstuhl – Leibniz-Zentrum für Informatik. pp. 39:1–39:18. arXiv: 1907.04750 . doi: 10.4230/LIPIcs.ESA.2019.39 .
↑ Dietzfelbinger, Martin; Pagh, Rasmus (2008). "Succinct data structures for retrieval and approximate membership (extended abstract)". In Aceto, Luca; Damgård, Ivan; Goldberg, Leslie Ann; Halldórsson, Magnús M.; Ingólfsdóttir, Anna; Walukiewicz, Igor (eds.). Automata, Languages and Programming, 35th International Colloquium, ICALP 2008, Reykjavik, Iceland, July 7–11, 2008, Proceedings, Part I: Track A: Algorithms, Automata, Complexity, and Games. Lecture Notes in Computer Science. Vol. 5125. Springer. pp. 385–396. arXiv: 0803.3693 . doi:10.1007/978-3-540-70575-8_32. ISBN 978-3-540-70574-1.
↑ Walzer, Stefan (2021). "Peeling close to the orientability threshold – spatial coupling in hashing-based data structures". In Marx, Dániel (ed.). Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms, SODA 2021, Virtual Conference, January 10–13, 2021. Society for Industrial and Applied Mathematics. pp. 2194–2211. arXiv: 2001.10500 . doi:10.1137/1.9781611976465.131. ISBN 978-1-61197-646-5.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[w-1] 1 2 Stefan, Walzer (2020). Random hypergraphs for hashing-based data structures (PhD). pp. 27–30.

[rf-2] 1 2 Dillinger, Peter C.; Walzer, Stefan (2021). "Ribbon filter: practically smaller than Bloom and Xor". arXiv: 2103.02515 [cs.DS].

[3] Dietzfelbinger, Martin; Walzer, Stefan (2019). "Efficient Gauss elimination for near-quadratic matrices with one short random block per row, with applications". In Bender, Michael A.; Svensson, Ola; Herman, Grzegorz (eds.). 27th Annual European Symposium on Algorithms, ESA 2019, September 9–11, 2019, Munich/Garching, Germany. LIPIcs. Vol. 144. Schloss Dagstuhl – Leibniz-Zentrum für Informatik. pp. 39:1–39:18. arXiv: 1907.04750 . doi: 10.4230/LIPIcs.ESA.2019.39 .

[4] Dietzfelbinger, Martin; Pagh, Rasmus (2008). "Succinct data structures for retrieval and approximate membership (extended abstract)". In Aceto, Luca; Damgård, Ivan; Goldberg, Leslie Ann; Halldórsson, Magnús M.; Ingólfsdóttir, Anna; Walukiewicz, Igor (eds.). Automata, Languages and Programming, 35th International Colloquium, ICALP 2008, Reykjavik, Iceland, July 7–11, 2008, Proceedings, Part I: Track A: Algorithms, Automata, Complexity, and Games. Lecture Notes in Computer Science. Vol. 5125. Springer. pp. 385–396. arXiv: 0803.3693 . doi:10.1007/978-3-540-70575-8_32. ISBN 978-3-540-70574-1.

[5] Walzer, Stefan (2021). "Peeling close to the orientability threshold – spatial coupling in hashing-based data structures". In Marx, Dániel (ed.). Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms, SODA 2021, Virtual Conference, January 10–13, 2021. Society for Industrial and Applied Mathematics. pp. 2194–2211. arXiv: 2001.10500 . doi:10.1137/1.9781611976465.131. ISBN 978-1-61197-646-5.

[1]

[2]

[3]

[4]

[5]

Retrieval Data Structure

Contents

Examples

Perfect hash functions

XOR-retrieval

Construction

Ribbon retrieval

Applications

Approximate membership

Perfect hash functions

Related Research Articles

References