Retrieval Data Structure

Last updated

In computer science, a retrieval data structure, also known as static function, is a space-efficient dictionary-like data type composed of a collection of (key, value) pairs that allows the following operations: [1]

Contents

They can also be thought of as a function for a universe and the set of keys where retrieve has to return for any value and an arbitrary value from otherwise.

In contrast to static functions, AMQ-filters support (probabilistic) membership queries and dictionaries additionally allow operations like listing keys or looking up the value associated with a key and returning some other symbol if the key is not contained.

As can be derived from the operations, this data structure does not need to store the keys at all and may actually use less space than would be needed for a simple list of the key value pairs. This makes it attractive in situations where the associated data is small (e.g. a few bits) compared to the keys because we can save a lot by reducing the space used by keys.

To give a simple example suppose video game names annotated with a boolean indicating whether the game contains a dog that can be petted are given. A static function built from this database can reproduce the associated flag for all names contained in the original set and an arbitrary one for other names. The size of this static function can be made to be only bits for a small which is obviously much less than any pair based representation. [1]

Examples

A trivial example of a static function is a sorted list of the keys and values which implements all the above operations and many more. However, the retrieve on a list is slow and we implement many unneeded operations that can be removed to allow optimizations. Furthermore, we are even allowed to return junk if the queried key is not contained which we did not use at all.

Perfect hash functions

Another simple example to build a static function is using a perfect hash function: After building the PHF for our keys, store the corresponding values at the correct position for the key. As can be seen, this approach also allows updating the associated values, the keys have to be static. The correctness follows from the correctness of the perfect hash function. Using a minimum perfect hash function gives a big space improvement if the associated values are relatively small.

XOR-retrieval

Hashed filters can be categorized by their queries into OR, AND and XOR-filters. For example, the bloom filter is an AND-filter since it returns true for a membership query if all probed locations match. XOR filters work only for static retrievals and are the most promising for building them space efficiently. [2] They are built by solving a linear system which ensures that a query for every key returns true.

Retrieval of an element XOR-Filter.svg
Retrieval of an element

Construction

Given a hash function that maps each key to a bitvector of length where all are linearly independent the following system of linear equations has a solution :

Therefore, the static function is given by and and the space usage is dominated by which is roughly bits per key for , the hash function is assumed to be small.

A retrieval for can be expressed as the bitwise XOR of the rows for all set bits of . Furthermore, fast queries require sparse , thus the problems that need to be solved for this method are finding a suitable hash function and still being able to solve the system of linear equations efficiently.

Ribbon retrieval

Using a sparse random matrix makes retrievals cache inefficient because they access most of in a random non local pattern. Ribbon retrieval improves on this by giving each a consecutive "ribbon" of width in which bits are set at random. [2]

Using the properties of the matrix can be computed in expected time: Ribbon solving works by first sorting the rows by their starting position (e.g. counting sort). Then, a REM form can be constructed iteratively by performing row operations on rows strictly below the current row, eliminating all 1-entries in all columns below the first 1-entry of this row. Row operations do not produce any values outside of the ribbon and are very cheap since they only require an XOR of bits which can be done in time on a RAM. It can be shown that the expected amount of row operations is . Finally, the solution is obtained by backsubstitution. [3]

Applications

Hash functions that lead to insertions are used to build a perfect hash function Cuckoo perfect hashing with indices.svg
Hash functions that lead to insertions are used to build a perfect hash function

Approximate membership

To build an approximate membership data structure use a fingerprinting function . Then build a static function on restricted to the domain of our keys .

Checking the membership of an element is done by evaluating with and returning true if the returned value equals .

The performance of this data structure is exactly the performance of the underlying static function. [4]

Perfect hash functions

A retrieval data structure can be used to construct a perfect hash function: First insert the keys into a cuckoo hash table with hash functions and buckets of size 1. Then, for every key store the index of the hash function that lead to a key's insertion into the hash table in a -bit retrieval data structure . The perfect hash function is given by . [5]

Related Research Articles

<span class="mw-page-title-main">Hash function</span> Mapping arbitrary data to fixed-size values

A hash function is any function that can be used to map data of arbitrary size to fixed-size values, though there are some hash functions that support variable length output. The values returned by a hash function are called hash values, hash codes, digests, or simply hashes. The values are usually used to index a fixed-size table called a hash table. Use of a hash function to index a hash table is called hashing or scatter storage addressing.

<span class="mw-page-title-main">Perfect hash function</span> Hash function without any collisions

In computer science, a perfect hash functionh for a set S is a hash function that maps distinct elements in S to a set of m integers, with no collisions. In mathematical terms, it is an injective function.

A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in other words, a query returns either "possibly in set" or "definitely not in set". Elements can be added to the set, but not removed ; the more items added, the larger the probability of false positives.

<span class="mw-page-title-main">Cuckoo hashing</span> Data structure hashing scheme

Cuckoo hashing is a scheme in computer programming for resolving hash collisions of values of hash functions in a table, with worst-case constant lookup time. The name derives from the behavior of some species of cuckoo, where the cuckoo chick pushes the other eggs or young out of the nest when it hatches in a variation of the behavior referred to as brood parasitism; analogously, inserting a new key into a cuckoo hashing table may push an older key to a different location in the table.

In mathematics and computing, universal hashing refers to selecting a hash function at random from a family of hash functions with a certain mathematical property. This guarantees a low number of collisions in expectation, even if the data is chosen by an adversary. Many universal families are known, and their evaluation is often very efficient. Universal hashing has numerous uses in computer science, for example in implementations of hash tables, randomized algorithms, and cryptography.

In information theory, information dimension is an information measure for random vectors in Euclidean space, based on the normalized entropy of finely quantized versions of the random vectors. This concept was first introduced by Alfréd Rényi in 1959.

In computer science, a succinct data structure is a data structure which uses an amount of space that is "close" to the information-theoretic lower bound, but still allows for efficient query operations. The concept was originally introduced by Jacobson to encode bit vectors, (unlabeled) trees, and planar graphs. Unlike general lossless data compression algorithms, succinct data structures retain the ability to use them in-place, without decompressing them first. A related notion is that of a compressed data structure, insofar as the size of the stored or encoded data similarly depends upon the specific content of the data itself.

In computer science, locality-sensitive hashing (LSH) is an algorithmic technique that hashes similar input items into the same "buckets" with high probability. Since similar items end up in the same buckets, this technique can be used for data clustering and nearest neighbor search. It differs from conventional hashing techniques in that hash collisions are maximized, not minimized. Alternatively, the technique can be seen as a way to reduce the dimensionality of high-dimensional data; high-dimensional input items can be reduced to low-dimensional versions while preserving relative distances between items.

A randomness extractor, often simply called an "extractor", is a function, which being applied to output from a weak entropy source, together with a short, uniformly random seed, generates a highly random output that appears independent from the source and uniformly distributed. Examples of weakly random sources include radioactive decay or thermal noise; the only restriction on possible sources is that there is no way they can be fully controlled, calculated or predicted, and that a lower bound on their entropy rate can be established. For a given source, a randomness extractor can even be considered to be a true random number generator (TRNG); but there is no single extractor that has been proven to produce truly random output from any type of weakly random source.

A locally decodable code (LDC) is an error-correcting code that allows a single bit of the original message to be decoded with high probability by only examining a small number of bits of a possibly corrupted codeword. This property could be useful, say, in a context where information is being transmitted over a noisy channel, and only a small subset of the data is required at a particular time and there is no need to decode the entire message at once. Note that locally decodable codes are not a subset of locally testable codes, though there is some overlap between the two.

Overcompleteness is a concept from linear algebra that is widely used in mathematics, computer science, engineering, and statistics. It was introduced by R. J. Duffin and A. C. Schaeffer in 1952.

In computer science and data mining, MinHash is a technique for quickly estimating how similar two sets are. The scheme was invented by Andrei Broder (1997), and initially used in the AltaVista search engine to detect duplicate web pages and eliminate them from search results. It has also been applied in large-scale clustering problems, such as clustering documents by the similarity of their sets of words.

In computing, the count–min sketch is a probabilistic data structure that serves as a frequency table of events in a stream of data. It uses hash functions to map events to frequencies, but unlike a hash table uses only sub-linear space, at the expense of overcounting some events due to collisions. The count–min sketch was invented in 2003 by Graham Cormode and S. Muthu Muthukrishnan and described by them in a 2005 paper.

Fuzzy extractors are a method that allows biometric data to be used as inputs to standard cryptographic techniques, to enhance computer security. "Fuzzy", in this context, refers to the fact that the fixed values required for cryptography will be extracted from values close to but not identical to the original key, without compromising the security required. One application is to encrypt and authenticate users records, using the biometric inputs of the user as a key.

<span class="mw-page-title-main">Sample complexity</span>

The sample complexity of a machine learning algorithm represents the number of training-samples that it needs in order to successfully learn a target function.

<span class="mw-page-title-main">Occam learning</span> Model of algorithmic learning

In computational learning theory, Occam learning is a model of algorithmic learning where the objective of the learner is to output a succinct representation of received training data. This is closely related to probably approximately correct (PAC) learning, where the learner is evaluated on its predictive power of a test set.

<span class="mw-page-title-main">Error tolerance (PAC learning)</span>

In PAC learning, error tolerance refers to the ability of an algorithm to learn when the examples received have been corrupted in some way. In fact, this is a very common and important issue since in many applications it is not possible to access noise-free data. Noise can interfere with the learning process at different levels: the algorithm may receive data that have been occasionally mislabeled, or the inputs may have some false information, or the classification of the examples may have been maliciously adulterated.

A cuckoo filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set, like a Bloom filter does. False positive matches are possible, but false negatives are not – in other words, a query returns either "possibly in set" or "definitely not in set". A cuckoo filter can also delete existing items, which is not supported by Bloom filters. In addition, for applications that store many items and target moderately low false positive rates, cuckoo filters can achieve lower space overhead than space-optimized Bloom filters.

Adding controlled noise from predetermined distributions is a way of designing differentially private mechanisms. This technique is useful for designing private mechanisms for real-valued functions on sensitive data. Some commonly used distributions for adding noise include Laplace and Gaussian distributions.

Approximate Membership Query Filter (AMQ-Filter) is a group of space-efficient probabilistic data structures that supports approximate membership queries. An approximate membership query answers if an element is in a set or not with a false positive rate of .

References

  1. 1 2 Stefan, Walzer (2020). Random hypergraphs for hashing-based data structures (PhD). pp. 27–30.
  2. 1 2 Dillinger, Peter C.; Walzer, Stefan (2021). "Ribbon filter: practically smaller than Bloom and Xor". arXiv: 2103.02515 [cs.DS].
  3. Dietzfelbinger, Martin; Walzer, Stefan (2019). "Efficient Gauss elimination for near-quadratic matrices with one short random block per row, with applications". In Bender, Michael A.; Svensson, Ola; Herman, Grzegorz (eds.). 27th Annual European Symposium on Algorithms, ESA 2019, September 9–11, 2019, Munich/Garching, Germany. LIPIcs. Vol. 144. Schloss Dagstuhl – Leibniz-Zentrum für Informatik. pp. 39:1–39:18. arXiv: 1907.04750 . doi:10.4230/LIPIcs.ESA.2019.39.
  4. Dietzfelbinger, Martin; Pagh, Rasmus (2008). "Succinct data structures for retrieval and approximate membership (extended abstract)". In Aceto, Luca; Damgård, Ivan; Goldberg, Leslie Ann; Halldórsson, Magnús M.; Ingólfsdóttir, Anna; Walukiewicz, Igor (eds.). Automata, Languages and Programming, 35th International Colloquium, ICALP 2008, Reykjavik, Iceland, July 7–11, 2008, Proceedings, Part I: Track A: Algorithms, Automata, Complexity, and Games. Lecture Notes in Computer Science. Vol. 5125. Springer. pp. 385–396. arXiv: 0803.3693 . doi:10.1007/978-3-540-70575-8_32.
  5. Walzer, Stefan (2021). "Peeling close to the orientability threshold – spatial coupling in hashing-based data structures". In Marx, Dániel (ed.). Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms, SODA 2021, Virtual Conference, January 10–13, 2021. Society for Industrial and Applied Mathematics. pp. 2194–2211. arXiv: 2001.10500 . doi:10.1137/1.9781611976465.131.