Cell-probe model

Last updated September 12, 2024

In computer science, the cell-probe model is a model of computation similar to the random-access machine, except that all operations are free except memory access. This model is useful for proving lower bounds of algorithms for data structure problems.

Overview

The cell-probe model is a modification of the random-access machine model, in which computational cost is only assigned to accessing memory cells.

The model is intended for proving lower bounds on the complexity of data structure problems. One type of such problems has two phases: the preprocessing phase and the query phase. The input to the first phase, the preprocessing phase, is a set of data from which to build some structure from memory cells. The input to the second phase, the query phase, is a query parameter. The query has to consult the data structure in order to compute its result; for example, a query may be asked to determine if the query parameter was included in the original input data set. Another type of problem involves both update operations, that modify the data structure, and query operations. For example, an update may add an element to the structure, or remove one. In both cases, the cell-probe complexity of the data structure is characterized by the number of memory cells accessed during preprocessing, query and (if relevant) update. The cell probe complexity is a lower bound on the time complexity of the corresponding operations on a random-access machine, where memory transfers are part of the operations counted in measuring time.

An example of such a problem is the dynamic partial sum problem.^[1]^[2]

History

Andrew Yao's 1981 paper "Should Tables Be Sorted?"^[3] is considered as the introduction of the cell-probe model. Yao used it to give a minimum number of memory cell "probes" or accesses necessary to determine whether a given query datum exists within a table stored in memory. In 1989, Fredman and Saks initiated the study of cell probe lower bounds for dynamic data-structure problems (i.e., involving updates and queries), and introduced the notation CPROBE(b) for the cell-probe model assuming that a memory cell (word) consists of b bits.^[4]

Notable results

Searching Tables

Yao^[3] considered a static data-structure problem where one has to build a data structure ("table") to represent a set $S$ of $n$ elements out of $1,\dots ,m$ . The query parameter is a number $x\leq m$ and the query has to report whether $x$ is in the table. A crucial requirement is that the table consist of exactly $n$ entries, where each entry is an integer between $1$ and $m$ . Yao showed that as long as the table size is bounded independently of $m$ and $m$ is large enough, a query must perform $\lceil \log(n+1)\rceil$ probes in the worst case. This shows that a sorted table together with binary search for queries is an optimal scheme, in this restricted setting.

It is worth noting that in the same paper, Yao^[3] also showed, that if the problem is relaxed to allow the data structure to store $n+1$ entries, then the queries can be performed using only two probes.^{[note 1]} This upper bound, similarly to the lower bound described above, also requires $m$ to be sufficiently large, as a function of $n$ . Remarkably, this upper bound uses only one additional table entry than the setting for which the lower bound applies.

Dynamic Partial Sums

The dynamic partial sum problem defines two operations Update $(k,v)$ which sets the value in an array $A$ at index $k$ to be $v$ , and Sum $(k)$ which returns the sum of the values in $A$ at indices 0 through $k$ . A naïve implementation would take $O(1)$ time for Update and $O(n)$ time for Sum.^[5]

Instead, values can be stored as leaves in a tree whose inner nodes store the sum over the subtree rooted at that node. In this structure Update requires $O(\log n)$ time to update each node in the leaf to root path, and Sum similarly requires $O(\log n)$ time to traverse the tree from leaf to root summing the values of all subtrees left of the query index.

Improving on a result of Fredman and Saks, Mihai Pătraşcu used the cell-probe model and an information transfer argument to show that the partial sums problem requires $\Omega \left(\log n\right)$ time per operation in the worst case (i.e., the worst of query and update must consume such time), assuming $b=\Omega (\log n)$ bits per word.^[1]^[2] He further exhibited the trade-off curve between update time and query time and investigated the case that updates are restricted to small numbers (of $\delta =o(b)$ bits).

Disjoint Set Maintenance (Union-Find)

In the disjoint-set data structure, the structure represents a collection of disjoint sets; there is an update operation, called Union, which unites two sets, and a query operation, called Find, which identifies the set to which a given element belongs. Fredman and Saks proved that in the model CPROBE(log n), any solution for this problem requires $\Omega (m\alpha (m,n))$ probes in the worst case (even in expectation) to execute $n-1$ unions and $m\geq n$ finds.^[4] This shows that the classic data structure described in the article on disjoint-set data structure is optimal.

Approximate Nearest Neighbor Searching

The exact nearest neighbor search problem is to determine the closest in a set of input points to a given query point. An approximate version of this problem is often considered since many applications of this problem are in very high dimension spaces and solving the problem in high dimensions requires exponential time or space with respect to the dimension.^[6]

Chakrabarti and Regev proved that the approximate nearest neighbor search problem on the Hamming cube using polynomial storage and $d^{O(1)}$ word size requires a worst-case query time of $\Omega \left({\frac {\log \log d}{\log \log \log d}}\right)$ . This proof used the cell-probe model and information theoretic techniques from communication complexity.

The Cell-Probe Model versus Random Access Machines

In the cell probe model, limiting the range of values that can be stored in a cell is paramount (otherwise one could encode the whole data structure in one cell). The idealized random access machine used as a computational model in Computer Science does not impose a limit on the contents of each cell (in contrast to the word RAM). Thus cell probe lower bounds apply to the word RAM, but do not apply to the idealized RAM. Certain techniques for cell-probe lower bounds can, however, be carried over to the idealized RAM with an algebraic instruction set and similar lower bounds result.^[7]

External links

NIST's Dictionary of Algorithms and Data Structures entry on the cell-probe model

Related Research Articles

In computer science, binary search, also known as half-interval search, logarithmic search, or binary chop, is a search algorithm that finds the position of a target value within a sorted array. Binary search compares the target value to the middle element of the array. If they are not equal, the half in which the target cannot lie is eliminated and the search continues on the remaining half, again taking the middle element to compare to the target value, and repeating this until the target value is found. If the search ends with the remaining half being empty, the target is not in the array.

In computer science, the computational complexity or simply complexity of an algorithm is the amount of resources required to run it. Particular focus is given to computation time and memory storage requirements. The complexity of a problem is the complexity of the best algorithms that allow solving the problem.

In theoretical computer science, communication complexity studies the amount of communication required to solve a problem when the input to the problem is distributed among two or more parties. The study of communication complexity was first introduced by Andrew Yao in 1979, while studying the problem of computation distributed among several machines. The problem is usually stated as follows: two parties each receive a $-bit string and . The goal is for Alice to compute the value of a certain function,, that depends on both and, with the least amount of communication between them.$

In computer science, a perfect hash function $h$ for a set $S$ is a hash function that maps distinct elements in $S$ to a set of $m$ integers, with no collisions. In mathematical terms, it is an injective function.

In computing, a persistent data structure or not ephemeral data structure is a data structure that always preserves the previous version of itself when it is modified. Such data structures are effectively immutable, as their operations do not (visibly) update the structure in-place, but instead always yield a new updated structure. The term was introduced in Driscoll, Sarnak, Sleator, and Tarjan's 1986 article.

In computer science, a disjoint-set data structure, also called a union–find data structure or merge–find set, is a data structure that stores a collection of disjoint (non-overlapping) sets. Equivalently, it stores a partition of a set into disjoint subsets. It provides operations for adding new sets, merging sets, and finding a representative member of a set. The last operation makes it possible to find out efficiently if any two elements are in the same or different sets.

In computational geometry, Klee's measure problem is the problem of determining how efficiently the measure of a union of (multidimensional) rectangular ranges can be computed. Here, a d-dimensional rectangular range is defined to be a Cartesian product of d intervals of real numbers, which is a subset of R^d.

In computational complexity theory, the element distinctness problem or element uniqueness problem is the problem of determining whether all the elements of a list are distinct.

In computer science, a succinct data structure is a data structure which uses an amount of space that is "close" to the information-theoretic lower bound, but still allows for efficient query operations. The concept was originally introduced by Jacobson to encode bit vectors, (unlabeled) trees, and planar graphs. Unlike general lossless data compression algorithms, succinct data structures retain the ability to use them in-place, without decompressing them first. A related notion is that of a compressed data structure, insofar as the size of the stored or encoded data similarly depends upon the specific content of the data itself.

The dynamic convex hull problem is a class of dynamic problems in computational geometry. The problem consists in the maintenance, i.e., keeping track, of the convex hull for input data undergoing a sequence of discrete changes, i.e., when input data elements may be inserted, deleted, or modified. It should be distinguished from the kinetic convex hull, which studies similar problems for continuously moving points. Dynamic convex hull problems may be distinguished by the types of the input data and the allowed types of modification of the input data.

In theoretical computer science, the Aanderaa–Karp–Rosenberg conjecture is a group of related conjectures about the number of questions of the form "Is there an edge between vertex $and vertex ?" that have to be answered to determine whether or not an undirected graph has a particular property such as planarity or bipartiteness. They are named after Stål Aanderaa, Richard M. Karp, and Arnold L. Rosenberg. According to the conjecture, for a wide class of properties, no algorithm can guarantee that it will be able to skip any questions: any algorithm for determining whether the graph has the property, no matter how clever, might need to examine every pair of vertices before it can give its answer. A property satisfying this conjecture is called evasive.$

In computational complexity the decision tree model is the model of computation in which an algorithm is considered to be basically a decision tree, i.e., a sequence of queries or tests that are done adaptively, so the outcome of previous tests can influence the tests performed next.

Michael Ezra Saks is an American mathematician. He is currently the Department Chair of the Mathematics Department at Rutgers University (2017–) and from 2006 until 2010 was director of the Mathematics Graduate Program at Rutgers University. Saks received his Ph.D. from the Massachusetts Institute of Technology in 1980 after completing his dissertation titled Duality Properties of Finite Set Systems under his advisor Daniel J. Kleitman.

In computer science, integer sorting is the algorithmic problem of sorting a collection of data values by integer keys. Algorithms designed for integer sorting may also often be applied to sorting problems in which the keys are floating point numbers, rational numbers, or text strings. The ability to perform integer arithmetic on the keys allows integer sorting algorithms to be faster than comparison sorting algorithms in many cases, depending on the details of which operations are allowed in the model of computing and how large the integers to be sorted are.

In computer science, the Brodal queue is a heap/priority queue structure with very low worst case time bounds: $for insertion, find-minimum, meld and decrease-key and for delete-minimum and general deletion. They are the first heap variant to achieve these bounds without resorting to amortization of operational costs. Brodal queues are named after their inventor Gerth Stølting Brodal.$

In computer science, the range query problem consists of efficiently answering several queries regarding a given interval of elements within an array. For example, a common task, known as range minimum query, is finding the smallest value inside a given range within a list of numbers.

In computer science, $sorting is the problem of sorting pairs of numbers by their sums. Applications of the problem include transit fare minimisation, VLSI design, and sparse polynomial multiplication. As with comparison sorting and integer sorting more generally, algorithms for this problem can be based only on comparisons of these sums, or on other operations that work only when the inputs are small integers.$

In theoretical computer science, the word RAM model is a model of computation in which a random-access machine does arithmetic and bitwise operations on a word of $w$ bits. Michael Fredman and Dan Willard created it in 1990 to simulate programming languages like C.

In computer science, the predecessor problem involves maintaining a set of items to, given an element, efficiently query which element precedes or succeeds that element in an order. Data structures used to solve the problem include balanced binary search trees, van Emde Boas trees, and fusion trees. In the static predecessor problem, the set of elements does not change, but in the dynamic predecessor problem, insertions into and deletions from the set are allowed.

In computer science, the list-labeling problem involves maintaining a totally ordered set S supporting the following operations:

References

Notes

↑ See Theorem 3 of Yao's paper.

Citations

1 2 Pătraşcu, Mihai; Demaine, Erik D. (2006). "Logarithmic lower bounds in the cell-probe model". SIAM Journal on Computing. 35 (4): 932–963. arXiv: cs/0502041 . Bibcode:2005cs........2041P. doi:10.1137/s0097539705447256. S2CID 2202874.
1 2 Pătraşcu, Mihai. "Lower Bounds for Dynamic Partial Sums" (PDF). Retrieved 9 April 2014.
1 2 3 Yao, Andrew Chi-Chih (July 1981). "Should Tables Be Sorted?". J. ACM. 28 (3): 615–628. doi: 10.1145/322261.322274 .
1 2 Fredman, Michael; Saks, Michael (1989). "The cell probe complexity of dynamic data structures". Proceedings of the twenty-first annual ACM symposium on Theory of computing - STOC '89. Vol. 21. pp. 345–354. doi:10.1145/73007.73040. ISBN 0897913078. S2CID 13470414.
↑ Zatloukal, Kevin (November 10, 2010). "Notes on "Logarithmic Lower Bounds in the Cell-Probe Model"" (PDF). Retrieved 9 April 2014.
↑ Chakrabarti, Amit; Regev, Oded (2004). "An Optimal Randomised Cell Probe Lower Bound for Approximate Nearest Neighbour Searching". 45th Annual IEEE Symposium on Foundations of Computer Science. pp. 473–482. doi:10.1109/FOCS.2004.12. ISBN 0-7695-2228-9. S2CID 3004976.
↑ Ben-Amram, Amir M.; Galil, Zvi (2002). "Lower bounds for dynamic data structures on algebraic RAMs". Algorithmica. 32 (3): 364–395. doi:10.1007/s00453-001-0079-6. S2CID 22324845.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[5] See Theorem 3 of Yao's paper.

[patrascu2006-1] 1 2 Pătraşcu, Mihai; Demaine, Erik D. (2006). "Logarithmic lower bounds in the cell-probe model". SIAM Journal on Computing. 35 (4): 932–963. arXiv: cs/0502041 . Bibcode:2005cs........2041P. doi:10.1137/s0097539705447256. S2CID 2202874.

[patrascu_morin-2] 1 2 Pătraşcu, Mihai. "Lower Bounds for Dynamic Partial Sums" (PDF). Retrieved 9 April 2014.

[yao81-3] 1 2 3 Yao, Andrew Chi-Chih (July 1981). "Should Tables Be Sorted?". J. ACM. 28 (3): 615–628. doi: 10.1145/322261.322274 .

[FS89-4] 1 2 Fredman, Michael; Saks, Michael (1989). "The cell probe complexity of dynamic data structures". Proceedings of the twenty-first annual ACM symposium on Theory of computing - STOC '89. Vol. 21. pp. 345–354. doi:10.1145/73007.73040. ISBN 0897913078. S2CID 13470414.

[zatloukal2010-6] Zatloukal, Kevin (November 10, 2010). "Notes on "Logarithmic Lower Bounds in the Cell-Probe Model"" (PDF). Retrieved 9 April 2014.

[chakrabarti2004-7] Chakrabarti, Amit; Regev, Oded (2004). "An Optimal Randomised Cell Probe Lower Bound for Approximate Nearest Neighbour Searching". 45th Annual IEEE Symposium on Foundations of Computer Science. pp. 473–482. doi:10.1109/FOCS.2004.12. ISBN 0-7695-2228-9. S2CID 3004976.

[8] Ben-Amram, Amir M.; Galil, Zvi (2002). "Lower bounds for dynamic data structures on algebraic RAMs". Algorithmica. 32 (3): 364–395. doi:10.1007/s00453-001-0079-6. S2CID 22324845.

[1]

[2]

[3]

[4]

[note 1]

[5]

[6]

[7]