Range mode query

Last updated April 17, 2021

In data structures, the range mode query problem asks to build a data structure on some input data to efficiently answer queries asking for the mode of any consecutive subset of the input.

Problem statement

Given an array $A[1:n]=[a_{1},a_{2},...,a_{n}]$ , we wish to answer queries of the form $mode(A,i:j)$ , where $1\leq i\leq j\leq n$ . The mode $mode(S)$ of any array $S=[s_{1},s_{2},...,s_{k}]$ is an element $s_{i}$ such that the frequency of $s_{i}$ is greater than or equal to the frequency of $s_{j}\;\forall j\in \{1,...,k\}$ . For example, if $S=[1,2,4,2,3,4,2]$ , then $mode(S)=2$ because it occurs three times, while all other values occur fewer times. In this problem, the queries ask for the mode of subarrays of the form $A[i:j]=[a_{i},a_{i+1},...,a_{j}]$ .

Theorem 1

Let $A$ and $B$ be any multisets. If $c$ is a mode of $A\cup B$ and $c\notin A$ , then $c$ is a mode of $B$ .

Proof

Let $c\notin A$ be a mode of $C=A\cup B$ and $f_{c}$ be its frequency in $C$ . Suppose that $c$ is not a mode of $B$ . Thus, there exists an element $b$ with frequency $f_{b}$ that is the mode of $B$ . Since $b$ is the mode of $B$ and that $c\notin A$ , then $f_{b}>f_{c}$ . Thus, $b$ should be the mode of $C$ which is a contradiction.

Results

Space	Query Time	Restrictions	Source
$O(n)$	$O({\sqrt {n}})$		^[1]
$O(n)$	$O({\sqrt {n/w}})$	$w$ is the word size	^[1]
$O(n^{2}\log \log n/\log n)$	$O(1)$		^[2]
$O(n^{2-2\epsilon }/\log n)$	$O(n^{\epsilon })$	$0\leq \epsilon \leq 1/2$	^[1]
$O(n^{2-2\epsilon })$	$O(n^{\epsilon }\log n)$	$0\leq \epsilon \leq 1/2$	^[2]

Lower bound

Any data structure using $S$ cells of $w$ bits each needs $\Omega \left({\frac {\log n}{\log(Sw/n)}}\right)$ time to answer a range mode query.^[3]

This contrasts with other range query problems, such as the range minimum query which have solutions offering constant time query time and linear space. This is due to the hardness of the mode problem, since even if we know the mode of $A[i:j]$ and the mode of $A[j+1:k]$ , there is no simple way of computing the mode of $A[i:k]$ . Any element of $A[i:j]$ or $A[j+1:k]$ could be the mode. For example, if $mode(A[i:j])=a$ and its frequency is $f_{a}$ , and $mode(A[j+1:k])=b$ and its frequency is also $f_{a}$ , there could be an element $c$ with frequency $f_{a}-1$ in $A[i:j]$ and frequency $f_{a}-1$ in $A[j+1:k]$ . $a\not =c\not =b$ , but its frequency in $A[i:k]$ is greater than the frequency of $a$ and $b$ , which makes $c$ a better candidate for $mode(A[i:k])$ than $a$ or $b$ .

Linear space data structure with square root query time

This method by Chan et al.^[1] uses $O(n+s^{2})$ space and $O(n/s)$ query time. By setting $s={\sqrt {n}}$ , we get $O(n)$ and $O({\sqrt {n}})$ bounds for space and query time.

Preprocessing

Let $A[1:n]$ be an array, and $D[1:\Delta ]$ be an array that contains the distinct values of A, where $\Delta$ is the number of distinct elements. We define $B[1:n]$ to be an array such that, for each $i$ , $B[i]$ contains the rank (position) of $A[i]$ in $D$ . Arrays $B,D$ can be created by a linear scan of $A$ .

Arrays $Q_{1},Q_{2},...,Q_{\Delta }$ are also created, such that, for each $a\in \{1,...,\Delta \}$ , $Q_{a}=\{b\;|\;B[b]=a\}$ . We then create an array $B'[1:n]$ , such that, for all $b\in \{1,...,n\}$ , $B'[b]$ contains the rank of $b$ in $Q_{B[b]}$ . Again, a linear scan of $B$ suffices to create arrays $Q_{1},Q_{2},...,Q_{\Delta }$ and $B'$ .

It is now possible to answer queries of the form "is the frequency of $B[i]$ in $B[i:j]$ at least $q$ " in constant time, by checking whether $Q_{B[i]}[B'[i]+q-1]\leq j$ .

The array is split B into $s$ blocks $b_{1},b_{2},...,b_{s}$ , each of size $t=\lceil n/s\rceil$ . Thus, a block $b_{i}$ spans over $B[i\cdot t+1:(i+1)t]$ . The mode and the frequency of each block or set of consecutive blocks will be pre-computed in two tables $S$ and $S'$ . $S[b_{i},b_{j}]$ is the mode of $b_{i}\cup b_{i+1}\cup ...\cup b_{j}$ , or equivalently, the mode of $B[b_{i}t+1:(b_{j}+1)t]$ , and $S'$ stores the corresponding frequency. These two tables can be stored in $O(s^{2})$ space, and can be populated in $O(s\cdot n)$ by scanning $B$ $s$ times, computing a row of $S,S'$ each time with the following algorithm:

algorithm computeS_Sprime isinput: Array B = [0:n - 1],          Array D = [0:Delta - 1],          Integer soutput: Tables S and Sprime     let S← Table(0:n - 1, 0:n - 1)     let Sprime← Table(0:n - 1, 0:n - 1)     let firstOccurence← Array(0:Delta - 1)     for all i in {0, ..., Delta - 1} do         firstOccurence[i] ← -1      end forfor i ← 0:s - 1 do             let j← i × t         let c← 0         let fc← 0         let noBlock← i         let block_start← j         let block_end← min{(i + 1) × t - 1, n - 1}         while j < n doif firstOccurence[B[j]] = -1 then                 firstOccurence[B[j]] ← j             end ifif atLeastQInstances(firstOccurence[B[j]], block_end, fc + 1) then                 c ← B[j]                 fc ← fc + 1             end ifif j = block_end then                 S[i * s + noBlock] ← c                 Sprime[i × s + noBlock] ← fc                    noBlock ← noBlock + 1                 block_end ← min{block_end + t, n - 1}             end ifend whilefor all j in {0, ..., Delta - 1} do             firstOccurence[j] ← -1          end forend for

Query

We will define the query algorithm over array $B$ . This can be translated to an answer over $A$ , since for any $a,i,j$ , $B[a]$ is a mode for $B[i:j]$ if and only if $A[a]$ is a mode for $A[i:j]$ . We can convert an answer for $B$ to an answer for $A$ in constant time by looking in $A$ or $B$ at the corresponding index.

Given a query $mode(B,i,j)$ , the query is split in three parts: the prefix, the span and the suffix. Let $b_{i}=\lceil (i-1)/t\rceil$ and $b_{j}=\lfloor j/t\rfloor -1$ . These denote the indices of the first and last block that are completely contained in $B$ . The range of these blocks is called the span. The prefix is then $B[i:min\{b_{i}t,j\}]$ (the set of indices before the span), and the suffix is $B[max\{(b_{j}+1)t+1,i\}:j]$ (the set of indices after the span). The prefix, suffix or span can be empty, the latter is if $b_{j}<b_{i}$ .

For the span, the mode $c$ is already stored in $S[b_{i},b_{j}]$ . Let $f_{c}$ be the frequency of the mode, which is stored in $S'[b_{i},b_{j}]$ . If the span is empty, let $f_{c}=0$ . Recall that, by Theorem 1, the mode of $B[i:j]$ is either an element of the prefix, span or suffix. A linear scan is performed over each element in the prefix and in the suffix to check if its frequency is greater than the current candidate $c$ , in which case $c$ and $f_{c}$ are updated to the new value. At the end of the scan, $c$ contains the mode of $B[i:j]$ and $f_{c}$ its frequency.

Scanning procedure

The procedure is similar for both prefix and suffix, so it suffice to run this procedure for both:

Let $x$ be the index of the current element. There are three cases:

If $Q_{B[x]}[B'[x]-1]\geq i$ , then it was present in $B[i:x-1]$ and its frequency has already been counted. Pass to the next element.
Otherwise, check if the frequency of $B[x]$ $Range mode query$ in $B[i:j]$ $Range mode query$ is at least $f_{c}$ $Range mode query$ (this can be done in constant time since it is the equivalent of checking it for $B[x:j]$ $Range mode query$ ).
1. If it is not, then pass to the next element.
2. If it is, then compute the actual frequency $f_{x}$ of $B[x]$ in $B[i:j]$ by a linear scan (starting at index $B'[x]+f_{c}-1$ ) or a binary search in $Q_{B[x]}$ . Set $c:=B[x]$ and $f_{c}:=f_{x}$ .

This linear scan (excluding the frequency computations) is bounded by the block size $t$ , since neither the prefix or the suffix can be greater than $t$ . A further analysis of the linear scans done for frequency computations shows that it is also bounded by the block size.^[1] Thus, the query time is $O(t)=O(n/s)$ .

Subquadratic space data structure with constant query time

This method by ^[2] uses $O\left({\frac {n^{2}\log {\log {n}}}{\log {n}}}\right)$ space for a constant time query. We can observe that, if a constant query time is desired, this is a better solution than the one proposed by Chan et al.,^[1] as the latter gives a space of $O(n^{2})$ for constant query time if $s=n$ .

Preprocessing

Let $A[1:n]$ be an array. The preprocessing is done in three steps:

Split the array $A$ in $s$ blocks $b_{1},b_{2},...,b_{s}$ , where the size of each block is $t=\lceil n/s\rceil$ . Build a table $S$ of size $s\times s$ where $S[i,j]$ is the mode of $b_{i}\cup b_{i+1}\cup ...\cup b_{j}$ . The total space for this step is $O(s^{2})$
For any query $mode(A,i,j)$ , let $b_{i'}$ be the block that contains $i$ and $b_{j'}$ be the block that contains $j$ . Let the span be the set of blocks completely contained in $A[i:j]$ . The mode $c$ of the block can be retrieved from $S$ . By Theorem 1, the mode can be either an element of the prefix (indices of $A[i:j]$ before the start of the span), an element of the suffix (indices of $A[i:j]$ after the end of the span), or $c$ . The size of the prefix plus the size of the suffix is bounded by $2t$ , thus the position of the mode isstored as an integer ranging from $0$ to $2t$ , where $[0:2t-1]$ indicates a position in the prefix/suffix and $2t$ indicates that the mode is the mode of the span. There are ${\binom {t}{2}}$ possible queries involving blocks $b_{i'}$ and $b_{j'}$ , so these values are stored in a table of size $t^{2}$ . Furthermore, there are $(2t+1)^{t^{2}}$ such tables, so the total space required for this step is $O(t^{2}(2t+1)^{t^{2}})$ . To access those tables, a pointer is added in addition to the mode in the table $S$ for each pair of blocks.
To handle queries $mode(A,i,j)$ where $i$ and $j$ are in the same block, all such solutions are precomputed. There are $O(st^{2})$ of them, they are stored in a three dimensional table $T$ of this size.

The total space used by this data structure is $O(s^{2}+t^{2}(2t+1)^{t^{2}}+st^{2})$ , which reduces to $O\left({\frac {n^{2}\log {\log {n}}}{\log {n}}}\right)$ if we take $t={\sqrt {\log {n}/\log {\log {n}}}}$ .

Query

Given a query $mode(A,i,j)$ , check if it is completely contained inside a block, in which case the answer is stored in table $T$ . If the query spans exactly one or more blocks, then the answer is found in table $S$ . Otherwise, use the pointer stored in table $S$ at position $S[b_{i'},b_{j'}]$ , where $b_{i'},b_{j'}$ are the indices of the blocks that contain respectively $i$ and $j$ , to find the table $U_{b_{i'},b_{j'}}$ that contains the positions of the mode for these blocks and use the position to find the mode in $A$ . This can be done in constant time.

Related Research Articles

Dynamic programming is both a mathematical optimization method and a computer programming method. The method was developed by Richard Bellman in the 1950s and has found applications in numerous fields, from aerospace engineering to economics.

In computer science, the Boyer–Moore string-search algorithm is an efficient string-searching algorithm that is the standard benchmark for practical string-search literature. It was developed by Robert S. Boyer and J Strother Moore in 1977. The original paper contained static tables for computing the pattern shifts without an explanation of how to produce them. The algorithm for producing the tables was published in a follow-on paper; this paper contained errors which were later corrected by Wojciech Rytter in 1980. The algorithm preprocesses the string being searched for, but not the string being searched in. It is thus well-suited for applications in which the pattern is much shorter than the text or where it persists across multiple searches. The Boyer–Moore algorithm uses information gathered during the preprocess step to skip sections of the text, resulting in a lower constant factor than many other string search algorithms. In general, the algorithm runs faster as the pattern length increases. The key features of the algorithm are to match on the tail of the pattern rather than the head, and to skip along the text in jumps of multiple characters rather than searching every single character in the text.

In computer science, a suffix tree is a compressed trie containing all the suffixes of the given text as their keys and positions in the text as their values. Suffix trees allow particularly fast implementations of many important string operations.

In computer science, a fusion tree is a type of tree data structure that implements an associative array on $w$ -bit integers. When operating on a collection of $n$ key–value pairs, it uses $O (n)$ space and performs searches in $O (log w n)$ time, which is asymptotically faster than a traditional self-balancing binary search tree, and also better than the van Emde Boas tree for large values of $w$ . It achieves this speed by exploiting certain constant-time operations that can be done on a machine word. Fusion trees were invented in 1990 by Michael Fredman and Dan Willard.

In computer science, a suffix array is a sorted array of all suffixes of a string. It is a data structure used in, among others, full text indices, data compression algorithms, and the field of bibliometrics.

In computer science, the prefix sum, cumulative sum, inclusive scan, or simply scan of a sequence of numbers $x 0, x 1, x 2, ...$ is a second sequence of numbers $y 0, y 1, y 2, ...$ , the sums of prefixes of the input sequence:

In descriptive complexity, a branch of computational complexity, FO is a complexity class of structures that can be recognized by formulas of first-order logic, and also equals the complexity class AC⁰. Descriptive complexity uses the formalism of logic, but does not use several key notions associated with logic such as proof theory or axiomatization.

In computer science, a succinct data structure is a data structure which uses an amount of space that is "close" to the information-theoretic lower bound, but still allows for efficient query operations. The concept was originally introduced by Jacobson to encode bit vectors, (unlabeled) trees, and planar graphs. Unlike general lossless data compression algorithms, succinct data structures retain the ability to use them in-place, without decompressing them first. A related notion is that of a compressed data structure, in which the size of the data structure depends upon the particular data being represented.

The Fréchet distribution, also known as inverse Weibull distribution, is a special case of the generalized extreme value distribution. It has the cumulative distribution function

In computer science, a range minimum query (RMQ) solves the problem of finding the minimal value in a sub-array of an array of comparable objects. Range minimum queries have several use cases in computer science, such as the lowest common ancestor problem and the longest common prefix problem (LCP).

In computer science, streaming algorithms are algorithms for processing data streams in which the input is presented as a sequence of items and can be examined in only a few passes. In most models, these algorithms have access to limited memory. They may also have limited processing time per item.

Samplesort is a sorting algorithm that is a divide and conquer algorithm often used in parallel processing systems. Conventional divide and conquer sorting algorithms partitions the array into sub-intervals or buckets. The buckets are then sorted individually and then concatenated together. However, if the array is non-uniformly distributed, the performance of these sorting algorithms can be significantly throttled. Samplesort addresses this issue by selecting a sample of size $s$ from the $n$ -element sequence, and determining the range of the buckets by sorting the sample and choosing $p -1 < s$ elements from the result. These elements then divide the array into $p$ approximately equal-sized buckets. Samplesort is described in the 1970 paper, "Samplesort: A Sampling Approach to Minimal Storage Tree Sorting", by W. D. Frazer and A. C. McKellar.

In computational complexity the decision tree model is the model of computation in which an algorithm is considered to be basically a decision tree, i.e., a sequence of queries or tests that are done adaptively, so the outcome of the previous tests can influence the test is performed next.

A Fenwick tree or binary indexed tree is a data structure that can efficiently update elements and calculate prefix sums in a table of numbers.

In machine learning, a Ranking SVM is a variant of the support vector machine algorithm, which is used to solve certain ranking problems. The ranking SVM algorithm was published by Thorsten Joachims in 2002. The original purpose of the algorithm was to improve the performance of an internet search engine. However, it was found that Ranking SVM also can be used to solve other problems such as Rank SIFT.

In data structures, a range query consists of preprocessing some input data into a data structure to efficiently answer any number of queries on any subset of the input. Particularly, there is a group of problems that have been extensively studied where the input is an array of unsorted numbers and a query consists of computing some function, such as the minimum, on a specific range of the array.

In computer science, the longest common prefix array is an auxiliary data structure to the suffix array. It stores the lengths of the longest common prefixes (LCPs) between all pairs of consecutive suffixes in a sorted suffix array.

In computer science, a suffix automaton is an efficient data structure for representing the substring index of a given string which allows the storage, processing, and retrieval of compressed information about all its substrings. The suffix automaton of a string $is the smallest directed acyclic graph with a dedicated initial vertex and a set of "final" vertices, such that paths from the initial vertex to final vertices represent the suffixes of the string. Formally speaking, a suffix automaton is defined by the following set of properties:$

Its arcs are tagged with letters;
none of its nodes have two outgoing arcs tagged with the same letter;
for every suffix of $there exists a path from initial vertex to some final vertex such that the concatenation of letters on the path forms this suffix;$
it has the fewest vertices among all graphs defined by the properties above.

In computer science, a parallel external memory (PEM) model is a cache-aware, external-memory abstract machine. It is the parallel-computing analogy to the single-processor external memory (EM) model. In a similar way, it is the cache-aware analogy to the parallel random-access machine (PRAM). The PEM model consists of a number of processors, together with their respective private caches and a shared main memory.

LSH is a cryptographic hash function designed in 2014 by South Korea to provide integrity in general-purpose software environments such as PCs and smart devices. LSH is one of the cryptographic algorithms approved by the Korean Cryptographic Module Validation Program (KCMVP). And it is the national standard of South Korea.

References

1 2 3 4 5 6 Chan, Timothy M.; Durocher, Stephane; Larsen, Kasper Green; Morrison, Jason; Wilkinson, Bryan T. (2013). "Linear-Space Data Structures for Range Mode Query in Arrays" (PDF). Theory of Computing Systems. Springer: 1–23.
1 2 3 Krizanc, Danny; Morin, Pat; Smid, Michiel H. M. (2003). "Range Mode and Range Median Queries on Lists and Trees" (PDF). ISAAC: 517–526.
↑ Greve, M; Jørgensen, A.; Larsen, K.; Truelsen, J. (2010). "Cell probe lower bounds and approximations for range mode". Automata, Languages and Programming: 605–616.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[chan2013-1] 1 2 3 4 5 6 Chan, Timothy M.; Durocher, Stephane; Larsen, Kasper Green; Morrison, Jason; Wilkinson, Bryan T. (2013). "Linear-Space Data Structures for Range Mode Query in Arrays" (PDF). Theory of Computing Systems. Springer: 1–23.

[morin-2] 1 2 3 Krizanc, Danny; Morin, Pat; Smid, Michiel H. M. (2003). "Range Mode and Range Median Queries on Lists and Trees" (PDF). ISAAC: 517–526.

[jorgensen-3] Greve, M; Jørgensen, A.; Larsen, K.; Truelsen, J. (2010). "Cell probe lower bounds and approximations for range mode". Automata, Languages and Programming: 605–616.

[1]

[2]

[3]

Range mode query

Contents

Problem statement

Theorem 1

Proof

Results

Lower bound

Linear space data structure with square root query time

Preprocessing

Query

Scanning procedure

Subquadratic space data structure with constant query time

Preprocessing

Query

Related Research Articles

References