This article's tone or style may not reflect the encyclopedic tone used on Wikipedia.(December 2017) |
In computer science, the range query problem consists of efficiently answering several queries regarding a given interval of elements within an array. For example, a common task, known as range minimum query, is finding the smallest value inside a given range within a list of numbers.
Given a function that accepts an array, a range query on an array takes two indices and and returns the result of when applied to the subarray . For example, for a function that returns the sum of all values in an array, the range query returns the sum of all values in the range .[ citation needed ]
Range sum queries may be answered in constant time and linear space by pre-computing an array p of same length as the input such that for every index i, the element pi is the sum of the first i elements of a. Any query may then be computed as follows:
This strategy may be extended to any other binary operation whose inverse function is well-defined and easily computable. [1] It can also be extended to higher dimensions with a similar pre-processing. [2] For example, if pi,j contains the sum of the first i×j elements of a, then
A more difficult subset of the problem consists of executing range queries on dynamic data; that is, data that may mutate between each query. In order to efficiently update array values, more sophisticated data structures like the segment tree or Fenwick tree are necessary.[ citation needed ]
When the function of interest in a range query is a semigroup operator, the notion of is not always defined, so the strategy in the previous section does not work. Andrew Yao showed [3] that there exists an efficient solution for range queries that involve semigroup operators. He proved that for any constant c, a pre-processing of time and space allows to answer range queries on lists where f is a semigroup operator in time, where is a certain functional inverse of the Ackermann function.
There are some semigroup operators that admit slightly better solutions. For instance when . Assume then returns the index of the minimum element of . Then denotes the corresponding minimum range query. There are several data structures that allow to answer a range minimum query in time using a pre-processing of time and space . One such solution is based on the equivalence between this problem and the lowest common ancestor problem.
The Cartesian tree of an array has as root and as left and right subtrees the Cartesian tree of and the Cartesian tree of respectively. A range minimum query is the lowest common ancestor in of and . Because the lowest common ancestor can be solved in constant time using a pre-processing of time and space , range minimum query can as well. The solution when is analogous. Cartesian trees can be constructed in linear time.
The mode of an array is the element that appears the most in it. For instance the mode of is 4. In case of a tie, any of the most frequent elements might be picked as the mode. A range mode query consists in pre-processing such that we can find the mode in any range of . Several data structures have been devised to solve this problem, we summarize some of the results in the following table. [1]
Space | Query Time | Restrictions |
---|---|---|
Recently Jørgensen et al. proved a lower bound on the cell-probe model of for any data structure that uses S cells. [4]
This particular case is of special interest since finding the median has several applications. [5] On the other hand, the median problem, a special case of the selection problem, is solvable in O(n), using the median of medians algorithm. [6] However its generalization through range median queries is recent. [7] A range median query where A,i and j have the usual meanings returns the median element of . Equivalently, should return the element of of rank . Range median queries cannot be solved by following any of the previous methods discussed above including Yao's approach for semigroup operators. [8]
There have been studied two variants of this problem, the offline version, where all the k queries of interest are given in a batch, and a version where all the pre-processing is done up front. The offline version can be solved with time and space.
The following pseudocode of the quickselect algorithm shows how to find the element of rank r in an unsorted array of distinct elements, to find the range medians we set . [7]
rangeMedian(A, i, j, r) { if A.length() == 1 return A[1] if A.low is undefined then m = median(A) A.low = [e in A | e <= m] A.high = [e in A | e > m ] calculate t the number of elements of A[i, j] that belong to A.low if r <= t thenreturn rangeMedian(A.low, i, j, r) elsereturn rangeMedian(A.high, i, j, r-t) }
Procedure rangeMedian
partitions A
, using A
's median, into two arrays A.low
and A.high
, where the former contains the elements of A
that are less than or equal to the median m
and the latter the rest of the elements of A
. If we know that the number of elements of that end up in A.low
is t
and this number is bigger than r
then we should keep looking for the element of rank r
in A.low
; otherwise we should look for the element of rank in A.high
. To find t, it is enough to find the maximum index such that is in A.low
and the maximum index such that is in A.high
. Then . The total cost for any query, without considering the partitioning part, is since at most recursion calls are done and only a constant number of operations are performed in each of them (to get the value of t fractional cascading should be used). If a linear algorithm to find the medians is used, the total cost of pre-processing for k range median queries is . The algorithm can also be modified to solve the online version of the problem. [7]
Finding frequent elements in a given set of items is one of the most important tasks in data mining. Finding frequent elements might be a difficult task to achieve when most items have similar frequencies. Therefore, it might be more beneficial if some threshold of significance was used for detecting such items. One of the most famous algorithms for finding the majority of an array was proposed by Boyer and Moore [9] which is also known as the Boyer–Moore majority vote algorithm. Boyer and Moore proposed an algorithm to find the majority element of a string (if it has one) in time and using space. In the context of Boyer and Moore’s work and generally speaking, a majority element in a set of items (for example string or an array) is one whose number of instances is more than half of the size of that set. Few years later, Misra and Gries [10] proposed a more general version of Boyer and Moore's algorithm using comparisons to find all items in an array whose relative frequencies are greater than some threshold . A range -majority query is one that, given a subrange of a data structure (for example an array) of size , returns the set of all distinct items that appear more than (or in some publications equal to) times in that given range. In different structures that support range -majority queries, can be either static (specified during pre-processing) or dynamic (specified at query time). Many of such approaches are based on the fact that, regardless of the size of the range, for a given there could be at most distinct candidates with relative frequencies at least . By verifying each of these candidates in constant time, query time is achieved. A range -majority query is decomposable [11] in the sense that a -majority in a range with partitions and must be a -majority in either or . Due to this decomposability, some data structures answer -majority queries on one-dimensional arrays by finding the Lowest common ancestor (LCA) of the endpoints of the query range in a Range tree and validating two sets of candidates (of size ) from each endpoint to the lowest common ancestor in constant time resulting in query time.
Gagie et al. [12] proposed a data structure that supports range -majority queries on an array . For each query in this data structure a threshold and a rectangular range are specified, and the set of all elements that have relative frequencies (inside that rectangular range) greater than or equal to are returned as the output. This data structure supports dynamic thresholds (specified at query time) and a pre-processing threshold based on which it is constructed. During the pre-processing, a set of vertical and horizontal intervals are built on the array. Together, a vertical and a horizontal interval form a block. Each block is part of a superblock nine times bigger than itself (three times the size of the block's horizontal interval and three times the size of its vertical one). For each block a set of candidates (with elements at most) is stored which consists of elements that have relative frequencies at least (the pre-processing threshold as mentioned above) in its respective superblock. These elements are stored in non-increasing order according to their frequencies and it is easy to see that, any element that has a relative frequency at least in a block must appear its set of candidates. Each -majority query is first answered by finding the query block, or the biggest block that is contained in the provided query rectangle in time. For the obtained query block, the first candidates are returned (without being verified) in time, so this process might return some false positives. Many other data structures (as discussed below) have proposed methods for verifying each candidate in constant time and thus maintaining the query time while returning no false positives. The cases in which the query block is smaller than are handled by storing different instances of this data structure of the following form:
where is the pre-processing threshold of the -th instance. Thus, for query blocks smaller than the -th instance is queried. As mentioned above, this data structure has query time and requires bits of space by storing a Huffman-encoded copy of it (note the factor and also see Huffman coding).
Chan et al. [13] proposed a data structure that given a one-dimensional array, a subrange of (specified at query time) and a threshold (specified at query time), is able to return the list of all -majorities in time requiring words of space. To answer such queries, Chan et al. [13] begin by noting that there exists a data structure capable of returning the top-k most frequent items in a range in time requiring words of space. For a one-dimensional array , let a one-sided top-k range query to be of form . For a maximal range of ranges in which the frequency of a distinct element in remains unchanged (and equal to ), a horizontal line segment is constructed. The -interval of this line segment corresponds to and it has a -value equal to . Since adding each element to changes the frequency of exactly one distinct element, the aforementioned process creates line segments. Moreover, for a vertical line all horizonal line segments intersecting it are sorted according to their frequencies. Note that, each horizontal line segment with -interval corresponds to exactly one distinct element in , such that . A top-k query can then be answered by shooting a vertical ray and reporting the first horizontal line segments that intersect it (remember from above that these line segments are already sorted according to their frequencies) in time.
Chan et al. [13] first construct a range tree in which each branching node stores one copy of the data structure described above for one-sided range top-k queries and each leaf represents an element from . The top-k data structure at each node is constructed based on the values existing in the subtrees of that node and is meant to answer one-sided range top-k queries. Please note that for a one-dimensional array , a range tree can be constructed by dividing into two halves and recursing on both halves; therefore, each node of the resulting range tree represents a range. It can also be seen that this range tree requires words of space, because there are levels and each level has nodes. Moreover, since at each level of a range tree all nodes have a total of elements of at their subtrees and since there are levels, the space complexity of this range tree is .
Using this structure, a range -majority query on with is answered as follows. First, the lowest common ancestor (LCA) of leaf nodes and is found in constant time. Note that there exists a data structure requiring bits of space that is capable of answering the LCA queries in time. [14] Let denote the LCA of and , using and according to the decomposability of range -majority queries (as described above and in [11] ), the two-sided range query can be converted into two one-sided range top-k queries (from to and ). These two one-sided range top-k queries return the top-() most frequent elements in each of their respective ranges in time. These frequent elements make up the set of candidates for -majorities in in which there are candidates some of which might be false positives. Each candidate is then assessed in constant time using a linear-space data structure (as described in Lemma 3 in [15] ) that is able to determine in time whether or not a given subrange of an array contains at least instances of a particular element .
Gagie et al. [16] proposed a data structure which supports queries such that, given two nodes and in a tree, are able to report the list of elements that have a greater relative frequency than on the path from to . More formally, let be a labelled tree in which each node has a label from an alphabet of size . Let denote the label of node in . Let denote the unique path from to in in which middle nodes are listed in the order they are visited. Given , and a fixed (specified during pre-processing) threshold , a query must return the set of all labels that appear more than times in .
To construct this data structure, first nodes are marked. This can be done by marking any node that has distance at least from the bottom of the three (height) and whose depth is divisible by . After doing this, it can be observed that the distance between each node and its nearest marked ancestor is less than . For a marked node , different sequences (paths towards the root) are stored,
for where returns the label of the direct parent of node . Put another way, for each marked node, the set of all paths with a power of two length (plus one for the node itself) towards the root is stored. Moreover, for each , the set of all majority candidates are stored. More specifically, contains the set of all -majorities in or labels that appear more than times in . It is easy to see that the set of candidates can have at most distinct labels for each . Gagie et al. [16] then note that the set of all -majorities in the path from any marked node to one of its ancestors is included in some (Lemma 2 in [16] ) since the length of is equal to thus there exists a for whose length is between where is the distance between x and z. The existence of such implies that a -majority in the path from to must be a -majority in , and thus must appear in . It is easy to see that this data structure require words of space, because as mentioned above in the construction phase nodes are marked and for each marked node some candidate sets are stored. By definition, for each marked node of such sets are stores, each of which contains candidates. Therefore, this data structure requires words of space. Please note that each node also stores which is equal to the number of instances of on the path from to the root of , this does not increase the space complexity since it only adds a constant number of words per node.
Each query between two nodes and can be answered by using the decomposability property (as explained above) of range -majority queries and by breaking the query path between and into four subpaths. Let be the lowest common ancestor of and , with and being the nearest marked ancestors of and respectively. The path from to is decomposed into the paths from and to and respectively (the size of these paths are smaller than by definition, all of which are considered as candidates), and the paths from and to (by finding the suitable as explained above and considering all of its labels as candidates). Please note that, boundary nodes have to be handled accordingly so that all of these subpaths are disjoint and from all of them a set of candidates is derived. Each of these candidates is then verified using a combination of the query which returns the lowest ancestor of node that has label and the fields of each node. On a -bit RAM and an alphabet of size , the query can be answered in time whilst having linear space requirements. [17] Therefore, verifying each of the candidates in time results in total query time for returning the set of all -majorities on the path from to .
All the problems described above have been studied for higher dimensions as well as their dynamic versions. On the other hand, range queries might be extended to other data structures like trees, [8] such as the level ancestor problem. A similar family of problems are orthogonal range queries, also known as counting queries.
Autocorrelation, sometimes known as serial correlation in the discrete time case, is the correlation of a signal with a delayed copy of itself as a function of delay. Informally, it is the similarity between observations of a random variable as a function of the time lag between them. The analysis of autocorrelation is a mathematical tool for finding repeating patterns, such as the presence of a periodic signal obscured by noise, or identifying the missing fundamental frequency in a signal implied by its harmonic frequencies. It is often used in signal processing for analyzing functions or series of values, such as time domain signals.
In computer science, binary search, also known as half-interval search, logarithmic search, or binary chop, is a search algorithm that finds the position of a target value within a sorted array. Binary search compares the target value to the middle element of the array. If they are not equal, the half in which the target cannot lie is eliminated and the search continues on the remaining half, again taking the middle element to compare to the target value, and repeating this until the target value is found. If the search ends with the remaining half being empty, the target is not in the array.
In statistics, an expectation–maximization (EM) algorithm is an iterative method to find (local) maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step. It can be used, for example, to estimate a mixture of gaussians, or to solve the multiple linear regression problem.
In computer science, a selection algorithm is an algorithm for finding the th smallest value in a collection of ordered values, such as numbers. The value that it finds is called the th order statistic. Selection includes as special cases the problems of finding the minimum, median, and maximum element in the collection. Selection algorithms include quickselect, and the median of medians algorithm. When applied to a collection of values, these algorithms take linear time, as expressed using big O notation. For data that is already structured, faster algorithms may be possible; as an extreme case, selection in an already-sorted array takes time .
In computing, a persistent data structure or not ephemeral data structure is a data structure that always preserves the previous version of itself when it is modified. Such data structures are effectively immutable, as their operations do not (visibly) update the structure in-place, but instead always yield a new updated structure. The term was introduced in Driscoll, Sarnak, Sleator, and Tarjan's 1986 article.
In statistics, a generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.
In computer science, a disjoint-set data structure, also called a union–find data structure or merge–find set, is a data structure that stores a collection of disjoint (non-overlapping) sets. Equivalently, it stores a partition of a set into disjoint subsets. It provides operations for adding new sets, merging sets, and finding a representative member of a set. The last operation makes it possible to find out efficiently if any two elements are in the same or different sets.
Variational Bayesian methods are a family of techniques for approximating intractable integrals arising in Bayesian inference and machine learning. They are typically used in complex statistical models consisting of observed variables as well as unknown parameters and latent variables, with various sorts of relationships among the three types of random variables, as might be described by a graphical model. As typical in Bayesian inference, the parameters and latent variables are grouped together as "unobserved variables". Variational Bayesian methods are primarily used for two purposes:
In computer science, a k-d tree is a space-partitioning data structure for organizing points in a k-dimensional space. K-dimensional is that which concerns exactly k orthogonal axes or a space of any number of dimensions. k-d trees are a useful data structure for several applications, such as:
The Kaplan–Meier estimator, also known as the product limit estimator, is a non-parametric statistic used to estimate the survival function from lifetime data. In medical research, it is often used to measure the fraction of patients living for a certain amount of time after treatment. In other fields, Kaplan–Meier estimators may be used to measure the length of time people remain unemployed after a job loss, the time-to-failure of machine parts, or how long fleshy fruits remain on plants before they are removed by frugivores. The estimator is named after Edward L. Kaplan and Paul Meier, who each submitted similar manuscripts to the Journal of the American Statistical Association. The journal editor, John Tukey, convinced them to combine their work into one paper, which has been cited more than 34,000 times since its publication in 1958.
In electromagnetics, directivity is a parameter of an antenna or optical system which measures the degree to which the radiation emitted is concentrated in a single direction. It is the ratio of the radiation intensity in a given direction from the antenna to the radiation intensity averaged over all directions. Therefore, the directivity of a hypothetical isotropic radiator is 1, or 0 dBi.
In graph theory and computer science, the lowest common ancestor (LCA) of two nodes v and w in a tree or directed acyclic graph (DAG) T is the lowest node that has both v and w as descendants, where we define each node to be a descendant of itself.
In statistics, the Kendall rank correlation coefficient, commonly referred to as Kendall's τ coefficient, is a statistic used to measure the ordinal association between two measured quantities. A τ test is a non-parametric hypothesis test for statistical dependence based on the τ coefficient. It is a measure of rank correlation: the similarity of the orderings of the data when ranked by each of the quantities. It is named after Maurice Kendall, who developed it in 1938, though Gustav Fechner had proposed a similar measure in the context of time series in 1897.
In computer science, fractional cascading is a technique to speed up a sequence of binary searches for the same value in a sequence of related data structures. The first binary search in the sequence takes a logarithmic amount of time, as is standard for binary searches, but successive searches in the sequence are faster. The original version of fractional cascading, introduced in two papers by Chazelle and Guibas in 1986, combined the idea of cascading, originating in range searching data structures of Lueker (1978) and Willard (1978), with the idea of fractional sampling, which originated in Chazelle (1983). Later authors introduced more complex forms of fractional cascading that allow the data structure to be maintained as the data changes by a sequence of discrete insertion and deletion events.
In computer science, a succinct data structure is a data structure which uses an amount of space that is "close" to the information-theoretic lower bound, but still allows for efficient query operations. The concept was originally introduced by Jacobson to encode bit vectors, (unlabeled) trees, and planar graphs. Unlike general lossless data compression algorithms, succinct data structures retain the ability to use them in-place, without decompressing them first. A related notion is that of a compressed data structure, insofar as the size of the stored or encoded data similarly depends upon the specific content of the data itself.
A Fenwick tree or binary indexed tree(BIT) is a data structure that can efficiently update values and calculate prefix sums in an array of values.
In machine learning, a ranking SVM is a variant of the support vector machine algorithm, which is used to solve certain ranking problems. The ranking SVM algorithm was published by Thorsten Joachims in 2002. The original purpose of the algorithm was to improve the performance of an internet search engine. However, it was found that ranking SVM also can be used to solve other problems such as Rank SIFT.
In graph theory and theoretical computer science, the level ancestor problem is the problem of preprocessing a given rooted tree T into a data structure that can determine the ancestor of a given node at a given distance from the root of the tree.
In computer science, the longest common prefix array is an auxiliary data structure to the suffix array. It stores the lengths of the longest common prefixes (LCPs) between all pairs of consecutive suffixes in a sorted suffix array.
The PH-tree is a tree data structure used for spatial indexing of multi-dimensional data (keys) such as geographical coordinates, points, feature vectors, rectangles or bounding boxes. The PH-tree is space partitioning index with a structure similar to that of a quadtree or octree. However, unlike quadtrees, it uses a splitting policy based on tries and similar to Crit bit trees that is based on the bit-representation of the keys. The bit-based splitting policy, when combined with the use of different internal representations for nodes, provides scalability with high-dimensional data. The bit-representation splitting policy also imposes a maximum depth, thus avoiding degenerated trees and the need for rebalancing.