V-optimal histograms

Last updated

Histograms are most commonly used as visual representations of data. However, Database systems use histograms to summarize data internally and provide size estimates for queries. These histograms are not presented to users or displayed visually, so a wider range of options are available for their construction. Simple or exotic histograms are defined by four parameters, Sort Value, Source Value, Partition Class and Partition Rule. The most basic histogram is the equi-width histogram, where each bucket represents the same range of values. That histogram would be defined as having a Sort Value of Value, a Source Value of Frequency, be in the Serial Partition Class and have a Partition Rule stating that all buckets have the same range.

Contents

V-optimal histograms are an example of a more "exotic" histogram. V-optimality is a Partition Rule which states that the bucket boundaries are to be placed as to minimize the cumulative weighted variance of the buckets. Implementation of this rule is a complex problem and construction of these histograms is also a complex process.

Definition

A v-optimal histogram is based on the concept of minimizing a quantity which is called the weighted variance in this context. [1] This is defined as

where the histogram consists of J bins or buckets, nj is the number of items contained in the jth bin and where Vj is the variance between the values associated with the items in the jth bin.

Examples

The following example will construct a V-optimal histogram having a Sort Value of Value, a Source Value of Frequency, and a Partition Class of Serial. In practice, almost all histograms used in research or commercial products are of the Serial class, meaning that sequential sort values are placed in either the same bucket, or sequential buckets. For example, values 1, 2, 3 and 4 will be in buckets 1 and 2, or buckets 1, 2 and 3, but never in buckets 1 and 3. That will be taken as an assumption in any further discussion.

Take a simple set of data, for example, a list of integers:

1, 3, 4, 7, 2, 8, 3, 6, 3, 6, 8, 2, 1, 6, 3, 5, 3, 4, 7, 2, 6, 7, 2

Compute the value and frequency pairs (1, 2), (2, 4), (3, 5), (4, 2), (5, 1), (6, 4), (7, 3), (8, 2)

Our V-optimal histogram will have two buckets. Since one bucket must end at the data point for 8, we must decide where to put the other bucket boundary. The V-optimality rule states that the cumulative weighted variance of the buckets must be minimized. We will look at two options and compute the cumulative variance of those options.

Option 1: Bucket 1 contains values 1 through 4. Bucket 2 contains values 5 through 8.

Bucket 1:
Average frequency 3.25
Weighted variance 2.28

Bucket 2:
Average frequency 2.5
Weighted variance 2.19

Sum of Weighted Variance 4.47

Option 2: Bucket 1 contains values 1 through 2. Bucket 2 contains values 3 through 8.

Bucket 1:
Average frequency 3
Weighted variance 1.41

Bucket 2:
Average frequency 2.83
Weighted variance 3.29

Sum of Weighted Variance 4.70

The first choice is better, so the histogram that would wind up being stored is:
Bucket 1: Range (1–4), Average Frequency 3.25
Bucket 2: Range (5–8), Average Frequency 2.5

Advantages of V-optimality vs. equi-width or equi-depth

V-optimal histograms do a better job of estimating the bucket contents. A histogram is an estimation of the base data, and any histogram will have errors. The partition rule used in VOptimal histograms attempts to have the smallest variance possible among the buckets, which provides for a smaller error. Research done by Poosala and Ioannidis 1 has demonstrated that the most accurate estimation of data is done with a VOptimal histogram using value as a sort parameter and frequency as a source parameter.

Disadvantages of V-optimality vs. equi-width or equi-depth

While the V-optimal histogram is more accurate, it does have drawbacks. It is a difficult structure to update. Any changes to the source parameter could potentially result in having to re-build the histogram entirely, rather than updating the existing histogram. An equi-width histogram does not have this problem. Equi-depth histograms will experience this issue to some degree, but because the equi-depth construction is simpler, there is a lower cost to maintain it. The difficulty in updating VOptimal histograms is an outgrowth of the difficulty involved in constructing these histograms.

Computing the V-optimal histogram is computationally expensive to compute compared to other types of histograms. [2]

Construction issues

The above example is a simple one. There are only 7 choices of bucket boundaries. One could compute the cumulative variance for all 7 options easily and choose the absolute best placement. However, as the range of values gets larger and the number of buckets gets larger, the set of possible histograms grows exponentially and it becomes a dauntingly complex problem to find the set of boundaries that provide the absolute minimum variance using the naïve approach. Using dynamic programming, it is possible to compute the V-optimal histogram in where N is the number of data points and B is the number buckets. [3]

Since finding the optimal histogram is quadratic, it is common to instead approximate the V-optimal histogram. By creating random solutions, using those as a starting point and improving upon them, one can find a solution that is a fair approximation of the "best" solution. One construction method used to get around this problem is the Iterative Improvement algorithm. Another is Simulated Annealing. The two may be combined in Two Phase Optimization, or 2PO. These algorithms are put forth in "Randomized Algorithms..." (cited below) as a method to optimize queries, but the general idea may be applied to construction of V-optimal Histograms.

Iterative improvement

Iterative Improvement (II) is a fairly naïve greedy algorithm. Starting from a random state, iterative steps in many directions are considered. The step that offers the best improvement of cost (in this case Total Variance) is taken. The process is repeated until one settles at the local minimum, where no further improvement is possible. Applied to the construction of V-optimal histograms, the initial random state would be a set of values representing the bucket boundary placements. The iterative improvement steps would involve moving each boundary until it was at its local minimum, then moving to the next boundary and adjusting it accordingly.

Simulated annealing

A basic explanation of Simulated Annealing is that it is a lot like II, only instead of taking the greedy step each time, it will sometimes accept a step that results in an increase in cost. In theory, SA will be less likely to stop at a very local minimum, and more likely to find a more global one. A useful piece of imagery is an M-shaped graph, representing overall cost on the Y axis. If the initial state is on the V-shaped part of the "M", II will settle into the high valley, the local minimum. Because SA will accept uphill moves, it is more likely to climb up the slope of the "V" and wind up at the foot of the "M", the global minimum.

Two phase optimization

Two phase optimization, or 2PO, combines the II and SA methods. II is run until a local minimum is reached, then SA is run on that solution in an attempt to find less obvious improvements.

Variation

The idea behind V-optimal histograms is to minimize the variance inside each bucket. In considering this, a thought occurs that the variance of any set with one member is 0. This is the idea behind "End-Biased" V-optimal Histograms. The value with the highest frequency is always placed in its own bucket. This ensures that the estimate for that value (which is likely to be the most frequently requested estimate, since it is the most frequent value) will always be accurate and also removes the value most likely to cause a high variance from the data set.

Another thought that might occur is that variance would be reduced if one were to sort by frequency, instead of value. This would naturally tend to place like values next to each other. Such a histogram can be constructed by using a Sort Value of Frequency and a Source Value of Frequency. At this point, however, the buckets must carry additional information indicating what data values are present in the bucket. These histograms have been shown to be less accurate, due to the additional layer of estimation required.

Related Research Articles

<span class="mw-page-title-main">Histogram</span> Graphical representation of the distribution of numerical data

A histogram is an approximate representation of the distribution of numerical data. The term was first introduced by Karl Pearson. To construct a histogram, the first step is to "bin" the range of values— divide the entire range of values into a series of intervals—and then count how many values fall into each interval. The bins are usually specified as consecutive, non-overlapping intervals of a variable. The bins (intervals) must be adjacent and are often of equal size.

The following outline is provided as an overview of and topical guide to statistics:

<span class="mw-page-title-main">Minimum spanning tree</span> Least-weight tree connecting graph vertices

A minimum spanning tree (MST) or minimum weight spanning tree is a subset of the edges of a connected, edge-weighted undirected graph that connects all the vertices together, without any cycles and with the minimum possible total edge weight. That is, it is a spanning tree whose sum of edge weights is as small as possible. More generally, any edge-weighted undirected graph has a minimum spanning forest, which is a union of the minimum spanning trees for its connected components.

In computer science, counting sort is an algorithm for sorting a collection of objects according to keys that are small positive integers; that is, it is an integer sorting algorithm. It operates by counting the number of objects that possess distinct key values, and applying prefix sum on those counts to determine the positions of each key value in the output sequence. Its running time is linear in the number of items and the difference between the maximum key value and the minimum key value, so it is only suitable for direct use in situations where the variation in keys is not significantly greater than the number of items. It is often used as a subroutine in radix sort, another sorting algorithm, which can handle larger keys more efficiently.

<span class="mw-page-title-main">Fractal flame</span>

Fractal flames are a member of the iterated function system class of fractals created by Scott Draves in 1992. Draves' open-source code was later ported into Adobe After Effects graphics software and translated into the Apophysis fractal flame editor.

<span class="mw-page-title-main">Image segmentation</span> Partitioning a digital image into segments

In digital image processing and computer vision, image segmentation is the process of partitioning a digital image into multiple image segments, also known as image regions or image objects. The goal of segmentation is to simplify and/or change the representation of an image into something that is more meaningful and easier to analyze. Image segmentation is typically used to locate objects and boundaries in images. More precisely, image segmentation is the process of assigning a label to every pixel in an image such that pixels with the same label share certain characteristics.

In computer science, a selection algorithm is an algorithm for finding the th smallest value in a collection of ordered values, such as numbers. The value that it finds is called the th order statistic. Selection includes as special cases the problems of finding the minimum, median, and maximum element in the collection. Selection algorithms include quickselect, and the median of medians algorithm. When applied to a collection of values, these algorithms take linear time, as expressed using big O notation. For data that is already structured, faster algorithms may be possible; as an extreme case, selection in an already-sorted array takes time .

<span class="mw-page-title-main">R-tree</span> Data structures used in spatial indexing

R-trees are tree data structures used for spatial access methods, i.e., for indexing multi-dimensional information such as geographical coordinates, rectangles or polygons. The R-tree was proposed by Antonin Guttman in 1984 and has found significant use in both theoretical and applied contexts. A common real-world usage for an R-tree might be to store spatial objects such as restaurant locations or the polygons that typical maps are made of: streets, buildings, outlines of lakes, coastlines, etc. and then find answers quickly to queries such as "Find all museums within 2 km of my current location", "retrieve all road segments within 2 km of my location" or "find the nearest gas station". The R-tree can also accelerate nearest neighbor search for various distance metrics, including great-circle distance.

In statistics, the k-nearest neighbors algorithm (k-NN) is a non-parametric supervised learning method first developed by Evelyn Fix and Joseph Hodges in 1951, and later expanded by Thomas Cover. It is used for classification and regression. In both cases, the input consists of the k closest training examples in a data set. The output depends on whether k-NN is used for classification or regression:

k-means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells. k-means clustering minimizes within-cluster variances, but not regular Euclidean distances, which would be the more difficult Weber problem: the mean optimizes squared errors, whereas only the geometric median minimizes Euclidean distances. For instance, better Euclidean solutions can be found using k-medians and k-medoids.

<span class="mw-page-title-main">Kernel density estimation</span> Estimator

In statistics, kernel density estimation (KDE) is the application of kernel smoothing for probability density estimation, i.e., a non-parametric method to estimate the probability density function of a random variable based on kernels as weights. KDE answers a fundamental data smoothing problem where inferences about the population are made, based on a finite data sample. In some fields such as signal processing and econometrics it is also termed the Parzen–Rosenblatt window method, after Emanuel Parzen and Murray Rosenblatt, who are usually credited with independently creating it in its current form. One of the famous applications of kernel density estimation is in estimating the class-conditional marginal densities of data when using a naive Bayes classifier, which can improve its prediction accuracy.

<span class="mw-page-title-main">Quicksort</span> Divide and conquer sorting algorithm

Quicksort is an efficient, general-purpose sorting algorithm. Quicksort was developed by British computer scientist Tony Hoare in 1959 and published in 1961. It is still a commonly used algorithm for sorting. Overall, it is slightly faster than merge sort and heapsort for randomized data, particularly on larger distributions.

Query optimization is a feature of many relational database management systems and other databases such as NoSQL and graph databases. The query optimizer attempts to determine the most efficient way to execute a given query by considering the possible query plans.

Flashsort is a distribution sorting algorithm showing linear computational complexity O(n) for uniformly distributed data sets and relatively little additional memory requirement. The original work was published in 1998 by Karl-Dietrich Neubert.

<span class="mw-page-title-main">Maximum cut</span> Problem of finding a maximum cut in a graph

In a graph, a maximum cut is a cut whose size is at least the size of any other cut. That is, it is a partition of the graph's vertices into two complementary sets S and T, such that the number of edges between S and T is as large as possible. Finding such a cut is known as the max-cut problem.

Clustering is the problem of partitioning data points into groups based on their similarity. Correlation clustering provides a method for clustering a set of objects into the optimum number of clusters without specifying that number in advance.

In computer science, streaming algorithms are algorithms for processing data streams in which the input is presented as a sequence of items and can be examined in only a few passes, typically just one. These algorithms are designed to operate with limited memory, generally logarithmic in the size of the stream and/or in the maximum value in the stream, and may also have limited processing time per item.

Determining the number of clusters in a data set, a quantity often labelled k as in the k-means algorithm, is a frequent problem in data clustering, and is a distinct issue from the process of actually solving the clustering problem.

References

  1. Poosala at al. (1996)
  2. Mousavi, Hamid; Zaniolo, Carlo (2011-03-21). "Fast and accurate computation of equi-depth histograms over data streams". Proceedings of the 14th International Conference on Extending Database Technology. EDBT/ICDT '11. Uppsala, Sweden: Association for Computing Machinery. pp. 69–80. doi:10.1145/1951365.1951376. ISBN   978-1-4503-0528-0. S2CID   6790310.
  3. Jagadish, H. V.; Jin, Hui; Ooi, Beng Chin; Tan, Kian-Lee (2001-05-01). "Global optimization of histograms". Proceedings of the 2001 ACM SIGMOD international conference on Management of data. SIGMOD '01. Santa Barbara, California, USA: Association for Computing Machinery. pp. 223–234. doi:10.1145/375663.375687. ISBN   978-1-58113-332-5. S2CID   52857204.

Works cited