Aggregate function

Last updated

In database management, an aggregate function or aggregation function is a function where multiple values are processed together to form a single summary statistic.

Contents

(Figure 1) Entity relationship diagram representation of aggregation. Aggregation - Entity Relationship Diagram.png
(Figure 1) Entity relationship diagram representation of aggregation.

Common aggregate functions include:

Others include:

Formally, an aggregate function takes as input a set, a multiset (bag), or a list from some input domain I and outputs an element of an output domain O. [1] The input and output domains may be the same, such as for SUM, or may be different, such as for COUNT.

Aggregate functions occur commonly in numerous programming languages, in spreadsheets, and in relational algebra.

The listagg function, as defined in the SQL:2016 standard [2] aggregates data from multiple rows into a single concatenated string.

In the entity relationship diagram, aggregation is represented as seen in Figure 1 with a rectangle around the relationship and its entities to indicate that it is being treated as an aggregate entity. [3]

Decomposable aggregate functions

Aggregate functions present a bottleneck, because they potentially require having all input values at once. In distributed computing, it is desirable to divide such computations into smaller pieces, and distribute the work, usually computing in parallel, via a divide and conquer algorithm.

Some aggregate functions can be computed by computing the aggregate for subsets, and then aggregating these aggregates; examples include COUNT, MAX, MIN, and SUM. In other cases the aggregate can be computed by computing auxiliary numbers for subsets, aggregating these auxiliary numbers, and finally computing the overall number at the end; examples include AVERAGE (tracking sum and count, dividing at the end) and RANGE (tracking max and min, subtracting at the end). In other cases the aggregate cannot be computed without analyzing the entire set at once, though in some cases approximations can be distributed; examples include DISTINCT COUNT (Count-distinct problem), MEDIAN, and MODE.

Such functions are called decomposable aggregation functions [4] or decomposable aggregate functions. The simplest may be referred to as self-decomposable aggregation functions, which are defined as those functions f such that there is a merge operator such that

where is the union of multisets (see monoid homomorphism).

For example, SUM:

, for a singleton;
, meaning that merge is simply addition.

COUNT:

,
.

MAX:

,
.

MIN:

, [2]
.

Note that self-decomposable aggregation functions can be combined (formally, taking the product) by applying them separately, so for instance one can compute both the SUM and COUNT at the same time, by tracking two numbers.

More generally, one can define a decomposable aggregation functionf as one that can be expressed as the composition of a final function g and a self-decomposable aggregation function h, . For example, AVERAGE=SUM/COUNT and RANGE=MAXMIN.

In the MapReduce framework, these steps are known as InitialReduce (value on individual record/singleton set), Combine (binary merge on two aggregations), and FinalReduce (final function on auxiliary values), [5] and moving decomposable aggregation before the Shuffle phase is known as an InitialReduce step, [6]

Decomposable aggregation functions are important in online analytical processing (OLAP), as they allow aggregation queries to be computed on the pre-computed results in the OLAP cube, rather than on the base data. [7] For example, it is easy to support COUNT, MAX, MIN, and SUM in OLAP, since these can be computed for each cell of the OLAP cube and then summarized ("rolled up"), but it is difficult to support MEDIAN, as that must be computed for every view separately.

Other decomposable aggregate functions

In order to calculate the average and standard deviation from aggregate data, it is necessary to have available for each group: the total of values (Σxi = SUM(x)), the number of values (N=COUNT(x)) and the total of squares of the values (Σxi2=SUM(x2)) of each groups. [8]

AVG:

or

or, only if COUNT(X)=COUNT(Y)


SUM(x2): The sum of squares of the values is important in order to calculate the Standard Deviation of groups


STDDEV:
For a finite population with equal probabilities at all points, we have [9] [ circular reference ]

This means that the standard deviation is equal to the square root of the difference between the average of the squares of the values and the square of the average value.

See also

Related Research Articles

In integral calculus, an elliptic integral is one of a number of related functions defined as the value of certain integrals, which were first studied by Giulio Fagnano and Leonhard Euler. Their name originates from their originally arising in connection with the problem of finding the arc length of an ellipse.

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate. The logic of maximum likelihood is both intuitive and flexible, and as such the method has become a dominant means of statistical inference.

Online analytical processing, or OLAP, is an approach to answer multi-dimensional analytical (MDA) queries swiftly in computing. OLAP is part of the broader category of business intelligence, which also encompasses relational databases, report writing and data mining. Typical applications of OLAP include business reporting for sales, marketing, management reporting, business process management (BPM), budgeting and forecasting, financial reporting and similar areas, with new applications emerging, such as agriculture.

In mathematical analysis, Fubini's theorem is a result that gives conditions under which it is possible to compute a double integral by using an iterated integral, introduced by Guido Fubini in 1907. One may switch the order of integration if the double integral yields a finite answer when the integrand is replaced by its absolute value.

<span class="mw-page-title-main">Gumbel distribution</span> Particular case of the generalized extreme value distribution

In probability theory and statistics, the Gumbel distribution is used to model the distribution of the maximum of a number of samples of various distributions.

In mathematics, the Rayleigh quotient for a given complex Hermitian matrix and nonzero vector is defined as:

<span class="mw-page-title-main">Prime-counting function</span> Function representing the number of primes less than or equal to a given number

In mathematics, the prime-counting function is the function counting the number of prime numbers less than or equal to some real number x. It is denoted by π(x) (unrelated to the number π).

<span class="mw-page-title-main">Theta function</span> Special functions of several complex variables

In mathematics, theta functions are special functions of several complex variables. They show up in many topics, including Abelian varieties, moduli spaces, quadratic forms, and solitons. As Grassmann algebras, they appear in quantum field theory.

In mathematics, the Jacobi elliptic functions are a set of basic elliptic functions. They are found in the description of the motion of a pendulum, as well as in the design of electronic elliptic filters. While trigonometric functions are defined with reference to a circle, the Jacobi elliptic functions are a generalization which refer to other conic sections, the ellipse in particular. The relation to trigonometric functions is contained in the notation, for example, by the matching notation for . The Jacobi elliptic functions are used more often in practical problems than the Weierstrass elliptic functions as they do not require notions of complex analysis to be defined and/or understood. They were introduced by Carl Gustav Jakob Jacobi (1829). Carl Friedrich Gauss had already studied special Jacobi elliptic functions in 1797, the lemniscate elliptic functions in particular, but his work was published much later.

<span class="mw-page-title-main">OLAP cube</span> Multidimensional data array organized for rapid analysis

An OLAP cube is a multi-dimensional array of data. Online analytical processing (OLAP) is a computer-based technique of analyzing data to look for insights. The term cube here refers to a multi-dimensional dataset, which is also sometimes called a hypercube if the number of dimensions is greater than three.

Stein's lemma, named in honor of Charles Stein, is a theorem of probability theory that is of interest primarily because of its applications to statistical inference — in particular, to James–Stein estimation and empirical Bayes methods — and its applications to portfolio choice theory. The theorem gives a formula for the covariance of one random variable with the value of a function of another, when the two random variables are jointly normally distributed.

In mathematics, specifically the theory of elliptic functions, the nome is a special function that belongs to the non-elementary functions. This function is of great importance in the description of the elliptic functions, especially in the description of the modular identity of the Jacobi theta function, the Hermite elliptic transcendents and the Weber modular functions, that are used for solving equations of higher degrees.

In probability theory and statistics, the generalized extreme value (GEV) distribution is a family of continuous probability distributions developed within extreme value theory to combine the Gumbel, Fréchet and Weibull families also known as type I, II and III extreme value distributions. By the extreme value theorem the GEV distribution is the only possible limit distribution of properly normalized maxima of a sequence of independent and identically distributed random variables. Note that a limit distribution needs to exist, which requires regularity conditions on the tail of the distribution. Despite this, the GEV distribution is often used as an approximation to model the maxima of long (finite) sequences of random variables.

In mathematics, the Iverson bracket, named after Kenneth E. Iverson, is a notation that generalises the Kronecker delta, which is the Iverson bracket of the statement x = y. It maps any statement to a function of the free variables in that statement. This function is defined to take the value 1 for the values of the variables for which the statement is true, and takes the value 0 otherwise. It is generally denoted by putting the statement inside square brackets:

In statistics and information theory, a maximum entropy probability distribution has entropy that is at least as great as that of all other members of a specified class of probability distributions. According to the principle of maximum entropy, if nothing is known about a distribution except that it belongs to a certain class, then the distribution with the largest entropy should be chosen as the least-informative default. The motivation is twofold: first, maximizing entropy minimizes the amount of prior information built into the distribution; second, many physical systems tend to move towards maximal entropy configurations over time.

In mathematics, the Rogers–Ramanujan identities are two identities related to basic hypergeometric series and integer partitions. The identities were first discovered and proved by Leonard James Rogers (1894), and were subsequently rediscovered by Srinivasa Ramanujan some time before 1913. Ramanujan had no proof, but rediscovered Rogers's paper in 1917, and they then published a joint new proof. Issai Schur (1917) independently rediscovered and proved the identities.

<span class="mw-page-title-main">Lemniscate elliptic functions</span> Mathematical functions

In mathematics, the lemniscate elliptic functions are elliptic functions related to the arc length of the lemniscate of Bernoulli. They were first studied by Giulio Fagnano in 1718 and later by Leonhard Euler and Carl Friedrich Gauss, among others.

<span class="mw-page-title-main">Rogers–Ramanujan continued fraction</span> Continued fraction closely related to the Rogers–Ramanujan identities

The Rogers–Ramanujan continued fraction is a continued fraction discovered by Rogers (1894) and independently by Srinivasa Ramanujan, and closely related to the Rogers–Ramanujan identities. It can be evaluated explicitly for a broad class of values of its argument.

<span class="mw-page-title-main">Dixon elliptic functions</span>

In mathematics, the Dixon elliptic functions sm and cm are two elliptic functions that map from each regular hexagon in a hexagonal tiling to the whole complex plane. Because these functions satisfy the identity , as real functions they parametrize the cubic Fermat curve , just as the trigonometric functions sine and cosine parametrize the unit circle .

In mathematics, Legendre's relation can be expressed in either of two forms: as a relation between complete elliptic integrals, or as a relation between periods and quasiperiods of elliptic functions. The two forms are equivalent as the periods and quasiperiods can be expressed in terms of complete elliptic integrals. It was introduced by A. M. Legendre.

References

  1. Jesus, Baquero & Almeida 2011, 2 Problem Definition, pp. 3.
  2. 1 2 Winand, Markus (2017-05-15). "Big News in Databases: New SQL Standard, Cloud Wars, and ACIDRain (Spring 2017)". DZone. Archived from the original on 2017-05-27. Retrieved 2017-06-10. In December 2016, ISO released a new version of the SQL standard. It introduces new features such as row pattern matching, listagg, date and time formatting, and JSON support.
  3. Elmasri, Ramez (2016). Fundamentals of database systems. Sham Navathe (Seventh ed.). Hoboken, NJ. p. 133. ISBN   978-0-13-397077-7. OCLC   913842106.{{cite book}}: CS1 maint: location missing publisher (link)
  4. Jesus, Baquero & Almeida 2011, 2.1 Decomposable functions, pp. 3–4.
  5. Yu, Gunda & Isard 2009, 2. Distributed Aggregation, pp. 2–4.
  6. Yu, Gunda & Isard 2009, 2. Distributed Aggregation, p. 1.
  7. Zhang 2017, p. 1.
  8. Ing. Óscar Bonilla, MBA
  9. Standard deviation#Identities and mathematical properties

Literature