In statistics, the medcouple is a robust statistic that measures the skewness of a univariate distribution. [1] It is defined as a scaled median difference between the left and right half of a distribution. Its robustness makes it suitable for identifying outliers in adjusted boxplots. [2] [3] Ordinary box plots do not fare well with skew distributions, since they label the longer unsymmetrical tails as outliers. Using the medcouple, the whiskers of a boxplot can be adjusted for skew distributions and thus have a more accurate identification of outliers for non-symmetrical distributions.
As a kind of order statistic, the medcouple belongs to the class of incomplete generalised L-statistics. [1] Like the ordinary median or mean, the medcouple is a nonparametric statistic, thus it can be computed for any distribution.
The following description uses zero-based indexing in order to harmonise with the indexing in many programming languages.
Let be an ordered sample of size , and let be the median of . Define the sets
of sizes and respectively. For and , we define the kernel function
where is the sign function.
The medcouple is then the median of the set [1] : 998
In other words, we split the distribution into all values greater or equal to the median and all values less than or equal to the median. We define a kernel function whose first variable is over the greater values and whose second variable is over the lesser values. For the special case of values tied to the median, we define the kernel by the signum function. The medcouple is then the median over all values of .
Since the medcouple is not a median applied to all couples, but only to those for which , it belongs to the class of incomplete generalised L-statistics. [1] : 998
The medcouple has a number of desirable properties. A few of them are directly inherited from the kernel function.
We make the following observations about the kernel function :
These properties are in turn inherited by the medcouple. Thus, the medcouple is independent of the mean and standard deviation of a distribution, a desirable property for measuring skewness. For ease of computation, these properties enable us to define the two sets
where . This makes the set have range of at most 1, median 0, and keep the same medcouple as .
For , the medcouple kernel reduces to
Using the recentred and rescaled set we can observe the following.
With properties 1, 2, and 4, we can thus define the following matrix,
If we sort the sets and in decreasing order, then the matrix has sorted rows and sorted columns, [1] : 1006
The medcouple is then the median of this matrix with sorted rows and sorted columns. The fact that the rows and columns are sorted allows the implementation of a fast algorithm for computing the medcouple.
The breakdown point is the number of values that a statistic can resist before it becomes meaningless, i.e. the number of arbitrarily large outliers that the data set may have before the value of the statistic is affected. For the medcouple, the breakdown point is 25%, since it is a median taken over the couples such that . [1] : 1002
Like all measures of skewness, the medcouple is positive for distributions that are skewed to the right, negative for distributions skewed to the left, and zero for symmetrical distributions. In addition, the values of the medcouple are bounded by 1 in absolute value. [1] : 998
Before presenting medcouple algorithms, we recall that there exist algorithms for the finding the median. Since the medcouple is a median, ordinary algorithms for median-finding are important.
The naïve algorithm for computing the medcouple is slow. [1] : 1005 It proceeds in two steps. First, it constructs the medcouple matrix which contains all of the possible values of the medcouple kernel. In the second step, it finds the median of this matrix. Since there are entries in the matrix in the case when all elements of the data set are unique, the algorithmic complexity of the naïve algorithm is .
More concretely, the naïve algorithm proceeds as follows. Recall that we are using zero-based indexing.
function naïve_medcouple(vector X): // X is a vector of size n.// Sorting in decreasing order can be done in-place in O(n log n) time sort_decreasing(X) xm := median(X) xscale := 2 * max(abs(X)) // Define the upper and lower centred and rescaled vectors// they inherit X's own decreasing sorting Zplus := [(x - xm)/xscale | x in X such that x >= xm] Zminus := [(x - xm)/xscale | x in X such that x <= xm] p := size(Zplus) q := size(Zminus) // Define the kernel function closing over Zplus and Zminusfunction h(i, j): a := Zplus[i] b := Zminus[j] if a == b: return signum(p - 1 - i - j) else: return (a + b) / (a - b) endifendfunction// O(n^2) operations necessary to form this vector H := [h(i, j) | i in [0, 1, ..., p - 1] and j in [0, 1, ..., q - 1]] return median(H) endfunction
The final call to median
on a vector of size can be done itself in operations, hence the entire naïve medcouple algorithm is of the same complexity.
The fast algorithm outperforms the naïve algorithm by exploiting the sorted nature of the medcouple matrix . Instead of computing all entries of the matrix, the fast algorithm uses the Kth pair algorithm of Johnson & Mizoguchi. [4]
The first stage of the fast algorithm proceeds as the naïve algorithm. We first compute the necessary ingredients for the kernel matrix, , with sorted rows and sorted columns in decreasing order. Rather than computing all values of , we instead exploit the monotonicity in rows and columns, via the following observations.
First, we note that we can compare any with all values of in time. [4] : 150 For example, for determining all and such that , we have the following function:
functiongreater_h(kernelh,intp,intq,realu):// h is the kernel function, h(i,j) gives the ith, jth entry of H// p and q are the number of rows and columns of the kernel matrix H// vector of size pP:=vector(p)// indexing from zeroj:=0// starting from the bottom, compute the [[supremum|least upper bound]] for each rowfori:=p-1,p-2,...,1,0:// search this row until we find a value less than uwhilej<qandh(i,j)>u:j:=j+1endwhile// the entry preceding the one we just found is greater than uP[i]:=j-1endforreturnPendfunction
This greater_h
function is traversing the kernel matrix from the bottom left to the top right, and returns a vector of indices that indicate for each row where the boundary lies between values greater than and those less than or equal to . This method works because of the row-column sorted property of . Since greater_h
computes at most values of , its complexity is . [4] : 150
Conceptually, the resulting vector can be visualised as establishing a boundary on the matrix as suggested by the following diagram, where the red entries are all larger than :
The symmetric algorithm for computing the values of less than is very similar. It instead proceeds along in the opposite direction, from the top right to the bottom left:
functionless_h(kernelh,intp,intq,realu):// vector of size pQ:=vector(p)// last possible row indexj:=q-1// starting from the top, compute the [[infimum|greatest lower bound]] for each rowfori:=0,1,...,p-2,p-1:// search this row until we find a value greater than uwhilej>=0andh(i,j)<u:j:=j-1endwhile// the entry following the one we just found is less than uQ[i]:=j+1endforreturnQendfunction
This lower boundary can be visualised like so, where the blue entries are smaller than :
For each , we have that , with strict inequality occurring only for those rows that have values equal to .
We also have that the sums
give, respectively, the number of elements of that are greater than , and the number of elements that are greater than or equal to . Thus this method also yields the rank of within the elements of . [4] : 149
The second observation is that we can use the sorted matrix structure to instantly compare any element to at least half of the entries in the matrix. For example, the median of the row medians across the entire matrix is less than the upper left quadrant in red, but greater than the lower right quadrant in blue:
More generally, using the boundaries given by the and vectors from the previous section, we can assume that after some iterations, we have pinpointed the position of the medcouple to lie between the red left boundary and the blue right boundary: [4] : 149
The yellow entries indicate the median of each row. If we mentally re-arrange the rows so that the medians align and ignore the discarded entries outside the boundaries,
we can select a weighted median of these medians, each entry weighted by the number of remaining entries on this row. This ensures that we can discard at least 1/4 of all remaining values no matter if we have to discard the larger values in red or the smaller values in blue:
Each row median can be computed in time, since the rows are sorted, and the weighted median can be computed in time, using a binary search. [4] : 148
Putting together these two observations, the fast medcouple algorithm proceeds broadly as follows. [4] : 148
The initial sorting in order to form the function takes time. At each iteration, the weighted median takes time, as well as the computations of the new tentative and left and right boundaries. Since each iteration discards at least 1/4 of all remaining entries, there will be at most iterations. [4] : 150 Thus, the whole fast algorithm takes time. [4] : 150
Let us restate the fast algorithm in more detail.
function medcouple(vector X): // X is a vector of size n// Compute initial ingredients as for the naïve medcouple sort_decreasing(X) xm := median(X) xscale := 2 * max(abs(X)) Zplus := [(x - xm)/xscale | x in X such that x >= xm] Zminus := [(x - xm)/xscale | x in X such that x <= xm] p := size(Zplus) q := size(Zminus) function h(i, j): a := Zplus[i] b := Zminus[j] if a == b: return signum(p - 1 - i - j) else: return (a + b) / (a - b) endifendfunction// Begin Kth pair algorithm (Johnson & Mizoguchi)// The initial left and right boundaries, two vectors of size p L := [0, 0, ..., 0] R := [q - 1, q - 1, ..., q - 1] // number of entries to the left of the left boundary Ltotal := 0 // number of entries to the left of the right boundary Rtotal := p*q // Since we are indexing from zero, the medcouple index is one// less than its rank. medcouple_index := floor(Rtotal / 2) // Iterate while the number of entries between the boundaries is// greater than the number of rows in the matrix.while Rtotal - Ltotal > p: // Compute row medians and their associated weights, but skip// any rows that are already empty. middle_idx := [i | i in [0, 1, ..., p - 1] suchthat L[i] <= R[i]] row_medians := [h(i, floor((L[i] + R[i])/2) | i in middle_idx] weights := [R[i] - L[i] + 1 | i in middle_idx] WM := weighted median(row_medians, weights) // New tentative right and left boundaries P := greater_h(h, p, q, WM) Q := less_h(h, p, q, WM) Ptotal := sum(P) + size(P) Qtotal := sum(Q) // Determine which entries to discard, or if we've found the medcoupleif medcouple_index <= Ptotal - 1: R := P Rtotal := Ptotal else: if medcouple_index > Qtotal - 1: L := Q Ltotal := Qtotal else: // Found the medcouple, rank of the weighted median equals medcouple index return WM endifendifendwhile // Did not find the medcouple, but there are very few tentative entries remaining remaining := [h(i, j) | i in [0, 1, ..., p - 1], j in [L[i], L[i] + 1, ..., R[i]] suchthat L[i] <= R[i] ] // Select the medcouple by rank amongst the remaining entries medcouple := select_nth(remaining, medcouple_index - Ltotal) return medcouple endfunction
In real-world use, the algorithm also needs to account for errors arising from finite-precision floating point arithmetic. For example, the comparisons for the medcouple kernel function should be done within machine epsilon, as well as the order comparisons in the greater_h
and less_h
functions.
In mathematics, a symmetric matrix with real entries is positive-definite if the real number is positive for every nonzero real column vector where is the transpose of . More generally, a Hermitian matrix is positive-definite if the real number is positive for every nonzero complex column vector where denotes the conjugate transpose of
In machine learning, support vector machines are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratories by Vladimir Vapnik with colleagues SVMs are one of the most robust prediction methods, being based on statistical learning frameworks or VC theory proposed by Vapnik and Chervonenkis (1974). Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that assigns new examples to one category or the other, making it a non-probabilistic binary linear classifier. SVM maps training examples to points in space so as to maximise the width of the gap between the two categories. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall.
Dynamic programming is both a mathematical optimization method and a computer programming method. The method was developed by Richard Bellman in the 1950s and has found applications in numerous fields, from aerospace engineering to economics.
In mathematics, and in particular linear algebra, the Moore–Penrose inverse of a matrix is the most widely known generalization of the inverse matrix. It was independently described by E. H. Moore in 1920, Arne Bjerhammar in 1951, and Roger Penrose in 1955. Earlier, Erik Ivar Fredholm had introduced the concept of a pseudoinverse of integral operators in 1903. When referring to a matrix, the term pseudoinverse, without further specification, is often used to indicate the Moore–Penrose inverse. The term generalized inverse is sometimes used as a synonym for pseudoinverse.
In mathematical optimization, Dantzig's simplex algorithm is a popular algorithm for linear programming.
In linear algebra, a Hankel matrix, named after Hermann Hankel, is a square matrix in which each ascending skew-diagonal from left to right is constant, e.g.:
In linear algebra and functional analysis, a projection is a linear transformation from a vector space to itself such that . That is, whenever is applied twice to any vector, it gives the same result as if it were applied once. It leaves its image unchanged. This definition of "projection" formalizes and generalizes the idea of graphical projection. One can also consider the effect of a projection on a geometrical object by examining the effect of the projection on points in the object.
In mathematics, a unimodular matrixM is a square integer matrix having determinant +1 or −1. Equivalently, it is an integer matrix that is invertible over the integers: there is an integer matrix N that is its inverse. Thus every equation Mx = b, where M and b both have integer components and M is unimodular, has an integer solution. The n × n unimodular matrices form a group called the n × n general linear group over , which is denoted .
In linear algebra, the Gram matrix of a set of vectors in an inner product space is the Hermitian matrix of inner products, whose entries are given by the inner product . If the vectors are the columns of matrix then the Gram matrix is in the general case that the vector coordinates are complex numbers, which simplifies to for the case that the vector coordinates are real numbers.
In mathematics, the kernel of a linear map, also known as the null space or nullspace, is the linear subspace of the domain of the map which is mapped to the zero vector. That is, given a linear map L : V → W between two vector spaces V and W, the kernel of L is the vector space of all elements v of V such that L(v) = 0, where 0 denotes the zero vector in W, or more symbolically:
A continuous-time Markov chain (CTMC) is a continuous stochastic process in which, for each state, the process will change state according to an exponential random variable and then move to a different state as specified by the probabilities of a stochastic matrix. An equivalent formulation describes the process as changing state according to the least value of a set of exponential random variables, one for each possible state it can move to, with the parameters determined by the current state.
The Lanczos algorithm is an iterative method devised by Cornelius Lanczos that is an adaptation of power methods to find the "most useful" eigenvalues and eigenvectors of an Hermitian matrix, where is often but not necessarily much smaller than . Although computationally efficient in principle, the method as initially formulated was not useful, due to its numerical instability.
In linear algebra, eigendecomposition is the factorization of a matrix into a canonical form, whereby the matrix is represented in terms of its eigenvalues and eigenvectors. Only diagonalizable matrices can be factorized in this way. When the matrix being factorized is a normal or real symmetric matrix, the decomposition is called "spectral decomposition", derived from the spectral theorem.
The block Wiedemann algorithm for computing kernel vectors of a matrix over a finite field is a generalization by Don Coppersmith of an algorithm due to Doug Wiedemann.
In the mathematical field of linear algebra, an arrowhead matrix is a square matrix containing zeros in all entries except for the first row, first column, and main diagonal, these entries can be any number. In other words, the matrix has the form
The iterative proportional fitting procedure is the operation of finding the fitted matrix which is the closest to an initial matrix but with the row and column totals of a target matrix . The fitted matrix being of the form , where and are diagonal matrices such that has the margins of . Some algorithms can be chosen to perform biproportion. We have also the entropy maximization, information loss minimization or RAS which consists of factoring the matrix rows to match the specified row totals, then factoring its columns to match the specified column totals; each step usually disturbs the previous step’s match, so these steps are repeated in cycles, re-adjusting the rows and columns in turn, until all specified marginal totals are satisfactorily approximated. However, all algorithms give the same solution. In three- or more-dimensional cases, adjustment steps are applied for the marginals of each dimension in turn, the steps likewise repeated in cycles.
DNA code construction refers to the application of coding theory to the design of nucleic acid systems for the field of DNA–based computation.
In mathematics, low-rank approximation is a minimization problem, in which the cost function measures the fit between a given matrix and an approximating matrix, subject to a constraint that the approximating matrix has reduced rank. The problem is used for mathematical modeling and data compression. The rank constraint is related to a constraint on the complexity of a model that fits the data. In applications, often there are other constraints on the approximating matrix apart from the rank constraint, e.g., non-negativity and Hankel structure.
In the field of statistical learning theory, matrix regularization generalizes notions of vector regularization to cases where the object to be learned is a matrix. The purpose of regularization is to enforce conditions, for example sparsity or smoothness, that can produce stable predictive functions. For example, in the more common vector framework, Tikhonov regularization optimizes over
Regularized least squares (RLS) is a family of methods for solving the least-squares problem while using regularization to further constrain the resulting solution.