K q-flats

Last updated August 18, 2024

In data mining and machine learning, $k$ $q$ -flats algorithm^[1]^[2] is an iterative method which aims to partition $m$ observations into $k$ clusters where each cluster is close to a $q$ -flat, where $q$ is a given integer.

Description

Problem formulation

Given a set $A$ of $m$ observations $(a_{1},a_{2},\dots ,a_{m})$ where each observation $a_{i}$ is an n-dimensional real vector, $k$ $q$ -flats algorithm aims to partition $m$ observation points by generating $k$ $q$ -flats that minimize the sum of the squares of distances of each observation to a nearest $q$ -flat.

A $q$ -flat is a subset of $\mathbb {R} ^{n}$ that is congruent to $\mathbb {R} ^{q}$ . For example, a $0$ -flat is a point; a $1$ -flat is a line; a $2$ -flat is a plane; a $n-1$ -flat is a hyperplane. $q$ -flat can be characterized by the solution set of a linear system of equations: $F=\left\{x\mid x\in \mathbb {R} ^{n},W'x=\gamma \right\}$ , where $W\in \mathbb {R} ^{n\times (n-q)}$ , $\gamma \in \mathbb {R} ^{1\times (n-q)}$ .

Denote a partition of $\{1,2,\dots ,n\}$ as $S=(S_{1},S_{2},\dots ,S_{k})$ . The problem can be formulated as

\min _{F_{l},l=1,\dots ,k{\text{ are q-flats}}}\min _{S}\sum _{l=1}^{k}\sum _{a_{j}\in S_{i}}\|a_{j}-P_{F_{i}}(a_{j})\|^{2},

(P1)

where $P_{F_{i}}(a_{j})$ is the projection of $a_{j}$ onto $F_{i}$ . Note that $\|a_{j}-P_{F_{i}}(a_{j})\|=\operatorname {dist} (a_{j},F_{l})$ is the distance from $a_{j}$ to $F_{l}$ .

Algorithm

The algorithm is similar to the k-means algorithm (i.e. Lloyd's algorithm) in that it alternates between cluster assignment and cluster update. In specific, the algorithm starts with an initial set of $q$ -flats $F_{l}^{(0)}=\left\{x\in R^{n}\mid \left(W_{l}^{(0)}\right)'x=\gamma _{l}^{(0)}\right\},l=1,\dots ,k$ , and proceeds by alternating between the following two steps:

Cluster Assignment: (given $q$ -flats, assign each point to closest $q$ -flat): the $i$ -th cluster is updated as $S_{i}^{(t)}=\left\{a_{j}\mid \left\|(W_{i}^{(t)})'a_{j}-\gamma _{i}^{(t)}\right\|_{F}=\min _{l=1,\dots ,k}\left\|(W_{l}^{(t)})'a_{j}-\gamma _{l}^{(t)}\right\|_{F}\right\}.$
Cluster Update: (given cluster assignment, update the $q$ -flats): For $l=1,\dots ,k$ , let $A(l)\in R^{m(l)\times n}$ with rows corresponding to all $a_{i}$ assigned to cluster $l$ . Set $W_{l}^{(t+1)}$ to be the matrix whose columns are the orthonormal eigenvectors corresponding to the $(n-q)$ least eigenvalues of $A(l)'\left(I-{\tfrac {ee'}{m}}\right)A(l)$ and $\gamma _{l}^{(t+1)}={\frac {e'A(l)W_{l}^{(t+1)}}{m}}$ .

Stop whenever the assignments no longer change.

The cluster assignment step uses the following fact: given a $q$ -flat $F_{l}=\{x\mid W'x=\gamma \}$ and a vector $a$ , where $W'W=I$ , the distance from $a$ to the $q$ -flat $F_{l}$ is ${\textstyle \operatorname {dist} (a,F_{l})=\min _{x:W'x=\gamma }\left\|x-a\right\|_{F}^{2}=\left\|W(W'W)^{-1}(W'x-\gamma )\right\|_{F}^{2}=\left\|W'x-\gamma \right\|_{F}^{2}.}$

The key part of this algorithm is how to update the cluster, i.e. given $m$ points, how to find a $q$ -flat that minimizes the sum of squares of distances of each point to the $q$ -flat. Mathematically, this problem is: given $A\in R^{m\times n},$ solve the quadratic optimization problem

\min _{W\in R^{n\times (n-q)},\gamma \in R^{1\times (n-q)}}\left\|AW-e\gamma \right\|_{F}^{2},

subject to

W'W=I,

(P2)

where $A\in \mathbb {R} ^{m\times n}$ is given, and $e=(1,\dots ,1)'\in \mathbb {R} ^{m\times 1}$ .

The problem can be solved using Lagrangian multiplier method and the solution is as given in the cluster update step.

It can be shown that the algorithm will terminate in a finite number of iterations (no more than the total number of possible assignments, which is bounded by $k^{m}$ ). In addition, the algorithm will terminate at a point that the overall objective cannot be decreased either by a different assignment or by defining new cluster planes for these clusters (such point is called "locally optimal" in the references).

This convergence result is a consequence of the fact that problem (P2) can be solved exactly. The same convergence result holds for $k$ -means algorithm because the cluster update problem can be solved exactly.

Relation to other machine learning methods

$k$ -means algorithm

$k$ $q$ -flats algorithm is a generalization of $k$ -means algorithm. In fact, $k$ -means algorithm is $k$ $0$ -flats algorithm since a point is a 0-flat. Despite their connection, they should be used in different scenarios. $k$ $q$ -flats algorithm for the case that data lie in a few low-dimensional spaces. $k$ -means algorithm is desirable for the case the clusters are of the ambient dimension. For example, if all observations lie in two lines, $k$ $q$ -flats algorithm with $q=1$ may be used; if the observations are two Gaussian clouds, $k$ -means algorithm may be used.

Sparse Dictionary Learning

Natural signals lie in a high-dimensional space. For example, the dimension of a 1024-by-1024 image is about 10⁶, which is far too high for most signal processing algorithms. One way to get rid of the high dimensionality is to find a set of basis functions, such that the high-dimensional signal can be represented by only a few basis functions. In other words, the coefficients of the signal representation lies in a low-dimensional space, which is easier to apply signal processing algorithms. In the literature, wavelet transform is usually used in image processing, and fourier transform is usually used in audio processing. The set of basis functions is usually called a dictionary.

However, it is not clear what is the best dictionary to use once given a signal data set. One popular approach is to find a dictionary when given a data set using the idea of Sparse Dictionary Learning. It aims to find a dictionary, such that the signal can be sparsely represented by the dictionary. The optimization problem can be written as follows.

\min _{B,R}\|X-BR\|_{F}^{2}

subject to

\|R_{i}\|_{0}\leq q

where

$X$ is a $d$ -by- $N$ matrix. Each columns of $X$ represent a signal, and there are total $N$ signals.
$B$ is a $d$ -by- $l$ matrix. Each columns of $B$ represent a basis function, and there are total $l$ basis functions in the dictionary.
$R$ is a $l$ -by- $N$ matrix. $R_{i}$ ( $i$ -th columns of $R$ ) represent the coefficients when we use the dictionary $B$ to represent the $i$ -th columns of $X$ .
$\|v\|_{0}$ denotes the zero-norm of the vector $v$ .
$\|V\|_{F}$ denotes the Frobenius norm of matrix $V$ .

The idea of $k$ $q$ -flats algorithm is similar to sparse dictionary learning in nature. If we restrict the $q$ -flat to $q$ -dimensional subspace, then the $k$ $q$ -flats algorithm is simply finding the closed $q$ -dimensional subspace to a given signal. Sparse dictionary learning is also doing the same thing, except for an additional constraints on the sparsity of the representation. Mathematically, it is possible to show that $k$ $q$ -flats algorithm is of the form of sparse dictionary learning with an additional block structure on $R$ .

Let $B_{k}$ be a $d\times q$ matrix, where columns of $B_{k}$ are basis of the $k$ -th flat. Then the projection of the signal $x$ to the $k$ -th flat is $B_{k}r_{k}$ , where $r_{k}$ is a $q$ -dimensional coefficient. Let $B=[B_{1},\cdots ,B_{K}]$ denote concatenation of basis of $K$ flats, it is easy to show that the $k$ $q$ -flat algorithm is the same as the following.

\min _{B,R}\|X-BR\|_{F}^{2}

subject to

\|R_{i}\|_{0}\leq q

and R has a block structure.

The block structure of $R$ refers the fact that each signal is labeled to only one flat. Comparing the two formulations, $k$ $q$ -flat is the same as sparse dictionary modeling when $l=K\times q$ and with an additional block structure on R. Users may refer to Szlam's paper^[3] for more discussion about the relationship between the two concept.

Applications and variations

Classification

Classification is a procedure that classifies an input signal into different classes. One example is to classify an email into spam or non-spam classes. Classification algorithms usually require a supervised learning stage. In the supervised learning stage, training data for each class is used for the algorithm to learn the characteristics of the class. In the classification stage, a new observation is classified into a class by using the characteristics that were already trained.

$k$ $q$ -flat algorithm can be used for classification. Suppose there are total of m classes. For each class, $k$ flats are trained a priori via training data set. When a new data comes, find the flat that is closest to the new data. Then the new data is associate to class of the closest flat.

However, the classification performance can be further improved if we impose some structure on the flats. One possible choice is to require different flats from different class be sufficiently far apart. Some researchers^[4] use this idea and develop a discriminative k q-flat algorithm.

K-metrics

Source:^[3]

In $k$ $q$ -flats algorithm, $\|x-P_{F}(x)\|^{2}$ is used to measure the representation error. $P_{F}(x)$ denotes the projection of x to the flat F. If data lies in a $q$ -dimension flat, then a single $q$ -flat can represent the data very well. On the contrary, if data lies in a very high dimension space but near a common center, then k-means algorithm is a better way than $k$ $q$ -flat algorithm to represent the data. It is because $k$ -means algorithm use $\|x-x_{c}\|^{2}$ to measure the error, where $x_{c}$ denotes the center. K-metrics is a generalization that use both the idea of flat and mean. In k-metrics, error is measured by the following Mahalanobis metric.

$\left\|x-y\right\|_{A}^{2}=(x-y)^{\mathsf {T}}A(x-y)$

where $A$ is a positive semi-definite matrix.

If $A$ is the identity matrix, then the Mahalanobis metric is exactly the same as the error measure used in $k$ -means. If $A$ is not the identity matrix, then $\|x-y\|_{A}^{2}$ will favor certain directions as the $k$ $q$ -flat error measure.

Related Research Articles

In mathematical physics and mathematics, the Pauli matrices are a set of three $2 \times 2$ complex matrices that are traceless, Hermitian, involutory and unitary. Usually indicated by the Greek letter sigma, they are occasionally denoted by tau when used in connection with isospin symmetries.

<span class="mw-page-title-main">Principal component analysis</span> Method of data analysis

Principal component analysis (PCA) is a linear dimensionality reduction technique with applications in exploratory data analysis, visualization and data preprocessing.

In differential geometry, a Riemannian manifold is a geometric space on which many geometric notions such as distance, angles, length, volume, and curvature are defined. Euclidean space, the $-sphere$ , hyperbolic space, and smooth surfaces in three-dimensional space, such as ellipsoids and paraboloids, are all examples of Riemannian manifolds. Riemannian manifolds are named after German mathematician Bernhard Riemann, who first conceptualized them.

In mechanics and geometry, the 3D rotation group, often denoted SO(3), is the group of all rotations about the origin of three-dimensional Euclidean space $under the operation of composition.$

In vector calculus, Green's theorem relates a line integral around a simple closed curve $C$ to a double integral over the plane region $D$ bounded by $C$ . It is the two-dimensional special case of Stokes' theorem. In one dimension, it is equivalent to the fundamental theorem of calculus. In three dimensions, it is equivalent to the divergence theorem.

In geometry and complex analysis, a Möbius transformation of the complex plane is a rational function of the form $of one complex variable z; here the coefficients a, b, c, d are complex numbers satisfying ad - bc \neq 0 .$

In mathematics, and especially differential geometry and gauge theory, a connection on a fiber bundle is a device that defines a notion of parallel transport on the bundle; that is, a way to "connect" or identify fibers over nearby points. The most common case is that of a linear connection on a vector bundle, for which the notion of parallel transport must be linear. A linear connection is equivalently specified by a covariant derivative, an operator that differentiates sections of the bundle along tangent directions in the base manifold, in such a way that parallel sections have derivative zero. Linear connections generalize, to arbitrary vector bundles, the Levi-Civita connection on the tangent bundle of a pseudo-Riemannian manifold, which gives a standard way to differentiate vector fields. Nonlinear connections generalize this concept to bundles whose fibers are not necessarily linear.

In linear algebra, a rotation matrix is a transformation matrix that is used to perform a rotation in Euclidean space. For example, using the convention below, the matrix

Multi-task learning (MTL) is a subfield of machine learning in which multiple learning tasks are solved at the same time, while exploiting commonalities and differences across tasks. This can result in improved learning efficiency and prediction accuracy for the task-specific models, when compared to training the models separately. Inherently, Multi-task learning is a multi-objective optimization problem having trade-offs between different tasks. Early versions of MTL were called "hints".

In the field of mathematics, norms are defined for elements within a vector space. Specifically, when the vector space comprises matrices, such norms are referred to as matrix norms. Matrix norms differ from vector norms in that they must also interact with matrix multiplication.

<span class="mw-page-title-main">Regularization (mathematics)</span> Technique to make a model more generalizable and transferable

In mathematics, statistics, finance, and computer science, particularly in machine learning and inverse problems, regularization is a process that changes the result answer to be "simpler". It is often used to obtain results for ill-posed problems or to prevent overfitting.

In the mathematical field of dynamical systems, a random dynamical system is a dynamical system in which the equations of motion have an element of randomness to them. Random dynamical systems are characterized by a state space S, a set of maps $from S into itself that can be thought of as the set of all possible equations of motion, and a probability distribution Q on the set that represents the random choice of map. Motion in a random dynamical system can be informally thought of as a state evolving according to a succession of maps randomly chosen according to the distribution Q .$

In computer science, online machine learning is a method of machine learning in which data becomes available in a sequential order and is used to update the best predictor for future data at each step, as opposed to batch learning techniques which generate the best predictor by learning on the entire training data set at once. Online learning is a common technique used in areas of machine learning where it is computationally infeasible to train over the entire dataset, requiring the need of out-of-core algorithms. It is also used in situations where it is necessary for the algorithm to dynamically adapt to new patterns in the data, or when the data itself is generated as a function of time, e.g., stock price prediction. Online learning algorithms may be prone to catastrophic interference, a problem that can be addressed by incremental learning approaches.

The iterative proportional fitting procedure is the operation of finding the fitted matrix $which is the closest to an initial matrix but with the row and column totals of a target matrix . The fitted matrix being of the form, where and are diagonal matrices such that has the margins of . Some algorithms can be chosen to perform biproportion. We have also the entropy maximization, information loss minimization or RAS which consists of factoring the matrix rows to match the specified row totals, then factoring its columns to match the specified column totals; each step usually disturbs the previous step's match, so these steps are repeated in cycles, re-adjusting the rows and columns in turn, until all specified marginal totals are satisfactorily approximated. However, all algorithms give the same solution. In three- or more-dimensional cases, adjustment steps are applied for the marginals of each dimension in turn, the steps likewise repeated in cycles.$

In cryptography, learning with errors (LWE) is a mathematical problem that is widely used to create secure encryption algorithms. It is based on the idea of representing secret information as a set of equations with errors. In other words, LWE is a way to hide the value of a secret by introducing noise to it. In more technical terms, it refers to the computational problem of inferring a linear $-ary function over a finite ring from given samples some of which may be erroneous. The LWE problem is conjectured to be hard to solve, and thus to be useful in cryptography.$

Overcompleteness is a concept from linear algebra that is widely used in mathematics, computer science, engineering, and statistics. It was introduced by R. J. Duffin and A. C. Schaeffer in 1952.

Least-squares support-vector machines (LS-SVM) for statistics and in statistical modeling, are least-squares versions of support-vector machines (SVM), which are a set of related supervised learning methods that analyze data and recognize patterns, and which are used for classification and regression analysis. In this version one finds the solution by solving a set of linear equations instead of a convex quadratic programming (QP) problem for classical SVMs. Least-squares SVM classifiers were proposed by Johan Suykens and Joos Vandewalle. LS-SVMs are a class of kernel-based learning methods.

In mathematics, low-rank approximation refers to the process of approximating a given matrix by a matrix of lower rank. More precisely, it is a minimization problem, in which the cost function measures the fit between a given matrix and an approximating matrix, subject to a constraint that the approximating matrix has reduced rank. The problem is used for mathematical modeling and data compression. The rank constraint is related to a constraint on the complexity of a model that fits the data. In applications, often there are other constraints on the approximating matrix apart from the rank constraint, e.g., non-negativity and Hankel structure.

In coding theory, folded Reed–Solomon codes are like Reed–Solomon codes, which are obtained by mapping $Reed-Solomon codewords over a larger alphabet by careful bundling of codeword symbols.$

Short integer solution (SIS) and ring-SIS problems are two average-case problems that are used in lattice-based cryptography constructions. Lattice-based cryptography began in 1996 from a seminal work by Miklós Ajtai who presented a family of one-way functions based on SIS problem. He showed that it is secure in an average case if the shortest vector problem $(where for some constant) is hard in a worst-case scenario.$

References

↑ Bradley, P.S.; Mangasarian, O.L. (2000). "k-Plane Clustering". Journal of Global Optimization. 16 (1): 23–32. doi:10.1023/A:1008324625522. S2CID 913034.
↑ Tseng, P. (2000). "Nearest q-Flat to m Points". Journal of Optimization Theory and Applications. 105 (1): 249–252. doi:10.1023/A:1004678431677. ISSN 0022-3239. S2CID 118142932.
1 2 Szlam, Arthur; Sapiro, Guillermo (2009-06-14). "Discriminative k -metrics". In Bottou, Léon; Littman, Michael (eds.). Proceedings of the 26th Annual International Conference on Machine Learning. ACM. pp. 1009–1016. doi:10.1145/1553374.1553503. hdl:11299/180116. ISBN 978-1-60558-516-1. S2CID 2509292.
↑ Szlam, A.; Sapiro, G. (2008). "Supervised Learning via Discriminative kq-Flats" (PDF).

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Bradley, P.S.; Mangasarian, O.L. (2000). "k-Plane Clustering". Journal of Global Optimization. 16 (1): 23–32. doi:10.1023/A:1008324625522. S2CID 913034.

[2] Tseng, P. (2000). "Nearest q-Flat to m Points". Journal of Optimization Theory and Applications. 105 (1): 249–252. doi:10.1023/A:1004678431677. ISSN 0022-3239. S2CID 118142932.

[Szlam-3] 1 2 Szlam, Arthur; Sapiro, Guillermo (2009-06-14). "Discriminative k -metrics". In Bottou, Léon; Littman, Michael (eds.). Proceedings of the 26th Annual International Conference on Machine Learning. ACM. pp. 1009–1016. doi:10.1145/1553374.1553503. hdl:11299/180116. ISBN 978-1-60558-516-1. S2CID 2509292.

[Discri-4] Szlam, A.; Sapiro, G. (2008). "Supervised Learning via Discriminative kq-Flats" (PDF).

[1]

[2]

[3]

[4]

K q-flats

Contents

Description

Problem formulation

Algorithm

Relation to other machine learning methods

k-means algorithm

Sparse Dictionary Learning

Applications and variations

Classification

K-metrics

Related Research Articles

References

$k$ -means algorithm