Count sketch

Last updated September 26, 2024

Count sketch is a type of dimensionality reduction that is particularly efficient in statistics, machine learning and algorithms.^[1]^[2] It was invented by Moses Charikar, Kevin Chen and Martin Farach-Colton^[3] in an effort to speed up the AMS Sketch by Alon, Matias and Szegedy for approximating the frequency moments of streams^[4] (these calculations require counting of the number of occurrences for the distinct elements of the stream).

The sketch is nearly identical^{[ citation needed ]} to the Feature hashing algorithm by John Moody,^[5] but differs in its use of hash functions with low dependence, which makes it more practical. In order to still have a high probability of success, the median trick is used to aggregate multiple count sketches, rather than the mean.

These properties allow use for explicit kernel methods, bilinear pooling in neural networks and is a cornerstone in many numerical linear algebra algorithms.^[6]

Intuitive explanation

The inventors of this data structure offer the following iterative explanation of its operation:^[3]

at the simplest level, the output of a single hash function $s$ mapping stream elements $q$ into {+1, -1} is feeding a single up/down counter $C$ . After a single pass over the data, the frequency $n(q)$ of a stream element $q$ can be approximated, although extremely poorly, by the expected value ${\mathbf {E}}[C\cdot s(q)]$ ;
a straightforward way to improve the variance of the previous estimate is to use an array of different hash functions $s_{i}$ , each connected to its own counter $C_{i}$ . For each element $q$ , the ${\mathbf {E}}[C_{i}\cdot s_{i}(q)]=n(q)$ still holds, so averaging across the $i$ range will tighten the approximation;
the previous construct still has a major deficiency: if a lower-frequency-but-still-important output element $a$ exhibits a hash collision with a high-frequency element, $n(a)$ estimate can be significantly affected. Avoiding this requires reducing the frequency of collision counter updates between any two distinct elements. This is achieved by replacing each $C_{i}$ in the previous construct with an array of $m$ counters (making the counter set into a two-dimensional matrix $C_{i,j}$ ), with index $j$ of a particular counter to be incremented/decremented selected via another set of hash functions $h_{i}$ that map element $q$ into the range {1.. $m$ }. Since ${\mathbf {E}}[C_{i,h_{i}(q)}\cdot s_{i}(q)]=n(q)$ , averaging across all values of $i$ will work.

Mathematical definition

1. For constants $w$ and $t$ (to be defined later) independently choose $d=2t+1$ random hash functions $h_{1},\dots ,h_{d}$ and $s_{1},\dots ,s_{d}$ such that $h_{i}:[n]\to [w]$ and $s_{i}:[n]\to \{\pm 1\}$ . It is necessary that the hash families from which $h_{i}$ and $s_{i}$ are chosen be pairwise independent.

2. For each item $q_{i}$ in the stream, add $s_{j}(q_{i})$ to the $h_{j}(q_{i})$ th bucket of the $j$ th hash.

At the end of this process, one has $wd$ sums $(C_{ij})$ where

C_{i,j}=\sum _{h_{i}(k)=j}s_{i}(k).

To estimate the count of $q$ s one computes the following value:

r_{q}={\text{median}}_{i=1}^{d}\,s_{i}(q)\cdot C_{i,h_{i}(q)}.

The values $s_{i}(q)\cdot C_{i,h_{i}(q)}$ are unbiased estimates of how many times $q$ has appeared in the stream.

The estimate $r_{q}$ has variance $O(\mathrm {min} \{m_{1}^{2}/w^{2},m_{2}^{2}/w\})$ , where $m_{1}$ is the length of the stream and $m_{2}^{2}$ is $\sum _{q}(\sum _{i}[q_{i}=q])^{2}$ .^[7]

Furthermore, $r_{q}$ is guaranteed to never be more than $2m_{2}/{\sqrt {w}}$ off from the true value, with probability $1-e^{-O(t)}$ .

Vector formulation

Alternatively Count-Sketch can be seen as a linear mapping with a non-linear reconstruction function. Let $M^{(i\in [d])}\in \{-1,0,1\}^{w\times n}$ , be a collection of $d=2t+1$ matrices, defined by

M_{h_{i}(j),j}^{(i)}=s_{i}(j)

for $j\in [w]$ and 0 everywhere else.

Then a vector $v\in \mathbb {R} ^{n}$ is sketched by $C^{(i)}=M^{(i)}v\in \mathbb {R} ^{w}$ . To reconstruct $v$ we take $v_{j}^{*}={\text{median}}_{i}C_{j}^{(i)}s_{i}(j)$ . This gives the same guarantees as stated above, if we take $m_{1}=\|v\|_{1}$ and $m_{2}=\|v\|_{2}$ .

Relation to Tensor sketch

The count sketch projection of the outer product of two vectors is equivalent to the convolution of two component count sketches.

The count sketch computes a vector convolution

$C^{(1)}x\ast C^{(2)}x^{T}$ , where $C^{(1)}$ and $C^{(2)}$ are independent count sketch matrices.

Pham and Pagh^[8] show that this equals $C(x\otimes x^{T})$ – a count sketch $C$ of the outer product of vectors, where $\otimes$ denotes Kronecker product.

The fast Fourier transform can be used to do fast convolution of count sketches. By using the face-splitting product ^[9]^[10]^[11] such structures can be computed much faster than normal matrices.

Related Research Articles

In mathematics, convolution is a mathematical operation on two functions that produces a third function. The term convolution refers to both the result function and to the process of computing it. It is defined as the integral of the product of the two functions after one is reflected about the y-axis and shifted. The integral is evaluated for all values of shift, producing the convolution function. The choice of which function is reflected and shifted before the integral does not change the integral result. Graphically, it expresses how the 'shape' of one function is modified by the other.

In mathematics, the determinant is a scalar-valued function of the entries of a square matrix. The determinant of a matrix $A$ is commonly denoted $det(A)$ , $det A$ , or $| A |$ . Its value characterizes some properties of the matrix and the linear map represented, on a given basis, by the matrix. In particular, the determinant is nonzero if and only if the matrix is invertible and the corresponding linear map is an isomorphism.

In mathematical physics and mathematics, the Pauli matrices are a set of three $2 \times 2$ complex matrices that are traceless, Hermitian, involutory and unitary. Usually indicated by the Greek letter sigma, they are occasionally denoted by tau when used in connection with isospin symmetries.

In mathematics, the tensor product $of two vector spaces V and W is a vector space to which is associated a bilinear map that maps a pair to an element of denoted ⁠ ⁠ .$

In mathematics, specifically in linear algebra, matrix multiplication is a binary operation that produces a matrix from two matrices. For matrix multiplication, the number of columns in the first matrix must be equal to the number of rows in the second matrix. The resulting matrix, known as the matrix product, has the number of rows of the first and the number of columns of the second matrix. The product of matrices $A$ and $B$ is denoted as $AB$ .

In linear algebra, a Toeplitz matrix or diagonal-constant matrix, named after Otto Toeplitz, is a matrix in which each descending diagonal from left to right is constant. For instance, the following matrix is a Toeplitz matrix:

In linear algebra, the Cayley–Hamilton theorem states that every square matrix over a commutative ring satisfies its own characteristic equation.

In linear algebra, an $n$ -by- $n$ square matrix $A$ is called invertible if there exists an $n$ -by- $n$ square matrix $B$ such that $where I n denotes the n -by- n identity matrix and the multiplication used is ordinary matrix multiplication. If this is the case, then the matrix B is uniquely determined by A, and is called the (multiplicative) inverse of A, denoted by A -1 . Matrix inversion is the process of finding the matrix which when multiplied by the original matrix gives the identity matrix.$

<span class="mw-page-title-main">Block matrix</span> Matrix defined using smaller matrices called blocks

In mathematics, a block matrix or a partitioned matrix is a matrix that is interpreted as having been broken into sections called blocks or submatrices.

In mathematics, the Kronecker product, sometimes denoted by ⊗, is an operation on two matrices of arbitrary size resulting in a block matrix. It is a specialization of the tensor product from vectors to matrices and gives the matrix of the tensor product linear map with respect to a standard choice of basis. The Kronecker product is to be distinguished from the usual matrix multiplication, which is an entirely different operation. The Kronecker product is also sometimes called matrix direct product.

In statistics, the generalized linear array model (GLAM) is used for analyzing data sets with array structures. It based on the generalized linear model with the design matrix written as a Kronecker product.

<span class="mw-page-title-main">Classical group</span>

In mathematics, the classical groups are defined as the special linear groups over the reals $, the complex numbers and the quaternions together with special automorphism groups of symmetric or skew-symmetric bilinear forms and Hermitian or skew-Hermitian sesquilinear forms defined on real, complex and quaternionic finite-dimensional vector spaces. Of these, the complex classical Lie groups are four infinite families of Lie groups that together with the exceptional groups exhaust the classification of simple Lie groups. The compact classical groups are compact real forms of the complex classical groups. The finite analogues of the classical groups are the classical groups of Lie type . The term "classical group" was coined by Hermann Weyl, it being the title of his 1939 monograph The Classical Groups .$

In computer science, streaming algorithms are algorithms for processing data streams in which the input is presented as a sequence of items and can be examined in only a few passes, typically just one. These algorithms are designed to operate with limited memory, generally logarithmic in the size of the stream and/or in the maximum value in the stream, and may also have limited processing time per item.

In computing, the count–min sketch is a probabilistic data structure that serves as a frequency table of events in a stream of data. It uses hash functions to map events to frequencies, but unlike a hash table uses only sub-linear space, at the expense of overcounting some events due to collisions. The count–min sketch was invented in 2003 by Graham Cormode and S. Muthu Muthukrishnan and described by them in a 2005 paper.

In computer science, the count-distinct problem (also known in applied mathematics as the cardinality estimation problem) is the problem of finding the number of distinct elements in a data stream with repeated elements. This is a well-known problem with numerous applications. The elements might represent IP addresses of packets passing through a router, unique visitors to a web site, elements in a large database, motifs in a DNA sequence, or elements of RFID/sensor networks.

In statistics, the class of vector generalized linear models (VGLMs) was proposed to enlarge the scope of models catered for by generalized linear models (GLMs). In particular, VGLMs allow for response variables outside the classical exponential family and for more than one parameter. Each parameter can be transformed by a link function. The VGLM framework is also large enough to naturally accommodate multiple responses; these are several independent responses each coming from a particular statistical distribution with possibly different parameter values.

In statistics, machine learning and algorithms, a tensor sketch is a type of dimensionality reduction that is particularly efficient when applied to vectors that have tensor structure. Such a sketch can be used to speed up explicit kernel methods, bilinear pooling in neural networks and is a cornerstone in many numerical linear algebra algorithms.

In mathematics, the Khatri–Rao product or block Kronecker product of two partitioned matrices $and is defined as$

Vadym Slyusar is a Soviet and Ukrainian scientist, Professor, Doctor of Technical Sciences, Honored Scientist and Technician of Ukraine, founder of tensor-matrix theory of digital antenna arrays (DAAs), N-OFDM and other theories in fields of radar systems, smart antennas for wireless communications and digital beamforming.

In computer science, a retrieval data structure, also known as static function, is a space-efficient dictionary-like data type composed of a collection of pairs that allows the following operations:

References

↑ Faisal M. Algashaam; Kien Nguyen; Mohamed Alkanhal; Vinod Chandran; Wageeh Boles. "Multispectral Periocular Classification WithMultimodal Compact Multi-Linear Pooling" [1]. IEEE Access, Vol. 5. 2017.
↑ Ahle, Thomas; Knudsen, Jakob (2019-09-03). "Almost Optimal Tensor Sketch". ResearchGate . Retrieved 2020-07-11.
1 2 Charikar, Chen & Farach-Colton 2004.
↑ Alon, Noga, Yossi Matias, and Mario Szegedy. "The space complexity of approximating the frequency moments." Journal of Computer and system sciences 58.1 (1999): 137-147.
↑ Moody, John. "Fast learning in multi-resolution hierarchies." Advances in neural information processing systems. 1989.
↑ Woodruff, David P. "Sketching as a Tool for Numerical Linear Algebra." Theoretical Computer Science 10.1-2 (2014): 1–157.
↑ Larsen, Kasper Green, Rasmus Pagh, and Jakub Tětek. "CountSketches, Feature Hashing and the Median of Three." International Conference on Machine Learning. PMLR, 2021.
↑ Ninh, Pham; Pagh, Rasmus (2013). Fast and scalable polynomial kernels via explicit feature maps. SIGKDD international conference on Knowledge discovery and data mining. Association for Computing Machinery. doi:10.1145/2487575.2487591.
↑ Slyusar, V. I. (1998). "End products in matrices in radar applications" (PDF). Radioelectronics and Communications Systems. 41 (3): 50–53.
↑ Slyusar, V. I. (1997-05-20). "Analytical model of the digital antenna array on a basis of face-splitting matrix products" (PDF). Proc. ICATT-97, Kyiv: 108–109.
↑ Slyusar, V. I. (March 13, 1998). "A Family of Face Products of Matrices and its Properties" (PDF). Cybernetics and Systems Analysis C/C of Kibernetika I Sistemnyi Analiz.- 1999. 35 (3): 379–384. doi:10.1007/BF02733426. S2CID 119661450.