Autocorrelation (words)

Last updated August 19, 2023

In combinatorics, a branch of mathematics, the autocorrelation of a word is the set of periods of this word. More precisely, it is a sequence of values which indicate how much the end of a word looks likes the beginning of a word. This value can be used to compute, for example, the average value of the first occurrence of this word in a random string.

Definition

In this article, A is an alphabet, and $w=w_{1}\dots w_{n}$ a word on A of length n. The autocorrelation of $w$ can be defined as the correlation of $w$ with itself. However, we redefine this notion below.

Autocorrelation vector

The autocorrelation vector of $w$ is $c=(c_{0},\dots ,c_{n-1})$ , with $c_{i}$ being 1 if the prefix of length $n-i$ equals the suffix of length $n-i$ , and with $c_{i}$ being 0 otherwise. That is $c_{i}$ indicates whether $w_{i+1}\dots w_{n}=w_{1}\dots w_{n-i}$ .

For example, the autocorrelation vector of $aaa$ is $(1,1,1)$ since, clearly, for $i$ being 0, 1 or 2, the prefix of length $n-i$ is equal to the suffix of length $n-i$ . The autocorrelation vector of $abb$ is $(1,0,0)$ since no strict prefix is equal to a strict suffix. Finally, the autocorrelation vector of $aabbaa$ is 100011, as shown in the following table:

a	a	b	b	a	a
a	a	b	b	a	a						1
	a	a	b	b	a	a					0
		a	a	b	b	a	a				0
			a	a	b	b	a	a			0
				a	a	b	b	a	a		1
					a	a	b	b	a	a	1

Note that $c_{0}$ is always equal to 1, since the prefix and the suffix of length $n$ are both equal to the word $w$ . Similarly, $c_{n-1}$ is 1 if and only if the first and the last letters are the same.

Autocorrelation polynomial

The autocorrelation polynomial of $w$ is defined as $c(z)=c_{0}z^{0}+\dots +c_{n-1}z^{n-1}$ . It is a polynomial of degree at most $n-1$ .

For example, the autocorrelation polynomial of $aaa$ is $1+z+z^{2}$ and the autocorrelation polynomial of $abb$ is $1$ . Finally, the autocorrelation polynomial of $aabbaa$ is $1+z^{4}+z^{5}$ .

Property

We now indicate some properties which can be computed using the autocorrelation polynomial.

First occurrence of a word in a random string

Suppose that you choose an infinite sequence $s$ of letters of $A$ , randomly, each letter with probability ${\frac {1}{|A|}}$ , where $|A|$ is the number of letters of $A$ . Let us call $E$ the expectation of the first occurrence of ? $m$ in $s$ ? . Then $E$ equals $|A|^{n}c\left({\frac {1}{|A|}}\right)$ . That is, each subword $v$ of $w$ which is both a prefix and a suffix causes the average value of the first occurrence of $w$ to occur $|A|^{|v|}$ letters later. Here $|v|$ is the length of v.

For example, over the binary alphabet $A=\{a,b\}$ , the first occurrence of $aa$ is at position $2^{2}(1+{\frac {1}{2}})=6$ while the average first occurrence of $ab$ is at position $2^{2}(1)=4$ . Intuitively, the fact that the first occurrence of $aa$ is later than the first occurrence of $ab$ can be explained in two ways:

We can consider, for each position $p$ $Autocorrelation (words)$ , what are the requirement for $w$ $Autocorrelation (words)$ 's first occurrence to be at $p$ $Autocorrelation (words)$ .
- The first occurrence of $ab$ can be at position 1 in only one way in both case. If $s$ starts with $w$ . This has probability ${\frac {1}{4}}$ for both considered values of $w$ .
- The first occurrence of $ab$ is at position 2 if the prefix of $s$ of length 3 is $aab$ or is $bab$ . However, the first occurrence of $aa$ is at position 2 if and only if the prefix of $s$ of length 3 is $baa$ . (Note that the first occurrence of $aa$ in $aaa$ is at position 1.).
- In general, the number of prefixes of length $n+1$ such that the first occurrence of $aa$ is at position $n$ is smaller for $aa$ than for $ba$ . This explain why, on average, the first $aa$ arrive later than the first $ab$ .
We can also consider the fact that the average number of occurrences of $w$ in a random string of length $l$ is $|A|^{l-n}$ . This number is independent of the autocorrelation polynomial. An occurrence of $w$ may overlap another occurrence in different ways. More precisely, each 1 in its autocorrelation vector correspond to a way for occurrence to overlap. Since many occurrences of $w$ can be packed together, using overlapping, but the average number of occurrences does not change, it follows that the distance between two non-overlapping occurrences is greater when the autocorrelation vector contains many 1's.

Ordinary generating functions

Autocorrelation polynomials allows to give simple equations for the ordinary generating functions (OGF) of many natural questions.

The OGF of the languages of words not containing $w$ is ${\frac {c(z)}{z^{n}+(1-|A|z)c(z)}}$ .
The OGF of the languages of words containing $w$ is ${\frac {z^{n}}{(1-|A|z)(z^{n}+(1-|A|z)c(z))}}$ .
The OGF of the languages of words containing a single occurrence of $w$ , at the end of the word is ${\frac {z^{n}}{z^{n}+(1-|A|z)c(z)}}$ .

Related Research Articles

In mathematics, the determinant is a scalar value that is a function of the entries of a square matrix. It characterizes some properties of the matrix and the linear map represented by the matrix. In particular, the determinant is nonzero if and only if the matrix is invertible and the linear map represented by the matrix is an isomorphism. The determinant of a product of matrices is the product of their determinants (the preceding property is a corollary of this one). The determinant of a matrix $A$ is denoted $det(A)$ , $det A$ , or $| A |$ .

<span class="mw-page-title-main">Gradient</span> Multivariate derivative (mathematics)

In vector calculus, the gradient of a scalar-valued differentiable function $of several variables is the vector field whose value at a point is the "direction and rate of fastest increase". If the gradient of a function is non-zero at a point, the direction of the gradient is the direction in which the function increases most quickly from, and the magnitude of the gradient is the rate of increase in that direction, the greatest absolute directional derivative. Further, a point where the gradient is the zero vector is known as a stationary point. The gradient thus plays a fundamental role in optimization theory, where it is used to maximize a function by gradient ascent. In coordinate-free terms, the gradient of a function may be defined by:$

<span class="mw-page-title-main">Triangle inequality</span> Property of geometry, also used to generalize the notion of "distance" in metric spaces

In mathematics, the triangle inequality states that for any triangle, the sum of the lengths of any two sides must be greater than or equal to the length of the remaining side. This statement permits the inclusion of degenerate triangles, but some authors, especially those writing about elementary geometry, will exclude this possibility, thus leaving out the possibility of equality. If $x$ , $y$ , and $z$ are the lengths of the sides of the triangle, with no side being greater than $z$ , then the triangle inequality states that

In mathematics, the complex conjugate of a complex number is the number with an equal real part and an imaginary part equal in magnitude but opposite in sign. That is, if $and are real numbers then the complex conjugate of is The complex conjugate of is often denoted as or .$

<span class="mw-page-title-main">Quadratic function</span> Polynomial function of degree two

In mathematics, a quadratic polynomial is a polynomial of degree two in one or more variables. A quadratic function is the polynomial function defined by a quadratic polynomial. Before the 20th century, the distinction was unclear between a polynomial and its associated polynomial function; so "quadratic polynomial" and "quadratic function" were almost synonymous. This is still the case in many elementary courses, where both terms are often abbreviated as "quadratic".

In linear algebra, the adjugate or classical adjoint of a square matrix $A$ is the transpose of its cofactor matrix and is denoted by $adj(A)$ . It is also occasionally known as adjunct matrix, or "adjoint", though the latter term today normally refers to a different concept, the adjoint operator which for a matrix is the conjugate transpose.

In mathematics, and in particular linear algebra, the Moore–Penrose inverse $of a matrix is the most widely known generalization of the inverse matrix. It was independently described by E. H. Moore in 1920, Arne Bjerhammar in 1951, and Roger Penrose in 1955. Earlier, Erik Ivar Fredholm had introduced the concept of a pseudoinverse of integral operators in 1903. When referring to a matrix, the term pseudoinverse, without further specification, is often used to indicate the Moore-Penrose inverse. The term generalized inverse is sometimes used as a synonym for pseudoinverse.$

In mathematics, a monomial is, roughly speaking, a polynomial which has only one term. Two definitions of a monomial may be encountered:

A monomial, also called power product, is a product of powers of variables with nonnegative integer exponents, or, in other words, a product of variables, possibly with repetitions. For example, $is a monomial. The constant is a monomial, being equal to the empty product and to for any variable . If only a single variable is considered, this means that a monomial is either or a power of, with a positive integer. If several variables are considered, say, then each can be given an exponent, so that any monomial is of the form with non-negative integers.$
A monomial is a monomial in the first sense multiplied by a nonzero constant, called the coefficient of the monomial. A monomial in the first sense is a special case of a monomial in the second sense, where the coefficient is $. For example, in this interpretation and are monomials.$

In mathematics, the simplest form of the parallelogram law belongs to elementary geometry. It states that the sum of the squares of the lengths of the four sides of a parallelogram equals the sum of the squares of the lengths of the two diagonals. We use these notations for the sides: AB, BC, CD, DA. But since in Euclidean geometry a parallelogram necessarily has opposite sides equal, that is, AB = CD and BC = DA, the law can be stated as

In mathematics, a spline is a special function defined piecewise by polynomials. In interpolating problems, spline interpolation is often preferred to polynomial interpolation because it yields similar results, even when using low degree polynomials, while avoiding Runge's phenomenon for higher degrees.

In mathematics, the lexicographic or lexicographical order is a generalization of the alphabetical order of the dictionaries to sequences of ordered symbols or, more generally, of elements of a totally ordered set.

In computer science, the Boyer–Moore string-search algorithm is an efficient string-searching algorithm that is the standard benchmark for practical string-search literature. It was developed by Robert S. Boyer and J Strother Moore in 1977. The original paper contained static tables for computing the pattern shifts without an explanation of how to produce them. The algorithm for producing the tables was published in a follow-on paper; this paper contained errors which were later corrected by Wojciech Rytter in 1980.

In geometry, Euler's rotation theorem states that, in three-dimensional space, any displacement of a rigid body such that a point on the rigid body remains fixed, is equivalent to a single rotation about some axis that runs through the fixed point. It also means that the composition of two rotations is also a rotation. Therefore the set of rotations has a group structure, known as a rotation group.

In linear algebra, a circulant matrix is a square matrix in which all row vectors are composed of the same elements and each row vector is rotated one element to the right relative to the preceding row vector. It is a particular kind of Toeplitz matrix.

In group theory, a word metric on a discrete group $is a way to measure distance between any two elements of . As the name suggests, the word metric is a metric on, assigning to any two elements, of a distance that measures how efficiently their difference can be expressed as a word whose letters come from a generating set for the group. The word metric on G is very closely related to the Cayley graph of G : the word metric measures the length of the shortest path in the Cayley graph between two elements of G .$

A maximum length sequence (MLS) is a type of pseudorandom binary sequence.

Many letters of the Latin alphabet, both capital and small, are used in mathematics, science, and engineering to denote by convention specific or abstracted constants, variables of a certain type, units, multipliers, or physical entities. Certain letters, when combined with special formatting, take on special meaning.

In computer science, a longest common substring of two or more strings is a longest string that is a substring of all of them. There may be more than one longest common substring. Applications include data deduplication and plagiarism detection.

In numerical analysis, the Weierstrass method or Durand–Kerner method, discovered by Karl Weierstrass in 1891 and rediscovered independently by Durand in 1960 and Kerner in 1966, is a root-finding algorithm for solving polynomial equations. In other words, the method can be used to solve numerically the equation

<span class="mw-page-title-main">Pythagorean theorem</span> Relation between sides of a right triangle

In mathematics, the Pythagorean theorem or Pythagoras' theorem is a fundamental relation in Euclidean geometry between the three sides of a right triangle. It states that the area of the square whose side is the hypotenuse is equal to the sum of the areas of the squares on the other two sides.

References

Flajolet and Sedgewick (2010). Analytic Combinatorics. New York: Cambridge University Press. pp. 60-61. ISBN 978-0-521-89806-5.
Rosen, Ned. "Expected waiting times for strings of coin flips" (PDF). Retrieved 3 December 2017.
Odlyzko, A. M.; Guibas, L. J. (1981). "String overlaps, pattern matching, and nontransitive games". Journal of Combinatorial Theory. Series A 30 (2): 183–208. doi:10.1016/0097-3165(81)90005-4.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.