Complexity function

Last updated

In computer science, the complexity function of a string, a finite or infinite sequence of letters from some alphabet, is the function that counts the number of distinct factors (substrings of consecutive symbols) from that string. More generally, the complexity function of a language, a set of finite words over an alphabet, counts the number of distinct words of given length.

Contents

Complexity function of a word

Let u be a (possibly infinite) sequence of symbols from an alphabet. Define the function pu(n) of a positive integer n to be the number of different factors (consecutive substrings) of length n from the string u. [1] [2] [3] [4] [5]

For a string u of length at least n over an alphabet of size k we clearly have

the bounds being achieved by the constant word and a disjunctive word, [6] for example, the Champernowne word respectively. [7] For infinite words u, we have pu(n) bounded if u is ultimately periodic (a finite, possibly empty, sequence followed by a finite cycle). Conversely, if pu(n) ≤ n for some n, then u is ultimately periodic. [3] [8]

An aperiodic sequence is one which is not ultimately periodic. An aperiodic sequence has strictly increasing complexity function (this is the Morse–Hedlund theorem), [9] [10] so p(n) is at least n+1. [11]

A set S of finite binary words is balanced if for each n the subset Sn of words of length n has the property that the Hamming weight of the words in Sn takes at most two distinct values. A balanced sequence is one for which the set of factors is balanced. [12] A balanced sequence has complexity function at most n+1. [13]

A Sturmian word over a binary alphabet is one with complexity function n + 1. [14] A sequence is Sturmian if and only if it is balanced and aperiodic. [2] [15] An example is the Fibonacci word. [14] [16] More generally, a Sturmian word over an alphabet of size k is one with complexity n+k−1. An Arnoux-Rauzy word over a ternary alphabet has complexity 2n + 1: [14] an example is the Tribonacci word. [17]

For recurrent words, those in which each factor appears infinitely often, the complexity function almost characterises the set of factors: if s is a recurrent word with the same complexity function as t are then s has the same set of factors as t or δt where δ denotes the letter doubling morphism aaa. [18]

Complexity function of a language

Let L be a language over an alphabet and define the function pL(n) of a positive integer n to be the number of different words of length n in L [9] The complexity function of a word is thus the complexity function of the language consisting of the factors of that word.

The complexity function of a language is less constrained than that of a word. For example, it may be bounded but not eventually constant: the complexity function of the regular language takes values 3 and 4 on odd and even n≥2 respectively. There is an analogue of the Morse–Hedlund theorem: if the complexity of L satisfies pL(n) ≤ n for some n, then pL is bounded and there is a finite language F such that [9]

A polynomial or sparse language is one for which the complexity function p(n) is bounded by a fixed power of n. A regular language which is not polynomial is exponential: there are infinitely many n for which p(n) is greater than kn for some fixed k > 1. [19]

The topological entropy of an infinite sequence u is defined by

The limit exists as the logarithm of the complexity function is subadditive. [20] [21] Every real number between 0 and 1 occurs as the topological entropy of some sequence is applicable, [22] which may be taken to be uniformly recurrent [23] or even uniquely ergodic. [24]

For x a real number and b an integer ≥ 2 then the complexity function of x in base b is the complexity function p(x,b,n) of the sequence of digits of x written in base b. If x is an irrational number then p(x,b,n) ≥ n+1; if x is rational then p(x,b,n) ≤ C for some constant C depending on x and b. [6] It is conjectured that for algebraic irrational x the complexity is bn (which would follow if all such numbers were normal) but all that is known in this case is that p grows faster than any linear function of n. [25]

The abelian complexity functionpab(n) similarly counts the number of occurrences of distinct factors of given length n, where now we identify factors that differ only by a permutation of the positions. Clearly pab(n) ≤ p(n). The abelian complexity of a Sturmian sequence satisfies pab(n) = 2. [26]

Related Research Articles

In theoretical computer science and formal language theory, a regular language is a formal language that can be defined by a regular expression, in the strict sense in theoretical computer science.

In mathematics, the Thue–Morse sequence, or Prouhet–Thue–Morse sequence, is the binary sequence obtained by starting with 0 and successively appending the Boolean complement of the sequence obtained thus far. The first few steps of this procedure yield the strings 0 then 01, 0110, 01101001, 0110100110010110, and so on, which are prefixes of the Thue–Morse sequence. The full sequence begins:

In abstract algebra, the free monoid on a set is the monoid whose elements are all the finite sequences of zero or more elements from that set, with string concatenation as the monoid operation and with the unique sequence of zero elements, often called the empty string and denoted by ε or λ, as the identity element. The free monoid on a set A is usually denoted A. The free semigroup on A is the subsemigroup of A containing all elements except the empty string. It is usually denoted A+.

In mathematics, the Prouhet–Thue–Morse constant, named for Eugène Prouhet, Axel Thue, and Marston Morse, is the number—denoted by τ—whose binary expansion .01101001100101101001011001101001... is given by the Thue–Morse sequence. That is,

Sturmian word

In mathematics, a Sturmian word, named after Jacques Charles François Sturm, is a certain kind of infinitely long sequence of characters. Such a sequence can be generated by considering a game of English billiards on a square table. The struck ball will successively hit the vertical and horizontal edges labelled 0 and 1 generating a sequence of letters. This sequence is a Sturmian word.

Fibonacci word

A Fibonacci word is a specific sequence of binary digits. The Fibonacci word is formed by repeated concatenation in the same way that the Fibonacci numbers are formed by repeated addition.

In combinatorics, a squarefree word is a word that does not contain any squares. A square is a word of the form XX, where X is not empty. Thus, a squarefree word can also be defined as a word that avoids the pattern XX.

In mathematics and theoretical computer science, an automatic sequence is an infinite sequence of terms characterized by a finite automaton. The n-th term of an automatic sequence a(n) is a mapping of the final state reached in a finite automaton accepting the digits of the number n in some fixed base k.

A Markov partition is a tool used in dynamical systems theory, allowing the methods of symbolic dynamics to be applied to the study of hyperbolic dynamics. By using a Markov partition, the system can be made to resemble a discrete-time Markov process, with the long-term dynamical characteristics of the system represented as a Markov shift. The appellation 'Markov' is appropriate because the resulting dynamics of the system obeys the Markov property. The Markov partition thus allows standard techniques from symbolic dynamics to be applied, including the computation of expectation values, correlations, topological entropy, topological zeta functions, Fredholm determinants and the like.

Combinatorics on words is a fairly new field of mathematics, branching from combinatorics, which focuses on the study of words and formal languages. The subject looks at letters or symbols, and the sequences they form. Combinatorics on words affects various areas of mathematical study, including algebra and computer science. There have been a wide range of contributions to the field. Some of the first work was on square-free words by Axel Thue in the early 1900s. He and colleagues observed patterns within words and tried to explain them. As time went on, combinatorics on words became useful in the study of algorithms and coding. It led to developments in abstract algebra and answering open questions.

Rauzy fractal

In mathematics, the Rauzy fractal is a fractal set associated with the Tribonacci substitution

M. Lothaire is the pseudonym of a group of mathematicians, many of whom were students of Marcel-Paul Schützenberger. The name is used as the author of several of their joint books about combinatorics on words. The group is named for Lothair I.

Cutting sequence Records individual grid lines crossed ("cut") as a curve crosses a square grid

In digital geometry, a cutting sequence is a sequence of symbols whose elements correspond to the individual grid lines crossed ("cut") as a curve crosses a square grid.

In mathematics and computer science, a morphic word or substitutive word is an infinite sequence of symbols which is constructed from a particular class of endomorphism of a free monoid.

In mathematics, a sesquipower or Zimin word is a string over an alphabet with identical prefix and suffix. Sesquipowers are unavoidable patterns, in the sense that all sufficiently long strings contain one.

In mathematics and theoretical computer science, a pattern is an unavoidable pattern if it is unavoidable on any finite alphabet.

In mathematics, a recurrent word or sequence is an infinite word over a finite alphabet in which every factor occurs infinitely many times. An infinite word is recurrent if and only if it is a sesquipower.

In mathematics and computer science, the critical exponent of a finite or infinite sequence of symbols over a finite alphabet describes the largest number of times a contiguous subsequence can be repeated. For example, the critical exponent of "Mississippi" is 7/3, as it contains the string "ississi", which is of length 7 and period 3.

In mathematics, Ostrowski numeration, named after Alexander Ostrowski, is either of two related numeration systems based on continued fractions: a non-standard positional numeral system for integers and a non-integer representation of real numbers.

Valérie Berthé French mathematician

Valérie Berthé is a French mathematician who works as a director of research for the Centre national de la recherche scientifique (CNRS) at the Institut de Recherche en Informatique Fondamentale (IRIF), a joint project between CNRS and Paris Diderot University. Her research involves symbolic dynamics, combinatorics on words, discrete geometry, numeral systems, tessellations, and fractals.

References

  1. Lothaire (2011) p.7
  2. 1 2 Lothaire (2011) p.46
  3. 1 2 Pytheas Fogg (2002) p.3
  4. Berstel et al (2009) p.82
  5. Allouche & Shallit (2003) p.298
  6. 1 2 Bugeaud (2012) p.91
  7. Cassaigne & Nicolas (2010) p.165
  8. Allouche & Shallit (2003) p.302
  9. 1 2 3 Berthé & Rigo (2010) p.166
  10. Cassaigne & Nicolas (2010) p.166
  11. Lothaire (2011) p.22
  12. Allouche & Shallit (2003) p.313
  13. Lothaire (2011) p.48
  14. 1 2 3 Pytheas Fogg (2002) p.6
  15. Allouche & Shallit (2003) p.318
  16. de Luca, Aldo (1995). "A division property of the Fibonacci word". Information Processing Letters. 54 (6): 307–312. doi:10.1016/0020-0190(95)00067-M.
  17. Pytheas Fogg (2002) p.368
  18. Berstel et al (2009) p.84
  19. Berthé & Rigo (2010) p.136
  20. Pytheas Fogg (2002) p.4
  21. Allouche & Shallit (2003) p.303
  22. Cassaigne & Nicolas (2010) p.169
  23. Berthé & Rigo (2010) p.391
  24. Berthé & Rigo (2010) p.169
  25. Berthé & Rigo (2010) p.414
  26. Blanchet-Sadri, Francine; Fox, Nathan (2013). "On the Asymptotic Abelian Complexity of Morphic Words". In Béal, Marie-Pierre; Carton, Olivier (eds.). Developments in Language Theory. Proceedings, 17th International Conference, DLT 2013, Marne-la-Vallée, France, June 18-21, 2013. Lecture Notes in Computer Science. 7907. Berlin, Heidelberg: Springer-Verlag. pp. 94–105. doi:10.1007/978-3-642-38771-5_10. ISBN   978-3-642-38770-8. ISSN   0302-9743.