Combinatorics on words

Last updated
Construction of a Thue-Morse infinite word Morse-Thue sequence.gif
Construction of a Thue–Morse infinite word

Combinatorics on words is a fairly new field of mathematics, branching from combinatorics, which focuses on the study of words and formal languages. The subject looks at letters or symbols, and the sequences they form. Combinatorics on words affects various areas of mathematical study, including algebra and computer science. There have been a wide range of contributions to the field. Some of the first work was on square-free words by Axel Thue in the early 1900s. He and colleagues observed patterns within words and tried to explain them. As time went on, combinatorics on words became useful in the study of algorithms and coding. It led to developments in abstract algebra and answering open questions.

Contents

Definition

Combinatorics is an area of discrete mathematics. Discrete mathematics is the study of countable structures. These objects have a definite beginning and end. The study of enumerable objects is the opposite of disciplines such as analysis, where calculus and infinite structures are studied. Combinatorics studies how to count these objects using various representations. Combinatorics on words is a recent development in this field that focuses on the study of words and formal languages. A formal language is any set of symbols and combinations of symbols that people use to communicate information. [1]

Some terminology relevant to the study of words should first be explained. First and foremost, a word is basically a sequence of symbols, or letters, in a finite set. [1] One of these sets is known by the general public as the alphabet. For example, the word "encyclopedia" is a sequence of symbols in the English alphabet, a finite set of twenty-six letters. Since a word can be described as a sequence, other basic mathematical descriptions can be applied. The alphabet is a set, so as one would expect, the empty set is a subset. In other words, there exists a unique word of length zero. The length of the word is defined by the number of symbols that make up the sequence, and is denoted by . [1] Again looking at the example "encyclopedia", , since encyclopedia has twelve letters. The idea of factoring of large numbers can be applied to words, where a factor of a word is a block of consecutive symbols. [1] Thus, "cyclop" is a factor of "encyclopedia".

In addition to examining sequences in themselves, another area to consider of combinatorics on words is how they can be represented visually. In mathematics various structures are used to encode data. A common structure used in combinatorics is the tree structure. A tree structure is a graph where the vertices are connected by one line, called a path or edge. Trees may not contain cycles, and may or may not be complete. It is possible to encode a word, since a word is constructed by symbols, and encode the data by using a tree. [1] This gives a visual representation of the object.

Major contributions

The first books on combinatorics on words that summarize the origins of the subject were written by a group of mathematicians that collectively went by the name of M. Lothaire. Their first book was published in 1983, when combinatorics on words became more widespread. [1]

Patterns

Patterns within words

A main contributor to the development of combinatorics on words was Axel Thue (1863–1922); he researched repetition. Thue's main contribution was the proof of the existence of infinite square-free words. Square-free words do not have adjacent repeated factors. [1] To clarify, "dining" is not square-free since "in" is repeated consecutively, while "servers" is square-free, its two "er" factors not being adjacent. Thue proves his conjecture on the existence of infinite square-free words by using substitutions. A substitution is a way to take a symbol and replace it with a word. He uses this technique to describe his other contribution, the Thue–Morse sequence, or Thue–Morse word. [1]

Thue wrote two papers on square-free words, the second of which was on the Thue–Morse word. Marston Morse is included in the name because he discovered the same result as Thue did, yet they worked independently. Thue also proved the existence of an overlap-free word. An overlap-free word is when, for two symbols and , the pattern does not exist within the word. He continues in his second paper to prove a relationship between infinite overlap-free words and square-free words. He takes overlap-free words that are created using two different letters, and demonstrates how they can be transformed into square-free words of three letters using substitution. [1]

As was previously described, words are studied by examining the sequences made by the symbols. Patterns are found, and they can be described mathematically. Patterns can be either avoidable patterns, or unavoidable. A significant contributor to the work of unavoidable patterns, or regularities, was Frank Ramsey in 1930. His important theorem states that for integers , , there exists a least positive integer such that despite how a complete graph is colored with two colors, there will always exist a solid color subgraph of each color. [1]

Other contributors to the study of unavoidable patterns include van der Waerden. His theorem states that if the positive integers are partitioned into classes, then there exists a class such that contains an arithmetic progression of some unknown length. An arithmetic progression is a sequence of numbers in which the difference between adjacent numbers remains constant. [1]

When examining unavoidable patterns sesquipowers are also studied. For some patterns ,,, a sesquipower is of the form , , , . This is another pattern such as square-free, or unavoidable patterns. Coudrain and Schützenberger mainly studied these sesquipowers for group theory applications. In addition, Zimin proved that sesquipowers are all unavoidable. Whether the entire pattern shows up, or only some piece of the sesquipower shows up repetitively, it is not possible to avoid it. [1]

Patterns within alphabets

Necklaces are constructed from words of circular sequences. They are most frequently used in music and astronomy. Flye Sainte-Marie in 1894 proved there are binary de Bruijn necklaces of length . A de Bruijn necklace contains factors made of words of length n over a certain number of letters. The words appear only once in the necklace. [1]

In 1874, Baudot developed the code that would eventually take the place of Morse code by applying the theory of binary de Bruijn necklaces. The problem continued from Sainte-Marie to Martin in 1934, who began looking at algorithms to make words of the de Bruijn structure. It was then worked on by Posthumus in 1943. [1]

Language hierarchy

Possibly the most applied result in combinatorics on words is the Chomsky hierarchy, developed by Noam Chomsky. He studied formal language in the 1950s. [2] His way of looking at language simplified the subject. He disregards the actual meaning of the word, does not consider certain factors such as frequency and context, and applies patterns of short terms to all length terms. The basic idea of Chomsky's work is to divide language into four levels, or the language hierarchy. The four levels are: regular, context-free, context-sensitive, and computably enumerable or unrestricted. [2] Regular is the least complex while computably enumerable is the most complex. While his work grew out of combinatorics on words, it drastically affected other disciplines, especially computer science. [3]

Word types

Sturmian words

Sturmian words, created by François Sturm, have roots in combinatorics on words. There exist several equivalent definitions of Sturmian words. For example, an infinite word is Sturmian if and only if it has distinct factors of length , for every non-negative integer . [1]

Lyndon word

A Lyndon word is a word over a given alphabet that is written in its simplest and most ordered form out of its respective conjugacy class. Lyndon words are important because for any given Lyndon word , there exists Lyndon words and , with , . Further, there exists a theorem by Chen, Fox, and Lyndon, that states any word has a unique factorization of Lyndon words, where the factorization words are non-increasing. Due to this property, Lyndon words are used to study algebra, specifically group theory. They form the basis for the idea of commutators. [1]

Visual representation

Cobham contributed work relating Eugène Prouhet's work with finite automata. A mathematical graph is made of edges and nodes. With finite automata, the edges are labeled with a letter in an alphabet. To use the graph, one starts at a node and travels along the edges to reach a final node. The path taken along the graph forms the word. It is a finite graph because there are a countable number of nodes and edges, and only one path connects two distinct nodes. [1]

Gauss codes, created by Carl Friedrich Gauss in 1838, are developed from graphs. Specifically, a closed curve on a plane is needed. If the curve only crosses over itself a finite number of times, then one labels the intersections with a letter from the alphabet used. Traveling along the curve, the word is determined by recording each letter as an intersection is passed. Gauss noticed that the distance between when the same symbol shows up in a word is an even integer. [1]

Group theory

Walther Franz Anton von Dyck began the work of combinatorics on words in group theory by his published work in 1882 and 1883. He began by using words as group elements. Lagrange also contributed in 1771 with his work on permutation groups. [1]

One aspect of combinatorics on words studied in group theory is reduced words. A group is constructed with words on some alphabet including generators and inverse elements, excluding factors that appear of the form aā or āa, for some a in the alphabet. Reduced words are formed when the factors aā, āa are used to cancel out elements until a unique word is reached. [1]

Nielsen transformations were also developed. For a set of elements of a free group, a Nielsen transformation is achieved by three transformations; replacing an element with its inverse, replacing an element with the product of itself and another element, and eliminating any element equal to 1. By applying these transformations Nielsen reduced sets are formed. A reduced set means no element can be multiplied by other elements to cancel out completely. There are also connections with Nielsen transformations with Sturmian words. [1]

Considered problems

One problem considered in the study of combinatorics on words in group theory is the following: for two elements , of a semigroup, does modulo the defining relations of and . Post and Markov studied this problem and determined it undecidable, meaning that there is no possible algorithm that can answer the question in all cases (because any such algorithm could be encoded into a word problem which that algorithm could not solve). [1]

The Burnside question was proved using the existence of an infinite cube-free word. This question asks if a group is finite if the group has a definite number of generators and meets the criteria , for in the group. [1]

Many word problems are undecidable based on the Post correspondence problem. Any two homomorphisms with a common domain and a common codomain form an instance of the Post correspondence problem, which asks whether there exists a word in the domain such that . Post proved that this problem is undecidable; consequently, any word problem that can be reduced to this basic problem is likewise undecidable. [1]

Other applications

Combinatorics on words have applications on equations. Makanin proved that it is possible to find a solution for a finite system of equations, when the equations are constructed from words. [1]

See also

Related Research Articles

<span class="mw-page-title-main">Formal language</span> Sequence of words formed by specific rules

In logic, mathematics, computer science, and linguistics, a formal language consists of words whose letters are taken from an alphabet and are well-formed according to a specific set of rules called a formal grammar.

In theoretical computer science and formal language theory, a regular language is a formal language that can be defined by a regular expression, in the strict sense in theoretical computer science.

In mathematics, the Thue–Morse sequence or Prouhet–Thue–Morse sequence or parity sequence is the binary sequence obtained by starting with 0 and successively appending the Boolean complement of the sequence obtained thus far. The first few steps of this procedure yield the strings 0 then 01, 0110, 01101001, 0110100110010110, and so on, which are prefixes of the Thue–Morse sequence. The full sequence begins:

In abstract algebra, the free monoid on a set is the monoid whose elements are all the finite sequences of zero or more elements from that set, with string concatenation as the monoid operation and with the unique sequence of zero elements, often called the empty string and denoted by ε or λ, as the identity element. The free monoid on a set A is usually denoted A. The free semigroup on A is the subsemigroup of A containing all elements except the empty string. It is usually denoted A+.

In mathematics, the lexicographic or lexicographical order is a generalization of the alphabetical order of the dictionaries to sequences of ordered symbols or, more generally, of elements of a totally ordered set.

<span class="mw-page-title-main">Sturmian word</span> Kind of infinitely long sequence of characters

In mathematics, a Sturmian word, named after Jacques Charles François Sturm, is a certain kind of infinitely long sequence of characters. Such a sequence can be generated by considering a game of English billiards on a square table. The struck ball will successively hit the vertical and horizontal edges labelled 0 and 1 generating a sequence of letters. This sequence is a Sturmian word.

In mathematics, subshifts of finite type are used to model dynamical systems, and in particular are the objects of study in symbolic dynamics and ergodic theory. They also describe the set of all possible sequences executed by a finite state machine. The most widely studied shift spaces are the subshifts of finite type.

In mathematics, in the areas of combinatorics and computer science, a Lyndon word is a nonempty string that is strictly smaller in lexicographic order than all of its rotations. Lyndon words are named after mathematician Roger Lyndon, who investigated them in 1954, calling them standard lexicographic sequences. Anatoly Shirshov introduced Lyndon words in 1953 calling them regular words. Lyndon words are a special case of Hall words; almost all properties of Lyndon words are shared by Hall words.

In combinatorics, a squarefree word is a word that does not contain any squares. A square is a word of the form XX, where X is not empty. Thus, a squarefree word can also be defined as a word that avoids the pattern XX.

In formal language theory, an alphabet, sometimes called a vocabulary, is a non-empty set of indivisible symbols/glyphs, typically thought of as representing letters, characters, digits, phonemes, or even words. Alphabets in this technical sense of a set are used in a diverse range of fields including logic, mathematics, computer science, and linguistics. An alphabet may have any cardinality ("size") and, depending on its purpose, may be finite, countable, or even uncountable.

In mathematics and theoretical computer science, an automatic sequence (also called a k-automatic sequence or a k-recognizable sequence when one wants to indicate that the base of the numerals used is k) is an infinite sequence of terms characterized by a finite automaton. The n-th term of an automatic sequence a(n) is a mapping of the final state reached in a finite automaton accepting the digits of the number n in some fixed base k.

In computer science, the complexity function of a word or string is the function that counts the number of distinct factors of that string. More generally, the complexity function of a formal language counts the number of distinct words of given length.

In mathematics, a shuffle algebra is a Hopf algebra with a basis corresponding to words on some set, whose product is given by the shuffle productXY of two words X, Y: the sum of all ways of interlacing them. The interlacing is given by the riffle shuffle permutation.

In mathematics, a factorisation of a free monoid is a sequence of subsets of words with the property that every word in the free monoid can be written as a concatenation of elements drawn from the subsets. The Chen–Fox–Lyndon theorem states that the Lyndon words furnish a factorisation. The Schützenberger theorem relates the definition in terms of a multiplicative property to an additive property.

In mathematics and computer science, a morphic word or substitutive word is an infinite sequence of symbols which is constructed from a particular class of endomorphism of a free monoid.

In mathematics, a sesquipower or Zimin word is a string over an alphabet with identical prefix and suffix. Sesquipowers are unavoidable patterns, in the sense that all sufficiently long strings contain one.

In mathematics and theoretical computer science, a pattern is an unavoidable pattern if it is unavoidable on any finite alphabet.

In mathematics, a recurrent word or sequence is an infinite word over a finite alphabet in which every factor occurs infinitely many times. An infinite word is recurrent if and only if it is a sesquipower.

In mathematics and computer science, the critical exponent of a finite or infinite sequence of symbols over a finite alphabet describes the largest number of times a contiguous subsequence can be repeated. For example, the critical exponent of "Mississippi" is 7/3, as it contains the string "ississi", which is of length 7 and period 3.

Dejean's theorem is a statement about repetitions in infinite strings of symbols. It belongs to the field of combinatorics on words; it was conjectured in 1972 by Françoise Dejean and proven in 2009 by Currie and Rampersad and, independently, by Rao.

References

  1. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 Berstel, Jean; Dominique Perrin (April 2007). "The origins of combinatorics on words". European Journal of Combinatorics. 28 (3): 996–1022. CiteSeerX   10.1.1.361.7000 . doi: 10.1016/j.ejc.2005.07.019 .
  2. 1 2 Jäger, Gerhard; James Rogers (July 2012). "Formal language theory: refining the Chomsky hierarchy". Philosophical Transactions of the Royal Society B. 367 (1598): 1956–1970. doi:10.1098/rstb.2012.0077. PMC   3367686 . PMID   22688632.
  3. Morales-Bueno, Rafael; Baena-Garcia, Manuel; Carmona-Cejudo, Jose M.; Castillo, Gladys (2010). "Counting Word Avoiding Factors". Electronic Journal of Mathematics and Technology. 4 (3): 251.

Further reading