Square-free word

Last updated

In combinatorics, a squarefree word is a word (a sequence of symbols) that does not contain any squares. A square is a word of the form XX, where X is not empty. Thus, a squarefree word can also be defined as a word that avoids the pattern XX.

Contents

Finite squarefree words

Binary alphabet

Over a binary alphabet , the only squarefree words are the empty word , and .

Ternary alphabet

Over a ternary alphabet , there are infinitely many squarefree words. It is possible to count the number of ternary squarefree words of length n.

The number of ternary squarefree words of length n [1]
n0123456789101112
136121830426078108144204264

This number is bounded by , where . [2] The upper bound on can be found via Fekete's Lemma and approximation by automata. The lower bound can be found by finding a substitution that preserves squarefreeness. [2]

Alphabet with more than three letters

Since there are infinitely many squarefree words over three-letter alphabets, this implies there are also infinitely many squarefree words over an alphabet with more than three letters.

The following table shows the exact growth rate of the k-ary squarefree words:

The growth rate of the k-ary squarefree words [2]
alphabet size (k)456789
growth rate2.62150803.73253864.79140695.82846616.85411737.8729902
alphabet size (k)101112131415
growth rate8.88748569.898981310.908327911.916080412.922616713.9282035

2-dimensional words

Consider a map from to A, where A is an alphabet and is called a 2-dimensional word. Let be the entry . A word is a line of if there exists such that , and for . [3]

Carpi [4] proves that there exists a 2-dimensional word over a 16-letter alphabet such that every line of is squarefree. A computer search shows that there are no 2-dimensional words over a 7-letter alphabet, such that every line of is squarefree.

Generating finite squarefree words

Shur [5] proposes an algorithm called R2F (random-t(w)o-free) that can generate a squarefree word of length n over any alphabet with three or more letters. This algorithm is based on a modification of entropy compression: it randomly selects letters from a k-letter alphabet to generate a -ary squarefree word.

algorithm R2F isinput: alphabet size ,            word length output: a -ary squarefree word wof length n.      (Note that  is the alphabet with letters .)     (For a word ,  is the permutation of  such that a precedes b in  if the       right most position of a in w is to the right of the rightmost position of b in w.      For example,   has .)      choose  in  uniformly at random      setto followed by all other letters of  in increasing order     set the number N of iterations to 0      whiledo         choose j in  uniformly at random         append  to the end of w         update  shifting the first j elements to the right and setting          increment N by 1ifw ends with a square of rank rthen             delete the last r letters of wreturnw

Every (k+1)-ary squarefree word can be the output of Algorithm R2F, because on each iteration it can append any letter except for the last letter of w.

The expected number of random k-ary letters used by Algorithm R2F to construct a -ary squarefree word of length n is

Note that there exists an algorithm that can verify the squarefreeness of a word of length n in time. Apostolico and Preparata [6] give an algorithm using suffix trees. Crochemore [7] uses partitioning in his algorithm. Main and Lorentz [8] provide an algorithm based on the divide-and-conquer method. A naive implementation may require time to verify the squarefreeness of a word of length n.

Infinite squarefree words

There exist arbitrarily long squarefree words in any alphabet with three or more letters, as proved by Axel Thue. [9]

Examples

First difference of the Thue–Morse sequence

One example of an infinite squarefree word over an alphabet of size 3 is the word over the alphabet obtained by taking the first difference of the Thue–Morse sequence. [9] That is, from the Thue–Morse sequence

one forms a new sequence in which each term is the difference of two consecutive terms of the Thue–Morse sequence. The resulting squarefree word is

(sequence A029883 in the OEIS ).

Leech's morphism

Another example found by John Leech [10] is defined recursively over the alphabet . Let be any squarefree word starting with the letter 0. Define the words recursively as follows: the word is obtained from by replacing each 0 in with 0121021201210, each 1 with 1202102012021, and each 2 with 2010210120102. It is possible to prove that the sequence converges to the infinite squarefree word

0121021201210120210201202120102101201021202102012021...

Generating infinite squarefree words

Infinite squarefree words can be generated by squarefree morphism. A morphism is called squarefree if the image of every squarefree word is squarefree. A morphism is called k–squarefree if the image of every squarefree word of length k is squarefree.

Crochemore [11] proves that a uniform morphism h is squarefree if and only if it is 3-squarefree. In other words, h is squarefree if and only if is squarefree for all squarefree w of length 3. It is possible to find a squarefree morphism by brute-force search.

algorithm squarefree_morphism isoutput: a squarefree morphism with the lowest possible rank k.      setwhile True dosetk_sf_wordsto the list of all squarefree words of length k over a ternary alphabet         for eachink_sf_wordsdofor eachink_sf_wordsdofor eachink_sf_wordsdoifthenbreak from the current loop (advance to next )                     ifandthenifis squarefree for all squarefree w of length 3thenreturn         increment k by 1

Over a ternary alphabet, there are exactly 144 uniform squarefree morphisms of rank 11 and no uniform squarefree morphisms with a lower rank than 11.

To obtain an infinite squarefree words, start with any squarefree word such as 0, and successively apply a squarefree morphism h to it. The resulting words preserve the property of squarefreeness. For example, let h be a squarefree morphism, then as , is an infinite squarefree word.

Note that, if a morphism over a ternary alphabet is not uniform, then this morphism is squarefree if and only if it is 5-squarefree. [11]

Letter combinations in squarefree words

Extending a squarefree word to avoid ab. Squarefreeness-3--Bound.png
Extending a squarefree word to avoid ab.

Avoid two-letter combinations

Over a ternary alphabet, a squarefree word of length more than 13 contains all the squarefree two-letter combinations. [12]

This can be proved by constructing a squarefree word without the two-letter combination ab. As a result, bcbacbcacbaca is the longest squarefree word without the combination ab and its length is equal to 13.

Note that over a more than three-letter alphabet there are squarefree words of any length without an arbitrary two-letter combination.

Avoid three-letter combinations

Over a ternary alphabet, a squarefree word of length more than 36 contains all the squarefree three-letter combinations. [12]

However, there are squarefree words of any length without the three-letter combination aba.

Note that over a more than three-letter alphabet there are squarefree words of any length without an arbitrary three-letter combination.

Density of a letter

The density of a letter a in a finite word w is defined as where is the number of occurrences of a in and is the length of the word. The density of a letter a in an infinite word is where is the prefix of the word w of length l. [13]

The minimal density of a letter a in an infinite ternary squarefree word is equal to . [13]

The maximum density of a letter a in an infinite ternary squarefree word is equal to . [14]

Notes

  1. "A006156 - OEIS". oeis.org. Retrieved 2019-03-28.
  2. 1 2 3 Shur, Arseny (2011). "Growth properties of power-free languages". Computer Science Review. 6 (5–6): 28–43. doi:10.1016/j.cosrev.2012.09.001.
  3. Berthe, Valerie; Rigo, Michel, eds. (2016), "Preface", Combinatorics, Words and Symbolic Dynamics, Cambridge University Press, pp. xi–xviii, doi:10.1017/cbo9781139924733.001, ISBN   9781139924733
  4. Carpi, Arturo (1988). "Multidimensional unrepetitive configurations". Theoretical Computer Science. 56 (2): 233–241. doi: 10.1016/0304-3975(88)90080-1 . ISSN   0304-3975.
  5. Shur, Arseny (2015). "Generating square-free words efficiently". Theoretical Computer Science. 601: 67–72. doi: 10.1016/j.tcs.2015.07.027 .
  6. Apostolico, A.; Preparata, F.P. (Feb 1983). "Optimal off-line detection of repetitions in a string". Theoretical Computer Science. 22 (3): 297–315. doi: 10.1016/0304-3975(83)90109-3 . ISSN   0304-3975.
  7. Crochemore, Max (Oct 1981). "An optimal algorithm for computing the repetitions in a word". Information Processing Letters. 12 (5): 244–250. doi:10.1016/0020-0190(81)90024-7. ISSN   0020-0190.
  8. Main, Michael G; Lorentz, Richard J (Sep 1984). "An O(n log n) algorithm for finding all repetitions in a string". Journal of Algorithms. 5 (3): 422–432. doi:10.1016/0196-6774(84)90021-x. ISSN   0196-6774.
  9. 1 2 Berstel, Jean (1994). Axel Thue's papers on repetitions in words a translation. Départements de mathématiques et d'informatique, Université du Québec à Montréal. ISBN   978-2892761405. OCLC   494791187.
  10. Leech, J. (1957). "A problem on strings of beads". Math. Gaz. 41: 277–278. doi:10.1017/S0025557200236115. S2CID   126406225. Zbl   0079.01101.
  11. 1 2 Berstel, Jean (April 1984). "Some Recent Results on Squarefree Words". Annual Symposium on Theoretical Aspects of Computer Science. Lecture Notes in Computer Science. 166: 14–25. doi:10.1007/3-540-12920-0_2. ISBN   978-3-540-12920-2.
  12. 1 2 Zolotov, Boris (2015). "Another Solution to the Thue Problem of Non-Repeating Words". arXiv: 1505.00019 [math.CO].
  13. 1 2 Khalyavin, Andrey (2007). "The minimal density of a letter in an infinite ternary square-free word is 883/3215" (PDF). Journal of Integer Sequences. 10 (2): 3. Bibcode:2007JIntS..10...65K.
  14. Ochem, Pascal (2007). "Letter frequency in infinite repetition-free words". Theoretical Computer Science. 380 (3): 388–392. doi: 10.1016/j.tcs.2007.03.027 . ISSN   0304-3975.

Related Research Articles

<span class="mw-page-title-main">Formal language</span> Sequence of words formed by specific rules

In logic, mathematics, computer science, and linguistics, a formal language consists of words whose letters are taken from an alphabet and are well-formed according to a specific set of rules.

In theoretical computer science and formal language theory, a regular language is a formal language that can be defined by a regular expression, in the strict sense in theoretical computer science.

In mathematics, a real number is said to be simply normal in an integer base b if its infinite sequence of digits is distributed uniformly in the sense that each of the b digit values has the same natural density 1/b. A number is said to be normal in base b if, for every positive integer n, all possible strings n digits long have density bn.

In statistics, the Wishart distribution is a generalization to multiple dimensions of the gamma distribution. It is named in honor of John Wishart, who first formulated the distribution in 1928.

In mathematics, computer science, and logic, rewriting covers a wide range of methods of replacing subterms of a formula with other terms. Such methods may be achieved by rewriting systems. In their most basic form, they consist of a set of objects, plus relations on how to transform those objects.

In mathematics, the Thue–Morse sequence, or Prouhet–Thue–Morse sequence, is the binary sequence obtained by starting with 0 and successively appending the Boolean complement of the sequence obtained thus far. The first few steps of this procedure yield the strings 0 then 01, 0110, 01101001, 0110100110010110, and so on, which are prefixes of the Thue–Morse sequence. The full sequence begins:

In abstract algebra, the free monoid on a set is the monoid whose elements are all the finite sequences of zero or more elements from that set, with string concatenation as the monoid operation and with the unique sequence of zero elements, often called the empty string and denoted by ε or λ, as the identity element. The free monoid on a set A is usually denoted A. The free semigroup on A is the subsemigroup of A containing all elements except the empty string. It is usually denoted A+.

<span class="mw-page-title-main">Sturmian word</span> Kind of infinitely long sequence of characters

In mathematics, a Sturmian word, named after Jacques Charles François Sturm, is a certain kind of infinitely long sequence of characters. Such a sequence can be generated by considering a game of English billiards on a square table. The struck ball will successively hit the vertical and horizontal edges labelled 0 and 1 generating a sequence of letters. This sequence is a Sturmian word.

<span class="mw-page-title-main">Fibonacci word</span> Binary sequence from Fibonacci recurrence

A Fibonacci word is a specific sequence of binary digits. The Fibonacci word is formed by repeated concatenation in the same way that the Fibonacci numbers are formed by repeated addition.

In theoretical computer science and mathematical logic a string rewriting system (SRS), historically called a semi-Thue system, is a rewriting system over strings from a alphabet. Given a binary relation between fixed strings over the alphabet, called rewrite rules, denoted by , an SRS extends the rewriting relation to all strings in which the left- and right-hand side of the rules appear as substrings, that is , where , , , and are strings.

In mathematics, in the areas of combinatorics and computer science, a Lyndon word is a nonempty string that is strictly smaller in lexicographic order than all of its rotations. Lyndon words are named after mathematician Roger Lyndon, who investigated them in 1954, calling them standard lexicographic sequences. Anatoly Shirshov introduced Lyndon words in 1953 calling them regular words. Lyndon words are a special case of Hall words; almost all properties of Lyndon words are shared by Hall words.

In formal language theory, an alphabet is a non-empty set of symbols/glyphs, typically thought of as representing letters, characters, or digits but among other possibilities the "symbols" could also be a set of phonemes. Alphabets in this technical sense of a set are used in a diverse range of fields including logic, mathematics, computer science, and linguistics. An alphabet may have any cardinality ("size") and depending on its purpose maybe be finite, countable, or even uncountable.

<span class="mw-page-title-main">Thue number</span>

In the mathematical area of graph theory, the Thue number of a graph is a variation of the chromatic index, defined by Alon et al. (2002) and named after mathematician Axel Thue, who studied the squarefree words used to define this number.

In mathematics and theoretical computer science, an automatic sequence (also called a k-automatic sequence or a k-recognizable sequence when one wants to indicate that the base of the numerals used is k) is an infinite sequence of terms characterized by a finite automaton. The n-th term of an automatic sequence a(n) is a mapping of the final state reached in a finite automaton accepting the digits of the number n in some fixed base k.

Combinatorics on words is a fairly new field of mathematics, branching from combinatorics, which focuses on the study of words and formal languages. The subject looks at letters or symbols, and the sequences they form. Combinatorics on words affects various areas of mathematical study, including algebra and computer science. There have been a wide range of contributions to the field. Some of the first work was on square-free words by Axel Thue in the early 1900s. He and colleagues observed patterns within words and tried to explain them. As time went on, combinatorics on words became useful in the study of algorithms and coding. It led to developments in abstract algebra and answering open questions.

In computer science, the complexity function of a word or string is the function that counts the number of distinct factors of that string. More generally, the complexity function of a formal language counts the number of distinct words of given length.

In mathematics and computer science, a morphic word or substitutive word is an infinite sequence of symbols which is constructed from a particular class of endomorphism of a free monoid.

In mathematics and theoretical computer science, a pattern is an unavoidable pattern if it is unavoidable on any finite alphabet.

In mathematics and computer science, the critical exponent of a finite or infinite sequence of symbols over a finite alphabet describes the largest number of times a contiguous subsequence can be repeated. For example, the critical exponent of "Mississippi" is 7/3, as it contains the string "ississi", which is of length 7 and period 3.

Dejean's theorem is a statement about repetitions in infinite strings of symbols. It belongs to the field of combinatorics on words; it was conjectured in 1972 by Françoise Dejean and proven in 2009 by Currie and Rampersad and, independently, by Rao.

References