In combinatorics, a square-free word is a word (a sequence of symbols) that does not contain any squares. A square is a word of the form XX, where X is not empty. Thus, a square-free word can also be defined as a word that avoids the pattern XX.
Over a binary alphabet , the only square-free words are the empty word , and .
Over a ternary alphabet , there are infinitely many square-free words. It is possible to count the number of ternary square-free words of length n.
n | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 3 | 6 | 12 | 18 | 30 | 42 | 60 | 78 | 108 | 144 | 204 | 264 |
This number is bounded by , where . [2] The upper bound on can be found via Fekete's Lemma and approximation by automata. The lower bound can be found by finding a substitution that preserves square-freeness. [2]
Since there are infinitely many square-free words over three-letter alphabets, this implies there are also infinitely many square-free words over an alphabet with more than three letters.
The following table shows the exact growth rate of the k-ary square-free words, rounded off to 7 digits after the decimal point, for k in the range from 4 to 15: [2]
alphabet size (k) | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|
growth rate | 2.6215080 | 3.7325386 | 4.7914069 | 5.8284661 | 6.8541173 | 7.8729902 |
alphabet size (k) | 10 | 11 | 12 | 13 | 14 | 15 |
growth rate | 8.8874856 | 9.8989813 | 10.9083279 | 11.9160804 | 12.9226167 | 13.9282035 |
Consider a map from to A, where A is an alphabet and is called a 2-dimensional word. Let be the entry . A word is a line of if there exists such that , and for . [3]
Carpi [4] proves that there exists a 2-dimensional word over a 16-letter alphabet such that every line of is square-free. A computer search shows that there are no 2-dimensional words over a 7-letter alphabet, such that every line of is square-free.
Shur [5] proposes an algorithm called R2F (random-t(w)o-free) that can generate a square-free word of length n over any alphabet with three or more letters. This algorithm is based on a modification of entropy compression: it randomly selects letters from a k-letter alphabet to generate a -ary square-free word.
algorithm R2F isinput: alphabet size , word length output: a -ary square-free word wof length n. (Note that is the alphabet with letters .) (For a word , is the permutation of such that a precedes b in if the right most position of a in w is to the right of the rightmost position of b in w. For example, has .) choose in uniformly at random setto followed by all other letters of in increasing order set the number N of iterations to 0 whiledo choose j in uniformly at random append to the end of w update shifting the first j elements to the right and setting increment N by 1ifw ends with a square of rank rthen delete the last r letters of wreturnw
Every (k+1)-ary square-free word can be the output of Algorithm R2F, because on each iteration it can append any letter except for the last letter of w.
The expected number of random k-ary letters used by Algorithm R2F to construct a -ary square-free word of length n isNote that there exists an algorithm that can verify the square-freeness of a word of length n in time. Apostolico and Preparata [6] give an algorithm using suffix trees. Crochemore [7] uses partitioning in his algorithm. Main and Lorentz [8] provide an algorithm based on the divide-and-conquer method. A naive implementation may require time to verify the square-freeness of a word of length n.
There exist infinitely long square-free words in any alphabet with three or more letters, as proved by Axel Thue. [9]
One example of an infinite square-free word over an alphabet of size 3 is the word over the alphabet obtained by taking the first difference of the Thue–Morse sequence. [9] That is, from the Thue–Morse sequence
one forms a new sequence in which each term is the difference of two consecutive terms of the Thue–Morse sequence. The resulting square-free word is
Another example found by John Leech [10] is defined recursively over the alphabet . Let be any square-free word starting with the letter 0. Define the words recursively as follows: the word is obtained from by replacing each 0 in with 0121021201210, each 1 with 1202102012021, and each 2 with 2010210120102. It is possible to prove that the sequence converges to the infinite square-free word
Infinite square-free words can be generated by square-free morphism. A morphism is called square-free if the image of every square-free word is square-free. A morphism is called k–square-free if the image of every square-free word of length k is square-free.
Crochemore [11] proves that a uniform morphism h is square-free if and only if it is 3-square-free. In other words, h is square-free if and only if is square-free for all square-free w of length 3. It is possible to find a square-free morphism by brute-force search.
algorithm square-free_morphism isoutput: a square-free morphism with the lowest possible rank k. setwhile True dosetk_sf_wordsto the list of all square-free words of length k over a ternary alphabet for eachink_sf_wordsdofor eachink_sf_wordsdofor eachink_sf_wordsdoifthenbreak from the current loop (advance to next ) ifandthenifis square-free for all square-free w of length 3thenreturn increment k by 1
Over a ternary alphabet, there are exactly 144 uniform square-free morphisms of rank 11 and no uniform square-free morphisms with a lower rank than 11.
To obtain an infinite square-free words, start with any square-free word such as 0, and successively apply a square-free morphism h to it. The resulting words preserve the property of square-freeness. For example, let h be a square-free morphism, then as , is an infinite square-free word.
Note that, if a morphism over a ternary alphabet is not uniform, then this morphism is square-free if and only if it is 5-square-free. [11]
Over a ternary alphabet, a square-free word of length more than 13 contains all the square-free two-letter combinations. [12]
This can be proved by constructing a square-free word without the two-letter combination ab. As a result, bcbacbcacbaca is the longest square-free word without the combination ab and its length is equal to 13.
Note that over a more than three-letter alphabet there are square-free words of any length without an arbitrary two-letter combination.
Over a ternary alphabet, a square-free word of length more than 36 contains all the square-free three-letter combinations. [12]
However, there are square-free words of any length without the three-letter combination aba.
Note that over a more than three-letter alphabet there are square-free words of any length without an arbitrary three-letter combination.
The density of a letter a in a finite word w is defined as where is the number of occurrences of a in and is the length of the word. The density of a letter a in an infinite word is where is the prefix of the word w of length l. [13]
The minimal density of a letter a in an infinite ternary square-free word is equal to . [13]
The maximum density of a letter a in an infinite ternary square-free word is equal to . [14]
In logic, mathematics, computer science, and linguistics, a formal language consists of words whose letters are taken from an alphabet and are well-formed according to a specific set of rules called a formal grammar.
In theoretical computer science and formal language theory, a regular language is a formal language that can be defined by a regular expression, in the strict sense in theoretical computer science.
In mathematics, a real number is said to be simply normal in an integer base b if its infinite sequence of digits is distributed uniformly in the sense that each of the b digit values has the same natural density 1/b. A number is said to be normal in base b if, for every positive integer n, all possible strings n digits long have density b−n.
In statistics, the Wishart distribution is a generalization of the gamma distribution to multiple dimensions. It is named in honor of John Wishart, who first formulated the distribution in 1928. Other names include Wishart ensemble, or Wishart–Laguerre ensemble, or LOE, LUE, LSE.
In mathematics, computer science, and logic, rewriting covers a wide range of methods of replacing subterms of a formula with other terms. Such methods may be achieved by rewriting systems. In their most basic form, they consist of a set of objects, plus relations on how to transform those objects.
In mathematics, the Thue–Morse or Prouhet–Thue–Morse sequence is the binary sequence that can be obtained by starting with 0 and successively appending the Boolean complement of the sequence obtained thus far. It is sometimes called the fair share sequence because of its applications to fair division or parity sequence. The first few steps of this procedure yield the strings 0, 01, 0110, 01101001, 0110100110010110, and so on, which are the prefixes of the Thue–Morse sequence. The full sequence begins:
In abstract algebra, the free monoid on a set is the monoid whose elements are all the finite sequences of zero or more elements from that set, with string concatenation as the monoid operation and with the unique sequence of zero elements, often called the empty string and denoted by ε or λ, as the identity element. The free monoid on a set A is usually denoted A∗. The free semigroup on A is the subsemigroup of A∗ containing all elements except the empty string. It is usually denoted A+.
Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text. A matrix containing word counts per document is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns. Documents are then compared by cosine similarity between any two columns. Values close to 1 represent very similar documents while values close to 0 represent very dissimilar documents.
In mathematics, a Sturmian word, named after Jacques Charles François Sturm, is a certain kind of infinitely long sequence of characters. Such a sequence can be generated by considering a game of English billiards on a square table. The struck ball will successively hit the vertical and horizontal edges labelled 0 and 1 generating a sequence of letters. This sequence is a Sturmian word.
Geometry processing is an area of research that uses concepts from applied mathematics, computer science and engineering to design efficient algorithms for the acquisition, reconstruction, analysis, manipulation, simulation and transmission of complex 3D models. As the name implies, many of the concepts, data structures, and algorithms are directly analogous to signal processing and image processing. For example, where image smoothing might convolve an intensity signal with a blur kernel formed using the Laplace operator, geometric smoothing might be achieved by convolving a surface geometry with a blur kernel formed using the Laplace-Beltrami operator.
In theoretical computer science and mathematical logic a string rewriting system (SRS), historically called a semi-Thue system, is a rewriting system over strings from a alphabet. Given a binary relation between fixed strings over the alphabet, called rewrite rules, denoted by , an SRS extends the rewriting relation to all strings in which the left- and right-hand side of the rules appear as substrings, that is , where , , , and are strings.
In mathematics, Higman's lemma states that the set of finite sequences over a finite alphabet , as partially ordered by the subsequence relation, is well-quasi-ordered. That is, if is an infinite sequence of words over a finite alphabet , then there exist indices such that can be obtained from by deleting some symbols. More generally this remains true when is not necessarily finite, but is itself well-quasi-ordered, and the subsequence relation is generalized into an "embedding" relation that allows the replacement of symbols by earlier symbols in the well-quasi-ordering of . This is a special case of the later Kruskal's tree theorem. It is named after Graham Higman, who published it in 1952.
In formal language theory, an alphabet, sometimes called a vocabulary, is a non-empty set of indivisible symbols/characters/glyphs, typically thought of as representing letters, characters, digits, phonemes, or even words. Alphabets in this technical sense of a set are used in a diverse range of fields including logic, mathematics, computer science, and linguistics. An alphabet may have any cardinality ("size") and, depending on its purpose, may be finite, countable, or even uncountable.
In mathematics and theoretical computer science, an automatic sequence (also called a k-automatic sequence or a k-recognizable sequence when one wants to indicate that the base of the numerals used is k) is an infinite sequence of terms characterized by a finite automaton. The n-th term of an automatic sequence a(n) is a mapping of the final state reached in a finite automaton accepting the digits of the number n in some fixed base k.
Combinatorics on words is a fairly new field of mathematics, branching from combinatorics, which focuses on the study of words and formal languages. The subject looks at letters or symbols, and the sequences they form. Combinatorics on words affects various areas of mathematical study, including algebra and computer science. There have been a wide range of contributions to the field. Some of the first work was on square-free words by Axel Thue in the early 1900s. He and colleagues observed patterns within words and tried to explain them. As time went on, combinatorics on words became useful in the study of algorithms and coding. It led to developments in abstract algebra and answering open questions.
In computer science, more specifically in automata and formal language theory, nested words are a concept proposed by Alur and Madhusudan as a joint generalization of words, as traditionally used for modelling linearly ordered structures, and of ordered unranked trees, as traditionally used for modelling hierarchical structures. Finite-state acceptors for nested words, so-called nested word automata, then give a more expressive generalization of finite automata on words. The linear encodings of languages accepted by finite nested word automata gives the class of visibly pushdown languages. The latter language class lies properly between the regular languages and the deterministic context-free languages. Since their introduction in 2004, these concepts have triggered much research in that area.
In mathematics and computer science, a morphic word or substitutive word is an infinite sequence of symbols which is constructed from a particular class of endomorphism of a free monoid.
In mathematics and theoretical computer science, a pattern is an unavoidable pattern if it is unavoidable on any finite alphabet.
In mathematics and computer science, the critical exponent of a finite or infinite sequence of symbols over a finite alphabet describes the largest number of times a contiguous subsequence can be repeated. For example, the critical exponent of "Mississippi" is 7/3, as it contains the string "ississi", which is of length 7 and period 3.
Dejean's theorem is a statement about repetitions in infinite strings of symbols. It belongs to the field of combinatorics on words; it was conjectured in 1972 by Françoise Dejean and proven in 2009 by Currie and Rampersad and, independently, by Rao.