Suffix automaton  

Type  Substring index  
Invented  1983  
Invented by  Anselm Blumer; Janet Blumer; Andrzej Ehrenfeucht; David Haussler; Ross McConnell  
Time complexity in big O notation  

In computer science, a suffix automaton is an efficient data structure for representing the substring index of a given string which allows the storage, processing, and retrieval of compressed information about all its substrings. The suffix automaton of a string is the smallest directed acyclic graph with a dedicated initial vertex and a set of "final" vertices, such that paths from the initial vertex to final vertices represent the suffixes of the string. Formally speaking, a suffix automaton is defined by the following set of properties:
Speaking in the terms of automata theory, a suffix automaton is the minimal partial deterministic finite automaton that recognizes the set of suffixes of a given string . The state graph of a suffix automaton is called a directed acyclic word graph (DAWG), a term that is also sometimes used for any deterministic acyclic finite state automaton.
Suffix automata were introduced in 1983 by a group of scientists from the University of Denver and the University of Colorado Boulder. They suggested a linear time online algorithm for its construction and showed that the suffix automaton of a string having length at least two characters has at most states and at most transitions. Further works have shown a close connection between suffix automata and suffix trees, and have outlined several generalizations of suffix automata, such as compacted suffix automaton obtained by compression of nodes with a single outgoing arc.
Suffix automata provide efficient solutions to problems such as substring search and computation of the largest common substring of two and more strings.
The concept of suffix automaton was introduced in 1983^{ [1] } by a group of scientists from University of Denver and University of Colorado Boulder consisting of Anselm Blumer, Janet Blumer, Andrzej Ehrenfeucht, David Haussler and Ross McConnell, although similar concepts had earlier been studied alongside suffix trees in the works of Peter Weiner,^{ [2] } Vaughan Pratt^{ [3] } and Anatol Slissenko.^{ [4] } In their initial work, Blumer et al. showed a suffix automaton built for the string of length greater than has at most states and at most transitions, and suggested a linear algorithm for automaton construction.^{ [5] }
In 1983, MuTian Chen and Joel Seiferas independently showed that Weiner's 1973 suffixtree construction algorithm^{ [2] } while building a suffix tree of the string constructs a suffix automaton of the reversed string as an auxiliary structure.^{ [6] } In 1987, Blumer et al. applied the compressing technique used in suffix trees to a suffix automaton and invented the compacted suffix automaton, which is also called the compacted directed acyclic word graph (CDAWG).^{ [7] } In 1997, Maxime Crochemore and Renaud Vérin developed a linear algorithm for direct CDAWG construction.^{ [1] } In 2001, Shunsuke Inenaga et al. developed an algorithm for construction of CDAWG for a set of words given by a trie.^{ [8] }
Usually when speaking about suffix automata and related concepts, some notions from formal language theory and automata theory are used, in particular:^{ [9] }
Formally, deterministic finite automaton is determined by 5tuple , where:^{ [10] }
Most commonly, deterministic finite automaton is represented as a directed graph ("diagram") such that:^{ [10] }
In terms of its diagram, the automaton recognizes the word only if there is a path from the initial vertex to some final vertex such that concatenation of characters on this path forms . The set of words recognized by an automaton forms a language that is set to be recognized by the automaton. In these terms, the language recognized by a suffix automaton of is the language of its (possibly empty) suffixes.^{ [9] }
"Right context" of the word with respect to language is a set that is a set of words such that their concatenation with forms a word from . Right contexts induce a natural equivalence relation on the set of all words. If language is recognized by some deterministic finite automaton, there exists unique up to isomorphism automaton that recognizes the same language and has the minimum possible number of states. Such an automaton is called a minimal automaton for the given language . Myhill–Nerode theorem allows it to define it explicitly in terms of right contexts:^{ [11] }^{ [12] }
Theorem — Minimal automaton recognizing language over the alphabet may be explicitly defined in the following way:
In these terms, "suffix automaton" is the minimal deterministic finite automaton recognizing the language of suffixes of the word . Right context of the word with respect to this language constitutes of words , such that is a suffix of . It allows one to formulate the following lemma defining bijection between right context of the word and the set of right positions of its occurrences in :^{ [13] }^{ [14] }
Theorem — Let be the set of right positions of occurrences of in .
There is a following bijection between and :
For example, for the word and its subword , it holds and . Informally, is formed by words that follow occurrences of to the end of and is formed by right positions of those occurrences. In this example, the element corresponds with the word while the word corresponds with the element .
It implies several structure properties of suffix automaton states. Let , then:^{ [14] }
Any state of the suffix automaton recognizes some continuous chain of nested suffixes of the longest word recognized by this state.^{ [14] }
"Left extension" of the string is the longest string that has the same right context as . Length of the longest string recognized by is denoted by . It holds:^{ [15] }
Theorem — Left extension of may be represented as , where is the longest word such that any occurrence of in is preceded by .
"Suffix link" of the state is the pointer to the state that contains the largest suffix of that is not recognized by .
In this terms it can be said recognizes exactly all suffixes of that is longer than and not longer than . It also holds:^{ [15] }
A "prefix tree" (or "trie") is a rooted directed tree in which arcs are marked by characters in such a way no vertex of such tree has two outgoing arcs marked with the same character. Some vertices in trie are marked as final. Trie is said to recognize a set of words defined by paths from its root to final vertices. In this way prefix trees are a special kind of deterministic finite automata if you perceive its root as an initial vertex.^{ [16] } The "suffix trie" of the word is a prefix tree recognizing a set of its suffixes. "A suffix tree" is a tree obtained from a suffix trie via the compaction procedure, during which consequent edges are merged if the degree of the vertex between them is equal to two.^{ [15] }
By its definition, a suffix automaton can be obtained via minimization of the suffix trie. It may be shown that a compacted suffix automaton is obtained by both minimization of the suffix tree (if one assumes each string on the edge of the suffix tree is a solid character from the alphabet) and compaction of the suffix automaton.^{ [17] } Besides this connection between the suffix tree and the suffix automaton of the same string there is as well a connection between the suffix automaton of the string and the suffix tree of the reversed string .^{ [18] }
Similarly to right contexts one may introduce "left contexts" , "right extensions" corresponding to the longest string having same left context as and the equivalence relation . If one considers right extensions with respect to the language of "prefixes" of the string it may be obtained:^{ [15] }
Theorem — Suffix tree of the string may be defined explicitly in the following way:
Here triplet means there is an edge from to with the string written on it
, which implies the suffix link tree of the string and the suffix tree of the string are isomorphic:^{ [18] }
Suffix structures of words "abbcbc" and "cbcbba" 


Similarly to the case of left extensions, the following lemma holds for right extensions:^{ [15] }
Theorem — Right extension of the string may be represented as , where is the longest word such that every occurrence of in is succeeded by .
A suffix automaton of the string of length has at most states and at most transitions. These bounds are reached on strings and correspondingly.^{ [13] } This may be formulated in a stricter way as where and are the numbers of transitions and states in automaton correspondingly.^{ [14] }
Maximal suffix automata 


Initially the automaton only consists of a single state corresponding to the empty word, then characters of the string are added one by one and the automaton is rebuilt on each step incrementally.^{ [19] }
After a new character is appended to the string, some equivalence classes are altered. Let be the right context of with respect to the language of suffixes. Then the transition from to after is appended to is defined by lemma:^{ [14] }
Theorem — Let be some words over and be some character from this alphabet. Then there is a following correspondence between and :
After adding to the current word the right context of may change significantly only if is a suffix of . It implies equivalence relation is a refinement of . In other words, if , then . After the addition of a new character at most two equivalence classes of will be split and each of them may split in at most two new classes. First, equivalence class corresponding to empty right context is always split into two equivalence classes, one of them corresponding to itself and having as a right context. This new equivalence class contains exactly and all its suffixes that did not occur in , as the right context of such words was empty before and contains only empty word now.^{ [14] }
Given the correspondence between states of the suffix automaton and vertices of the suffix tree, it is possible to find out the second state that may possibly split after a new character is appended. The transition from to corresponds to the transition from to in the reversed string. In terms of suffix trees it corresponds to the insertion of the new longest suffix into the suffix tree of . At most two new vertices may be formed after this insertion: one of them corresponding to , while the other one corresponds to its direct ancestor if there was a branching. Returning to suffix automata, it means the first new state recognizes and the second one (if there is a second new state) is its suffix link. It may be stated as a lemma:^{ [14] }
Theorem — Let , be some word and character over . Also let be the longest suffix of , which occurs in , and let . Then for any substrings of it holds:
It implies that if (for example, when didn't occur in at all and ), then only the equivalence class corresponding to the empty right context is split.^{ [14] }
Besides suffix links it is also needed to define final states of the automaton. It follows from structure properties that all suffixes of a word recognized by are recognized by some vertex on suffix path of . Namely, suffixes with length greater than lie in , suffixes with length greater than but not greater than lie in and so on. Thus if the state recognizing is denoted by , then all final states (that is, recognizing suffixes of ) form up the sequence .^{ [19] }
After the character is appended to possible new states of suffix automaton are and . Suffix link from goes to and from it goes to . Words from occur in only as its suffixes therefore there should be no transitions at all from while transitions to it should go from suffixes of having length at least and be marked with the character . State is formed by subset of , thus transitions from should be same as from . Meanwhile, transitions leading to should go from suffixes of having length less than and at least , as such transitions have lead to before and corresponded to seceded part of this state. States corresponding to these suffixes may be determined via traversal of suffix link path for .^{ [19] }
Theoretical results above lead to the following algorithm that takes character and rebuilds the suffix automaton of into the suffix automaton of :^{ [19] }
The whole procedure is described by the following pseudocode:^{ [19] }
functionadd_letter(x): definep = lastassignlast = new_state()assignlen(last) = len(p) + 1whileδ(p, x) is undefined: assignδ(p, x) = last, p = link(p)defineq = δ(p, x)ifq = last: assignlink(last) = q_{0}else iflen(q) = len(p) + 1: assignlink(last) = qelse: definecl = new_state()assignlen(cl) = len(p) + 1assignδ(cl) = δ(q), link(cl) = link(q)assignlink(last) = link(q) = clwhileδ(p, x) = q: assignδ(p, x) = cl, p = link(p)
Here is the initial state of the automaton and is a function creating new state for it. It is assumed , , and are stored as global variables.^{ [19] }
Complexity of the algorithm may vary depending on the underlying structure used to store transitions of the automaton. It may be implemented in with memory overhead or in with memory overhead if one assumes that memory allocation is done in . To obtain such complexity, one has to use the methods of amortized analysis. The value of strictly reduces with each iteration of the cycle while it may only increase by as much as one after the first iteration of the cycle on the next add_letter call. Overall value of never exceeds and it is only increased by one between iterations of appending new letters that suggest total complexity is at most linear as well. The linearity of the second cycle is shown in a similar way.^{ [19] }
The suffix automaton is closely related to other suffix structures and substring indices. Given a suffix automaton of a specific string one may construct its suffix tree via compacting and recursive traversal in linear time.^{ [20] } Similar transforms are possible in both directions to switch between the suffix automaton of and the suffix tree of reversed string .^{ [18] } Other than this several generalizations were developed to construct an automaton for the set of strings given by trie,^{ [8] } compacted suffix automation (CDAWG),^{ [7] } to maintain the structure of the automaton on the sliding window,^{ [21] } and to construct it in a bidirectional way, supporting the insertion of a characters to both the beginning and the end of the string.^{ [22] }
As was already mentioned above, a compacted suffix automaton is obtained via both compaction of a regular suffix automaton (by removing states which are nonfinal and have exactly one outgoing arc) and the minimization of a suffix tree. Similarly to the regular suffix automaton, states of compacted suffix automaton may be defined in explicit manner. A twoway extension of a word is the longest word , such that every occurrence of in is preceded by and succeeded by . In terms of left and right extensions it means that twoway extension is the left extension of the right extension or, which is equivalent, the right extension of the left extension, that is . In terms of twoway extensions compacted automaton is defined as follows:^{ [15] }
Theorem — Compacted suffix automaton of the word is defined by a pair , where:
Twoway extensions induce an equivalence relation which defines the set of words recognized by the same state of compacted automaton. This equivalence relation is a transitive closure of the relation defined by , which highlights the fact that a compacted automaton may be obtained by both gluing suffix tree vertices equivalent via relation (minimization of the suffix tree) and gluing suffix automaton states equivalent via relation (compaction of suffix automaton).^{ [23] } If words and have same right extensions, and words and have same left extensions, then cumulatively all strings , and have same twoway extensions. At the same time it may happen that neither left nor right extensions of and coincide. As an example one may take , and , for which left and right extensions are as follows: , but and . That being said, while equivalence relations of oneway extensions were formed by some continuous chain of nested prefixes or suffixes, bidirectional extensions equivalence relations are more complex and the only thing one may conclude for sure is that strings with the same twoway extension are substrings of the longest string having the same twoway extension, but it may even happen that they don't have any nonempty substring in common. The total number of equivalence classes for this relation does not exceed which implies that compacted suffix automaton of the string having length has at most states. The amount of transitions in such automaton is at most .^{ [15] }
Consider a set of words . It is possible to construct a generalization of suffix automaton that would recognize the language formed up by suffixes of all words from the set. Constraints for the number of states and transitions in such automaton would stay the same as for a singleword automaton if you put .^{ [23] } The algorithm is similar to the construction of singleword automaton except instead of state, function add_letter would work with the state corresponding to the word assuming the transition from the set of words to the set .^{ [24] }^{ [25] }
This idea is further generalized to the case when is not given explicitly but instead is given by a prefix tree with vertices. Mohri et al. showed such an automaton would have at most and may be constructed in linear time from its size. At the same time, the number of transitions in such automaton may reach , for example for the set of words over the alphabet the total length of workds is equal to , the number of vertices in corresponding suffix trie is equal to and corresponding suffix automaton is formed of states and transitions. Algorithm suggested by Mohri mainly repeats the generic algorithm for building automaton of several strings but instead of growing words one by one, it traverses the trie in a breadthfirst search order and append new characters as it meet them in the traversal, which guarantees amortized linear complexity.^{ [26] }
Some compression algorithms, such as LZ77 and RLE may benefit from storing suffix automaton or similar structure not for the whole string but for only last its characters while the string is updated. This is because compressing data is usually expressively large and using memory is undesirable. In 1985, Janet Blumer developed an algorithm to maintain a suffix automaton on a sliding window of size in worstcase and on average, assuming characters are distributed independently and uniformly. She also showed complexity cannot be improved: if one considers words construed as a concatenation of several words, where , then the number of states for the window of size would frequently change with jumps of order , which renders even theoretical improvement of for regular suffix automata impossible.^{ [27] }
The same should be true for the suffix tree because its vertices correspond to states of the suffix automaton of the reversed string but this problem may be resolved by not explicitly storing every vertex corresponding to the suffix of the whole string, thus only storing vertices with at least two outgoing edges. A variation of McCreight's suffix tree construction algorithm for this task was suggested in 1989 by Edward Fiala and Daniel Greene;^{ [28] } several years later a similar result was obtained with the variation of Ukkonen's algorithm by Jesper Larsson.^{ [29] }^{ [30] } The existence of such an algorithm, for compacted suffix automaton that absorbs some properties of both suffix trees and suffix automata, was an open question for a long time until it was discovered by Martin Senft and Tomasz Dvorak in 2008, that it is impossible if the alphabet's size is at least two.^{ [31] }
One way to overcome this obstacle is to allow window width to vary a bit while staying . It may be achieved by an approximate algorithm suggested by Inenaga et al. in 2004. The window for which suffix automaton is built in this algorithm is not guaranteed to be of length but it is guaranteed to be at least and at most while providing linear overall complexity of the algorithm.^{ [32] }
Suffix automaton of the string may be used to solve such problems as:^{ [33] }^{ [34] }
It is assumed here that is given on the input after suffix automaton of is constructed.^{ [33] }
Suffix automata are also used in data compression,^{ [35] } music retrieval^{ [36] }^{ [37] } and matching on genome sequences.^{ [38] }
In the theory of computation, a branch of theoretical computer science, a pushdown automaton (PDA) is a type of automaton that employs a stack.
In quantum field theory, the Dirac spinor is the spinor that describes all known fundamental particles that are fermions, with the possible exception of neutrinos. It appears in the planewave solution to the Dirac equation, and is a certain combination of two Weyl spinors, specifically, a bispinor that transforms "spinorially" under the action of the Lorentz group.
In mathematics, and especially differential geometry and gauge theory, a connection on a fiber bundle is a device that defines a notion of parallel transport on the bundle; that is, a way to "connect" or identify fibers over nearby points. The most common case is that of a linear connection on a vector bundle, for which the notion of parallel transport must be linear. A linear connection is equivalently specified by a covariant derivative, an operator that differentiates sections of the bundle along tangent directions in the base manifold, in such a way that parallel sections have derivative zero. Linear connections generalize, to arbitrary vector bundles, the LeviCivita connection on the tangent bundle of a Riemannian manifold, which gives a standard way to differentiate vector fields. Nonlinear connections generalize this concept to bundles whose fibers are not necessarily linear.
In field theory, the primitive element theorem or Artin's theorem on primitive elements is a result characterizing the finite degree field extensions that possess a primitive element, or simple extensions. It says in that a finite extension is simple if and only if there are only finitely many intermediate fields. In particular, finite separable extensions are simple, more general.
In physics, the Hamilton–Jacobi equation, named after William Rowan Hamilton and Carl Gustav Jacob Jacobi, is an alternative formulation of classical mechanics, equivalent to other formulations such as Newton's laws of motion, Lagrangian mechanics and Hamiltonian mechanics. The Hamilton–Jacobi equation is particularly useful in identifying conserved quantities for mechanical systems, which may be possible even when the mechanical problem itself cannot be solved completely.
In mathematical logic and type theory, the λcube is a framework introduced by Henk Barendregt to investigate the different dimensions in which the calculus of constructions is a generalization of the simply typed λcalculus. Each dimension of the cube corresponds to a new kind of dependency between terms and types. Here, "dependency" refers to the capacity of a term or type to bind a term or type. The respective dimensions of the λcube correspond to:
In quantum field theory, the LSZ reduction formula is a method to calculate Smatrix elements from the timeordered correlation functions of a quantum field theory. It is a step of the path that starts from the Lagrangian of some quantum field theory and leads to prediction of measurable quantities. It is named after the three German physicists Harry Lehmann, Kurt Symanzik and Wolfhart Zimmermann.
In general relativity, the Gibbons–Hawking–York boundary term is a term that needs to be added to the Einstein–Hilbert action when the underlying spacetime manifold has a boundary.
In differential geometry and mathematical physics, a spin connection is a connection on a spinor bundle. It is induced, in a canonical manner, from the affine connection. It can also be regarded as the gauge field generated by local Lorentz transformations. In some canonical formulations of general relativity, a spin connection is defined on spatial slices and can also be regarded as the gauge field generated by local rotations.
In set theory, Silver machines are devices used for bypassing the use of fine structure in proofs of statements holding in L. They were invented by set theorist Jack Silver as a means of proving global square holds in the constructible universe.
In the theory of general relativity, a stress–energy–momentum pseudotensor, such as the Landau–Lifshitz pseudotensor, is an extension of the nongravitational stress–energy tensor that incorporates the energy–momentum of gravity. It allows the energy–momentum of a system of gravitating matter to be defined. In particular it allows the total of matter plus the gravitating energy–momentum to form a conserved current within the framework of general relativity, so that the total energy–momentum crossing the hypersurface of any compact space–time hypervolume vanishes.
Continuous wavelets of compact support can be built, which are related to the beta distribution. The process is derived from probability distributions using blur derivative. These new wavelets have just one cycle, so they are termed unicycle wavelets. They can be viewed as a soft variety of Haar wavelets whose shape is finetuned by two parameters and . Closedform expressions for beta wavelets and scale functions as well as their spectra are derived. Their importance is due to the Central Limit Theorem by Gnedenko and Kolmogorov applied for compactly supported signals.
A yield surface is a fivedimensional surface in the sixdimensional space of stresses. The yield surface is usually convex and the state of stress of inside the yield surface is elastic. When the stress state lies on the surface the material is said to have reached its yield point and the material is said to have become plastic. Further deformation of the material causes the stress state to remain on the yield surface, even though the shape and size of the surface may change as the plastic deformation evolves. This is because stress states that lie outside the yield surface are nonpermissible in rateindependent plasticity, though not in some models of viscoplasticity.
The Birnbaum–Saunders distribution, also known as the fatigue life distribution, is a probability distribution used extensively in reliability applications to model failure times. There are several alternative formulations of this distribution in the literature. It is named after Z. W. Birnbaum and S. C. Saunders.
In crystallography, a fractional coordinate system is a coordinate system in which the edges of the unit cell are used as the basic vectors to describe the positions of atomic nuclei. The unit cell is a parallelepiped defined by the lengths of its edges and angles between them .
The table of chords, created by the Greek astronomer, geometer, and geographer Ptolemy in Egypt during the 2nd century AD, is a trigonometric table in Book I, chapter 11 of Ptolemy's Almagest, a treatise on mathematical astronomy. It is essentially equivalent to a table of values of the sine function. It was the earliest trigonometric table extensive enough for many practical purposes, including those of astronomy. Centuries passed before more extensive trigonometric tables were created. One such table is the Canon Sinuum created at the end of the 16th century.
In statistics, the matrix tdistribution is the generalization of the multivariate tdistribution from vectors to matrices. The matrix tdistribution shares the same relationship with the multivariate tdistribution that the matrix normal distribution shares with the multivariate normal distribution. For example, the matrix tdistribution is the compound distribution that results from sampling from a matrix normal distribution having sampled the covariance matrix of the matrix normal from an inverse Wishart distribution.
In mathematics, the exterior algebra has a rich algebraic structure. The exterior algebra of vector fields on manifolds has an even richer structure driven by the interplay of differentiation on the manifold with the properties of the exterior algebra. This article summarizes several identities in exterior calculus.
In mathematical logic, the intersection type discipline is a branch of type theory encompassing type systems that use the intersection type constructor to assign multiple types to a single term. In particular, if a term can be assigned both the type and the type , then can be assigned the intersection type . Therefore, the intersection type constructor can be used to express finite heterogeneous ad hoc polymorphism . For example, the λterm can be assigned the type in most intersection type systems, assuming for the term variable both the function type and the corresponding argument type .