Lexicographically minimal string rotation

Last updated

In computer science, the lexicographically minimal string rotation or lexicographically least circular substring is the problem of finding the rotation of a string possessing the lowest lexicographical order of all such rotations. For example, the lexicographically minimal rotation of "bbaaccaadd" would be "aaccaaddbb". It is possible for a string to have multiple lexicographically minimal rotations, but for most applications this does not matter as the rotations must be equivalent. Finding the lexicographically minimal rotation is useful as a way of normalizing strings. If the strings represent potentially isomorphic structures such as graphs, normalizing in this way allows for simple equality checking. [1] A common implementation trick when dealing with circular strings is to concatenate the string to itself instead of having to perform modular arithmetic on the string indices.

Contents

Algorithms

The Naive Algorithm

The naive algorithm for finding the lexicographically minimal rotation of a string is to iterate through successive rotations while keeping track of the most lexicographically minimal rotation encountered. If the string is of length n, this algorithm runs in O(n2) time in the worst case.

Booth's Algorithm

An efficient algorithm was proposed by Booth (1980). [2] The algorithm uses a modified preprocessing function from the Knuth–Morris–Pratt string search algorithm. The failure function for the string is computed as normal, but the string is rotated during the computation so some indices must be computed more than once as they wrap around. Once all indices of the failure function have been successfully computed without the string rotating again, the minimal lexicographical rotation is known to be found and its starting index is returned. The correctness of the algorithm is somewhat difficult to understand, but it is easy to implement.

defleast_rotation(s:str)->int:"""Booth's lexicographically minimal string rotation algorithm."""n=len(s)f=[-1]*(2*n)k=0forjinrange(1,2*n):i=f[j-k-1]whilei!=-1ands[j%n]!=s[(k+i+1)%n]:ifs[j%n]<s[(k+i+1)%n]:k=j-i-1i=f[i]ifi==-1ands[j%n]!=s[(k+i+1)%n]:ifs[j%n]<s[(k+i+1)%n]:k=jf[j-k]=-1else:f[j-k]=i+1returnk

Of interest is that removing all lines of code which modify the value of k results in the original Knuth-Morris-Pratt preprocessing function, as k (representing the rotation) will remain zero. Booth's algorithm runs in time, where n is the length of the string. The algorithm performs at most comparisons in the worst case, and requires auxiliary memory of length n to hold the failure function table.

Shiloach's Fast Canonization Algorithm

Shiloach (1981) [3] proposed an algorithm improving on Booth's result in terms of performance. It was observed that if there are q equivalent lexicographically minimal rotations of a string of length n, then the string must consist of q equal substrings of length d=n/q. The algorithm requires only n + d/2 comparisons and constant space in the worst case.

The algorithm is divided into two phases. The first phase is a quick sieve which rules out indices that are obviously not starting locations for the lexicographically minimal rotation. The second phase then finds the lexicographically minimal rotation start index from the indices which remain.

Duval's Lyndon Factorization Algorithm

Duval (1983) [4] proposed an efficient algorithm involving the factorization of the string into its component Lyndon words, which runs in linear time with a constant memory requirement.

Variants

Shiloach (1979) [5] proposed an algorithm to efficiently compare two circular strings for equality without a normalization requirement. An additional application which arises from the algorithm is the fast generation of certain chemical structures without repetitions.

See also

Related Research Articles

<span class="mw-page-title-main">Hash function</span> Mapping arbitrary data to fixed-size values

A hash function is any function that can be used to map data of arbitrary size to fixed-size values, though there are some hash functions that support variable length output. The values returned by a hash function are called hash values, hash codes, digests, or simply hashes. The values are usually used to index a fixed-size table called a hash table. Use of a hash function to index a hash table is called hashing or scatter storage addressing.

<span class="mw-page-title-main">String (computer science)</span> Sequence of characters, data type

In computer programming, a string is traditionally a sequence of characters, either as a literal constant or as some kind of variable. The latter may allow its elements to be mutated and the length changed, or it may be fixed. A string is generally considered as a data type and is often implemented as an array data structure of bytes that stores a sequence of elements, typically characters, using some character encoding. String may also denote more general arrays or other sequence data types and structures.

<span class="mw-page-title-main">Trie</span> K-ary search tree data structure

In computer science, a trie, also called digital tree or prefix tree, is a type of k-ary search tree, a tree data structure used for locating specific keys from within a set. These keys are most often strings, with links between nodes defined not by the entire key, but by individual characters. In order to access a key, the trie is traversed depth-first, following the links between nodes, which represent each character in the key.

The Burrows–Wheeler transform rearranges a character string into runs of similar characters. This is useful for compression, since it tends to be easy to compress a string that has runs of repeated characters by techniques such as move-to-front transform and run-length encoding. More importantly, the transformation is reversible, without needing to store any additional data except the position of the first original character. The BWT is thus a "free" method of improving the efficiency of text compression algorithms, costing only some extra computation. The Burrows–Wheeler transform is an algorithm used to prepare data for use with data compression techniques such as bzip2. It was invented by Michael Burrows and David Wheeler in 1994 while Burrows was working at DEC Systems Research Center in Palo Alto, California. It is based on a previously unpublished transformation discovered by Wheeler in 1983. The algorithm can be implemented efficiently using a suffix array thus reaching linear time complexity.

In computer science, the Cocke–Younger–Kasami algorithm is a parsing algorithm for context-free grammars published by Itiroo Sakai in 1961. The algorithm is named after some of its rediscoverers: John Cocke, Daniel Younger, Tadao Kasami, and Jacob T. Schwartz. It employs bottom-up parsing and dynamic programming.

In computer science, the Knuth–Morris–Pratt algorithm is a string-searching algorithm that searches for occurrences of a "word" W within a main "text string" S by employing the observation that when a mismatch occurs, the word itself embodies sufficient information to determine where the next match could begin, thus bypassing re-examination of previously matched characters.

<span class="mw-page-title-main">Algorithmic probability</span>

In algorithmic information theory, algorithmic probability, also known as Solomonoff probability, is a mathematical method of assigning a prior probability to a given observation. It was invented by Ray Solomonoff in the 1960s. It is used in inductive inference theory and analyses of algorithms. In his general theory of inductive inference, Solomonoff uses the method together with Bayes' rule to obtain probabilities of prediction for an algorithm's future outputs.

In computational linguistics and computer science, edit distance is a string metric, i.e. a way of quantifying how dissimilar two strings are to one another, that is measured by counting the minimum number of operations required to transform one string into the other. Edit distances find applications in natural language processing, where automatic spelling correction can determine candidate corrections for a misspelled word by selecting words from a dictionary that have a low distance to the word in question. In bioinformatics, it can be used to quantify the similarity of DNA sequences, which can be viewed as strings of the letters A, C, G and T.

In computer science, the Rabin–Karp algorithm or Karp–Rabin algorithm is a string-searching algorithm created by Richard M. Karp and Michael O. Rabin (1987) that uses hashing to find an exact match of a pattern string in a text. It uses a rolling hash to quickly filter out positions of the text that cannot match the pattern, and then checks for a match at the remaining positions. Generalizations of the same idea can be used to find more than one match of a single pattern, or to find matches for more than one pattern.

In computer science, the Boyer–Moore string-search algorithm is an efficient string-searching algorithm that is the standard benchmark for practical string-search literature. It was developed by Robert S. Boyer and J Strother Moore in 1977. The original paper contained static tables for computing the pattern shifts without an explanation of how to produce them. The algorithm for producing the tables was published in a follow-on paper; this paper contained errors which were later corrected by Wojciech Rytter in 1980.

<span class="mw-page-title-main">Suffix tree</span> Tree containing all suffixes of a given text

In computer science, a suffix tree is a compressed trie containing all the suffixes of the given text as their keys and positions in the text as their values. Suffix trees allow particularly fast implementations of many important string operations.

In computer science, a suffix array is a sorted array of all suffixes of a string. It is a data structure used in, among others, full-text indices, data-compression algorithms, and the field of bibliometrics.

In computer science, a longest common substring of two or more strings is a longest string that is a substring of all of them. There may be more than one longest common substring. Applications include data deduplication and plagiarism detection.

In information theory and computer science, the Damerau–Levenshtein distance is a string metric for measuring the edit distance between two sequences. Informally, the Damerau–Levenshtein distance between two words is the minimum number of operations required to change one word into the other.

<span class="mw-page-title-main">Approximate string matching</span> Finding strings that approximately match a pattern

In computer science, approximate string matching is the technique of finding strings that match a pattern approximately. The problem of approximate string matching is typically divided into two sub-problems: finding approximate substring matches inside a given string and finding dictionary strings that match the pattern approximately.

In mathematics, in the areas of combinatorics and computer science, a Lyndon word is a nonempty string that is strictly smaller in lexicographic order than all of its rotations. Lyndon words are named after mathematician Roger Lyndon, who investigated them in 1954, calling them standard lexicographic sequences. Anatoly Shirshov introduced Lyndon words in 1953 calling them regular words. Lyndon words are a special case of Hall words; almost all properties of Lyndon words are shared by Hall words.

<span class="mw-page-title-main">Necklace (combinatorics)</span>

In combinatorics, a k-ary necklace of length n is an equivalence class of n-character strings over an alphabet of size k, taking all rotations as equivalent. It represents a structure with n circularly connected beads which have k available colors.

In coding theory, the Sardinas–Patterson algorithm is a classical algorithm for determining in polynomial time whether a given variable-length code is uniquely decodable, named after August Albert Sardinas and George W. Patterson, who published it in 1953. The algorithm carries out a systematic search for a string which admits two different decompositions into codewords. As Knuth reports, the algorithm was rediscovered about ten years later in 1963 by Floyd, despite the fact that it was at the time already well known in coding theory.

In computer science, the Wagner–Fischer algorithm is a dynamic programming algorithm that computes the edit distance between two strings of characters.

In computer science, the two-way string-matching algorithm is a string-searching algorithm, discovered by Maxime Crochemore and Dominique Perrin in 1991. It takes a pattern of size m, called a “needle”, preprocesses it in linear time O(m), producing information that can then be used to search for the needle in any “haystack” string, taking only linear time O(n) with n being the haystack's length.

References

  1. Kellogg S. Booth; Colbourn, Charles J. (1980). "Linear Time Automorphism Algorithms for Trees, Interval Graphs, and Planar Graphs". SIAM Journal on Computing. Society for Industrial and Applied Mathematics. 10 (1): 203–225. doi:10.1137/0210015. ISSN   0097-5397.
  2. Kellogg S. Booth (1980). "Lexicographically least circular substrings". Information Processing Letters. Elsevier. 10 (4–5): 240–242. doi:10.1016/0020-0190(80)90149-0. ISSN   0020-0190.
  3. Yossi Shiloach (1981). "Fast canonization of circular strings". Journal of Algorithms. Elsevier. 2 (2): 107–121. doi:10.1016/0196-6774(81)90013-4. ISSN   0196-6774.
  4. Jean Pierre Duval (1983). "Factorizing words over an ordered alphabet". Journal of Algorithms. Elsevier. 8 (8): 363–381. doi:10.1016/0196-6774(83)90017-2. ISSN   0196-6774.
  5. Yossi Shiloach (1979). "A fast equivalence-checking algorithm for circular lists". Information Processing Letters. Elsevier. 8 (5): 236–238. doi:10.1016/0020-0190(79)90114-5. ISSN   0020-0190.