Substring

Last updated December 21, 2023

In formal language theory and computer science, a substring is a contiguous sequence of characters within a string.^{[ citation needed ]} For instance, "the best of" is a substring of "It was the best of times". In contrast, "Itwastimes" is a subsequence of "It was the best of times", but not a substring.

Substring

A string $u$ is a substring (or factor)^[1] of a string $t$ if there exists two strings $p$ and $s$ such that $t=pus$ . In particular, the empty string is a substring of every string.

Example: The string $u=$ ana is equal to substrings (and subsequences) of $t=$ banana at two different offsets:

banana  |||||  ana||    |||    ana

The first occurrence is obtained with $p=$ b and $s=$ na, while the second occurrence is obtained with $p=$ ban and $s$ being the empty string.

A substring of a string is a prefix of a suffix of the string, and equivalently a suffix of a prefix; for example, nan is a prefix of nana, which is in turn a suffix of banana. If $u$ is a substring of $t$ , it is also a subsequence, which is a more general concept. The occurrences of a given pattern in a given string can be found with a string searching algorithm. Finding the longest string which is equal to a substring of two or more strings is known as the longest common substring problem. In the mathematical literature, substrings are also called subwords (in America) or factors (in Europe). ^{[ citation needed ]}

Prefix

A string $p$ is a prefix^[1] of a string $t$ if there exists a string $s$ such that $t=ps$ . A proper prefix of a string is not equal to the string itself;^[2] some sources^[3] in addition restrict a proper prefix to be non-empty. A prefix can be seen as a special case of a substring.

Example: The string ban is equal to a prefix (and substring and subsequence) of the string banana:

banana ||| ban

The square subset symbol is sometimes used to indicate a prefix, so that $p\sqsubseteq t$ denotes that $p$ is a prefix of $t$ . This defines a binary relation on strings, called the prefix relation, which is a particular kind of prefix order.

Suffix

A string $s$ is a suffix^[1] of a string $t$ if there exists a string $p$ such that $t=ps$ . A proper suffix of a string is not equal to the string itself. A more restricted interpretation is that it is also not empty. A suffix can be seen as a special case of a substring.

Example: The string nana is equal to a suffix (and substring and subsequence) of the string banana:

banana   ||||   nana

A suffix tree for a string is a trie data structure that represents all of its suffixes. Suffix trees have large numbers of applications in string algorithms. The suffix array is a simplified version of this data structure that lists the start positions of the suffixes in alphabetically sorted order; it has many of the same applications.

Border

A border is suffix and prefix of the same string, e.g. "bab" is a border of "babab" (and also of "baboon eating a kebab").^{[ citation needed ]}

Superstring

A superstring of a finite set $P$ of strings is a single string that contains every string in $P$ as a substring. For example, ${\text{bcclabccefab}}$ is a superstring of $P=\{{\text{abcc}},{\text{efab}},{\text{bccla}}\}$ , and ${\text{efabccla}}$ is a shorter one. Concatenating all members of $P$ , in arbitrary order, always obtains a trivial superstring of $P$ . Finding superstrings whose length is as small as possible is a more interesting problem.

A string that contains every possible permutation of a specified character set is called a superpermutation.

Related Research Articles

In computer programming, a string is traditionally a sequence of characters, either as a literal constant or as some kind of variable. The latter may allow its elements to be mutated and the length changed, or it may be fixed. A string is generally considered as a data type and is often implemented as an array data structure of bytes that stores a sequence of elements, typically characters, using some character encoding. String may also denote more general arrays or other sequence data types and structures.

In computer science, string-searching algorithms, sometimes called string-matching algorithms, are an important class of string algorithms that try to find a place where one or several strings are found within a larger string or text.

In computer science, a trie, also called digital tree or prefix tree, is a type of k-ary search tree, a tree data structure used for locating specific keys from within a set. These keys are most often strings, with links between nodes defined not by the entire key, but by individual characters. In order to access a key, the trie is traversed depth-first, following the links between nodes, which represent each character in the key.

In mathematics, a subsequence of a given sequence is a sequence that can be derived from the given sequence by deleting some or no elements without changing the order of the remaining elements. For example, the sequence $is a subsequence of obtained after removal of elements and The relation of one sequence being the subsequence of another is a preorder.$

A longest common subsequence (LCS) is the longest subsequence common to all sequences in a set of sequences. It differs from the longest common substring: unlike substrings, subsequences are not required to occupy consecutive positions within the original sequences. The problem of computing longest common subsequences is a classic computer science problem, the basis of data comparison programs such as the diff utility, and has applications in computational linguistics and bioinformatics. It is also widely used by revision control systems such as Git for reconciling multiple changes made to a revision-controlled collection of files.

In computer science, the Knuth–Morris–Pratt algorithm is a string-searching algorithm that searches for occurrences of a "word" W within a main "text string" S by employing the observation that when a mismatch occurs, the word itself embodies sufficient information to determine where the next match could begin, thus bypassing re-examination of previously matched characters.

In information theory, linguistics, and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits required to change one word into the other. It is named after the Soviet mathematician Vladimir Levenshtein, who considered this distance in 1965.

In computational linguistics and computer science, edit distance is a string metric, i.e. a way of quantifying how dissimilar two strings are to one another, that is measured by counting the minimum number of operations required to transform one string into the other. Edit distances find applications in natural language processing, where automatic spelling correction can determine candidate corrections for a misspelled word by selecting words from a dictionary that have a low distance to the word in question. In bioinformatics, it can be used to quantify the similarity of DNA sequences, which can be viewed as strings of the letters A, C, G and T.

In computer science, the Boyer–Moore string-search algorithm is an efficient string-searching algorithm that is the standard benchmark for practical string-search literature. It was developed by Robert S. Boyer and J Strother Moore in 1977. The original paper contained static tables for computing the pattern shifts without an explanation of how to produce them. The algorithm for producing the tables was published in a follow-on paper; this paper contained errors which were later corrected by Wojciech Rytter in 1980.

In computer science, a suffix tree is a compressed trie containing all the suffixes of the given text as their keys and positions in the text as their values. Suffix trees allow particularly fast implementations of many important string operations.

In computer science, a suffix array is a sorted array of all suffixes of a string. It is a data structure used in, among others, full-text indices, data-compression algorithms, and the field of bibliometrics.

In computer science, a longest common substring of two or more strings is a longest string that is a substring of all of them. There may be more than one longest common substring. Applications include data deduplication and plagiarism detection.

In computer science, the shortest common supersequence of two sequences X and Y is the shortest sequence which has X and Y as subsequences. This is a problem closely related to the longest common subsequence problem. Given two sequences X = < x₁,...,x_m > and Y = < y₁,...,y_n >, a sequence U = < u₁,...,u_k > is a common supersequence of X and Y if items can be removed from U to produce X and Y.

In computer science, a generalized suffix tree is a suffix tree for a set of strings. Given the set of strings $of total length, it is a Patricia tree containing all suffixes of the strings. It is mostly used in bioinformatics.$

In machine learning and data mining, a string kernel is a kernel function that operates on strings, i.e. finite sequences of symbols that need not be of the same length. String kernels can be intuitively understood as functions measuring the similarity of pairs of strings: the more similar two strings a and b are, the higher the value of a string kernel K(a, b) will be.

In computer science, the longest palindromic substring or longest symmetric factor problem is the problem of finding a maximum-length contiguous substring of a given string that is also a palindrome. For example, the longest palindromic substring of "bananas" is "anana". The longest palindromic substring is not guaranteed to be unique; for example, in the string "abracadabra", there is no palindromic substring with length greater than three, but there are two palindromic substrings with length three, namely, "aca" and "ada". In some applications it may be necessary to return all maximal palindromic substrings rather than returning only one substring or returning the maximum length of a palindromic substring.

In computer science, the longest common prefix array is an auxiliary data structure to the suffix array. It stores the lengths of the longest common prefixes (LCPs) between all pairs of consecutive suffixes in a sorted suffix array.

In computer science, a suffix automaton is an efficient data structure for representing the substring index of a given string which allows the storage, processing, and retrieval of compressed information about all its substrings. The suffix automaton of a string $is the smallest directed acyclic graph with a dedicated initial vertex and a set of "final" vertices, such that paths from the initial vertex to final vertices represent the suffixes of the string.$

Gestalt pattern matching, also Ratcliff/Obershelp pattern recognition, is a string-matching algorithm for determining the similarity of two strings. It was developed in 1983 by John W. Ratcliff and John A. Obershelp and published in the Dr. Dobb's Journal in July 1988.

In computer science, a generalized suffix array (GSA) is a suffix array containing all suffixes for a set of strings. Given the set of strings $of total length, it is a lexicographically sorted array of all suffixes of each string in . It is primarily used in bioinformatics and string processing.$

References

1 2 3 Lothaire, M. (1997). Combinatorics on words. Cambridge: Cambridge University Press. ISBN 0-521-59924-5.
↑ Kelley, Dean (1995). Automata and Formal Languages: An Introduction. London: Prentice-Hall International. ISBN 0-13-497777-7.
↑ Gusfield, Dan (1999) [1997]. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. US: Cambridge University Press. ISBN 0-521-58519-8.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[Lot97-1] 1 2 3 Lothaire, M. (1997). Combinatorics on words. Cambridge: Cambridge University Press. ISBN 0-521-59924-5.

[Kel95-2] Kelley, Dean (1995). Automata and Formal Languages: An Introduction. London: Prentice-Hall International. ISBN 0-13-497777-7.

[Gus97-3] Gusfield, Dan (1999) [1997]. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. US: Cambridge University Press. ISBN 0-521-58519-8.

[1]

[2]

[3]

Substring

Contents

Substring

Prefix

Suffix

Border

Superstring

See also

Related Research Articles

References