Jewels of Stringology

Last updated

Jewels of Stringology: Text Algorithms is a book on algorithms for pattern matching in strings and related problems. It was written by Maxime Crochemore and Wojciech Rytter, and published by World Scientific in 2003.

Contents

Topics

The first topics of the book are two basic string-searching algorithms for finding exactly-matching substrings, the Knuth–Morris–Pratt algorithm and the Boyer–Moore string-search algorithm. It then describes the suffix tree, an index for quickly looking up matching substrings, and two algorithms for constructing it. Other topics in the book include the construction of deterministic finite automata for pattern recognition, the discovery of repeated patterns in strings, constant-space string matching algorithms, and the lossless compression of strings. Approximate string matching is covered in several variations including edit distance and the longest common subsequence problem. The book concludes with advanced topics including two-dimensional pattern matching, parallel algorithms for pattern matching, the shortest common superstring problem, parameterized pattern matching and duplicate code detection, and the Rabin–Karp algorithm. [1]

Audience and reception

The book is written for an audience familiar with algorithm design and analysis, but not necessarily familiar with string algorithms. [1] Reviewer Rolf Klein suggests that this target audience may be narrow, as he evaluates the book as being too difficult for many students, but not supplying as much depth for experts as the same authors' earlier book Text Algorithms (1994). [2]

Reviewer Shoshana Marcus writes that the algorithms chosen for inclusion in the book are "elegant yet fundamental" but have often been overlooked by more general algorithms textbooks. She writes that the book itself should become a valuable reference for researchers in this area, and that it could also be used to supplement undergraduate or graduate course material in algorithms. [1] Reviewer Ricardo Baeza-Yates suggests that the book's omission of bit-level parallel programming techniques reflects its bias towards theoretical rather than practical methods, but nevertheless agrees with its suitability for graduate courses. [3]

Related Research Articles

String (computer science) Sequence of characters, data type

In computer programming, a string is traditionally a sequence of characters, either as a literal constant or as some kind of variable. The latter may allow its elements to be mutated and the length changed, or it may be fixed. A string is generally considered as a data type and is often implemented as an array data structure of bytes that stores a sequence of elements, typically characters, using some character encoding. String may also denote more general arrays or other sequence data types and structures.

In computer science, string-searching algorithms, sometimes called string-matching algorithms, are an important class of string algorithms that try to find a place where one or several strings are found within a larger string or text.

In computer science, the Knuth–Morris–Pratt string-searching algorithm searches for occurrences of a "word" W within a main "text string" S by employing the observation that when a mismatch occurs, the word itself embodies sufficient information to determine where the next match could begin, thus bypassing re-examination of previously matched characters.

In computer science, pattern matching is the act of checking a given sequence of tokens for the presence of the constituents of some pattern. In contrast to pattern recognition, the match usually has to be exact: "either it will or will not be a match." The patterns generally have the form of either sequences or tree structures. Uses of pattern matching include outputting the locations of a pattern within a token sequence, to output some component of the matched pattern, and to substitute the matching pattern with some other token sequence.

In computational linguistics and computer science, edit distance is a way of quantifying how dissimilar two strings are to one another by counting the minimum number of operations required to transform one string into the other. Edit distances find applications in natural language processing, where automatic spelling correction can determine candidate corrections for a misspelled word by selecting words from a dictionary that have a low distance to the word in question. In bioinformatics, it can be used to quantify the similarity of DNA sequences, which can be viewed as strings of the letters A, C, G and T.

Suffix tree computer science term: compressed trie containing all the suffixes of the given text

In computer science, a suffix tree is a compressed trie containing all the suffixes of the given text as their keys and positions in the text as their values. Suffix trees allow particularly fast implementations of many important string operations.

In computer science, a suffix array is a sorted array of all suffixes of a string. It is a data structure used, among others, in full text indices, data compression algorithms and within the field of bibliometrics.

The bitap algorithm is an approximate string matching algorithm. The algorithm tells whether a given text contains a substring which is "approximately equal" to a given pattern, where approximate equality is defined in terms of Levenshtein distance – if the substring and pattern are within a given distance k of each other, then the algorithm considers them equal. The algorithm begins by precomputing a set of bitmasks containing one bit for each element of the pattern. Then it is able to do most of the work with bitwise operations, which are extremely fast.

In computer science, an inverted index is a database index storing a mapping from content, such as words or numbers, to its locations in a table, or in a document or a set of documents. The purpose of an inverted index is to allow fast full-text searches, at a cost of increased processing when a document is added to the database. The inverted file may be the database file itself, rather than its index. It is the most popular data structure used in document retrieval systems, used on a large scale for example in search engines. Additionally, several significant general-purpose mainframe-based database management systems have used inverted list architectures, including ADABAS, DATACOM/DB, and Model 204.

Approximate string matching finding strings that match a pattern approximately

In computer science, approximate string matching is the technique of finding strings that match a pattern approximately. The problem of approximate string matching is typically divided into two sub-problems: finding approximate substring matches inside a given string and finding dictionary strings that match the pattern approximately.

Sequential pattern mining is a topic of data mining concerned with finding statistically relevant patterns between data examples where the values are delivered in a sequence. It is usually presumed that the values are discrete, and thus time series mining is closely related, but usually considered a different activity. Sequential pattern mining is a special case of structured data mining.

In computer science, compressed pattern matching is the process of searching for patterns in compressed data with little or no decompression. Searching in a compressed string is faster than searching an uncompressed string and requires less space.

In computer science, the longest palindromic substring or longest symmetric factor problem is the problem of finding a maximum-length contiguous substring of a given string that is also a palindrome. For example, the longest palindromic substring of "bananas" is "anana". The longest palindromic substring is not guaranteed to be unique; for example, in the string "abracadabra", there is no palindromic substring with length greater than three, but there are two palindromic substrings with length three, namely, "aca" and "ada". In some applications it may be necessary to return all maximal palindromic substrings rather than returning only one substring or returning the maximum length of a palindromic substring.

In computer science, a suffix automaton is the smallest partial deterministic finite automaton that recognizes the set of suffixes of a given string. The state graph of the suffix automaton is called the directed acyclic word graph (DAWG). The term DAWG is also sometimes used for any deterministic acyclic finite state automaton. The term suffix automaton is also sometimes used for any automaton recognizing suffixes of a text.

Gad Landau

Gad Menahem Landau is an Israeli computer scientist noted for his contributions to combinatorial pattern matching and string algorithms and is the founding department chair of the Computer Science Department at the University of Haifa.

Maxime Crochemore is a French computer scientist known for his numerous contributions to algorithms on strings. He is currently a professor at King's College London.

In computer science, an algorithm for matching wildcards is useful in comparing text strings that may contain wildcard syntax. Common uses of these algorithms include command-line interfaces, e.g. the Bourne shell or Microsoft Windows command-line or text editor or file manager, as well as the interfaces for some search engines and databases. Wildcard matching is a subset of the problem of matching regular expressions and string matching in general.

Gonzalo Navarro Badino is a full professor of computer science at the University of Chile and ACM Distinguished Member, whose interests include algorithms and data structures, data compression and text searching. He also participates in the Center for Biotechnology and Bioengineering and the Millennium Institute for Foundational Research on Date .. He obtained his PhD at the University of Chile in 1998 under the supervision of Ricardo Baeza-Yates with the thesis Approximate Text Searching, then worked as a post-doctoral researcher with Esko Ukkonen and Maxime Crochemore.

Sophie Schbath is a French statistician whose research concerns the statistics of pattern matching in strings and formal languages, particularly as applied to genomics. She is a director of research for the French National Institute for Research in Agriculture, Food, and Environment (INRAE), and a former president of the French BioInformatics Society.

References

  1. 1 2 3 Marcus, Shoshana (September 2015), "Review of Jewels of Stringology" (PDF), ACM SIGACT News , 46 (3): 11–14, doi:10.1145/2818936.2818940
  2. Klein, Rolf, zbMATH , Zbl   1078.68151 CS1 maint: untitled periodical (link)
  3. Baeza-Yates, Ricardo (2005), Mathematical Reviews , MR   2012571 CS1 maint: untitled periodical (link)