Jewels of Stringology

Last updated
Jewels of Stringology
Author Maxime Crochemore, Wojciech Rytter
GenreAlgorithms
PublisherWorld Scientific
Publication date
2003

Jewels of Stringology: Text Algorithms is a book on algorithms for pattern matching in strings and related problems. It was written by Maxime Crochemore and Wojciech Rytter, and published by World Scientific in 2003.

Contents

Topics

The first topics of the book are two basic string-searching algorithms for finding exactly-matching substrings, the Knuth–Morris–Pratt algorithm and the Boyer–Moore string-search algorithm. It then describes the suffix tree, an index for quickly looking up matching substrings, and two algorithms for constructing it. Other topics in the book include the construction of deterministic finite automata for pattern recognition, the discovery of repeated patterns in strings, constant-space string matching algorithms, and the lossless compression of strings. Approximate string matching is covered in several variations including edit distance and the longest common subsequence problem. The book concludes with advanced topics including two-dimensional pattern matching, parallel algorithms for pattern matching, the shortest common superstring problem, parameterized pattern matching and duplicate code detection, and the Rabin–Karp algorithm. [1]

Audience and reception

The book is written for an audience familiar with algorithm design and analysis, but not necessarily familiar with string algorithms. [1] Reviewer Rolf Klein suggests that this target audience may be narrow, as he evaluates the book as being too difficult for many students, but not supplying as much depth for experts as the same authors' earlier book Text Algorithms (1994). [2]

Reviewer Shoshana Marcus writes that the algorithms chosen for inclusion in the book are "elegant yet fundamental" but have often been overlooked by more general algorithms textbooks. She writes that the book itself should become a valuable reference for researchers in this area, and that it could also be used to supplement undergraduate or graduate course material in algorithms. [1] Reviewer Ricardo Baeza-Yates suggests that the book's omission of bit-level parallel programming techniques reflects its bias towards theoretical rather than practical methods, but nevertheless agrees with its suitability for graduate courses. [3]

Related Research Articles

<span class="mw-page-title-main">String (computer science)</span> Sequence of characters, data type

In computer programming, a string is traditionally a sequence of characters, either as a literal constant or as some kind of variable. The latter may allow its elements to be mutated and the length changed, or it may be fixed. A string is generally considered as a data type and is often implemented as an array data structure of bytes that stores a sequence of elements, typically characters, using some character encoding. String may also denote more general arrays or other sequence data types and structures.

In computer science, string-searching algorithms, sometimes called string-matching algorithms, are an important class of string algorithms that try to find a place where one or several strings are found within a larger string or text.

In computer science, the Knuth–Morris–Pratt algorithm is a string-searching algorithm that searches for occurrences of a "word" W within a main "text string" S by employing the observation that when a mismatch occurs, the word itself embodies sufficient information to determine where the next match could begin, thus bypassing re-examination of previously matched characters.

<span class="mw-page-title-main">Suffix tree</span> Tree containing all suffixes of a given text

In computer science, a suffix tree is a compressed trie containing all the suffixes of the given text as their keys and positions in the text as their values. Suffix trees allow particularly fast implementations of many important string operations.

In computer science, a suffix array is a sorted array of all suffixes of a string. It is a data structure used in, among others, full-text indices, data-compression algorithms, and the field of bibliometrics.

In computer science, the Boyer–Moore–Horspool algorithm or Horspool's algorithm is an algorithm for finding substrings in strings. It was published by Nigel Horspool in 1980 as SBM.

In computer science, a longest common substring of two or more strings is a longest string that is a substring of all of them. There may be more than one longest common substring. Applications include data deduplication and plagiarism detection.

The bitap algorithm is an approximate string matching algorithm. The algorithm tells whether a given text contains a substring which is "approximately equal" to a given pattern, where approximate equality is defined in terms of Levenshtein distance – if the substring and pattern are within a given distance k of each other, then the algorithm considers them equal. The algorithm begins by precomputing a set of bitmasks containing one bit for each element of the pattern. Then it is able to do most of the work with bitwise operations, which are extremely fast.

<span class="mw-page-title-main">Ricardo Baeza-Yates</span> Chilean computer scientist

Ricardo A. Baeza-Yates is a Chilean computer scientist that currently is the Director of Research of the Institute for Experiential AI at Northeastern University in the Silicon Valley campus. He is also part-time professor at Universitat Pompeu Fabra in Barcelona and Universidad de Chile in Santiago. He is an expert member of the Global Partnership on Artificial Intelligence, a member of the Association for Computing Machinery's US Technology Policy Committee as well as IEEE's Ethics Committee.

<span class="mw-page-title-main">Approximate string matching</span> Finding strings that approximately match a pattern

In computer science, approximate string matching is the technique of finding strings that match a pattern approximately. The problem of approximate string matching is typically divided into two sub-problems: finding approximate substring matches inside a given string and finding dictionary strings that match the pattern approximately.

<span class="mw-page-title-main">Zvi Galil</span> Israeli mathematician and computer scientist

Zvi Galil is an Israeli-American computer scientist and mathematician. He has served as the dean of the Columbia University School of Engineering and as president of Tel Aviv University from 2007 through 2009. From 2010 to 2019, he was the dean of the Georgia Institute of Technology College of Computing.

In computer science and the study of combinatorics on words, a partial word is a string that may contain a number of "do not know" or "do not care" symbols i.e. placeholders in the string where the symbol value is not known or not specified. More formally, a partial word is a partial function where is some finite alphabet. If u(k) is not defined for some then the unknown element at place k in the string is called a "hole". In regular expressions (following the POSIX standard) a hole is represented by the metacharacter ".". For example, aab.ab.b is a partial word of length 8 over the alphabet A ={a,b} in which the fourth and seventh characters are holes.

In computer science, compressed pattern matching is the process of searching for patterns in compressed data with little or no decompression. Searching in a compressed string is faster than searching an uncompressed string and requires less space.

In computer science, the longest palindromic substring or longest symmetric factor problem is the problem of finding a maximum-length contiguous substring of a given string that is also a palindrome. For example, the longest palindromic substring of "bananas" is "anana". The longest palindromic substring is not guaranteed to be unique; for example, in the string "abracadabra", there is no palindromic substring with length greater than three, but there are two palindromic substrings with length three, namely, "aca" and "ada". In some applications it may be necessary to return all maximal palindromic substrings rather than returning only one substring or returning the maximum length of a palindromic substring.

<span class="mw-page-title-main">Gad Landau</span> Israeli computer scientist

Gad Menahem Landau is an Israeli computer scientist noted for his contributions to combinatorial pattern matching and string algorithms and is the founding department chair of the Computer Science Department at the University of Haifa.

Maxime Crochemore is a French computer scientist known for his numerous contributions to algorithms on strings. He is currently a professor at King's College London.

In computer science, an algorithm for matching wildcards is useful in comparing text strings that may contain wildcard syntax. Common uses of these algorithms include command-line interfaces, e.g. the Bourne shell or Microsoft Windows command-line or text editor or file manager, as well as the interfaces for some search engines and databases. Wildcard matching is a subset of the problem of matching regular expressions and string matching in general.

Gonzalo Navarro Badino is a full professor of computer science at the University of Chile and ACM Distinguished Member, whose interests include algorithms and data structures, data compression and text searching. He also participates in the Center for Biotechnology and Bioengineering and the Millennium Institute for Foundational Research on Data .. He obtained his PhD at the University of Chile in 1998 under the supervision of Ricardo Baeza-Yates with the thesis Approximate Text Searching, then worked as a post-doctoral researcher with Esko Ukkonen and Maxime Crochemore.

Sophie Schbath is a French statistician whose research concerns the statistics of pattern matching in strings and formal languages, particularly as applied to genomics. She is a director of research for the French National Institute for Research in Agriculture, Food, and Environment (INRAE), and a former president of the French BioInformatics Society.

Algorithmic Combinatorics on Partial Words is a book in the area of combinatorics on words, and more specifically on partial words. It was written by Francine Blanchet-Sadri, and published in 2008 by Chapman & Hall/CRC in their Discrete Mathematics and its Applications book series.

References

  1. 1 2 3 Marcus, Shoshana (September 2015), "Review of Jewels of Stringology" (PDF), ACM SIGACT News , 46 (3): 11–14, doi:10.1145/2818936.2818940, S2CID   29751366
  2. Klein, Rolf, zbMATH , Zbl   1078.68151 {{citation}}: CS1 maint: untitled periodical (link)
  3. Baeza-Yates, Ricardo (2005), Mathematical Reviews , MR   2012571 {{citation}}: CS1 maint: untitled periodical (link)