Bitap algorithm

Last updated

The bitap algorithm (also known as the shift-or, shift-and or Baeza-Yates-Gonnet algorithm) is an approximate string matching algorithm. The algorithm tells whether a given text contains a substring which is "approximately equal" to a given pattern, where approximate equality is defined in terms of Levenshtein distance   if the substring and pattern are within a given distance k of each other, then the algorithm considers them equal. The algorithm begins by precomputing a set of bitmasks containing one bit for each element of the pattern. Then it is able to do most of the work with bitwise operations, which are extremely fast.

Contents

The bitap algorithm is perhaps best known as one of the underlying algorithms of the Unix utility agrep, written by Udi Manber, Sun Wu, and Burra Gopal. Manber and Wu's original paper gives extensions of the algorithm to deal with fuzzy matching of general regular expressions.

Due to the data structures required by the algorithm, it performs best on patterns less than a constant length (typically the word length of the machine in question), and also prefers inputs over a small alphabet. Once it has been implemented for a given alphabet and word length m, however, its running time is completely predictable it runs in O(mn) operations, no matter the structure of the text or the pattern.

The bitap algorithm for exact string searching was invented by Bálint Dömölki in 1964 and extended by R. K. Shyamasundar in 1977 , before being reinvented by Ricardo Baeza-Yates and Gaston Gonnet in 1989 (one chapter of first author's PhD thesis ) which also extended it to handle classes of characters, wildcards, and mismatches. In 1991, it was extended by Manber and Wu to handle also insertions and deletions (full fuzzy string searching). This algorithm was later improved by Baeza-Yates and Navarro in 1996.

Exact searching

The bitap algorithm for exact string searching, in full generality, looks like this in pseudocode:

algorithm bitap_search isinput:text as a string.            pattern as a string.     output: string     m := length(pattern)      ifm = 0 thenreturntext      /* Initialize the bit array R. */     R := new array[m+1] of bit, initially all 0     R[0] := 1      fori := 0; i < length(text); i += 1 do         /* Update the bit array. */         fork := m; k ≥ 1; k -= 1 doR[k] := R[k - 1] & (text[i] = pattern[k - 1])          ifR[m] thenreturn (text + i - m) + 1      return null

Bitap distinguishes itself from other well-known string searching algorithms in its natural mapping onto simple bitwise operations, as in the following modification of the above program. Notice that in this implementation, counterintuitively, each bit with value zero indicates a match, and each bit with value 1 indicates a non-match. The same algorithm can be written with the intuitive semantics for 0 and 1, but in that case we must introduce another instruction into the inner loop to set R |= 1. In this implementation, we take advantage of the fact that left-shifting a value shifts in zeros on the right, which is precisely the behavior we need.

Notice also that we require CHAR_MAX additional bitmasks in order to convert the (text[i] == pattern[k-1]) condition in the general implementation into bitwise operations. Therefore, the bitap algorithm performs better when applied to inputs over smaller alphabets.

#include<string.h>#include<limits.h>constchar*bitap_bitwise_search(constchar*text,constchar*pattern){intm=strlen(pattern);unsignedlongR;unsignedlongpattern_mask[CHAR_MAX+1];inti;if(pattern[0]=='\0')returntext;if(m>31)return"The pattern is too long!";/* Initialize the bit array R */R=~1;/* Initialize the pattern bitmasks */for(i=0;i<=CHAR_MAX;++i)pattern_mask[i]=~0;for(i=0;i<m;++i)pattern_mask[pattern[i]]&=~(1UL<<i);for(i=0;text[i]!='\0';++i){/* Update the bit array */R|=pattern_mask[text[i]];R<<=1;if(0==(R&(1UL<<m)))return(text+i-m)+1;}returnNULL;}

Fuzzy searching

To perform fuzzy string searching using the bitap algorithm, it is necessary to extend the bit array R into a second dimension. Instead of having a single array R that changes over the length of the text, we now have k distinct arrays R1..k. Array Ri holds a representation of the prefixes of pattern that match any suffix of the current string with i or fewer errors. In this context, an "error" may be an insertion, deletion, or substitution; see Levenshtein distance for more information on these operations.

The implementation below performs fuzzy matching (returning the first match with up to k errors) using the fuzzy bitap algorithm. However, it only pays attention to substitutions, not to insertions or deletions in other words, a Hamming distance of k. As before, the semantics of 0 and 1 are reversed from their conventional meanings.

#include<stdlib.h>#include<string.h>#include<limits.h>constchar*bitap_fuzzy_bitwise_search(constchar*text,constchar*pattern,intk){constchar*result=NULL;intm=strlen(pattern);unsignedlong*R;unsignedlongpattern_mask[CHAR_MAX+1];inti,d;if(pattern[0]=='\0')returntext;if(m>31)return"The pattern is too long!";/* Initialize the bit array R */R=malloc((k+1)*sizeof*R);for(i=0;i<=k;++i)R[i]=~1;/* Initialize the pattern bitmasks */for(i=0;i<=CHAR_MAX;++i)pattern_mask[i]=~0;for(i=0;i<m;++i)pattern_mask[pattern[i]]&=~(1UL<<i);for(i=0;text[i]!='\0';++i){/* Update the bit arrays */unsignedlongold_Rd1=R[0];R[0]|=pattern_mask[text[i]];R[0]<<=1;for(d=1;d<=k;++d){unsignedlongtmp=R[d];/* Substitution is all we care about */R[d]=(old_Rd1&(R[d]|pattern_mask[text[i]]))<<1;old_Rd1=tmp;}if(0==(R[k]&(1UL<<m))){result=(text+i-m)+1;break;}}free(R);returnresult;}

See also

  1. ^ Bálint Dömölki, An algorithm for syntactical analysis, Computational Linguistics 3, Hungarian Academy of Science pp. 29–46, 1964.
  2. ^ Bálint Dömölki, A universal compiler system based on production rules, BIT Numerical Mathematics, 8(4), pp 262275, 1968. doi : 10.1007/BF01933436
  3. ^ R. K. Shyamasundar, Precedence parsing using Dömölki's algorithm, International Journal of Computer Mathematics, 6(2)pp 105–114, 1977.
  4. ^ Ricardo Baeza-Yates. "Efficient Text Searching." PhD Thesis, University of Waterloo, Canada, May 1989.
  5. ^ Udi Manber, Sun Wu. "Fast text searching with errors." Technical Report TR-91-11. Department of Computer Science, University of Arizona, Tucson, June 1991. (gzipped PostScript)
  6. ^ Ricardo Baeza-Yates, Gastón H. Gonnet. "A New Approach to Text Searching." Communications of the ACM , 35(10): pp. 7482, October 1992.
  7. ^ Udi Manber, Sun Wu. "Fast text search allowing errors." Communications of the ACM , 35(10): pp. 8391, October 1992, doi : 10.1145/135239.135244.
  8. ^ R. Baeza-Yates and G. Navarro. A faster algorithm for approximate string matching. In Dan Hirchsberg and Gene Myers, editors, Combinatorial Pattern Matching (CPM'96), LNCS 1075, pages 123, Irvine, CA, June 1996.
  9. ^ G. Myers. "A fast bit-vector algorithm for approximate string matching based on dynamic programming." Journal of the ACM 46 (3), May 1999, 395415.
  10. libbitap, a free implementation that shows how the algorithm can easily be extended for most regular expressions. Unlike the code above, it places no limit on the pattern length.
  11. Ricardo Baeza-Yates, Berthier Ribeiro-Neto. Modern Information Retrieval. 1999. ISBN   0-201-39829-X.
  12. bitap.py - Python implementation of Bitap algorithm with Wu-Manber modifications.

Related Research Articles

<span class="mw-page-title-main">Hash function</span> Mapping arbitrary data to fixed-size values

A hash function is any function that can be used to map data of arbitrary size to fixed-size values, though there are some hash functions that support variable length output. The values returned by a hash function are called hash values, hash codes, hash digests, digests, or simply hashes. The values are usually used to index a fixed-size table called a hash table. Use of a hash function to index a hash table is called hashing or scatter storage addressing.

In computer science, string-searching algorithms, sometimes called string-matching algorithms, are an important class of string algorithms that try to find a place where one or several strings are found within a larger string or text.

<span class="mw-page-title-main">Trie</span> K-ary search tree data structure

In computer science, a trie, also called digital tree or prefix tree, is a type of k-ary search tree, a tree data structure used for locating specific keys from within a set. These keys are most often strings, with links between nodes defined not by the entire key, but by individual characters. In order to access a key, the trie is traversed depth-first, following the links between nodes, which represent each character in the key.

In computer science, the Knuth–Morris–Pratt algorithm is a string-searching algorithm that searches for occurrences of a "word" W within a main "text string" S by employing the observation that when a mismatch occurs, the word itself embodies sufficient information to determine where the next match could begin, thus bypassing re-examination of previously matched characters.

agrep is an open-source approximate string matching program, developed by Udi Manber and Sun Wu between 1988 and 1991, for use with the Unix operating system. It was later ported to OS/2, DOS, and Windows.

In computer science, a mask or bitmask is data that is used for bitwise operations, particularly in a bit field. Using a mask, multiple bits in a byte, nibble, word, etc. can be set either on or off, or inverted from on to off in a single bitwise operation. An additional use of masking involves predication in vector processing, where the bitmask is used to select which element operations in the vector are to be executed and which are not.

In computer science, the Rabin–Karp algorithm or Karp–Rabin algorithm is a string-searching algorithm created by Richard M. Karp and Michael O. Rabin that uses hashing to find an exact match of a pattern string in a text. It uses a rolling hash to quickly filter out positions of the text that cannot match the pattern, and then checks for a match at the remaining positions. Generalizations of the same idea can be used to find more than one match of a single pattern, or to find matches for more than one pattern.

In computer science, the Boyer–Moore string-search algorithm is an efficient string-searching algorithm that is the standard benchmark for practical string-search literature. It was developed by Robert S. Boyer and J Strother Moore in 1977. The original paper contained static tables for computing the pattern shifts without an explanation of how to produce them. The algorithm for producing the tables was published in a follow-on paper; this paper contained errors which were later corrected by Wojciech Rytter in 1980.

<span class="mw-page-title-main">Suffix tree</span> Tree containing all suffixes of a given text

In computer science, a suffix tree is a compressed trie containing all the suffixes of the given text as their keys and positions in the text as their values. Suffix trees allow particularly fast implementations of many important string operations.

In computer science, a fusion tree is a type of tree data structure that implements an associative array on w-bit integers on a finite universe, where each of the input integers has size less than 2w and is non-negative. When operating on a collection of n key–value pairs, it uses O(n) space and performs searches in O(logwn) time, which is asymptotically faster than a traditional self-balancing binary search tree, and also better than the van Emde Boas tree for large values of w. It achieves this speed by using certain constant-time operations that can be done on a machine word. Fusion trees were invented in 1990 by Michael Fredman and Dan Willard.

<span class="mw-page-title-main">Hamming weight</span> Number of nonzero symbols in a string

The Hamming weight of a string is the number of symbols that are different from the zero-symbol of the alphabet used. It is thus equivalent to the Hamming distance from the all-zero string of the same length. For the most typical case, a string of bits, this is the number of 1's in the string, or the digit sum of the binary representation of a given number and the ₁ norm of a bit vector. In this binary case, it is also called the population count, popcount, sideways sum, or bit summation.

In computer science, a suffix array is a sorted array of all suffixes of a string. It is a data structure used in, among others, full-text indices, data-compression algorithms, and the field of bibliometrics.

<span class="mw-page-title-main">Circular shift</span> Circular shifts: Mathematical concept and applications in software development

In combinatorial mathematics, a circular shift is the operation of rearranging the entries in a tuple, either by moving the final entry to the first position, while shifting all other entries to the next position, or by performing the inverse operation. A circular shift is a special kind of cyclic permutation, which in turn is a special kind of permutation. Formally, a circular shift is a permutation σ of the n entries in the tuple such that either

The computer programming languages C and Pascal have similar times of origin, influences, and purposes. Both were used to design their own compilers early in their lifetimes. The original Pascal definition appeared in 1969 and a first compiler in 1970. The first version of C appeared in 1972.

A bit field is a data structure that consists of one or more adjacent bits which have been allocated for specific purposes, so that any single bit or group of bits within the structure can be set or inspected. A bit field is most commonly used to represent integral types of known, fixed bit-width, such as single-bit Booleans.

<span class="mw-page-title-main">Approximate string matching</span> Finding strings that approximately match a pattern

In computer science, approximate string matching is the technique of finding strings that match a pattern approximately. The problem of approximate string matching is typically divided into two sub-problems: finding approximate substring matches inside a given string and finding dictionary strings that match the pattern approximately.

In computer science and the study of combinatorics on words, a partial word is a string that may contain a number of "do not know" or "do not care" symbols i.e. placeholders in the string where the symbol value is not known or not specified. More formally, a partial word is a partial function where is some finite alphabet. If u(k) is not defined for some then the unknown element at place k in the string is called a "hole". In regular expressions (following the POSIX standard) a hole is represented by the metacharacter ".". For example, aab.ab.b is a partial word of length 8 over the alphabet A ={a,b} in which the fourth and seventh characters are holes.

In computer science, a compressed suffix array is a compressed data structure for pattern matching. Compressed suffix arrays are a general class of data structure that improve on the suffix array. These data structures enable quick search for an arbitrary string with a comparatively small index.

In the C programming language, operations can be performed on a bit level using bitwise operators.

Jewels of Stringology: Text Algorithms is a book on algorithms for pattern matching in strings and related problems. It was written by Maxime Crochemore and Wojciech Rytter, and published by World Scientific in 2003.