Gestalt Pattern Matching

Last updated

Gestalt Pattern Matching, [1] also Ratcliff/Obershelp Pattern Recognition, [2] is a string-matching algorithm for determining the similarity of two strings. It was developed in 1983 by John W. Ratcliff and John A. Obershelp and published in the Dr. Dobb's Journal in July 1988. [2]

Contents

Algorithm

The similarity of two strings and is determined by the formula, calculating twice the number of matching characters divided by the total number of characters of both strings. The matching characters are defined as the longest common substring (LCS) plus recursively the number of matching characters in the non-matching regions on both sides of the LCS: [2]

[3]

where the similarity metric can take a value between zero and one:

The value of 1 stands for the complete match of the two strings, whereas the value of 0 means there is no match and not even one common letter.

Sample

S1WIKIMEDIA
S2WIKIMANIA

The longest common substring is WIKIM (grey) with 5 characters. There is no further substring on the left. The non-matching substrings on the right side are EDIA and ANIA. They again have a longest common substring IA (dark gray) with length 2. The similarity metric is determined by:

Properties

Complexity

The execution time of the algorithm is in a worst case and in an average case. By changing the computing method, the execution time can be improved significantly. [1]

Commutative property

It can be shown, that the Gestalt Pattern Matching Algorithm is not commutative: [4]

Sample

For the two strings

and

the metric result for

is with the substrings GESTALT P, A, T, E and for
the metric is with the substrings GESTALT P, R, A, C, I.

Applications

The algorithm became a basis of the Python difflib library, which was introduced in version 2.1. [1] Due to the unfavourable runtime behaviour of this similarity metric, three methods have been implemented. Two of them return an upper bound in a faster execution time. [1] The fastest variant only compares the length of the two substrings: [5]

,
# Drqr Implementation in Pythondefreal_quick_ratio(s1:str,s2:str)->float:"""Return an upper bound on ratio() very quickly."""l1,l2=len(s1),len(s2)length=l1+l2ifnotlength:return1.0return2.0*min(l1,l2)/length

The second upper bound calculates twice the sum of all used characters which occur in divided by the length of both strings but the sequence is ignored.

,
# Dqr Implementation in Pythondefquick_ratio(s1:str,s2:str)->float:"""Return an upper bound on ratio() relatively quickly."""length=len(s1)+len(s2)ifnotlength:return1.0intersect=collections.Counter(s1)&collections.Counter(s2)matches=sum(intersect.values())return2.0*matches/length

Trivially the following applies:

and
.

Related Research Articles

In computer science, string-searching algorithms, sometimes called string-matching algorithms, are an important class of string algorithms that try to find a place where one or several strings are found within a larger string or text.

<span class="mw-page-title-main">Triangle inequality</span> Property of geometry, also used to generalize the notion of "distance" in metric spaces

In mathematics, the triangle inequality states that for any triangle, the sum of the lengths of any two sides must be greater than or equal to the length of the remaining side. This statement permits the inclusion of degenerate triangles, but some authors, especially those writing about elementary geometry, will exclude this possibility, thus leaving out the possibility of equality. If x, y, and z are the lengths of the sides of the triangle, with no side being greater than z, then the triangle inequality states that

<span class="mw-page-title-main">Longest common subsequence problem</span> Algorithmic problem on pairs of sequences

The longest common subsequence (LCS) problem is the problem of finding the longest subsequence common to all sequences in a set of sequences. It differs from the longest common substring problem: unlike substrings, subsequences are not required to occupy consecutive positions within the original sequences. The longest common subsequence problem is a classic computer science problem, the basis of data comparison programs such as the diff utility, and has applications in computational linguistics and bioinformatics. It is also widely used by revision control systems such as Git for reconciling multiple changes made to a revision-controlled collection of files.

In computational linguistics and computer science, edit distance is a string metric, i.e. a way of quantifying how dissimilar two strings are to one another, that is measured by counting the minimum number of operations required to transform one string into the other. Edit distances find applications in natural language processing, where automatic spelling correction can determine candidate corrections for a misspelled word by selecting words from a dictionary that have a low distance to the word in question. In bioinformatics, it can be used to quantify the similarity of DNA sequences, which can be viewed as strings of the letters A, C, G and T.

In computer science, the Rabin–Karp algorithm or Karp–Rabin algorithm is a string-searching algorithm created by Richard M. Karp and Michael O. Rabin (1987) that uses hashing to find an exact match of a pattern string in a text. It uses a rolling hash to quickly filter out positions of the text that cannot match the pattern, and then checks for a match at the remaining positions. Generalizations of the same idea can be used to find more than one match of a single pattern, or to find matches for more than one pattern.

In computer science, the Boyer–Moore string-search algorithm is an efficient string-searching algorithm that is the standard benchmark for practical string-search literature. It was developed by Robert S. Boyer and J Strother Moore in 1977. The original paper contained static tables for computing the pattern shifts without an explanation of how to produce them. The algorithm for producing the tables was published in a follow-on paper; this paper contained errors which were later corrected by Wojciech Rytter in 1980. The algorithm preprocesses the string being searched for, but not the string being searched in. It is thus well-suited for applications in which the pattern is much shorter than the text or where it persists across multiple searches. The Boyer–Moore algorithm uses information gathered during the preprocess step to skip sections of the text, resulting in a lower constant factor than many other string search algorithms. In general, the algorithm runs faster as the pattern length increases. The key features of the algorithm are to match on the tail of the pattern rather than the head, and to skip along the text in jumps of multiple characters rather than searching every single character in the text.

<span class="mw-page-title-main">Suffix tree</span> Tree containing all suffixes of a given text

In computer science, a suffix tree is a compressed trie containing all the suffixes of the given text as their keys and positions in the text as their values. Suffix trees allow particularly fast implementations of many important string operations.

In computer science, a suffix array is a sorted array of all suffixes of a string. It is a data structure used in, among others, full-text indices, data-compression algorithms, and the field of bibliometrics.

The Lenstra–Lenstra–Lovász (LLL) lattice basis reduction algorithm is a polynomial time lattice reduction algorithm invented by Arjen Lenstra, Hendrik Lenstra and László Lovász in 1982. Given a basis with n-dimensional integer coordinates, for a lattice L with , the LLL algorithm calculates an LLL-reduced lattice basis in time

In computer science, the Boyer–Moore–Horspool algorithm or Horspool's algorithm is an algorithm for finding substrings in strings. It was published by Nigel Horspool in 1980 as SBM.

In computer science, the longest common substring problem is to find a longest string that is a substring of two or more strings. The problem may have multiple solutions. Applications include data deduplication and plagiarism detection.

In information theory and computer science, the Damerau–Levenshtein distance is a string metric for measuring the edit distance between two sequences. Informally, the Damerau–Levenshtein distance between two words is the minimum number of operations required to change one word into the other.

<span class="mw-page-title-main">Approximate string matching</span> Finding strings that approximately match a pattern

In computer science, approximate string matching is the technique of finding strings that match a pattern approximately. The problem of approximate string matching is typically divided into two sub-problems: finding approximate substring matches inside a given string and finding dictionary strings that match the pattern approximately.

BLEU is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. Quality is considered to be the correspondence between a machine's output and that of a human: "the closer a machine translation is to a professional human translation, the better it is" – this is the central idea behind BLEU. BLEU was one of the first metrics to claim a high correlation with human judgements of quality, and remains one of the most popular automated and inexpensive metrics.

In computer science and statistics, the Jaro–Winkler distance is a string metric measuring an edit distance between two sequences. It is a variant proposed in 1990 by William E. Winkler of the Jaro distance metric.

In data analysis, cosine similarity is a measure of similarity between two sequences of numbers. For defining it, the sequences are viewed as vectors in an inner product space, and the cosine similarity is defined as the cosine of the angle between them, that is, the dot product of the vectors divided by the product of their lengths. It follows that the cosine similarity does not depend on the magnitudes of the vectors, but only on their angle. The cosine similarity always belongs to the interval For example, two proportional vectors have a cosine similarity of 1, two orthogonal vectors have a similarity of 0, and two opposite vectors have a similarity of -1. The cosine similarity is particularly used in positive space, where the outcome is neatly bounded in .

In mathematics, a covering number is the number of spherical balls of a given size needed to completely cover a given space, with possible overlaps. Two related concepts are the packing number, the number of disjoint balls that fit in a space, and the metric entropy, the number of points that fit in a space when constrained to lie at some fixed minimum distance apart.

In machine learning and data mining, a string kernel is a kernel function that operates on strings, i.e. finite sequences of symbols that need not be of the same length. String kernels can be intuitively understood as functions measuring the similarity of pairs of strings: the more similar two strings a and b are, the higher the value of a string kernel K(a, b) will be.

Information distance is the distance between two finite objects expressed as the number of bits in the shortest program which transforms one object into the other one or vice versa on a universal computer. This is an extension of Kolmogorov complexity. The Kolmogorov complexity of a single finite object is the information in that object; the information distance between a pair of finite objects is the minimum information required to go from one object to the other or vice versa. Information distance was first defined and investigated in based on thermodynamic principles, see also. Subsequently, it achieved final form in. It is applied in the normalized compression distance and the normalized Google distance.

<span class="mw-page-title-main">Suffix automaton</span> Deterministic finite automaton accepting set of all suffixes of particular string

In computer science, a suffix automaton is an efficient data structure for representing the substring index of a given string which allows the storage, processing, and retrieval of compressed information about all its substrings. The suffix automaton of a string is the smallest directed acyclic graph with a dedicated initial vertex and a set of "final" vertices, such that paths from the initial vertex to final vertices represent the suffixes of the string.

References

  1. 1 2 3 4 difflib — Helpers for computing deltas inside the Python documentation
  2. 1 2 3 National Institute of Standards and Technology Ratcliff/Obershelp pattern recognition
  3. Ilya Ilyankou: Comparison of Jaro-Winkler and Ratcliff/Obershelp algorithms in spell check, May 2014 (PDF)
  4. How does Pythons SequenceMatcher work? at stackoverflow.com
  5. Borrowed from Python 3.7.0, difflib.py Lines 38–41 and 676–686

Further reading

See also