Gestalt pattern matching

Last updated

Gestalt pattern matching, [1] also Ratcliff/Obershelp pattern recognition, [2] is a string-matching algorithm for determining the similarity of two strings. It was developed in 1983 by John W. Ratcliff and John A. Obershelp and published in the Dr. Dobb's Journal in July 1988. [2]

Contents

Algorithm

The similarity of two strings and is determined by the formula, calculating twice the number of matching characters divided by the total number of characters of both strings. The matching characters are defined as some longest common substring [3] plus recursively the number of matching characters in the non-matching regions on both sides of the longest common substring: [2] [4]

where the similarity metric can take a value between zero and one:

The value of 1 stands for the complete match of the two strings, whereas the value of 0 means there is no match and not even one common letter.

Sample

S1WIKIMEDIA
S2WIKIMANIA

The longest common substring is WIKIM (light grey) with 5 characters. There is no further substring on the left. The non-matching substrings on the right side are EDIA and ANIA. They again have a longest common substring IA (dark gray) with length 2. The similarity metric is determined by:

Properties

The Ratcliff/Obershelp matching characters can be substantially different from each longest common subsequence of the given strings. For example and have as their only longest common substring, and no common characters right of its occurrence, and likewise left, leading to . However, the longest common subsequence of and is , with a total length of .

Complexity

The execution time of the algorithm is in a worst case and in an average case. By changing the computing method, the execution time can be improved significantly. [1]

Commutative property

The Python library implementation of the gestalt pattern matching algorithm is not commutative: [5]

Sample

For the two strings

and

the metric result for

is with the substrings GESTALT P, A, T, E and for
the metric is with the substrings GESTALT P, R, A, C, I.[ why? ]

Applications

The Python difflib library, which was introduced in version 2.1, [1] implements a similar algorithm that predates the Ratcliff-Obershelp algorithm. Due to the unfavourable runtime behaviour of this similarity metric, three methods have been implemented. Two of them return an upper bound in a faster execution time. [1] The fastest variant only compares the length of the two substrings: [6]

,

The second upper bound calculates twice the sum of all used characters which occur in divided by the length of both strings but the sequence is ignored.

[ clarification needed ]
# Dqr Implementation in Pythondefquick_ratio(s1:str,s2:str)->float:"""Return an upper bound on ratio() relatively quickly."""length=len(s1)+len(s2)ifnotlength:return1.0intersect=collections.Counter(s1)&collections.Counter(s2)matches=sum(intersect.values())return2.0*matches/length

Trivially the following applies:

and
.

Related Research Articles

In computer science, the Knuth–Morris–Pratt algorithm is a string-searching algorithm that searches for occurrences of a "word" W within a main "text string" S by employing the observation that when a mismatch occurs, the word itself embodies sufficient information to determine where the next match could begin, thus bypassing re-examination of previously matched characters.

In computational linguistics and computer science, edit distance is a string metric, i.e. a way of quantifying how dissimilar two strings are to one another, that is measured by counting the minimum number of operations required to transform one string into the other. Edit distances find applications in natural language processing, where automatic spelling correction can determine candidate corrections for a misspelled word by selecting words from a dictionary that have a low distance to the word in question. In bioinformatics, it can be used to quantify the similarity of DNA sequences, which can be viewed as strings of the letters A, C, G and T.

In computer science, the Boyer–Moore string-search algorithm is an efficient string-searching algorithm that is the standard benchmark for practical string-search literature. It was developed by Robert S. Boyer and J Strother Moore in 1977. The original paper contained static tables for computing the pattern shifts without an explanation of how to produce them. The algorithm for producing the tables was published in a follow-on paper; this paper contained errors which were later corrected by Wojciech Rytter in 1980.

<span class="mw-page-title-main">Suffix tree</span> Tree containing all suffixes of a given text

In computer science, a suffix tree is a compressed trie containing all the suffixes of the given text as their keys and positions in the text as their values. Suffix trees allow particularly fast implementations of many important string operations.

<span class="mw-page-title-main">Rope (data structure)</span> Data structure for storing strings

In computer programming, a rope, or cord, is a data structure composed of smaller strings that is used to efficiently store and manipulate a very long string. For example, a text editing program may use a rope to represent the text being edited, so that operations such as insertion, deletion, and random access can be done efficiently.

In computer science, a suffix array is a sorted array of all suffixes of a string. It is a data structure used in, among others, full-text indices, data-compression algorithms, and the field of bibliometrics.

A fully polynomial-time approximation scheme (FPTAS) is an algorithm for finding approximate solutions to function problems, especially optimization problems. An FPTAS takes as input an instance of the problem and a parameter ε > 0. It returns as output a value which is at least times the correct value, and at most times the correct value.

In computer science, a longest common substring of two or more strings is a longest string that is a substring of all of them. There may be more than one longest common substring. Applications include data deduplication and plagiarism detection.

In computer science, a substring index is a data structure which gives substring search in a text or text collection in sublinear time. If you have a document of length , or a set of documents of total length , you can locate all occurrences of a pattern in time.

<span class="mw-page-title-main">Generalized suffix tree</span>

In computer science, a generalized suffix tree is a suffix tree for a set of strings. Given the set of strings of total length , it is a Patricia tree containing all suffixes of the strings. It is mostly used in bioinformatics.

<span class="mw-page-title-main">Approximate string matching</span> Finding strings that approximately match a pattern

In computer science, approximate string matching is the technique of finding strings that match a pattern approximately. The problem of approximate string matching is typically divided into two sub-problems: finding approximate substring matches inside a given string and finding dictionary strings that match the pattern approximately.

BLEU is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. Quality is considered to be the correspondence between a machine's output and that of a human: "the closer a machine translation is to a professional human translation, the better it is" – this is the central idea behind BLEU. Invented at IBM in 2001, BLEU was one of the first metrics to claim a high correlation with human judgements of quality, and remains one of the most popular automated and inexpensive metrics.

In computer science and statistics, the Jaro–Winkler similarity is a string metric measuring an edit distance between two sequences. It is a variant of the Jaro distance metric proposed in 1990 by William E. Winkler.

The Price of Anarchy (PoA) is a concept in economics and game theory that measures how the efficiency of a system degrades due to selfish behavior of its agents. It is a general notion that can be extended to diverse systems and notions of efficiency. For example, consider the system of transportation of a city and many agents trying to go from some initial location to a destination. Here, efficiency means the average time for an agent to reach the destination. In the 'centralized' solution, a central authority can tell each agent which path to take in order to minimize the average travel time. In the 'decentralized' version, each agent chooses its own path. The Price of Anarchy measures the ratio between average travel time in the two cases.

In computer science, the longest palindromic substring or longest symmetric factor problem is the problem of finding a maximum-length contiguous substring of a given string that is also a palindrome. For example, the longest palindromic substring of "bananas" is "anana". The longest palindromic substring is not guaranteed to be unique; for example, in the string "abracadabra", there is no palindromic substring with length greater than three, but there are two palindromic substrings with length three, namely, "aca" and "ada". In some applications it may be necessary to return all maximal palindromic substrings rather than returning only one substring or returning the maximum length of a palindromic substring.

In computer science, the longest common prefix array is an auxiliary data structure to the suffix array. It stores the lengths of the longest common prefixes (LCPs) between all pairs of consecutive suffixes in a sorted suffix array.

In stochastic analysis, a rough path is a generalization of the notion of smooth path allowing to construct a robust solution theory for controlled differential equations driven by classically irregular signals, for example a Wiener process. The theory was developed in the 1990s by Terry Lyons. Several accounts of the theory are available.

In the field of computational biology, a planted motif search (PMS) also known as a (l, d)-motif search (LDMS) is a method for identifying conserved motifs within a set of nucleic acid or peptide sequences.

<span class="mw-page-title-main">Suffix automaton</span> Deterministic finite automaton accepting set of all suffixes of particular string

In computer science, a suffix automaton is an efficient data structure for representing the substring index of a given string which allows the storage, processing, and retrieval of compressed information about all its substrings. The suffix automaton of a string is the smallest directed acyclic graph with a dedicated initial vertex and a set of "final" vertices, such that paths from the initial vertex to final vertices represent the suffixes of the string.

In mathematics, the Chvátal–Sankoff constants are mathematical constants that describe the lengths of longest common subsequences of random strings. Although the existence of these constants has been proven, their exact values are unknown. They are named after Václav Chvátal and David Sankoff, who began investigating them in the mid-1970s.

References

  1. 1 2 3 4 difflib — Helpers for computing deltas inside the Python documentation
  2. 1 2 3 National Institute of Standards and Technology Ratcliff/Obershelp pattern recognition
  3. While two strings may have several longest common substrings, Ratcliff (1988) apparently assumes that there is only one.
  4. Ilya Ilyankou: Comparison of Jaro-Winkler and Ratcliff/Obershelp algorithms in spell check, May 2014 (PDF)
  5. How does Pythons SequenceMatcher work? at stackoverflow.com
  6. Borrowed from Python 3.7.0, difflib.py Lines 38–41 and 676–686

Further reading

See also