Match rating approach

Last updated

The match rating approach (MRA) is a phonetic algorithm for indexing of words by their pronunciation developed by Western Airlines in 1977 for the indexation and comparison of homophonous names. [1]

Contents

The algorithm itself has a simple set of encoding rules but a more lengthy set of comparison rules. The main mechanism is the similarity comparison, which calculates the number of unmatched characters by comparing the strings from left to right and then from right to left, and removing identical characters. This value is subtracted from 6 and then compared to a minimum threshold. The minimum threshold is defined in table A and is dependent upon the length of the strings.

The encoded name is known (perhaps incorrectly) as a personal numeric identifier (PNI). The encoded name can never contain more than 6 alpha only characters.

The match rating approach performs well with names containing the letter "y", unlike the original flavor of the NYSIIS algorithm; for example, the surnames "Smith" and "Smyth" are successfully matched. However, MRA does not perform well with encoded names that differ in length by more than 2.

Encoding rules

  1. Delete all vowels unless the vowel begins the word
  2. Remove the second consonant of any double consonants present
  3. Reduce codex to 6 letters by joining the first 3 and last 3 letters only

Comparison rules

In this section, the words "string(s)" and "name(s)" mean "encoded string(s)" and "encoded name(s)".

  1. If the length difference between the encoded strings is 3 or greater, then no similarity comparison is done.
  2. Obtain the minimum rating value by calculating the length sum of the encoded strings and using table A
  3. Process the encoded strings from left to right and remove any identical characters found from both strings respectively.
  4. Process the unmatched characters from right to left and remove any identical characters found from both names respectively.
  5. Subtract the number of unmatched characters from 6 in the longer string. This is the similarity rating.
  6. If the similarity rating equal to or greater than the minimum rating then the match is considered good.

Minimum threshold

The following table shows the mapping between the minimum rating and the string lengths.

Table A
Sum of LengthsMinimum Rating
≤ 45
4 < sum ≤ 74
7 < sum ≤ 113
= 122

Match rating approach examples

The table below displays the output of the match rating approach algorithm for some common homophonous names.

NameMRA CodexMinimum RatingSimilarity Comparison Rating
ByrneBYRN45
BoernBRN
SmithSMTH35
SmythSMYTH
CatherineCTHRN34
KathrynKTHRYN

See also

Soundex

Related Research Articles

<span class="mw-page-title-main">Arabic alphabet</span> Alphabets for Arabic and other languages

The Arabic alphabet, or Arabic abjad, is the Arabic script as specifically codified for writing the Arabic language. It is written from right-to-left in a cursive style, and includes 28 letters, of which most have contextual letterforms. The Arabic alphabet is considered an abjad, with only consonants required to be written; due to its optional use of diacritics to notate vowels, it is considered an impure abjad.

<span class="mw-page-title-main">Huffman coding</span> Technique to compress data

In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly used for lossless data compression. The process of finding or using such a code is Huffman coding, an algorithm developed by David A. Huffman while he was a Sc.D. student at MIT, and published in the 1952 paper "A Method for the Construction of Minimum-Redundancy Codes".

<span class="mw-page-title-main">String (computer science)</span> Sequence of characters, data type

In computer programming, a string is traditionally a sequence of characters, either as a literal constant or as some kind of variable. The latter may allow its elements to be mutated and the length changed, or it may be fixed. A string is generally considered as a data type and is often implemented as an array data structure of bytes that stores a sequence of elements, typically characters, using some character encoding. String may also denote more general arrays or other sequence data types and structures.

In computer science, string-searching algorithms, sometimes called string-matching algorithms, are an important class of string algorithms that try to find a place where one or several strings are found within a larger string or text.

Metaphone is a phonetic algorithm, published by Lawrence Philips in 1990, for indexing words by their English pronunciation. It fundamentally improves on the Soundex algorithm by using information about variations and inconsistencies in English spelling and pronunciation to produce a more accurate encoding, which does a better job of matching words and names which sound similar. As with Soundex, similar-sounding words should share the same keys. Metaphone is available as a built-in operator in a number of systems.

The Burrows–Wheeler transform rearranges a character string into runs of similar characters. This is useful for compression, since it tends to be easy to compress a string that has runs of repeated characters by techniques such as move-to-front transform and run-length encoding. More importantly, the transformation is reversible, without needing to store any additional data except the position of the first original character. The BWT is thus a "free" method of improving the efficiency of text compression algorithms, costing only some extra computation. The Burrows–Wheeler transform is an algorithm used to prepare data for use with data compression techniques such as bzip2. It was invented by Michael Burrows and David Wheeler in 1994 while Burrows was working at DEC Systems Research Center in Palo Alto, California. It is based on a previously unpublished transformation discovered by Wheeler in 1983. The algorithm can be implemented efficiently using a suffix array thus reaching linear time complexity.

Lempel–Ziv–Welch (LZW) is a universal lossless data compression algorithm created by Abraham Lempel, Jacob Ziv, and Terry Welch. It was published by Welch in 1984 as an improved implementation of the LZ78 algorithm published by Lempel and Ziv in 1978. The algorithm is simple to implement and has the potential for very high throughput in hardware implementations. It is the algorithm of the Unix file compression utility compress and is used in the GIF image format.

Punycode is a representation of Unicode with the limited ASCII character subset used for Internet hostnames. Using Punycode, host names containing Unicode characters are transcoded to a subset of ASCII consisting of letters, digits, and hyphens, which is called the letter–digit–hyphen (LDH) subset. For example, München is encoded as Mnchen-3ya.

In information theory, linguistics, and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits required to change one word into the other. It is named after the Soviet mathematician Vladimir Levenshtein, who considered this distance in 1965.

Lao script or Akson Lao is the primary script used to write the Lao language and other minority languages in Laos. Its earlier form, the Tai Noi script, was also used to write the Isan language, but was replaced by the Thai script. It has 27 consonants, 7 consonantal ligatures, 33 vowels, and 4 tone marks.

In computer science, the Rabin–Karp algorithm or Karp–Rabin algorithm is a string-searching algorithm created by Richard M. Karp and Michael O. Rabin (1987) that uses hashing to find an exact match of a pattern string in a text. It uses a rolling hash to quickly filter out positions of the text that cannot match the pattern, and then checks for a match at the remaining positions. Generalizations of the same idea can be used to find more than one match of a single pattern, or to find matches for more than one pattern.

In computer science, the Boyer–Moore string-search algorithm is an efficient string-searching algorithm that is the standard benchmark for practical string-search literature. It was developed by Robert S. Boyer and J Strother Moore in 1977. The original paper contained static tables for computing the pattern shifts without an explanation of how to produce them. The algorithm for producing the tables was published in a follow-on paper; this paper contained errors which were later corrected by Wojciech Rytter in 1980.

Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling. The algorithm mainly encodes consonants; a vowel will not be encoded unless it is the first letter. Soundex is the most widely known of all phonetic algorithms Improvements to Soundex are the basis for many modern phonetic algorithms.

<span class="mw-page-title-main">Needleman–Wunsch algorithm</span> Method for aligning biological sequences

The Needleman–Wunsch algorithm is an algorithm used in bioinformatics to align protein or nucleotide sequences. It was one of the first applications of dynamic programming to compare biological sequences. The algorithm was developed by Saul B. Needleman and Christian D. Wunsch and published in 1970. The algorithm essentially divides a large problem into a series of smaller problems, and it uses the solutions to the smaller problems to find an optimal solution to the larger problem. It is also sometimes referred to as the optimal matching algorithm and the global alignment technique. The Needleman–Wunsch algorithm is still widely used for optimal global alignment, particularly when the quality of the global alignment is of the utmost importance. The algorithm assigns a score to every possible alignment, and the purpose of the algorithm is to find all possible alignments having the highest score.

<span class="mw-page-title-main">Radix tree</span> Data structure

In computer science, a radix tree is a data structure that represents a space-optimized trie in which each node that is the only child is merged with its parent. The result is that the number of children of every internal node is at most the radix r of the radix tree, where r is a positive integer and a power x of 2, having x ≥ 1. Unlike regular trees, edges can be labeled with sequences of elements as well as single elements. This makes radix trees much more efficient for small sets and for sets of strings that share long prefixes.

Algorithmic information theory (AIT) is a branch of theoretical computer science that concerns itself with the relationship between computation and information of computably generated objects (as opposed to stochastically generated), such as strings or any other data structure. In other words, it is shown within algorithmic information theory that computational incompressibility "mimics" (except for a constant that only depends on the chosen universal programming language) the relations or inequalities found in information theory. According to Gregory Chaitin, it is "the result of putting Shannon's information theory and Turing's computability theory into a cocktail shaker and shaking vigorously."

In computer science and statistics, the Jaro–Winkler similarity is a string metric measuring an edit distance between two sequences. It is a variant of the Jaro distance metric metric proposed in 1990 by William E. Winkler.

<span class="mw-page-title-main">Meteg</span> Hebrew punctuation mark (ֽ)

Meteg is a punctuation mark used in Biblical Hebrew for stress marking. It is a vertical bar placed under the affected syllable.

Tamil All Character Encoding (TACE16) is a 16-bit Unicode-based character encoding scheme for Tamil language.

Cologne phonetics is a phonetic algorithm which assigns to words a sequence of digits, the phonetic code. The aim of this procedure is that identical sounding words have the same code assigned to them. The algorithm can be used to perform a similarity search between words. For example, it is possible in a name list to find entries like "Meier" under different spellings such as "Maier", "Mayer", or "Mayr". The Cologne phonetics is related to the well known Soundex phonetic algorithm but is optimized to match the German language. The algorithm was published in 1969 by Hans Joachim Postel.

References

  1. Moore, G B.; Kuhns, J L.; Treffzs, J L.; Montgomery, C A. (Feb 1, 1977). Accessing Individual Records from Personal Data Files Using Nonunique Identifiers. US National Institute of Standards and Technology. p. 17. NIST SP - 500-2.