Phonetic algorithm

Last updated

A phonetic algorithm is an algorithm for indexing of words by their pronunciation. Most phonetic algorithms were developed for English and are not useful for indexing words in other languages. [1] Because English spelling varies significantly depending on multiple factors, such as the word's origin and usage over time and borrowings from other languages, phonetic algorithms necessarily take into account numerous rules and exceptions. [2]

Contents

Algorithms

Among the best-known phonetic algorithms are:

Common uses

See also

Related Research Articles

Metaphone is a phonetic algorithm, published by Lawrence Philips in 1990, for indexing words by their English pronunciation. It fundamentally improves on the Soundex algorithm by using information about variations and inconsistencies in English spelling and pronunciation to produce a more accurate encoding, which does a better job of matching words and names which sound similar. As with Soundex, similar-sounding words should share the same keys. Metaphone is available as a built-in operator in a number of systems.

A phonemic orthography is an orthography in which the graphemes correspond to the language's phonemes. Natural languages rarely have perfectly phonemic orthographies; a high degree of grapheme–phoneme correspondence can be expected in orthographies based on alphabetic writing systems, but they differ in how complete this correspondence is. English orthography, for example, is alphabetic but highly nonphonemic; it was once mostly phonemic during the Middle English stage, when the modern spellings originated, but spoken English changed rapidly while the orthography was much more stable, resulting in the modern nonphonemic situation. On the contrary the Albanian, Serbian/Croatian/Bosnian/Montenegrin, Romanian, Italian, Turkish, Spanish, Finnish, Czech, Latvian, Esperanto, Korean and Swahili orthographic systems come much closer to being consistent phonemic representations.

In information theory, linguistics, and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. The Levenshtein distance between two words is the minimum number of single-character edits required to change one word into the other. It is named after Soviet mathematician Vladimir Levenshtein, who defined the metric in 1965.

In computational linguistics and computer science, edit distance is a string metric, i.e. a way of quantifying how dissimilar two strings are to one another, that is measured by counting the minimum number of operations required to transform one string into the other. Edit distances find applications in natural language processing, where automatic spelling correction can determine candidate corrections for a misspelled word by selecting words from a dictionary that have a low distance to the word in question. In bioinformatics, it can be used to quantify the similarity of DNA sequences, which can be viewed as strings of the letters A, C, G and T.

<span class="mw-page-title-main">Spell checker</span> Software to help correct spelling errors

In software, a spell checker is a software feature that checks for misspellings in a text. Spell-checking features are often embedded in software or services, such as a word processor, email client, electronic dictionary, or search engine.

Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling. The algorithm mainly encodes consonants; a vowel will not be encoded unless it is the first letter. Soundex is the most widely known of all phonetic algorithms Improvements to Soundex are the basis for many modern phonetic algorithms.

A letter bank is a relative of the anagram where all the letters of one word can be used as many times as desired to make a new word or phrase. For example, IMPS is a bank of MISSISSIPPI and SPROUT is a bank of SUPPORT OUR TROOPS.

Daitch–Mokotoff Soundex is a phonetic algorithm invented in 1985 by Jewish genealogists Gary Mokotoff and Randy Daitch. It is a refinement of the Russell and American Soundex algorithms designed to allow greater accuracy in matching of Slavic and Yiddish surnames with similar pronunciation but differences in spelling.

The New York State Identification and Intelligence System Phonetic Code, commonly known as NYSIIS, is a phonetic algorithm devised in 1970 as part of the New York State Identification and Intelligence System. It features an accuracy increase of 2.7% over the traditional Soundex algorithm.

Spelling suggestion is a feature of many computer software applications used to suggest plausible replacements for words that are likely to have been misspelled.

In computer science, a Levenshtein automaton for a string w and a number n is a finite-state automaton that can recognize the set of all strings whose Levenshtein distance from w is at most n. That is, a string x is in the formal language recognized by the Levenshtein automaton if and only if x can be transformed into w by at most n single-character insertions, deletions, and substitutions.

In information theory and computer science, the Damerau–Levenshtein distance is a string metric for measuring the edit distance between two sequences. Informally, the Damerau–Levenshtein distance between two words is the minimum number of operations required to change one word into the other.

Gary Mokotoff (born April 26, 1937) is an author, lecturer, and Jewish genealogy researcher. Mokotoff is the publisher of AVOTAYNU, the International Review of Jewish Genealogy, and is the former president of the International Association of Jewish Genealogical Societies (IAJGS). He is the creator of the JewishGen's Jewish Genealogical Family Finder and the Jewish Genealogical People Finder. He co-authored the Daitch–Mokotoff Soundex system. Mokotoff is co-author of Where We Once Walked: A Guide to the Jewish Communities Destroyed in the Holocaust.

<span class="mw-page-title-main">Approximate string matching</span> Finding strings that approximately match a pattern

In computer science, approximate string matching is the technique of finding strings that match a pattern approximately. The problem of approximate string matching is typically divided into two sub-problems: finding approximate substring matches inside a given string and finding dictionary strings that match the pattern approximately.

The match rating approach (MRA) is a phonetic algorithm for indexing of words by their pronunciation developed by Western Airlines in 1977 for the indexation and comparison of homophonous names.

In mathematics and computer science, a string metric is a metric that measures distance between two text strings for approximate string matching or comparison and in fuzzy string searching. A requirement for a string metric is fulfillment of the triangle inequality. For example, the strings "Sam" and "Samuel" can be considered to be close. A string metric provides a number indicating an algorithm-specific indication of distance.

<i>Where Once We Walked</i>

Where Once We Walked, compiled by noted genealogist Gary Mokotoff and Sallyann Amdur Sack with Alexander Sharon, is a gazetteer of 37,000 town names in Central and Eastern Europe focusing on those with Jewish populations in the 19th and first half of the 20th centuries and most of whose Jewish communities were almost or completely destroyed during The Holocaust.

TRE is an open-source library for pattern matching in text, which works like a regular expression engine with the ability to do approximate string matching. It was developed by Ville Laurikari and is distributed under a 2-clause BSD-like license.

Cologne phonetics is a phonetic algorithm which assigns to words a sequence of digits, the phonetic code. The aim of this procedure is that identical sounding words have the same code assigned to them. The algorithm can be used to perform a similarity search between words. For example, it is possible in a name list to find entries like "Meier" under different spellings such as "Maier", "Mayer", or "Mayr". The Cologne phonetics is related to the well known Soundex phonetic algorithm but is optimized to match the German language. The algorithm was published in 1969 by Hans Joachim Postel.

References

  1. Li, Nan; Hitchcock, Peter; Blustein, James; Bliemel, Michael (2011). H. Raghav Rao; Raj Sharman; T. S. Raghu (eds.). Exploring the grand challenges for next generation E-Business : 8th Workshop on E-Business, WEB 2009, Phoenix, AZ, USA, December 15, 2009, Revised selected papers. Berlin: Springer. p. 232. ISBN   9783642174483 . Retrieved 31 December 2020.
  2. Cohen, Eli B. (2009). Growing Information: Part 2. Santa Rosa, Calif.: Informing Science. p. 498. ISBN   978-1-932886-17-7.