Phonetic algorithm

Last updated May 07, 2024

A phonetic algorithm is an algorithm for indexing of words by their pronunciation. Most phonetic algorithms were developed for English and are not useful for indexing words in other languages.^[1] Because English spelling varies significantly depending on multiple factors, such as the word's origin and usage over time and borrowings from other languages, phonetic algorithms necessarily take into account numerous rules and exceptions.^[2]

Algorithms

Among the best-known phonetic algorithms are:

Soundex, which was developed to encode surnames for use in censuses. Soundex codes are four-character strings composed of a single letter followed by three numbers.
Daitch–Mokotoff Soundex, which is a refinement of Soundex designed to better match surnames of Slavic and Germanic origin. Daitch–Mokotoff Soundex codes are strings composed of six numeric digits.
Cologne phonetics: This is similar to Soundex, but more suitable for German words.
Metaphone and Double Metaphone which are suitable for use with most English words, not just names. Metaphone algorithms are the basis for many popular spell checkers.
New York State Identification and Intelligence System (NYSIIS), which maps similar phonemes to the same letter. The result is a string that can be pronounced by the reader without decoding.
Match Rating Approach developed by Western Airlines in 1977 - this algorithm has an encoding and range comparison technique.
Caverphone, created to assist in data matching between late 19th century and early 20th century electoral rolls, optimized for accents present in parts of New Zealand.

Common uses

Spell checkers can often contain phonetic algorithms. The Metaphone algorithm, for example, can take an incorrectly spelled word and create a code. The code is then looked up in directory for words with the same or similar Metaphone. Words that have the same or similar Metaphone become possible alternative spellings.
Search functionality will often use phonetic algorithms to find results that don't match exactly the term(s) used in the search. Searching for names can be difficult as there are often multiple alternative spellings for names. An example is the name Claire. It has two alternatives, Clare/Clair, which are both pronounced the same. Searching for one spelling wouldn't show results for the two others. Using Soundex all three variations produce the same Soundex code, C460. By searching names based on the Soundex code all three variations will be returned.
Data deduplication efforts use phonetic algorithms to easily bucket records into groups of similar sounding names for further evaluation.
Speech to text modules use phonetic encoding to find the set of dictionary words that are pronounced similarly to the phonemes output by the processed audio signal.

Related Research Articles

Metaphone is a phonetic algorithm, published by Lawrence Philips in 1990, for indexing words by their English pronunciation. It fundamentally improves on the Soundex algorithm by using information about variations and inconsistencies in English spelling and pronunciation to produce a more accurate encoding, which does a better job of matching words and names which sound similar. As with Soundex, similar-sounding words should share the same keys. Metaphone is available as a built-in operator in a number of systems.

A phonemic orthography is an orthography in which the graphemes correspond to the language's phonemes. Natural languages rarely have perfectly phonemic orthographies; a high degree of grapheme–phoneme correspondence can be expected in orthographies based on alphabetic writing systems, but they differ in how complete this correspondence is. English orthography, for example, is alphabetic but highly nonphonemic; it was once mostly phonemic during the Middle English stage, when the modern spellings originated, but spoken English changed rapidly while the orthography was much more stable, resulting in the modern nonphonemic situation. On the contrary the Albanian, Serbian/Croatian/Bosnian/Montenegrin, Romanian, Italian, Turkish, Spanish, Finnish, Czech, Latvian, Esperanto, Korean and Swahili orthographic systems come much closer to being consistent phonemic representations.

In information theory, linguistics, and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. The Levenshtein distance between two words is the minimum number of single-character edits required to change one word into the other. It is named after Soviet mathematician Vladimir Levenshtein, who defined the metric in 1965.

In computational linguistics and computer science, edit distance is a string metric, i.e. a way of quantifying how dissimilar two strings are to one another, that is measured by counting the minimum number of operations required to transform one string into the other. Edit distances find applications in natural language processing, where automatic spelling correction can determine candidate corrections for a misspelled word by selecting words from a dictionary that have a low distance to the word in question. In bioinformatics, it can be used to quantify the similarity of DNA sequences, which can be viewed as strings of the letters A, C, G and T.

In software, a spell checker is a software feature that checks for misspellings in a text. Spell-checking features are often embedded in software or services, such as a word processor, email client, electronic dictionary, or search engine.

Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling. The algorithm mainly encodes consonants; a vowel will not be encoded unless it is the first letter. Soundex is the most widely known of all phonetic algorithms Improvements to Soundex are the basis for many modern phonetic algorithms.

A letter bank is a relative of the anagram where all the letters of one word can be used as many times as desired to make a new word or phrase. For example, IMPS is a bank of MISSISSIPPI and SPROUT is a bank of SUPPORT OUR TROOPS.

Daitch–Mokotoff Soundex is a phonetic algorithm invented in 1985 by Jewish genealogists Gary Mokotoff and Randy Daitch. It is a refinement of the Russell and American Soundex algorithms designed to allow greater accuracy in matching of Slavic and Yiddish surnames with similar pronunciation but differences in spelling.

The New York State Identification and Intelligence System Phonetic Code, commonly known as NYSIIS, is a phonetic algorithm devised in 1970 as part of the New York State Identification and Intelligence System. It features an accuracy increase of 2.7% over the traditional Soundex algorithm.

Spelling suggestion is a feature of many computer software applications used to suggest plausible replacements for words that are likely to have been misspelled.

In computer science, a Levenshtein automaton for a string w and a number n is a finite-state automaton that can recognize the set of all strings whose Levenshtein distance from w is at most n. That is, a string x is in the formal language recognized by the Levenshtein automaton if and only if x can be transformed into w by at most n single-character insertions, deletions, and substitutions.

In information theory and computer science, the Damerau–Levenshtein distance is a string metric for measuring the edit distance between two sequences. Informally, the Damerau–Levenshtein distance between two words is the minimum number of operations required to change one word into the other.

Gary Mokotoff (born April 26, 1937) is an author, lecturer, and Jewish genealogy researcher. Mokotoff is the publisher of AVOTAYNU, the International Review of Jewish Genealogy, and is the former president of the International Association of Jewish Genealogical Societies (IAJGS). He is the creator of the JewishGen's Jewish Genealogical Family Finder and the Jewish Genealogical People Finder. He co-authored the Daitch–Mokotoff Soundex system. Mokotoff is co-author of Where We Once Walked: A Guide to the Jewish Communities Destroyed in the Holocaust.

In computer science, approximate string matching is the technique of finding strings that match a pattern approximately. The problem of approximate string matching is typically divided into two sub-problems: finding approximate substring matches inside a given string and finding dictionary strings that match the pattern approximately.

The match rating approach (MRA) is a phonetic algorithm for indexing of words by their pronunciation developed by Western Airlines in 1977 for the indexation and comparison of homophonous names.

In mathematics and computer science, a string metric is a metric that measures distance between two text strings for approximate string matching or comparison and in fuzzy string searching. A requirement for a string metric is fulfillment of the triangle inequality. For example, the strings "Sam" and "Samuel" can be considered to be close. A string metric provides a number indicating an algorithm-specific indication of distance.

Where Once We Walked, compiled by noted genealogist Gary Mokotoff and Sallyann Amdur Sack with Alexander Sharon, is a gazetteer of 37,000 town names in Central and Eastern Europe focusing on those with Jewish populations in the 19th and first half of the 20th centuries and most of whose Jewish communities were almost or completely destroyed during The Holocaust.

TRE is an open-source library for pattern matching in text, which works like a regular expression engine with the ability to do approximate string matching. It was developed by Ville Laurikari and is distributed under a 2-clause BSD-like license.

Cologne phonetics is a phonetic algorithm which assigns to words a sequence of digits, the phonetic code. The aim of this procedure is that identical sounding words have the same code assigned to them. The algorithm can be used to perform a similarity search between words. For example, it is possible in a name list to find entries like "Meier" under different spellings such as "Maier", "Mayer", or "Mayr". The Cologne phonetics is related to the well known Soundex phonetic algorithm but is optimized to match the German language. The algorithm was published in 1969 by Hans Joachim Postel.

References

↑ Li, Nan; Hitchcock, Peter; Blustein, James; Bliemel, Michael (2011). H. Raghav Rao; Raj Sharman; T. S. Raghu (eds.). Exploring the grand challenges for next generation E-Business : 8th Workshop on E-Business, WEB 2009, Phoenix, AZ, USA, December 15, 2009, Revised selected papers. Berlin: Springer. p. 232. ISBN 9783642174483 . Retrieved 31 December 2020.
↑ Cohen, Eli B. (2009). Growing Information: Part 2. Santa Rosa, Calif.: Informing Science. p. 498. ISBN 978-1-932886-17-7.

This article incorporates public domain material from Paul E. Black. "phonetic coding". Dictionary of Algorithms and Data Structures . NIST.

External links

Algorithm for converting words to phonemes and back.
StringMetric project a Scala library of phonetic algorithms.
clj-fuzzy project a Clojure library of phonetic algorithms.
SoundexBR library of phonetic algorithm implemented in R.
Talisman a JavaScript library collecting various phonetic algorithms that one can try online.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Li, Nan; Hitchcock, Peter; Blustein, James; Bliemel, Michael (2011). H. Raghav Rao; Raj Sharman; T. S. Raghu (eds.). Exploring the grand challenges for next generation E-Business : 8th Workshop on E-Business, WEB 2009, Phoenix, AZ, USA, December 15, 2009, Revised selected papers. Berlin: Springer. p. 232. ISBN 9783642174483 . Retrieved 31 December 2020.

[2] Cohen, Eli B. (2009). Growing Information: Part 2. Santa Rosa, Calif.: Informing Science. p. 498. ISBN 978-1-932886-17-7.

[1]

[2]