Metaphone

Last updated

Metaphone is a phonetic algorithm, published by Lawrence Philips in 1990, for indexing words by their English pronunciation. [1] It fundamentally improves on the Soundex algorithm by using information about variations and inconsistencies in English spelling and pronunciation to produce a more accurate encoding, which does a better job of matching words and names which sound similar. As with Soundex, similar-sounding words should share the same keys. Metaphone is available as a built-in operator in a number of systems.

Contents

Philips later produced a new version of the algorithm, which he named Double Metaphone. Contrary to the original algorithm whose application is limited to English only, this version takes into account spelling peculiarities of a number of other languages. In 2009 Philips released a third version, called Metaphone 3, which achieves an accuracy of approximately 99% for English words, non-English words familiar to Americans, and first names and family names commonly found in the United States, having been developed according to modern engineering standards against a test harness of prepared correct encodings.

Procedure

Original Metaphone codes use the 16 consonant symbols 0BFHJKLMNPRSTWXY. The '0' represents "th" (as an ASCII approximation of Θ), 'X' represents "sh" or "ch", and the others represent their usual English pronunciations. The vowels AEIOU are also used, but only at the beginning of the code. [2] This table summarizes most of the rules in the original implementation:

  1. Drop duplicate adjacent letters, except for C.
  2. If the word begins with 'KN', 'GN', 'PN', 'AE', 'WR', drop the first letter.
  3. Drop 'B' if after 'M' at the end of the word.
  4. 'C' transforms to 'X' if followed by 'IA' or 'H' (unless in latter case, it is part of '-SCH-', in which case it transforms to 'K'). 'C' transforms to 'S' if followed by 'I', 'E', or 'Y'. Otherwise, 'C' transforms to 'K'.
  5. 'D' transforms to 'J' if followed by 'GE', 'GY', or 'GI'. Otherwise, 'D' transforms to 'T'.
  6. Drop 'G' if followed by 'H' and 'H' is not at the end or before a vowel. Drop 'G' if followed by 'N' or 'NED' and is at the end.
  7. 'G' transforms to 'J' if before 'I', 'E', or 'Y', and it is not in 'GG'. Otherwise, 'G' transforms to 'K'.
  8. Drop 'H' if after vowel and not before a vowel.
  9. 'CK' transforms to 'K'.
  10. 'PH' transforms to 'F'.
  11. 'Q' transforms to 'K'.
  12. 'S' transforms to 'X' if followed by 'H', 'IO', or 'IA'.
  13. 'T' transforms to 'X' if followed by 'IA' or 'IO'. 'TH' transforms to '0'. Drop 'T' if followed by 'CH'.
  14. 'V' transforms to 'F'.
  15. 'WH' transforms to 'W' if at the beginning. Drop 'W' if not followed by a vowel.
  16. 'X' transforms to 'S' if at the beginning. Otherwise, 'X' transforms to 'KS'.
  17. Drop 'Y' if not followed by a vowel.
  18. 'Z' transforms to 'S'.
  19. Drop all vowels unless it is the beginning.

This table does not constitute a complete description of the original Metaphone algorithm, and the algorithm cannot be coded correctly from it. Original Metaphone contained many errors and was superseded by Double Metaphone, and in turn Double Metaphone and original Metaphone were superseded by Metaphone 3, which corrects thousands of miscodings that will be produced by the first two versions.

To implement Metaphone without purchasing a (source code) copy of Metaphone 3, the reference implementation of Double Metaphone can be used. [3] Alternatively, version 2.1.3 of Metaphone 3, an earlier 2009 version without a number of encoding corrections made in the current version, version 2.5.4, has been made available under the terms of the BSD License via the OpenRefine project. [4]

Double Metaphone

The Double Metaphone phonetic encoding algorithm is the second generation of this algorithm. Its implementation was described in the June 2000 issue of C/C++ Users Journal . [5] It makes a number of fundamental design improvements over the original Metaphone algorithm.

It is called "Double" because it can return both a primary and a secondary code for a string; this accounts for some ambiguous cases as well as for multiple variants of surnames with common ancestry. For example, encoding the name "Smith" yields a primary code of SM0 and a secondary code of XMT, while the name "Schmidt" yields a primary code of XMT and a secondary code of SMT—both have XMT in common.

Double Metaphone tries to account for myriad irregularities in English of Slavic, Germanic, Celtic, Greek, French, Italian, Spanish, Chinese, and other origins. Thus it uses a much more complex ruleset for coding than its predecessor; for example, it tests for approximately 100 different contexts of the use of the letter C alone.

Metaphone 3

A professional version was released in October 2009, developed by the same author, Lawrence Philips. It is a commercial product sold as source code. Metaphone 3 further improves phonetic encoding of words in the English language, non-English words familiar to Americans, and first names and family names commonly found in the United States. It improves encoding for proper names in particular to a considerable extent. [6] The author claims that in general it improves accuracy for all words from the approximately 89% of Double Metaphone to 98%. Developers can also now set switches in code to cause the algorithm to encode Metaphone keys 1) taking non-initial vowels into account, as well as 2) encoding voiced and unvoiced consonants differently. This allows the result set to be more closely focused if the developer finds that the search results include too many words that don't resemble the search term closely enough. [7] Metaphone 3 is sold as C++, Java, C#, PHP, Perl, and PL/SQL source, Ruby and Python wrappers accessing a Java jar, and also Metaphone 3 for Spanish and German pronunciation available as Java and C# source. [8] The latest revision of the Metaphone 3 algorithm is v2.5.4, released March 2015. The Metaphone3 Java source code for an earlier version, 2.1.3, lacking a large number of encoding corrections made in the current version, version 2.5.4, was included as part of the OpenRefine project and is publicly viewable. [9]

Common misconceptions

There are some misconceptions about the Metaphone algorithms that should be addressed. The following statements are true:

  1. All of them are designed to address regular, "dictionary" words, not just names, and
  2. Metaphone algorithms do not produce phonetic representations of the input words and names; rather, the output is an intentionally approximate phonetic representation, according to this standard:
  • words that start with a vowel sound will have an 'A', representing any vowel, as the first character of the encoding (in Double Metaphone and Metaphone 3 - original Metaphone just preserves the actual vowel),
  • vowels after an initial vowel sound will be disregarded and not encoded, and
  • voiced/unvoiced consonant pairs will be mapped to the same encoding. (Examples of voiced/unvoiced consonant pairs are D/T, B/P, Z/S, G/K, etc.).

This approximate encoding is necessary to account for the way English speakers vary their pronunciations and misspell or otherwise vary words and names they are trying to spell. Vowels, of course, are notoriously highly variable. British speakers often complain that Americans seem to pronounce 'T's the same as 'D'. Consider, also, that all English speakers often pronounce 'Z' where 'S' is spelled, almost always when a noun ending in a voiced consonant or a liquid is pluralized, for example "seasons", "beams", "examples", etc. Not encoding vowels after an initial vowel sound will help to group words where a vowel and a consonant may be transposed in the misspelling or alternative pronunciation.

Metaphone of other languages

Metaphone is useful for English variants and other languages, having been preferred to Soundex in several Indo-European languages. On the other hand, rough phonetic encoding causes language dependency or, in a language variant, average language-speaker dependency mainly for non-English variants.

Perhaps the first example of stable adaptation of non-English metaphone was Brazilian Portuguese: it originated in ~2008 as a database solution in Várzea Paulista municipality of Brazil, and it evolved to the current metaphone-ptbr algorithm.

See also

Related Research Articles

<span class="mw-page-title-main">A</span> First letter of the Latin alphabet

A, or a, is the first letter and the first vowel of the Latin alphabet, used in the modern English alphabet, the alphabets of other western European languages and others worldwide. Its name in English is a, plural aes. It is similar in shape to the Ancient Greek letter alpha, from which it derives. The uppercase version consists of the two slanting sides of a triangle, crossed in the middle by a horizontal bar. The lowercase version can be written in two forms: the double-storey a and single-storey ɑ. The latter is commonly used in handwriting and fonts based on it, especially fonts intended to be read by children, and is also found in italic type.

<span class="mw-page-title-main">G</span> Letter of the Latin alphabet

G, or g, is the seventh letter in the Latin alphabet, used in the modern English alphabet, the alphabets of other western European languages and others worldwide. Its name in English is gee, plural gees.

<span class="mw-page-title-main">H</span> Letter of the Latin alphabet

H, or h, is the eighth letter in the Latin alphabet, used in the modern English alphabet, the alphabets of other western European languages and others worldwide. Its name in English is aitch, or regionally haitch.

<span class="mw-page-title-main">International Phonetic Alphabet</span> System of phonetic notation

The International Phonetic Alphabet (IPA) is an alphabetic system of phonetic notation based primarily on the Latin script. It was devised by the International Phonetic Association in the late 19th century as a standardized representation of speech sounds in written form. The IPA is used by lexicographers, foreign language students and teachers, linguists, speech–language pathologists, singers, actors, constructed language creators, and translators.

Katakana is a Japanese syllabary, one component of the Japanese writing system along with hiragana, kanji and in some cases the Latin script.

<span class="mw-page-title-main">U</span> Letter in the Latin alphabet

U or u, is the twenty-first and sixth-to-last letter and fifth vowel letter of the Latin alphabet, used in the modern English alphabet, the alphabets of other western European languages and others worldwide. Its name in English is u, plural ues.

<span class="mw-page-title-main">Y</span> Letter of the Latin alphabet

Y, or y, is the twenty-fifth and penultimate letter of the Latin alphabet, used in the modern English alphabet, the alphabets of other western European languages and others worldwide. According to some authorities, it is the sixth vowel letter of the English alphabet. In the English writing system, it mostly represents a vowel and seldom a consonant, and in other orthographies it may represent a vowel or a consonant. Its name in English is wye, plural wyes.

A phonetic algorithm is an algorithm for indexing of words by their pronunciation. Most phonetic algorithms were developed for English and are not useful for indexing words in other languages. Because English spelling varies significantly depending on multiple factors, such as the word's origin and usage over time and borrowings from other languages, phonetic algorithms necessarily take into account numerous rules and exceptions.

The Thai script is the abugida used to write Thai, Southern Thai and many other languages spoken in Thailand. The Thai alphabet itself has 44 consonant symbols, 16 vowel symbols that combine into at least 32 vowel forms and four tone diacritics to create characters mostly representing syllables.

Pitman shorthand is a system of shorthand for the English language developed by Englishman Sir Isaac Pitman (1813–1897), who first presented it in 1837. Like most systems of shorthand, it is a phonetic system; the symbols do not represent letters, but rather sounds, and words are, for the most part, written as they are spoken.

Historical Chinese phonology deals with reconstructing the sounds of Chinese from the past. As Chinese is written with logographic characters, not alphabetic or syllabary, the methods employed in Historical Chinese phonology differ considerably from those employed in, for example, Indo-European linguistics; reconstruction is more difficult because, unlike Indo-European languages, no phonetic spellings were used.

Voice or voicing is a term used in phonetics and phonology to characterize speech sounds. Speech sounds can be described as either voiceless or voiced.

The first Slovak orthography was proposed by Anton Bernolák (1762–1813) in his Dissertatio philologico-critica de litteris Slavorum, used in the six-volume Slovak-Czech-Latin-German-Hungarian Dictionary (1825–1927) and used primarily by Slovak Catholics.

Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling. The algorithm mainly encodes consonants; a vowel will not be encoded unless it is the first letter. Soundex is the most widely known of all phonetic algorithms Improvements to Soundex are the basis for many modern phonetic algorithms.

In phonology, epenthesis means the addition of one or more sounds to a word, especially in the beginning syllable (prothesis) or in the ending syllable (paragoge) or in-between two syllabic sounds in a word. The word epenthesis comes from epi- "in addition to" and en- "in" and thesis "putting". Epenthesis may be divided into two types: excrescence for the addition of a consonant, and for the addition of a vowel, svarabhakti or alternatively anaptyxis. The opposite process, where one or more sounds are removed, is referred to as elision.

The match rating approach (MRA) is a phonetic algorithm for indexing of words by their pronunciation developed by Western Airlines in 1977 for the indexation and comparison of homophonous names.

Tamil phonology is characterised by the presence of “true-subapical” retroflex consonants and multiple rhotic consonants. Its script does not distinguish between voiced and unvoiced consonants; phonetically, voice is assigned depending on a consonant's position in a word, voiced intervocalically and after nasals except when geminated. Tamil phonology permits few consonant clusters, which can never be word initial.

This article covers the phonology of modern Colognian as spoken in the city of Cologne. Varieties spoken outside of Cologne are only briefly covered where appropriate. Historic precedent versions are not considered.

ISO 11940-2 is an ISO standard for a simplified transcription of the Thai language into Latin characters.

Cologne phonetics is a phonetic algorithm which assigns to words a sequence of digits, the phonetic code. The aim of this procedure is that identical sounding words have the same code assigned to them. The algorithm can be used to perform a similarity search between words. For example, it is possible in a name list to find entries like "Meier" under different spellings such as "Maier", "Mayer", or "Mayr". The Cologne phonetics is related to the well known Soundex phonetic algorithm but is optimized to match the German language. The algorithm was published in 1969 by Hans Joachim Postel.

References

  1. Hanging on the Metaphone, Lawrence Philips. Computer Language, Vol. 7, No. 12 (December), 1990.
  2. "Morfoedro - Technology". www.morfoedro.it. Retrieved 16 May 2018.
  3. http://aspell.net/metaphone/dmetaph.cpp [ bare URL plain text file ]
  4. "OpenRefine". GitHub . 19 May 2022.
  5. Philips, Lawrence (June 2000). "The double metaphone search algorithm". C/C++ Users Journal. 18 (6): 38–43.
  6. Best Faces Forward: A Large-scale Study of People Search in the Enterprise I Guy, S Ur, I Ronen, S Weber... - 2012 - http://www.research.ibm.com/haifa/dept/imt/papers/guyCHI12.pdf
  7. Atkinson, Kevin. "Lawrence Philips' Metaphone Algorithm". aspell.net. Retrieved 16 May 2018.
  8. "Anthropomorphic Software". www.amorphics.com. Retrieved 16 May 2018.
  9. "OpenRefine source for Metaphone3". github.com. Retrieved 2 Nov 2020.

Metaphone algorithms for other languages