Soundex

Last updated

Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling. [1] The algorithm mainly encodes consonants; a vowel will not be encoded unless it is the first letter. Soundex is the most widely known of all phonetic algorithms (in part because it is a standard feature of popular database software such as IBM Db2, PostgreSQL, [2] MySQL, [3] SQLite, [4] Ingres, MS SQL Server, [5] Oracle, [6] ClickHouse, [7] Snowflake [8] and SAP ASE. [9] ) Improvements to Soundex are the basis for many modern phonetic algorithms. [10]

Contents

History

Soundex was developed by Robert C. Russell and Margaret King Odell [11] and patented in 1918 [12] and 1922. [13] A variation, American Soundex, was used in the 1930s for a retrospective analysis of the US censuses from 1890 through 1920. The Soundex code came to prominence in the 1960s when it was the subject of several articles in the Communications and Journal of the Association for Computing Machinery , and especially when described in Donald Knuth's The Art of Computer Programming . [14]

The National Archives and Records Administration (NARA) maintains the current rule set for the official implementation of Soundex used by the U.S. government. [1] These encoding rules are available from NARA, upon request, in the form of General Information Leaflet 55, "Using the Census Soundex".

American Soundex

The Soundex code for a name consists of a letter followed by three numerical digits: the letter is the first letter of the name, and the digits encode the remaining consonants. Consonants at a similar place of articulation share the same digit so, for example, the labial consonants B, F, P, and V are each encoded as the number 1.

The correct value can be found as follows:

  1. Retain the first letter of the name and drop all other occurrences of a, e, i, o, u, y, h, w.
  2. Replace consonants with digits as follows (after the first letter):
    • b, f, p, v → 1
    • c, g, j, k, q, s, x, z → 2
    • d, t → 3
    • l → 4
    • m, n → 5
    • r → 6
  3. If two or more letters with the same number are adjacent in the original name (before step 1), only retain the first letter; also two letters with the same number separated by 'h', 'w' or 'y' are coded as a single number, whereas such letters separated by a vowel are coded twice. This rule also applies to the first letter.
  4. If there are too few letters in the word to assign three numbers, append zeros until there are three numbers. If there are four or more numbers, retain only the first three.

Using this algorithm, both "Robert" and "Rupert" return the same string "R163" while "Rubin" yields "R150". "Ashcraft" and "Ashcroft" both yield "A261". "Tymczak" yields "T522" not "T520" (the chars 'z' and 'k' in the name are coded as 2 twice since a vowel lies in between them). "Pfister" yields "P236" not "P123" (the first two letters have the same number and are coded once as 'P'), and "Honeyman" yields "H555".

The following algorithm is followed by most SQL languages (excluding PostgreSQL[ example needed ]):

  1. Save the first letter. Map all occurrences of a, e, i, o, u, y, h, w. to zero(0)
  2. Replace all consonants (include the first letter) with digits as in [2.] above.
  3. Replace all adjacent same digits with one digit, and then remove all the zero (0) digits
  4. If the saved letter's digit is the same as the resulting first digit, remove the digit (keep the letter).
  5. Append 3 zeros if result contains less than 3 digits. Remove all except the first letter and 3 digits after it (This step is the same as [4.] in explanation above).

The two algorithms above do not return the same results in all cases primarily because of the difference between when the vowels are removed. The first algorithm is used by most programming languages and the second is used by SQL. For example, "Tymczak" yields "T522" in the first algorithm, but "T520" in the algorithm used by SQL. Often, both algorithms generate the same code. As examples, both "Robert" and "Rupert" yield "R163" and "Honeyman" yields "H555". In designing an application, which combines SQL and a programming language, the architect must decide whether to do all of the Soundex encoding in the SQL server or all in the programming language. The MySQL implementation can return more than 4 characters. [15] [16]

Variants

A similar algorithm called "Reverse Soundex" prefixes the last letter of the name instead of the first.

The New York State Identification and Intelligence System (NYSIIS) algorithm was introduced in 1970 as an improvement to the Soundex algorithm. NYSIIS handles some multi-character n-grams and maintains relative vowel positioning, whereas Soundex does not.

Daitch–Mokotoff Soundex (D–M Soundex) was developed in 1985 by genealogist Gary Mokotoff and later improved by genealogist Randy Daitch because of problems they encountered while trying to apply the Russell Soundex to Jews with Germanic or Slavic surnames (such as Moskowitz vs. Moskovitz or Levine vs. Lewin). D–M Soundex is sometimes referred to as "Jewish Soundex" or "Eastern European Soundex", [17] although the authors discourage the use of those names. The D–M Soundex algorithm can return as many as 32 individual phonetic encodings for a single name. Results of D-M Soundex are returned in an all-numeric format between 100000 and 999999. This algorithm is much more complex than Russell Soundex.

As a response to deficiencies in the Soundex algorithm, Lawrence Philips developed the Metaphone algorithm in 1990. Philips developed an improvement to Metaphone in 2000, which he called Double Metaphone. Double Metaphone includes a much larger encoding rule set than its predecessor, handles a subset of non-Latin characters, and returns a primary and a secondary encoding to account for different pronunciations of a single word in English. Philips created Metaphone 3 as a further revision in 2009 to provide a professional version that provides a much higher percentage of correct encodings for English words, non-English words familiar to Americans, and first and last names found in the United States. It also provides settings that allow more exact consonant and internal vowel matching to allow the programmer to focus the precision of matches more closely.

See also

Related Research Articles

<span class="mw-page-title-main">International Phonetic Alphabet</span> System of phonetic notation

The International Phonetic Alphabet (IPA) is an alphabetic system of phonetic notation based primarily on the Latin script. It was devised by the International Phonetic Association in the late 19th century as a standard written representation for the sounds of speech. The IPA is used by lexicographers, foreign language students and teachers, linguists, speech–language pathologists, singers, actors, constructed language creators, and translators.

<span class="mw-page-title-main">O</span> 15th letter of the Latin alphabet

O, or o, is the fifteenth letter and the fourth vowel letter of the Latin alphabet, used in the modern English alphabet, the alphabets of other western European languages and others worldwide. Its name in English is o, plural oes.

<span class="mw-page-title-main">Y</span> Penultimate letter of the Latin alphabet

Y, or y, is the twenty-fifth and penultimate letter of the Latin alphabet, used in the modern English alphabet, the alphabets of other western European languages and others worldwide. According to some authorities, it is the sixth vowel letter of the English alphabet. Its name in English is wye, plural wyes.

A phonetic algorithm is an algorithm for indexing of words by their pronunciation. Most phonetic algorithms were developed for English and are not useful for indexing words in other languages. Because English spelling varies significantly depending on multiple factors, such as the word's origin and usage over time and borrowings from other languages, phonetic algorithms necessarily take into account numerous rules and exceptions.

Metaphone is a phonetic algorithm, published by Lawrence Philips in 1990, for indexing words by their English pronunciation. It fundamentally improves on the Soundex algorithm by using information about variations and inconsistencies in English spelling and pronunciation to produce a more accurate encoding, which does a better job of matching words and names which sound similar. As with Soundex, similar-sounding words should share the same keys. Metaphone is available as a built-in operator in a number of systems.

<span class="mw-page-title-main">Brahmic scripts</span> Family of abugida writing systems

The Brahmic scripts, also known as Indic scripts, are a family of abugida writing systems. They are used throughout the Indian subcontinent, Southeast Asia and parts of East Asia. They are descended from the Brahmi script of ancient India and are used by various languages in several language families in South, East and Southeast Asia: Indo-Aryan, Dravidian, Tibeto-Burman, Mongolic, Austroasiatic, Austronesian, and Tai. They were also the source of the dictionary order (gojūon) of Japanese kana.

The Thai script is the abugida used to write Thai, Southern Thai and many other languages spoken in Thailand. The Thai script itself has 44 consonant symbols, 16 vowel symbols that combine into at least 32 vowel forms, four tone diacritics, and other diacritics.

<span class="mw-page-title-main">SQLite</span> Serverless relational database management system (RDBMS)

SQLite is a database engine written in the C programming language. It is not a standalone app; rather, it is a library that software developers embed in their apps. As such, it belongs to the family of embedded databases. It is the most widely deployed database engine, as it is used by several of the top web browsers, operating systems, mobile phones, and other embedded systems.

<span class="mw-page-title-main">Arabic diacritics</span> Diacritics used in the Arabic script

The Arabic script has numerous diacritics, which include consonant pointing known as iʻjām (إِعْجَام), and supplementary diacritics known as tashkīl (تَشْكِيل). The latter include the vowel marks termed ḥarakāt.

The following tables compare general and technical information for a number of relational database management systems. Please see the individual products' articles for further information. Unless otherwise specified in footnotes, comparisons are based on the stable versions without any add-ons, extensions or external programs.

Daitch–Mokotoff Soundex is a phonetic algorithm invented in 1985 by Jewish genealogists Gary Mokotoff and Randy Daitch. It is a refinement of the Russell and American Soundex algorithms designed to allow greater accuracy in matching of Slavic and Yiddish surnames with similar pronunciation but differences in spelling.

The New York State Identification and Intelligence System Phonetic Code, commonly known as NYSIIS, is a phonetic algorithm devised in 1970 as part of the New York State Identification and Intelligence System. It features an accuracy increase of 2.7% over the traditional Soundex algorithm.

A relational database management system uses SQL MERGE statements to INSERT new records or UPDATE or DELETE existing records depending on whether condition matches. It was officially introduced in the SQL:2003 standard, and expanded in the SQL:2008 standard.

The match rating approach (MRA) is a phonetic algorithm for indexing of words by their pronunciation developed by Western Airlines in 1977 for the indexation and comparison of homophonous names.

The DUAL table is a special one-row, one-column table present by default in Oracle and other database installations. In Oracle, the table has a single VARCHAR2(1) column called DUMMY that has a value of 'X'. It is suitable for use in selecting a pseudo column such as SYSDATE or USER.

The Italian fiscal code, officially known in Italy as Codice fiscale, is the tax code in Italy, similar to a Social Security Number (SSN) in the United States or the National Insurance Number issued in the United Kingdom. It is an alphanumeric code of 16 characters. The code serves to unambiguously identify individuals irrespective of citizenship or residency status. Designed by and for the Italian tax office, it is now used for several other purposes, e.g. uniquely identifying individuals in the health system, or natural persons who act as parties in private contracts. The code is issued by the Italian tax office, the Agenzia delle Entrate.

<span class="mw-page-title-main">I</span> 9th letter of the Latin alphabet

I, or i, is the ninth letter and the third vowel letter of the Latin alphabet, used in the modern English alphabet, the alphabets of other western European languages and others worldwide. Its name in English is i, plural ies.

IPA numbers are a legacy system of coding the symbols of the International Phonetic Alphabet. They were the organizational basis for XSAMPA and the IPA Extensions block of Unicode.

Cologne phonetics is a phonetic algorithm which assigns to words a sequence of digits, the phonetic code. The aim of this procedure is that identical sounding words have the same code assigned to them. The algorithm can be used to perform a similarity search between words. For example, it is possible in a name list to find entries like "Meier" under different spellings such as "Maier", "Mayer", or "Mayr". The Cologne phonetics is related to the well known Soundex phonetic algorithm but is optimized to match the German language. The algorithm was published in 1969 by Hans Joachim Postel.

References

  1. 1 2 "The Soundex Indexing System". National Archives. National Archives and Records Administration. 30 May 2007. Archived from the original on 12 March 2020. Retrieved 24 December 2010.
  2. "Documentation: 9.1: fuzzystrmatch". PostgreSQL . Archived from the original on 23 July 2020. Retrieved 3 November 2012.
  3. "MySQL 5.5 Reference Manual :: 12.5 String Functions". MySQL. SOUNDEX(str). Archived from the original on 15 September 2016.
  4. "Built-In Scaler SQL Functions". SQLite. 16 July 2022. soundex(X). Archived from the original on 20 December 2022. Retrieved 24 December 2022.
  5. "SOUNDEX (Transact-SQL)". Microsoft Learn . 10 January 2010. Archived from the original on 23 October 2022. Retrieved 3 November 2012.
  6. "SOUNDEX". Database SQL Reference. Archived from the original on 21 October 2017. Retrieved 20 October 2017.
  7. "SOUNDEX". Functions for Working with Strings.
  8. "SOUNDEX — Snowflake Documentation". docs.snowflake.com. Retrieved 2023-01-16.
  9. "soundex". SAP Software Solutions. 28 May 2014. Archived from the original on 25 December 2022. Retrieved 24 May 2021.
  10. "Phonetic Matching: A Better Soundex" . Retrieved 2012-11-03.
  11. Odell, Margaret King (1956). "The profit in records management". Systems. 20. New York: 20.
  12. USpatent 1261167,R. C. Russell,"(untitled)",issued 1918-04-02 (Archived)
  13. USpatent 1435663,R. C. Russell,"(untitled)",issued 1922-11-14 (Archived)
  14. Knuth, Donald E. (1973). The Art of Computer Programming: Volume 3, Sorting and Searching. Addison-Wesley. pp. 391–92. ISBN   978-0-201-03803-3. OCLC   39472999. Archived from the original on 2008-09-04. Retrieved 2010-09-17.
  15. CodingForums.com ()
  16. "MySQL :: MySQL 5.5 Reference Manual :: 12.5 String Functions - SOUNDEX". dev.mysql.com.
  17. Mokotoff, Gary (2007-09-08). "Soundexing and Genealogy" . Retrieved 2008-01-27.