Caverphone

Last updated

The Caverphone within linguistics and computing, is a phonetic matching algorithm [1] [2] invented to identify English names with their sounds, originally built to process a custom dataset compound between 1893 and 1938 in southern Dunedin, New Zealand. [3] Started from a similar concept as metaphone, it has been developed to accommodate and process general English since then. [3]

Contents

Etymology

The Caverphone was created by David Hood in the Caversham Project at the University of Otago in New Zealand in 2002, revised in 2004. It was created to assist in data matching between late 19th century and early 20th century electoral rolls, where the name only needed to be in a "commonly recognisable form". The algorithm was intended to apply to those names that could not easily be matched between electoral rolls, after the exact matches were removed from the pool of potential matches. The algorithm is optimised for accents present in the study area (southern part of the city of Dunedin, New Zealand).

Procedure

Caverphone 1.0

The rules of the algorithm are applied consecutively to any particular name, as a series of replacements.

The algorithm is as follows:

  1. Convert to lowercase
  2. Remove anything not A-Z
  3. If the name starts with...
    1. cough, replace it by cou2f
    2. rough, replace it by rou2f
    3. tough, replace it by tou2f
    4. enough, replace it by enou2f
    5. gn, replace it by 2n
  4. If the name ends with
    1. mb, replace it by m2
  5. Replace
    1. cq with 2q
    2. ci with si
    3. ce with se
    4. cy with sy
    5. tch with 2ch
    6. c with k
    7. q with k
    8. x with k
    9. v with f
    10. dg with 2g
    11. tio with sio
    12. tia with sia
    13. d with t
    14. ph with fh
    15. b with p
    16. sh with s2
    17. z with s
    18. any initial vowel with an A
    19. all other vowels with a 3
    20. 3gh3 with 3kh3
    21. gh with 22
    22. g with k
    23. groups of the letter s with a S
    24. groups of the letter t with a T
    25. groups of the letter p with a P
    26. groups of the letter k with a K
    27. groups of the letter f with a F
    28. groups of the letter m with a M
    29. groups of the letter n with a N
    30. w3 with W3
    31. wy with Wy
    32. wh3 with Wh3
    33. why with Why
    34. w with 2
    35. any initial h with an A
    36. all other occurrences of h with a 2
    37. r3 with R3
    38. ry with Ry
    39. r with 2
    40. l3 with L3
    41. ly with Ly
    42. l with 2
    43. j with y
    44. y3 with Y3
    45. y with 2
  6. remove all
    1. 2
    2. 3
  7. put six 1 on the end
  8. take the first six characters as the code

Caverphone 2.0

  1. Start with a word
  2. Convert to lowercase
  3. Remove anything not in the standard alphabet (typically a-z) [note 1]
  4. Remove final e
  5. If the name starts with
    1. cough make it cou2f
    2. rough make it rou2f
    3. tough make it tou2f
    4. enough make it enou2f
    5. trough make it trou2f
    6. gn make it 2n
  6. If the name ends with
    1. mb make it m2
  7. Replace
    1. cq with 2q
    2. ci with si
    3. ce with se
    4. cy with sy
    5. tch with 2ch
    6. c with k
    7. q with k
    8. x with k
    9. v with f
    10. dg with 2g
    11. tio with sio
    12. tia with sia
    13. d with t
    14. ph with fh
    15. b with p
    16. sh with s2
    17. z with s
    18. an initial vowel [note 2] with an A
    19. all other vowels with a 3
    20. j with y
    21. an initial y3 with Y3
    22. an initial y with A
    23. y with 3
    24. 3gh3 with 3kh3
    25. gh with 22
    26. g with k
    27. groups of the letter s with a S
    28. groups of the letter t with a T
    29. groups of the letter p with a P
    30. groups of the letter k with a K
    31. groups of the letter f with a F
    32. groups of the letter m with a M
    33. groups of the letter n with a N
    34. w3 with W3
    35. wh3 with Wh3
    36. if the name ends in w replace the final w with 3
    37. w with 2
    38. an initial h with an A
    39. all other occurrences of h with a 2
    40. r3 with R3
    41. if the name ends in r replace the final r with 3
    42. r with 2
    43. l3 with L3
    44. if the name ends in l replace the final l with 3
    45. l with 2
  8. remove all 2s
  9. if the name end in 3, replace the final 3 with A
  10. remove all 3s
  11. put ten 1s on the end
  12. take the first ten characters as the code

  1. This may vary if the set of letters includes characters such as æ, ā, or ø
  2. Vowels are normally a, e, i, o, u but depending on the data might include characters such as æ, ā, or ø

Examples

Caverphone 1.0

Lee -> lee lee -> l33 l33 -> L33 L33 -> L L -> L111111 L111111 -> L11111 
Thompson -> thompson thompson -> th3mps3n th3mps3n -> th3mpS3n th3mpS3n -> Th3mpS3n Th3mpS3n -> Th3mPS3n Th3mPS3n -> Th3MPS3n Th3MPS3n -> Th3MPS3N Th3MPS3N -> T23MPS3N T23MPS3N ->  TMPSN TMPSN111111 -> TMPSN1 

Caverphone 2.0

Lee -> lee lee -> le le -> l3 l3 -> L3 L3 -> LA LA -> LA1111111111 LA1111111111 -> LA11111111 
Thompson -> thompson thompson -> th3mps3n th3mps3n -> th3mpS3n th3mpS3n -> Th3mpS3n Th3mpS3n -> Th3mPS3n Th3mPS3n -> Th3MPS3n Th3MPS3n -> Th3MPS3N Th3MPS3N -> T23MPS3N T23MPS3N ->  TMPSN TMPSN1111111111 -> TMPSN11111 

See also

Related Research Articles

Metaphone is a phonetic algorithm, published by Lawrence Philips in 1990, for indexing words by their English pronunciation. It fundamentally improves on the Soundex algorithm by using information about variations and inconsistencies in English spelling and pronunciation to produce a more accurate encoding, which does a better job of matching words and names which sound similar. As with Soundex, similar-sounding words should share the same keys. Metaphone is available as a built-in operator in a number of systems.

<span class="mw-page-title-main">Revised Romanization of Korean</span> Korean language romanization system

Revised Romanization of Korean is the official Korean language romanization system in South Korea. It was developed by the National Academy of the Korean Language from 1995 and was released to the public on 7 July 2000 by South Korea's Ministry of Culture and Tourism in Proclamation No. 2000-8.

<span class="mw-page-title-main">Jyutping</span> Romanization scheme for Cantonese

The Linguistic Society of Hong Kong Cantonese Romanization Scheme, also known as Jyutping, is a romanisation system for Cantonese developed in 1993 by the Linguistic Society of Hong Kong (LSHK).

In mathematics, and more specifically in computer algebra, computational algebraic geometry, and computational commutative algebra, a Gröbner basis is a particular kind of generating set of an ideal in a polynomial ring K[x1, ..., xn] over a field K. A Gröbner basis allows many important properties of the ideal and the associated algebraic variety to be deduced easily, such as the dimension and the number of zeros when it is finite. Gröbner basis computation is one of the main practical tools for solving systems of polynomial equations and computing the images of algebraic varieties under projections or rational maps.

<span class="mw-page-title-main">English alphabet</span> Latin-script alphabet consisting of 26 letters

Modern English is written with a Latin-script alphabet consisting of 26 letters, with each having both uppercase and lowercase forms. The word alphabet is a compound of alpha and beta, the names of the first two letters in the Greek alphabet. Old English was first written down using the Latin alphabet during the 7th century. During the centuries that followed, various letters entered or fell out of use. By the 16th century, the present set of 26 letters had largely stabilised:

<span class="mw-page-title-main">Nauruan language</span> Austronesian language spoken in Nauru

Nauruan or Nauru is an Austronesian language, spoken natively in the island country of Nauru. Its relationship to the other Micronesian languages is not well understood.

Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling. The algorithm mainly encodes consonants; a vowel will not be encoded unless it is the first letter. Soundex is the most widely known of all phonetic algorithms Improvements to Soundex are the basis for many modern phonetic algorithms.

The Royal Thai General System of Transcription (RTGS) is the official system for rendering Thai words in the Latin alphabet. It was published by the Royal Institute of Thailand in early 1917, when Thailand was called Siam.

Irish orthography is the set of conventions used to write Irish. A spelling reform in the mid-20th century led to An Caighdeán Oifigiúil, the modern standard written form used by the Government of Ireland, which regulates both spelling and grammar. The reform removed inter-dialectal silent letters, simplified some letter sequences, and modernised archaic spellings to reflect modern pronunciation, but it also removed letters pronounced in some dialects but not in others.

<span class="mw-page-title-main">Smith–Waterman algorithm</span> Algorithm for determining similar regions between two molecular sequences

The Smith–Waterman algorithm performs local sequence alignment; that is, for determining similar regions between two strings of nucleic acid sequences or protein sequences. Instead of looking at the entire sequence, the Smith–Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure.

The New York State Identification and Intelligence System Phonetic Code, commonly known as NYSIIS, is a phonetic algorithm devised in 1970 as part of the New York State Identification and Intelligence System. It features an accuracy increase of 2.7% over the traditional Soundex algorithm.

The match rating approach (MRA) is a phonetic algorithm for indexing of words by their pronunciation developed by Western Airlines in 1977 for the indexation and comparison of homophonous names.

<span class="mw-page-title-main">Uzbek alphabet</span> Scripts used to write the Uzbek language

The Uzbek language has been written in various scripts: Latin, Cyrillic and Arabic. The language traditionally used Arabic script, but the official Uzbek government under the Soviet Union started to use Cyrillic in 1940, which is when widespread literacy campaigns were initiated by the Soviet government across the Union. In 1992, Latin script was officially reintroduced in Uzbekistan along with Cyrillic. In the Xinjiang region of China, some Uzbek speakers write using Cyrillic, others with an alphabet based on the Uyghur Arabic alphabet. Uzbeks of Afghanistan also write the language using Arabic script, and the Arabic Uzbek alphabet is taught at some schools.

<span class="mw-page-title-main">Fisher–Yates shuffle</span> Algorithm for generating a random permutation of a finite set

The Fisher–Yates shuffle is an algorithm for shuffling a finite sequence. The algorithm takes a list of all the elements of the sequence, and continually determines the next element in the shuffled sequence by randomly drawing an element from the list until no elements remain. The algorithm produces an unbiased permutation: every permutation is equally likely. The modern version of the algorithm takes time proportional to the number of items being shuffled and shuffles them in place.

The modern Latvian orthography is based on Latin script adapted to phonetic principles, following the pronunciation of the language. The standard alphabet consists of 33 letters – 22 unmodified Latin letters and 11 modified by diacritics. It was developed by the Knowledge Commission of the Riga Latvian Association in 1908, and was approved the same year by the orthography commission under the leadership of Kārlis Mīlenbahs and Jānis Endzelīns. It was introduced by law from 1920 to 1922 in the Republic of Latvia.

<span class="mw-page-title-main">Caversham, New Zealand</span> Suburb of Dunedin, New Zealand

Caversham is one of the older suburbs (neighbourhoods) of the city of Dunedin, in New Zealand's South Island. It is sited at the western edge of the city's central plain at the mouth of the steep Caversham Valley, which rises to the saddle of Lookout Point. Major road and rail routes south lie nearby; the South Island Main Trunk railway runs through the suburb, and a bypass skirts its main retail area, connecting Dunedin's one-way street system with the Dunedin Southern Motorway. The suburb is linked by several bus routes to its neighbouring suburbs and central Dunedin.

Takuu is a Polynesian language from the Ellicean group spoken on the atoll of Takuu, near Bougainville Island. It is very closely related to Nukumanu and Nukuria from Papua New Guinea and to Ontong Java and Sikaiana from Solomon Islands.

<span class="mw-page-title-main">Hangul</span> Native alphabet of the Korean language

The Korean alphabet, known as Hangul or Hangeul in South Korea and Chosŏn'gŭl in North Korea, is the modern writing system for the Korean language. The letters for the five basic consonants reflect the shape of the speech organs used to pronounce them. They are systematically modified to indicate phonetic features. The vowel letters are systematically modified for related sounds, making Hangul a featural writing system. It has been described as a syllabic alphabet as it combines the features of alphabetic and syllabic writing systems.

In computer science theory – particularly formal language theory – Glushkov's construction algorithm, invented by Victor Mikhailovich Glushkov, transforms a given regular expression into an equivalent nondeterministic finite automaton (NFA). Thus, it forms a bridge between regular expressions and nondeterministic finite automata: two abstract representations of the same class of formal languages.

Cologne phonetics is a phonetic algorithm which assigns to words a sequence of digits, the phonetic code. The aim of this procedure is that identical sounding words have the same code assigned to them. The algorithm can be used to perform a similarity search between words. For example, it is possible in a name list to find entries like "Meier" under different spellings such as "Maier", "Mayer", or "Mayr". The Cologne phonetics is related to the well known Soundex phonetic algorithm but is optimized to match the German language. The algorithm was published in 1969 by Hans Joachim Postel.

References

  1. Milette, Greg; Stroud, Adam (2012-05-18). Professional Android Sensor Programming. John Wiley & Sons. pp. 421–. ISBN   9781118240458 . Retrieved 19 February 2013.
  2. Phua, Clifton; Lee, Vincent; Smith, Kate (2006). "The Personal Name Problem And a Recommended Data Mining Solution". Encyclopedia of Data Warehousing and Mining. CiteSeerX   10.1.1.127.5111 .
  3. 1 2 "Caverphone". National Institute of Standards and Technology . Retrieved 2018-08-20.