New York State Identification and Intelligence System

Last updated

The New York State Identification and Intelligence System Phonetic Code, commonly known as NYSIIS, is a phonetic algorithm devised in 1970 as part of the New York State Identification and Intelligence System (now a part of the New York State Division of Criminal Justice Services). It features an accuracy increase of 2.7% over the traditional Soundex algorithm. [1]

Contents

Procedure

The algorithm, as described in Name Search Techniques, [2] is:

  1. If the first letters of the name are
    'MAC' then change these letters to 'MCC'
    'KN' then change these letters to 'NN'
    'K' then change this letter to 'C'
    'PH' then change these letters to 'FF'
    'PF' then change these letters to 'FF'
    'SCH' then change these letters to 'SSS'
  2. If the last letters of the name are [3]
    'EE' then change these letters to 'Y␢'
    'IE' then change these letters to 'Y␢'
    'DT' or 'RT' or 'RD' or 'NT' or 'ND' then change these letters to 'D␢'
  3. The first character of the NYSIIS code is the first character of the name.
  4. In the following rules, a scan is performed on the characters of the name. This is described in terms of a program loop. A pointer is used to point to the current position under consideration in the name. Step 4 is to set this pointer to point to the second character of the name.
  5. Considering the position of the pointer, only one of the following statements can be executed.
    1. If blank then go to rule 7.
    2. If the current position is a vowel (AEIOU) then if equal to 'EV' then change to 'AF' otherwise change current position to 'A'.
    3. If the current position is the letter
      'Q' then change the letter to 'G'
      'Z' then change the letter to 'S'
      'M' then change the letter to 'N'
    4. If the current position is the letter 'K' then if the next letter is 'N' then replace the current position by 'N' otherwise replace the current position by 'C'
    5. If the current position points to the letter string
      'SCH' then replace the string with 'SSS'
      'PH' then replace the string with 'FF'
    6. If the current position is the letter 'H' and either the preceding or following letter is not a vowel (AEIOU) then replace the current position with the preceding letter.
    7. If the current position is the letter 'W' and the preceding letter is a vowel then replace the current position with the preceding position.
    8. If none of these rules applies, then retain the current position letter value.
  6. If the current position letter is equal to the last letter placed in the code then set the pointer to point to the next letter and go to step 5.
    The next character of the NYSIIS code is the current position letter.
    Increment the pointer to point at the next letter.
    Go to step 5.
  7. If the last character of the NYSIIS code is the letter 'S' then remove it.
  8. If the last two characters of the NYSIIS code are the letters 'AY' then replace them with the single character 'Y'.
  9. If the last character of the NYSIIS code is the letter 'A' then remove this letter.

Related Research Articles

A phonetic algorithm is an algorithm for indexing of words by their pronunciation. Most phonetic algorithms were developed for English and are not useful for indexing words in other languages. Because English spelling varies significantly depending on multiple factors, such as the word's origin and usage over time and borrowings from other languages, phonetic algorithms necessarily take into account numerous rules and exceptions.

Metaphone is a phonetic algorithm, published by Lawrence Philips in 1990, for indexing words by their English pronunciation. It fundamentally improves on the Soundex algorithm by using information about variations and inconsistencies in English spelling and pronunciation to produce a more accurate encoding, which does a better job of matching words and names which sound similar. As with Soundex, similar-sounding words should share the same keys. Metaphone is available as a built-in operator in a number of systems.

The Burrows–Wheeler transform rearranges a character string into runs of similar characters. This is useful for compression, since it tends to be easy to compress a string that has runs of repeated characters by techniques such as move-to-front transform and run-length encoding. More importantly, the transformation is reversible, without needing to store any additional data except the position of the first original character. The BWT is thus a "free" method of improving the efficiency of text compression algorithms, costing only some extra computation. The Burrows–Wheeler transform is an algorithm used to prepare data for use with data compression techniques such as bzip2. It was invented by Michael Burrows and David Wheeler in 1994 while Burrows was working at DEC Systems Research Center in Palo Alto, California. It is based on a previously unpublished transformation discovered by Wheeler in 1983. The algorithm can be implemented efficiently using a suffix array thus reaching linear time complexity.

The Thai script is the abugida used to write Thai, Southern Thai and many other languages spoken in Thailand. The Thai alphabet itself has 44 consonant symbols and 16 vowel symbols that combine into at least 32 vowel forms and four tone diacritics to create characters mostly representing syllables.

Alphabetical order is a system whereby character strings are placed in order based on the position of the characters in the conventional ordering of an alphabet. It is one of the methods of collation. In mathematics, a lexicographical order is the generalization of the alphabetical order to other data types, such as sequences of numbers or other ordered mathematical objects.

German orthography is the orthography used in writing the German language, which is largely phonemic. However, it shows many instances of spellings that are historic or analogous to other spellings rather than phonemic. The pronunciation of almost every word can be derived from its spelling once the spelling rules are known, but the opposite is not generally the case.

A check digit is a form of redundancy check used for error detection on identification numbers, such as bank account numbers, which are used in an application where they will at least sometimes be input manually. It is analogous to a binary parity bit used to check for errors in computer-generated data. It consists of one or more digits computed by an algorithm from the other digits in the sequence input.

<span class="mw-page-title-main">Soft sign</span> Letter of the Cyrillic script

The soft sign is a letter in the Cyrillic script that is used in various Slavic languages. In Old Church Slavonic, it represented a short or reduced front vowel. However, over time, the specific vowel sound it denoted was largely eliminated and merged with other vowel sounds.

Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling. The algorithm mainly encodes consonants; a vowel will not be encoded unless it is the first letter. Soundex is the most widely known of all phonetic algorithms Improvements to Soundex are the basis for many modern phonetic algorithms.

Daitch–Mokotoff Soundex is a phonetic algorithm invented in 1985 by Jewish genealogists Gary Mokotoff and Randy Daitch. It is a refinement of the Russell and American Soundex algorithms designed to allow greater accuracy in matching of Slavic and Yiddish surnames with similar pronunciation but differences in spelling.

The Caverphone within linguistics and computing, is a phonetic matching algorithm invented to identify English names with their sounds, originally built to process a custom dataset compound between 1893 and 1938 in southern Dunedin, New Zealand. Started from a similar concept as metaphone, it has been developed to accommodate and process general English since then.

C, or c, is the third letter in the Latin alphabet, used in the modern English alphabet, the alphabets of other western European languages and others worldwide. Its name in English is cee, plural cees.

Unicode supports several phonetic scripts and notations through its existing scripts and the addition of extra blocks with phonetic characters. These phonetic characters are derived from an existing script, usually Latin, Greek or Cyrillic. Apart from the International Phonetic Alphabet (IPA), extensions to the IPA and obsolete and nonstandard IPA symbols, these blocks also contain characters from the Uralic Phonetic Alphabet and the Americanist Phonetic Alphabet.

<span class="mw-page-title-main">Latin script</span> Writing system based on the alphabet used by the Romans

The Latin script, also known as the Roman script, is an alphabetic writing system based on the letters of the classical Latin alphabet, derived from a form of the Greek alphabet which was in use in the ancient Greek city of Cumae, in southern Italy. The Greek alphabet was altered by the Etruscans, and subsequently their alphabet was altered by the Romans. Several Latin-script alphabets exist, which differ in graphemes, collation and phonetic values from the classical Latin alphabet.

Unicode equivalence is the specification by the Unicode character encoding standard that some sequences of code points represent essentially the same character. This feature was introduced in the standard to allow compatibility with preexisting standard character sets, which often included similar or identical characters.

The match rating approach (MRA) is a phonetic algorithm for indexing of words by their pronunciation developed by Western Airlines in 1977 for the indexation and comparison of homophonous names.

The Italian fiscal code, officially known in Italy as Codice fiscale, is the tax code in Italy, similar to a Social Security Number (SSN) in the United States or the National Insurance Number issued in the United Kingdom. It is an alphanumeric code of 16 characters. The code serves to unambiguously identify individuals irrespective of citizenship or residency status. Designed by and for the Italian tax office, it is now used for several other purposes, e.g. uniquely identifying individuals in the health system, or natural persons who act as parties in private contracts. The code is issued by the Italian tax office, the Agenzia delle Entrate.

<span class="mw-page-title-main">Umlaut (diacritic)</span> Diacritic mark to indicate sound shift

The umlaut is the diacritical mark used to indicate in writing the result of the historical sound shift due to which former back vowels are now pronounced as front vowels.

Tamil All Character Encoding (TACE16) is a 16-bit Unicode-based character encoding scheme for Tamil language.

Cologne phonetics is a phonetic algorithm which assigns to words a sequence of digits, the phonetic code. The aim of this procedure is that identical sounding words have the same code assigned to them. The algorithm can be used to perform a similarity search between words. For example, it is possible in a name list to find entries like "Meier" under different spellings such as "Maier", "Mayer", or "Mayr". The Cologne phonetics is related to the well known Soundex phonetic algorithm but is optimized to match the German language. The algorithm was published in 1969 by Hans Joachim Postel.

References

  1. Rajkovic, P.; Jankovic, D. (2007), "Adaptation and Application of Daitch-Mokotoff Soundex Algorithm on Serbian Names" (PDF), XVII Conference on Applied Mathematics, Novi Sad, Serbia, archived from the original (PDF) on August 27, 2011{{citation}}: CS1 maint: location missing publisher (link)
  2. Taft, R. L. (1970), "Name Search Techniques", New York State Identification and Intelligence System, Albany, New York{{citation}}: CS1 maint: location missing publisher (link)
  3. "Unicode Character 'BLANK SYMBOL' (U+2422)".