Text normalization

Last updated

Text normalization is the process of transforming text into a single canonical form that it might not have had before. Normalizing text before storing or processing it allows for separation of concerns, since input is guaranteed to be consistent before operations are performed on it. Text normalization requires being aware of what type of text is to be normalized and how it is to be processed afterwards; there is no all-purpose normalization procedure. [1]

Contents

Applications

Text normalization is frequently used when converting text to speech. Numbers, dates, acronyms, and abbreviations are non-standard "words" that need to be pronounced differently depending on context. [2] For example:

Text can also be normalized for storing and searching in a database. For instance, if a search for "resume" is to match the word "résumé," then the text would be normalized by removing diacritical marks; and if "john" is to match "John", the text would be converted to a single case. To prepare text for searching, it might also be stemmed (e.g. converting "flew" and "flying" both into "fly"), canonicalized (e.g. consistently using American or British English spelling), or have stop words removed.

Techniques

For simple, context-independent normalization, such as removing non-alphanumeric characters or diacritical marks, regular expressions would suffice. For example, the sed script sed e "s/\s+/ /g"  inputfile would normalize runs of whitespace characters into a single space. More complex normalization requires correspondingly complicated algorithms, including domain knowledge of the language and vocabulary being normalized. Among other approaches, text normalization has been modeled as a problem of tokenizing and tagging streams of text [5] and as a special case of machine translation. [6] [7]

Textual scholarship

In the field of textual scholarship and the editing of historic texts, the term "normalization" implies a degree of modernization and standardization – for example in the extension of scribal abbreviations and the transliteration of the archaic glyphs typically found in manuscript and early printed sources. A normalized edition is therefore distinguished from a diplomatic edition (or semi-diplomatic edition), in which some attempt is made to preserve these features. The aim is to strike an appropriate balance between, on the one hand, rigorous fidelity to the source text (including, for example, the preservation of enigmatic and ambiguous elements); and, on the other, producing a new text that will be comprehensible and accessible to the modern reader. The extent of normalization is therefore at the discretion of the editor, and will vary. Some editors, for example, choose to modernize archaic spellings and punctuation, but others do not. [8]

See also

Related Research Articles

In communications and information processing, code is a system of rules to convert information—such as a letter, word, sound, image, or gesture—into another form, sometimes shortened or secret, for communication through a communication channel or storage in a storage medium. An early example is an invention of language, which enabled a person, through speech, to communicate what they thought, saw, heard, or felt to others. But speech limits the range of communication to the distance a voice can carry and limits the audience to those present when the speech is uttered. The invention of writing, which converted spoken language into visual symbols, extended the range of communication across space and time.

A diacritic is a glyph added to a letter or to a basic glyph. The term derives from the Ancient Greek διακριτικός, from διακρίνω. The word diacritic is a noun, though it is sometimes used in an attributive sense, whereas diacritical is only an adjective. Some diacritics, such as the acute and grave, are often called accents. Diacritics may appear above or below a letter or in some other position such as within the letter or between two letters.

Natural language processing Field of computer science and linguistics

Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.

Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech computer or speech synthesizer, and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal language text into speech; other systems render symbolic linguistic representations like phonetic transcriptions into speech. The reverse process is speech recognition.

Optical character recognition Computer recognition of visual text

Optical character recognition or optical character reader (OCR) is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene-photo or from subtitle text superimposed on an image.

Thorn (letter) Letter of Old English and some Scandinavian languages

Thorn or þorn is a letter in the Old English, Gothic, Old Norse, Old Swedish, and modern Icelandic alphabets, as well as Middle Scots and some dialects of Middle English. It was also used in medieval Scandinavia, but was later replaced with the digraph th, except in Iceland, where it survives. The letter originated from the rune in the Elder Fuþark and was called thorn in the Anglo-Saxon and thorn or thurs in the Scandinavian rune poems. It is similar in appearance to the archaic Greek letter sho (ϸ), although the two are historically unrelated.

The orthography of the Old Norse language was diverse, being written in both Runic and Latin alphabets, with many spelling conventions, variant letterforms, and unique letters and signs. In modern times, scholars established a standardized spelling for the language. When Old Norse names are used in texts in other languages, modifications to this spelling are often made. In particular, the names of Old Norse mythological figures often have several different spellings.

Hungarian alphabet

The Hungarian alphabet is an extension of the Latin alphabet used for writing the Hungarian language.

Samoan is the language of the Samoan Islands, comprising Samoa and the United States territory of American Samoa. It is an official language, alongside English, in both jurisdictions.

Polivanov system is a system of transliterating the Japanese language into Russian Cyrillic script, either to represent Japanese proper names or terms in Russian or as an aid to Japanese language learning in those languages. The system was developed by Yevgeny Polivanov in 1917.

Church Slavonic Liturgical language of the Eastern Orthodox Church in Slavic countries

Church Slavonic, also known as Church Slavic, New Church Slavonic or New Church Slavic, is the conservative Slavic liturgical language used by the Eastern Orthodox Church in Belarus, Bosnia and Herzegovina, Bulgaria, North Macedonia, Montenegro, Poland, Ukraine, Russia, Serbia, the Czech Republic and Slovakia, Slovenia and Croatia. The language appears also in the services of the Russian Orthodox Church Outside of Russia, the American Carpatho-Russian Orthodox Diocese, and occasionally in the services of the Orthodox Church in America.

In computer science, canonicalization is a process for converting data that has more than one possible representation into a "standard", "normal", or canonical form. This can be done to compare different representations for equivalence, to count the number of distinct data structures, to improve the efficiency of various algorithms by eliminating repeated calculations, or to make it possible to impose a meaningful sorting order.

Some English language terms have letters with diacritical marks. Most of the words are loanwords from French, with others coming from Spanish, Portuguese, German, or other languages. The diaeresis mark, the grave accent and the acute accent are the only diacritics native to Modern English, but their usage is considered to be largely archaic.

French orthography encompasses the spelling and punctuation of the French language. It is based on a combination of phonemic and historical principles. The spelling of words is largely based on the pronunciation of Old French c. 1100–1200 CE and has stayed more or less the same since then, despite enormous changes to the pronunciation of the language in the intervening years. This has resulted in a complicated relationship between spelling and sound, especially for vowels; a multitude of silent letters; and many homophones. Later attempts to respell some words in accordance with their Latin etymologies further increased the number of silent letters. Nevertheless, there are rules governing French orthography which allow for a reasonable degree of accuracy when pronouncing French words from their written forms. The reverse operation, producing written forms from pronunciation, is much more ambiguous.

Romanization of Persian or Latinization of Persian is the representation of the Persian language with the Latin script. Several different romanization schemes exist, each with its own set of rules driven by its own set of ideological goals.

Unicode equivalence is the specification by the Unicode character encoding standard that some sequences of code points represent essentially the same character. This feature was introduced in the standard to allow compatibility with preexisting standard character sets, which often included similar or identical characters.

Canonical XML is a normal form of XML, intended to allow relatively simple comparison of pairs of XML documents for equivalence; for this purpose, the Canonical XML transformation removes non-meaningful differences between the documents. Any XML document can be converted to Canonical XML.

SMS language Abbreviated slang used in text messaging

Short Message Service (SMS) language, textspeak, or texting language is the abbreviated language and slang commonly used with mobile phone text messaging, or other Internet-based communication such as email and instant messaging.

In linguistic morphology and information retrieval, stemming is the process of reducing inflected words to their word stem, base or root form—generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. Algorithms for stemming have been studied in computer science since the 1960s. Many search engines treat words with the same stem as synonyms as a kind of query expansion, a process called conflation.

Entity linking

In natural language processing, entity linking, also referred to as named-entity linking (NEL), named-entity disambiguation (NED), named-entity recognition and disambiguation (NERD) or named-entity normalization (NEN) is the task of assigning a unique identity to entities mentioned in text. For example, given the sentence "Paris is the capital of France", the idea is to determine that "Paris" refers to the city of Paris and not to Paris Hilton or any other entity that could be referred to as "Paris". Entity linking is different from named-entity recognition (NER) in that NER identifies the occurrence of a named entity in text but it does not identify which specific entity it is.

References

  1. Richard Sproat and Steven Bedrick (September 2011). "CS506/606: Txt Nrmlztn" . Retrieved October 2, 2012.
  2. Sproat, R.; Black, A.; Chen, S.; Kumar, S.; Ostendorf, M.; Richards, C. (2001). "Normalization of non-standard words." Computer Speech and Language15; 287–333. doi:10.1006/csla.2001.0169.
  3. "Samoan Numbers". MyLanguages.org. Retrieved October 2, 2012.
  4. "Text-to-Speech Engines Text Normalization". MSDN. Retrieved October 2, 2012.
  5. Zhu, C.; Tang, J.; Li, H.; Ng, H.; Zhao, T. (2007). "A Unified Tagging Approach to Text Normalization." Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics; 688–695. doi:10.1.1.72.8138.
  6. Filip, G.; Krzysztof, J.; Agnieszka, W.; Mikołaj, W. (2006). "Text Normalization as a Special Case of Machine Translation." Proceedings of the International Multiconference on Computer Science and Information Technology1; 51–56.
  7. Mosquera, A.; Lloret, E.; Moreda, P. (2012). "Towards Facilitating the Accessibility of Web 2.0 Texts through Text Normalisation" Proceedings of the LREC workshop: Natural Language Processing for Improving Textual Accessibility (NLP4ITA); 9-14
  8. Harvey, P. D. A. (2001). Editing Historical Records. London: British Library. pp. 40–46. ISBN   0-7123-4684-8.