Text normalization is the process of transforming text into a single canonical form that it might not have had before. Normalizing text before storing or processing it allows for separation of concerns, since input is guaranteed to be consistent before operations are performed on it. Text normalization requires being aware of what type of text is to be normalized and how it is to be processed afterwards; there is no all-purpose normalization procedure. [1]
Text normalization is frequently used when converting text to speech. Numbers, dates, acronyms, and abbreviations are non-standard "words" that need to be pronounced differently depending on context. [2] For example:
Text can also be normalized for storing and searching in a database. For instance, if a search for "resume" is to match the word "résumé," then the text would be normalized by removing diacritical marks; and if "john" is to match "John", the text would be converted to a single case. To prepare text for searching, it might also be stemmed (e.g. converting "flew" and "flying" both into "fly"), canonicalized (e.g. consistently using American or British English spelling), or have stop words removed.
For simple, context-independent normalization, such as removing non-alphanumeric characters or diacritical marks, regular expressions would suffice. For example, the sed script sed ‑e "s/\s+/ /g" inputfile
would normalize runs of whitespace characters into a single space. More complex normalization requires correspondingly complicated algorithms, including domain knowledge of the language and vocabulary being normalized. Among other approaches, text normalization has been modeled as a problem of tokenizing and tagging streams of text [5] and as a special case of machine translation. [6] [7]
In the field of textual scholarship and the editing of historic texts, the term "normalization" implies a degree of modernization and standardization – for example in the extension of scribal abbreviations and the transliteration of the archaic glyphs typically found in manuscript and early printed sources. A normalized edition is therefore distinguished from a diplomatic edition (or semi-diplomatic edition), in which some attempt is made to preserve these features. The aim is to strike an appropriate balance between, on the one hand, rigorous fidelity to the source text (including, for example, the preservation of enigmatic and ambiguous elements); and, on the other, producing a new text that will be comprehensible and accessible to the modern reader. The extent of normalization is therefore at the discretion of the editor, and will vary. Some editors, for example, choose to modernize archaic spellings and punctuation, but others do not. [8]
In communications and information processing, code is a system of rules to convert information—such as a letter, word, sound, image, or gesture—into another form, sometimes shortened or secret, for communication through a communication channel or storage in a storage medium. An early example is an invention of language, which enabled a person, through speech, to communicate what they thought, saw, heard, or felt to others. But speech limits the range of communication to the distance a voice can carry and limits the audience to those present when the speech is uttered. The invention of writing, which converted spoken language into visual symbols, extended the range of communication across space and time.
A regular expression, sometimes referred to as rational expression, is a sequence of characters that specifies a match pattern in text. Usually such patterns are used by string-searching algorithms for "find" or "find and replace" operations on strings, or for input validation. Regular expression techniques are developed in theoretical computer science and formal language theory.
Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal language text into speech; other systems render symbolic linguistic representations like phonetic transcriptions into speech. The reverse process is speech recognition.
Optical character recognition or optical character reader (OCR) is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene photo or from subtitle text superimposed on an image.
A cedilla, or cedille, is a hook or tail added under certain letters as a diacritical mark to modify their pronunciation. In Catalan, French, and Portuguese it is used only under the letter c, and the entire letter is called, respectively, c trencada, c cédille, and c cedilhado. It is used to mark vowel nasalization in many languages of Sub-Saharan Africa, including Vute from Cameroon.
Thorn or þorn is a letter in the Old English, Old Norse, Old Swedish and modern Icelandic alphabets, as well as modern transliterations of the Gothic alphabet, Middle Scots, and some dialects of Middle English. It was also used in medieval Scandinavia but was later replaced with the digraph th, except in Iceland, where it survives. The letter originated from the rune ᚦ in the Elder Futhark and was called thorn in the Anglo-Saxon and thorn or thurs in the Scandinavian rune poems. It is similar in appearance to the archaic Greek letter sho (ϸ), although the two are historically unrelated. The only language in which þ is currently in use is Icelandic.
The orthography of the Old Norse language was diverse, being written in both Runic and Latin alphabets, with many spelling conventions, variant letterforms, and unique letters and signs. In modern times, scholars established a standardized spelling for the language. When Old Norse names are used in texts in other languages, modifications to this spelling are often made. In particular, the names of Old Norse mythological figures often have several different spellings.
Natural language generation (NLG) is a software process that produces natural language output. A widely-cited survey of NLG methods describes NLG as "the subfield of artificial intelligence and computational linguistics that is concerned with the construction of computer systems that can produce understandable texts in English or other human languages from some underlying non-linguistic representation of information".
Church Slavonic is the conservative Slavic liturgical language used by the Eastern Orthodox Church in Belarus, Bulgaria, North Macedonia, Montenegro, Poland, Ukraine, Russia, Serbia, the Czech Republic and Slovakia, Slovenia and Croatia. The language appears also in the services of the Russian Orthodox Church Outside of Russia, the American Carpatho-Russian Orthodox Diocese, and occasionally in the services of the Orthodox Church in America.
In digital typography, combining characters are characters that are intended to modify other characters. The most common combining characters in the Latin script are the combining diacritical marks.
A paraphrase or rephrase is the rendering of the same text in different words without losing the meaning of the text itself. More often than not, a paraphrased text can convey its meaning better than the original words. In other words, it is a copy of the text in meaning, but which is different from the original. For example, when someone tells a story they heard in own words, they paraphrase, with the meaning being the same. The term itself is derived via Latin paraphrasis, from Ancient Greek παράφρασις (paráphrasis) 'additional manner of expression'. The act of paraphrasing is also called paraphrasis.
In Hebrew orthography, niqqud or nikud is a system of diacritical signs used to represent vowels or distinguish between alternative pronunciations of letters of the Hebrew alphabet. Several such diacritical systems were developed in the Early Middle Ages. The most widespread system, and the only one still used to a significant degree today, was created by the Masoretes of Tiberias in the second half of the first millennium AD in the Land of Israel. Text written with niqqud is called ktiv menuqad.
The Harvard-Kyoto Convention is a system for transliterating Sanskrit and other languages that use the Devanāgarī script into ASCII. It is predominantly used informally in e-mail, and for electronic texts.
In computer science, canonicalization is a process for converting data that has more than one possible representation into a "standard", "normal", or canonical form. This can be done to compare different representations for equivalence, to count the number of distinct data structures, to improve the efficiency of various algorithms by eliminating repeated calculations, or to make it possible to impose a meaningful sorting order.
Scribal abbreviations, or sigla, are abbreviations used by ancient and medieval scribes writing in various languages, including Latin, Greek, Old English and Old Norse.
English rarely uses diacritics, which are symbols indicating the modification of a letter's sound when spoken. Most of the affected words are in terms imported from other languages. The two dots accent, the grave accent, and the acute accent are the only diacritics native to Modern English, and their usage has tended to fall off except in certain publications and particular cases.
In orthography and typography, a homoglyph is one of two or more graphemes, characters, or glyphs with shapes that appear identical or very similar but may have differing meaning. The designation is also applied to sequences of characters sharing these properties.
Unicode equivalence is the specification by the Unicode character encoding standard that some sequences of code points represent essentially the same character. This feature was introduced in the standard to allow compatibility with pre-existing standard character sets, which often included similar or identical characters.
Paraphrase or paraphrasing in computational linguistics is the natural language processing task of detecting and generating paraphrases. Applications of paraphrasing are varied including information retrieval, question answering, text summarization, and plagiarism detection. Paraphrasing is also useful in the evaluation of machine translation, as well as semantic parsing and generation of new samples to expand existing corpora.
Richard Sproat is a computational linguist currently working for Sakana AI as a research scientist. Prior to joining Sakana AI, Sproat worked for Google between 2012 and 2024 on text normalization and speech recognition.