Truecasing

Last updated February 19, 2024

Truecasing, also called capitalization recovery,^[1]capitalization correction,^[2] or case restoration,^[3] is the problem in natural language processing (NLP) of determining the proper capitalization of words where such information is unavailable. This commonly comes up due to the standard practice (in English and many other languages) of automatically capitalizing the first word of a sentence. It can also arise in badly cased or noncased text (for example, all-lowercase or all-uppercase text messages).

Techniques

Neural networks that operate at the word level or the character level have been trained to recover capitalization with greater than 90% accuracy.
Sentence segmentation can be used to determine where sentences begin, to implement the rule that the first word of every sentence must be capitalized.
Part-of-speech tagging can be used to identify proper nouns (such as Africa, Jupiter, Sarah, or Amazon), which must be capitalized. In some cases, the same word can be used as different parts of speech, and is capitalized differently. For example, Xerox the company, as a noun, is capitalized, but to xerox a document, as a verb, is not capitalized. A xerox, as in the copy of a document, can be recognized by the presence of a determiner, which is not used for proper nouns.
Named entity recognition can be used to identify proper nouns, which must be capitalized.
A spell checker can be used to identify words that are always capitalized.

Applications

Truecasing aids in other NLP tasks, such as named entity recognition (NER), automatic content extraction (ACE), and machine translation.^[4] Proper capitalization allows easier detection of proper nouns, which are the starting points of NER and ACE. Some translation systems use statistical machine learning techniques, which could make use of the information contained in capitalization to increase accuracy.

Related Research Articles

Camel case is the practice of writing phrases without spaces or punctuation and with capitalized words. The format indicates the first word starting with either case, then the following words having an initial uppercase letter. Common examples include YouTube, iPhone and eBay. Camel case is often used as a naming convention in computer programming. It is also sometimes used in online usernames such as JohnSmith, and to make multi-word domain names more legible, for example in promoting EasyWidgetCompany.com.

Natural language processing (NLP) is an interdisciplinary subfield of computer science and linguistics. It is primarily concerned with giving computers the ability to support and manipulate human language. It involves processing natural language datasets, such as text corpora or speech corpora, using either rule-based or probabilistic machine learning approaches. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.

An eponym is a person, a place, or a thing after whom or which someone or something is, or is believed to be, named. The adjectives which are derived from the word eponym include eponymous and eponymic.

In computers, case sensitivity defines whether uppercase and lowercase letters are treated as distinct (case-sensitive) or equivalent (case-insensitive). For instance, when users interested in learning about dogs search an e-book, "dog" and "Dog" are of the same significance to them. Thus, they request a case-insensitive search. But when they search an online encyclopedia for information about the United Nations, for example, or something with no ambiguity regarding capitalization and ambiguity between two or more terms cut down by capitalization, they may prefer a case-sensitive search.

A proper noun is a noun that identifies a single entity and is used to refer to that entity as distinguished from a common noun, which is a noun that refers to a class of entities and may be used when referring to instances of a specific class. Some proper nouns occur in plural form, and then they refer to groups of entities considered as unique. Proper nouns can also occur in secondary applications, for example modifying nouns, or in the role of common nouns. The detailed definition of the term is problematic and, to an extent, governed by convention.

Alternating caps, also known as studly caps or sticky caps, is a form of text notation in which the capitalization of letters varies by some pattern, or arbitrarily, such as "aLtErNaTiNg cApS", "sTuDlY cApS" or "sTiCKy CaPs".

Capitalization or capitalisation is writing a word with its first letter as a capital letter and the remaining letters in lower case, in writing systems with a case distinction. The term also may refer to the choice of the casing applied to text.

Letter case is the distinction between the letters that are in larger uppercase or capitals and smaller lowercase in the written representation of certain languages. The writing systems that distinguish between the upper- and lowercase have two parallel sets of letters: each in the majuscule set has a counterpart in the minuscule set. Some counterpart letters have the same shape, and differ only in size, but for others the shapes are different. The two case variants are alternative representations of the same letter: they have the same name and pronunciation and are typically treated identically when sorting in alphabetical order.

In corpus linguistics, part-of-speech tagging, also called grammatical tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc.

Title case or headline case is a style of capitalization used for rendering the titles of published works or works of art in English. When using title case, all words are capitalized, except for minor words that are not the first or last word of the title. There are different rules for which words are major, hence capitalized. As an example, a headline might be written like this: "The Quick Brown Fox Jumps over the Lazy Dog".

The iota subscript is a diacritic mark in the Greek alphabet shaped like a small vertical stroke or miniature iota ⟨ι⟩ placed below the letter. It can occur with the vowel letters eta ⟨η⟩, omega ⟨ω⟩, and alpha ⟨α⟩. It represents the former presence of an offglide after the vowel, forming a so‐called "long diphthong". Such diphthongs —phonologically distinct from the corresponding normal or "short" diphthongs —were a feature of ancient Greek in the pre-classical and classical eras.

In file systems, case preservation is the preservation of the letter case of letters in file names. If an attempt is made to create a file named "ThisIsAFile" on a file system that preserves letter case, the file's name will be "ThisIsAFile", rather than, for example, "thisisafile" or "THISISAFILE".

Named-entity recognition (NER) (also known as (named)entity identification, entity chunking, and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.

Orthographic conventions have varied over time, and vary by publishers, authors, and regional preferences, on whether and when Internet should be capitalized. When the Internet first came into common use, most publications treated Internet as a capitalized proper noun, but this has become less common. This reflects the tendency in English to capitalize new terms and move them to lowercase as they become familiar. The word is sometimes still capitalized to distinguish the global IP-based internet from internets that are smaller or not IP-based, though many publications, including the AP Stylebook since 2016, recommend the lowercase form in every case. In 2016, the Oxford English Dictionary found that, based on a study of around 2.5 billion printed and online sources, "Internet" was capitalized in 54% of cases, with Internet being preferred in the United States and internet being preferred in the United Kingdom.

Hungarian orthography consists of rules defining the standard written form of the Hungarian language. It includes the spelling of lexical words, proper nouns and foreign words (loanwords) in themselves, with suffixes, and in compounds, as well as the hyphenation of words, punctuation, abbreviations, collation, and other information.

English orthography sometimes uses the term proper adjective to mean adjectives that take initial capital letters, and common adjective to mean those that do not. For example, a person from India is Indian—Indian is a proper adjective.

Error-driven learning is a type of reinforcement learning method. This method tweaks a model’s parameters based on the difference between the proposed and actual results. These models stand out as they depend on environmental feedback instead of explicit labels or categories. They are based on the idea that language acquisition involves the minimization of the prediction error (MPSE). By leveraging these prediction errors, the models consistently refine expectations and decrease computational complexity. Typically, these algorithms are operated by the GeneRec algorithm.

Capitalization or capitalisation in English grammar is the use of a capital letter at the start of a word. English usage varies from capitalization in other languages.

The following outline is provided as an overview of and topical guide to natural-language processing:

Reverential capitalization is the practice of capitalizing religious words that refer to deities or divine beings in cases where the words would not otherwise have been capitalized. Pronouns are also particularly included in reverential capitalization:

and God calleth to the light 'Day,' and to the darkness He hath called 'Night;' and there is an evening, and there is a morning — day one.

References

↑ Brown, Eric W.; Coden, Anni R. (2002). "Capitalization Recovery for Text". Information Retrieval Techniques for Speech Applications. Lecture Notes in Computer Science. Vol. 2273. pp. 11–22. doi: 10.1007/3-540-45637-6_2 . ISBN 978-3-540-43156-5.
↑ USpatent 7,827,025 B2,Peter K. L. Mau&Dong Yu,"Efficient capitalization through user modeling",issued 2010-11-02, assigned to Microsoft Corporation
↑ USpatent 8,972,855 B2,Zhu Liu; David Gibbon& Behzad Shahraray,"Method and apparatus for providing case restoration",issued 2015-03-03, assigned to AT&T Intellectual Property I, L.P.
↑ Lita, L. V.; Ittycheriah, A.; Roukos, S.; Kambhatla, N. (2003). "tRuEcasIng". Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics. Sapporo, Japan. pp. 152–159.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[Brown-1] Brown, Eric W.; Coden, Anni R. (2002). "Capitalization Recovery for Text". Information Retrieval Techniques for Speech Applications. Lecture Notes in Computer Science. Vol. 2273. pp. 11–22. doi: 10.1007/3-540-45637-6_2 . ISBN 978-3-540-43156-5.

[2] USpatent 7,827,025 B2,Peter K. L. Mau&Dong Yu,"Efficient capitalization through user modeling",issued 2010-11-02, assigned to Microsoft Corporation

[3] USpatent 8,972,855 B2,Zhu Liu; David Gibbon& Behzad Shahraray,"Method and apparatus for providing case restoration",issued 2015-03-03, assigned to AT&T Intellectual Property I, L.P.

[4] Lita, L. V.; Ittycheriah, A.; Roukos, S.; Kambhatla, N. (2003). "tRuEcasIng". Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics. Sapporo, Japan. pp. 152–159.

[1]

[2]

[3]

[4]

Truecasing

Contents

Techniques

Applications

See also

Related Research Articles

References