Truecasing

Last updated

Truecasing, also called capitalization recovery, [1] capitalization correction, [2] or case restoration, [3] is the problem in natural language processing (NLP) of determining the proper capitalization of words where such information is unavailable. This commonly comes up due to the standard practice (in English and many other languages) of automatically capitalizing the first word of a sentence. It can also arise in badly cased or noncased text (for example, all-lowercase or all-uppercase text messages).

Contents

Truecasing is unnecessary in languages whose scripts do not have a distinction between uppercase and lowercase letters. This includes all languages not written in the Latin, Greek, Cyrillic or Armenian alphabets, such as Korean, Japanese, Chinese, Thai, Hebrew, Arabic, Hindi, and Georgian.

Techniques

Applications

Truecasing aids in other NLP tasks, such as named entity recognition (NER), automatic content extraction (ACE), and machine translation. [4] Proper capitalization allows easier detection of proper nouns, which are the starting points of NER and ACE. Some translation systems use statistical machine learning techniques, which could make use of the information contained in capitalization to increase accuracy.

See also

Related Research Articles

<span class="mw-page-title-main">Camel case</span> Writing words with internal uppercase letters

Camel case is the practice of writing phrases without spaces or punctuation and with capitalized words. The format indicates the first word starting with either case, then the following words having an initial uppercase letter. Common examples include YouTube, iPhone and eBay. Camel case is often used as a naming convention in computer programming. It is also sometimes used in online usernames such as JohnSmith, and to make multi-word domain names more legible, for example in promoting EasyWidgetCompany.com.

Natural language processing (NLP) is an interdisciplinary subfield of computer science and linguistics. It is primarily concerned with giving computers the ability to support and manipulate human language. It involves processing natural language datasets, such as text corpora or speech corpora, using either rule-based or probabilistic machine learning approaches. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.

<span class="mw-page-title-main">Eponym</span> Person or thing after which something is named

An eponym is a person, a place, or a thing after whom or which someone or something is, or is believed to be, named. The adjectives which are derived from the word eponym include eponymous and eponymic.

<span class="mw-page-title-main">Case sensitivity</span> Defines whether uppercase and lowercase letters are treated as distinct

In computers, case sensitivity defines whether uppercase and lowercase letters are treated as distinct (case-sensitive) or equivalent (case-insensitive). For instance, when users interested in learning about dogs search an e-book, "dog" and "Dog" are of the same significance to them. Thus, they request a case-insensitive search. But when they search an online encyclopedia for information about the United Nations, for example, or something with no ambiguity regarding capitalization and ambiguity between two or more terms cut down by capitalization, they may prefer a case-sensitive search.

A proper noun is a noun that identifies a single entity and is used to refer to that entity as distinguished from a common noun, which is a noun that refers to a class of entities and may be used when referring to instances of a specific class. Some proper nouns occur in plural form, and then they refer to groups of entities considered as unique. Proper nouns can also occur in secondary applications, for example modifying nouns, or in the role of common nouns. The detailed definition of the term is problematic and, to an extent, governed by convention.

Alternating caps, also known as studly caps or sticky caps, is a form of text notation in which the capitalization of letters varies by some pattern, or arbitrarily, such as "aLtErNaTiNg cApS", "sTuDlY cApS" or "sTiCKy CaPs".

Capitalization or capitalisation is writing a word with its first letter as a capital letter and the remaining letters in lower case, in writing systems with a case distinction. The term also may refer to the choice of the casing applied to text.

<span class="mw-page-title-main">Letter case</span> Uppercase or lowercase

Letter case is the distinction between the letters that are in larger uppercase or capitals and smaller lowercase in the written representation of certain languages. The writing systems that distinguish between the upper- and lowercase have two parallel sets of letters: each in the majuscule set has a counterpart in the minuscule set. Some counterpart letters have the same shape, and differ only in size, but for others the shapes are different. The two case variants are alternative representations of the same letter: they have the same name and pronunciation and are typically treated identically when sorting in alphabetical order.

In corpus linguistics, part-of-speech tagging, also called grammatical tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc.

Title case or headline case is a style of capitalization used for rendering the titles of published works or works of art in English. When using title case, all words are capitalized, except for minor words that are not the first or last word of the title. There are different rules for which words are major, hence capitalized. As an example, a headline might be written like this: "The Quick Brown Fox Jumps over the Lazy Dog".

<span class="mw-page-title-main">Iota subscript</span> Diacritic mark in the Greek alphabet

The iota subscript is a diacritic mark in the Greek alphabet shaped like a small vertical stroke or miniature iota ⟨ι⟩ placed below the letter. It can occur with the vowel letters eta ⟨η⟩, omega ⟨ω⟩, and alpha ⟨α⟩. It represents the former presence of an offglide after the vowel, forming a so‐called "long diphthong". Such diphthongs —phonologically distinct from the corresponding normal or "short" diphthongs —were a feature of ancient Greek in the pre-classical and classical eras.

<span class="mw-page-title-main">Case preservation</span> Text processing that preserves the original capitalization of text

In file systems, case preservation is the preservation of the letter case of letters in file names. If an attempt is made to create a file named "ThisIsAFile" on a file system that preserves letter case, the file's name will be "ThisIsAFile", rather than, for example, "thisisafile" or "THISISAFILE".

Named-entity recognition (NER) (also known as (named)entity identification, entity chunking, and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.

Capitalization of <i>Internet</i> Conventions for capitalizing word

Orthographic conventions have varied over time, and vary by publishers, authors, and regional preferences, on whether and when Internet should be capitalized. When the Internet first came into common use, most publications treated Internet as a capitalized proper noun, but this has become less common. This reflects the tendency in English to capitalize new terms and move them to lowercase as they become familiar. The word is sometimes still capitalized to distinguish the global IP-based internet from internets that are smaller or not IP-based, though many publications, including the AP Stylebook since 2016, recommend the lowercase form in every case. In 2016, the Oxford English Dictionary found that, based on a study of around 2.5 billion printed and online sources, "Internet" was capitalized in 54% of cases, with Internet being preferred in the United States and internet being preferred in the United Kingdom.

Hungarian orthography consists of rules defining the standard written form of the Hungarian language. It includes the spelling of lexical words, proper nouns and foreign words (loanwords) in themselves, with suffixes, and in compounds, as well as the hyphenation of words, punctuation, abbreviations, collation, and other information.

English orthography sometimes uses the term proper adjective to mean adjectives that take initial capital letters, and common adjective to mean those that do not. For example, a person from India is Indian—Indian is a proper adjective.

Error-driven learning is a type of reinforcement learning method. This method tweaks a model’s parameters based on the difference between the proposed and actual results. These models stand out as they depend on environmental feedback instead of explicit labels or categories. They are based on the idea that language acquisition involves the minimization of the prediction error (MPSE). By leveraging these prediction errors, the models consistently refine expectations and decrease computational complexity. Typically, these algorithms are operated by the GeneRec algorithm.

Capitalization or capitalisation in English grammar is the use of a capital letter at the start of a word. English usage varies from capitalization in other languages.

The following outline is provided as an overview of and topical guide to natural-language processing:

Reverential capitalization is the practice of capitalizing religious words that refer to deities or divine beings in cases where the words would not otherwise have been capitalized. Pronouns are also particularly included in reverential capitalization:

and God calleth to the light 'Day,' and to the darkness He hath called 'Night;' and there is an evening, and there is a morning — day one.

References

  1. Brown, Eric W.; Coden, Anni R. (2002). "Capitalization Recovery for Text". Information Retrieval Techniques for Speech Applications. Lecture Notes in Computer Science. Vol. 2273. pp. 11–22. doi: 10.1007/3-540-45637-6_2 . ISBN   978-3-540-43156-5.
  2. USpatent 7,827,025 B2,Peter K. L. Mau&Dong Yu,"Efficient capitalization through user modeling",issued 2010-11-02, assigned to Microsoft Corporation
  3. USpatent 8,972,855 B2,Zhu Liu; David Gibbon& Behzad Shahraray,"Method and apparatus for providing case restoration",issued 2015-03-03, assigned to AT&T Intellectual Property I, L.P.
  4. Lita, L. V.; Ittycheriah, A.; Roukos, S.; Kambhatla, N. (2003). "tRuEcasIng". Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics. Sapporo, Japan. pp. 152–159.