Truecasing

Last updated

Truecasing is the problem in natural language processing (NLP) of determining the proper capitalization of words where such information is unavailable. This commonly comes up due to the standard practice (in English and many other languages) of automatically capitalizing the first word of a sentence. It can also arise in badly cased or noncased text (for example, all-lowercase or all-uppercase text messages).

Contents

Truecasing is unnecessary in languages whose scripts do not have a distinction between uppercase and lowercase letters. This includes all languages not written in the Latin, Greek, Cyrillic or Armenian alphabets, such as Japanese, Chinese, Thai, Hebrew, Arabic, Hindi, and Georgian.

Techniques

Applications

Truecasing aids in other NLP tasks, such as named entity recognition, automatic content extraction, and machine translation. [1] Proper capitalization allows easier detection of proper nouns, which are the starting points of NER and ACE. Some translation systems use statistical machine learning techniques, which could make use of the information contained in capitalization to increase accuracy.

See also

Related Research Articles

Camel case Writing words with internal uppercase letters

Camel case is the practice of writing phrases without spaces or punctuation. It indicates the separation of words with a single capitalized letter, and the first word starting with either case. Common examples include "iPhone" and "eBay". It is also sometimes used in online usernames such as "johnSmith", and to make multi-word domain names more legible, for example in promoting "EasyWidgetCompany.com".

Natural language processing Field of computer science and linguistics

Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.

Eponym Person or thing after which something is named

An eponym is a person, place, or thing after whom or which someone or something is, or is believed to be, named. The adjectives derived from eponym include eponymous and eponymic.

In computers, case sensitivity defines whether uppercase and lowercase letters are treated as distinct (case-sensitive) or equivalent (case-insensitive). For instance, when users interested in learning about dogs search an e-book, "dog" and "Dog" are of the same significance to them. Thus, they request a case-insensitive search. But when they search an online encyclopedia for information about the United Nations, for example, or something with no ambiguity regarding capitalization and ambiguity between two or more terms cut down by capitalization, they may prefer a case-sensitive search.

A proper noun is a noun that identifies a single entity and is used to refer to that entity as distinguished from a common noun, which is a noun that refers to a class of entities and may be used when referring to instances of a specific class. Some proper nouns occur in plural form, and then they refer to groups of entities considered as unique. Proper nouns can also occur in secondary applications, for example modifying nouns, or in the role of common nouns. The detailed definition of the term is problematic and, to an extent, governed by convention.

Capitalization or capitalisation is writing a word with its first letter as a capital letter and the remaining letters in lower case, in writing systems with a case distinction. The term also may refer to the choice of the casing applied to text.

Letter case Distinction between alphabetic letters in taller, "upper" case and shorter "lower" case

Letter case is the distinction between the letters that are in larger uppercase or capitals and smaller lowercase in the written representation of certain languages. The writing systems that distinguish between the upper and lowercase have two parallel sets of letters, with each letter in one set usually having an equivalent in the other set. The two case variants are alternative representations of the same letter: they have the same name and pronunciation and are treated identically when sorting in alphabetical order.

Title case or headline case is a style of capitalization used for rendering the titles of published works or works of art in English. When using title case, all words are capitalized except for minor words unless they are the first or last word of the title. There are different rules for which words are major, hence capitalized. As an example, a headline might be written like this: "The Quick Brown Fox Jumps over the Lazy Dog".

Iota subscript

The iota subscript is a diacritic mark in the Greek alphabet shaped like a small vertical stroke or miniature iota ⟨ι⟩ placed below the letter. It can occur with the vowel letters eta ⟨η⟩, omega ⟨ω⟩, and alpha ⟨α⟩. It represents the former presence of an offglide after the vowel, forming a so‐called "long diphthong". Such diphthongs —phonologically distinct from the corresponding normal or "short" diphthongs —were a feature of ancient Greek in the pre-classical and classical eras.

In computer programming, a naming convention is a set of rules for choosing the character sequence to be used for identifiers which denote variables, types, functions, and other entities in source code and documentation.

Conventions for the capitalization of Internet when referring to the global system of interconnected computer networks have varied over time, and vary by publishers, authors, and regional preferences. Increasingly, the proper noun sense of the word takes a lowercase i, in orthographic parallel with similar examples of how the proper names for the Sun, the Moon, the Universe, and the World are variably capitalized in English orthography.

The nouns of the German language have several properties, some unique. First off, they all have a grammatical gender, like many related Indo-European languages. The three genders are masculine, feminine, and neuter, and even words for objects without obvious masculine or feminine characteristics like 'bridge' or 'rock' can be masculine or feminine. German nouns are declined depending on their grammatical case and whether they are singular or plural. German has four cases: nominative, accusative, dative and genitive.

Hungarian orthography Standard written Hungarian

Hungarian orthography consists of rules defining the standard written form of the Hungarian language. It includes the spelling of lexical words, proper nouns and foreign words (loanwords) in themselves, with suffixes, and in compounds, as well as the hyphenation of words, punctuation, abbreviations, collation, and other information.

English orthography sometimes uses the term proper adjective to mean adjectives that take initial capital letters, and common adjective to mean those that do not. For example, a person from India is Indian—Indian is a proper adjective.

Sentence boundary disambiguation (SBD), also known as sentence breaking, sentence boundary detection, and sentence segmentation, is the problem in natural language processing of deciding where sentences begin and end. Natural language processing tools often require their input to be divided into sentences; however, sentence boundary identification can be challenging due to the potential ambiguity of punctuation marks. In written English, a period may indicate the end of a sentence, or may denote an abbreviation, a decimal point, an ellipsis, or an email address, among other possibilities. About 47% of the periods in the Wall Street Journal corpus denote abbreviations. Question marks and exclamation marks can be similarly ambiguous due to use in emoticons, computer code, and slang.

Latin script Writing system used for most European languages

The Latin script, also known as Roman script, is an alphabetic writing system based on the letters of the classical Latin alphabet, derived from a form of the Cumaean Greek version of the Greek alphabet used by the Etruscans. Several Latin-script alphabets exist, which differ in graphemes, collation and phonetic values from the classical Latin alphabet.

Capitalization or capitalisation in English grammar is the use of a capital letter at the head of a word. English usage varies from capitalization in other languages.

The following outline is provided as an overview of and topical guide to natural-language processing:

Reverential capitalization is the practice of capitalizing religious words that refer to deities or divine beings in cases where the words would not otherwise have been capitalized. Pronouns are also particularly included in reverential capitalization:

and God calleth to the light 'Day,' and to the darkness He hath called 'Night;' and there is an evening, and there is a morning — day one.

In computer programming languages, an identifier is a lexical token that names the language's entities. Some of the kinds of entities an identifier might denote include variables, data types, labels, subroutines, and modules.

References

  1. Lita, L. V.; Ittycheriah, A.; Roukos, S.; Kambhatla, N. (2003). "tRuEcasIng". Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics. Sapporo, Japan. pp. 152–159.