In linguistics, realization is the process by which some kind of surface representation is derived from its underlying representation; that is, the way in which some abstract object of linguistic analysis comes to be produced in actual language. Phonemes are often said to be realized by speech sounds. The different sounds that can realize a particular phoneme are called its allophones.
Realization is also a subtask of natural language generation, which involves creating an actual text in a human language (English, French, etc.) from a syntactic representation. There are a number of software packages available for realization, most of which have been developed by academic research groups in NLG. The remainder of this article concerns realization of this kind.
For example, the following Java code causes the simplenlg system [1] to print out the text The women do not smoke.:
NPPhraseSpecsubject=nlgFactory.createNounPhrase("the","woman");subject.setPlural(true);SPhraseSpecsentence=nlgFactory.createClause(subject,"smoke");sentence.setFeature(Feature.NEGATED,true);System.out.println(realiser.realiseSentence(sentence));
In this example, the computer program has specified the linguistic constituents of the sentence (verb, subject), and also linguistic features (plural subject, negated), and from this information the realiser has constructed the actual sentence.
Realisation involves three kinds of processing:
Syntactic realisation: Using grammatical knowledge to choose inflections, add function words and also to decide the order of components. For example, in English the subject usually precedes the verb, and the negated form of smoke is do not smoke.
Morphological realisation: Computing inflected forms, for example the plural form of woman is women (not womans).
Orthographic realisation: Dealing with casing, punctuation, and formatting. For example, capitalising The because it is the first word of the sentence.
The above examples are very basic, most realisers are capable of considerably more complex processing.
A number of realisers have been developed over the past 20 years. These systems differ in terms of complexity and sophistication of their processing, robustness in dealing with unusual cases, and whether they are accessed programmatically via an API or whether they take a textual representation of a syntactic structure as their input.
There are also major differences in pragmatic factors such as documentation, support, licensing terms, speed and memory usage, etc.
It is not possible to describe all realisers here, but a few of the emerging areas are:
In linguistics, syntax is the set of rules, principles, and processes that govern the structure of sentences in a given language, usually including word order. The term syntax is also used to refer to the study of such principles and processes. The goal of many syntacticians is to discover the syntactic rules common to all languages.
Agglutination is a linguistic process pertaining to derivational morphology in which complex words are formed by stringing together morphemes without changing them in spelling or phonetics. Languages that use agglutination widely are called agglutinative languages. An example of such a language is Turkish, where, for example, the word evlerinizden, or "from your houses", consists of the morphemes ev-ler-iniz-den, literally translated morpheme-by-morpheme as house-plural-your-from.
Mbula is an Austronesian language spoken by around 2,500 people on Umboi Island and Sakar Island in the Morobe Province of Papua New Guinea. Its basic word order is subject–verb–object; it has a nominative–accusative case-marking strategy.
Wiyot or Soulatluk (lit. "your jaw") is an Algic language spoken by the Wiyot people of Humboldt Bay, California. The language's last native speaker, Della Prince, died in 1962.
In linguistics, a zero or null is a segment which is not pronounced or written. It is a useful concept in analysis, indicating lack of an element where one might be expected. It is usually written with the symbol "∅", in Unicode U+2205∅EMPTY SET. A common ad hoc solution is to use the Scandinavian capital letter Ø instead.
Attraction, in linguistics, is a type of error in language production that incorrectly extends a feature from one word in a sentence to another. This can refer to agreement attraction, wherein a feature is assigned based on agreement with another word. This tends to happen in English with Subject Verb Agreement, especially where the subject is separated from the verb in a complex noun phrase structure. It can also refer to Case Attraction, which assigns features based on grammatical roles, or in dialectal forms of English, Negative Attraction which extends negation particles.
East Flemish is a collective term for the two easternmost subdivisions of the so-called Flemish dialects, native to the southwest of the Dutch language area, which also include West Flemish. Their position between West Flemish and Brabantian has caused East Flemish dialects to be grouped with the latter as well. They are spoken mainly in the province of East Flanders and a narrow strip in the southeast of West Flanders in Belgium and eastern Zeelandic Flanders in the Netherlands. Even though the dialects of the Dender area are often discussed together with the East Flemish dialects because of their location, the latter are actually South Brabantian.
Taba is a Malayo-Polynesian language of the South Halmahera–West New Guinea group. It is spoken mostly on the islands of Makian, Kayoa and southern Halmahera in North Maluku province of Indonesia by about 20,000 people.
In linguistics, a feature is any characteristic used to classify a phoneme or word. These are often binary or unary conditions which act as constraints in various forms of linguistic analysis.
Manam is a Kairiru–Manam language spoken mainly on the volcanic Manam Island, northeast of New Guinea.
The Yimas language is spoken by the Yimas people, who populate the Sepik River Basin region of Papua New Guinea. It is spoken primarily in Yimas village, Karawari Rural LLG, East Sepik Province. It is a member of the Lower-Sepik language family. All 250-300 speakers of Yimas live in two villages along the lower reaches of the Arafundi River, which stems from a tributary of the Sepik River known as the Karawari River.
Araki is a nearly extinct language spoken in the small island of Araki, south of Espiritu Santo Island in Vanuatu. Araki is gradually being replaced by Tangoa, a language from a neighbouring island.
The term linguistic performance was used by Noam Chomsky in 1960 to describe "the actual use of language in concrete situations". It is used to describe both the production, sometimes called parole, as well as the comprehension of language. Performance is defined in opposition to "competence"; the latter describes the mental knowledge that a speaker or listener has of language.
Aggregation is a subtask of natural language generation, which involves merging syntactic constituents together. Sometimes aggregation can be done at a conceptual level.
Iatmul is the language of the Iatmul people, spoken around the Sepik River in the East Sepik Province, northern Papua New Guinea. The Iatmul, however, do not refer to their language by the term Iatmul, but call it gepmakudi.
Duna is a Papuan language of Papua New Guinea. It may belong to the Trans New Guinea language family and is often further classified as a Duna-Pogaya language, for Bogaya appears to be Duna's closest relative, as evidenced by the similar development of the personal pronouns. Estimates for number of speakers range from 11,000 (1991) to 25,000 (2002).
Ute is a dialect of the Colorado River Numic language, spoken by the Ute people. Speakers primarily live on three reservations: Uintah-Ouray in northeastern Utah, Southern Ute in southwestern Colorado, and Ute Mountain in southwestern Colorado and southeastern Utah. Ute is part of the Numic branch of the Uto-Aztecan language family. Other dialects in this dialect chain are Chemehuevi and Southern Paiute. As of 2010, there were 1,640 speakers combined of all three dialects Colorado River Numic. Ute's parent language, Colorado River Numic, is classified as a threatened language, although there are tribally-sponsored language revitalization programs for the dialect.
The Sabanê language is one of the three major groups of languages spoken in the Nambikwara family. The groups of people who speak this language were located in the extreme north of the Nambikwara territory in the Rondônia and Mato Grosso states of western Brazil, between the Tenente Marques River and Juruena River. Today, most members of the group are found in the Pyreneus de Souza Indigenous Territory in the state of Rondonia.
Toʼabaita, also known as Toqabaqita, Toʼambaita, Malu and Maluʼu, is a language spoken by the people living at the north-western tip of Malaita Island, of South Eastern Solomon Islands. Toʼabaita is an Austronesian language.
Ambonese Malay or simply Ambonese is a Malay-based creole language spoken on Ambon Island in the Maluku Islands of Eastern Indonesia. It was first brought by traders from Western Indonesia, then developed when the Dutch Empire colonised the Maluku Islands. This was the first example of the transliteration of Malay into the Latin script and it was used as a tool by missionaries in Eastern Indonesia. Malay has been taught in schools and churches in Ambon, and because of this it has become a lingua franca in Ambon and its surroundings.