Interlinear gloss

Last updated

In linguistics and pedagogy, an interlinear gloss is a gloss (series of brief explanations, such as definitions or pronunciations) placed between lines, such as between a line of original text and its translation into another language. When glossed, each line of the original text acquires one or more corresponding lines of transcription known as an interlinear text or interlinear glossed text (IGT)interlinear for short. Such glosses help the reader follow the relationship between the source text and its translation, and the structure of the original language. In its simplest form, an interlinear gloss is simply a literal, word-for-word translation of the source text.

Contents

History

Interlinear text in Toussaint-Langenscheidt Spanisch, a Spanish-language textbook for German speakers, 1910 Toussaint-Langenscheidt Spanisch 7.13.png
Interlinear text in Toussaint-Langenscheidt Spanisch, a Spanish-language textbook for German speakers, 1910

Interlinear glosses have been used for a variety of purposes over a long period of time. One common usage has been to annotate bilingual textbooks for language education. This sort of interlinearization serves to help make the meaning of a source text explicit without attempting to formally model the structural characteristics of the source language.

Such annotations have occasionally been expressed not through interlinear layout, but rather, through enumeration of words in the object and meta language. One such example is Wilhelm von Humboldt's annotation of Classical Nahuatl: [1]

1

ni-

1

ich

2

c-

3

mache

3

chihui

2

es

4

-lia

4

für

5

in

5

der

6

no-

6

mein

7

piltzin

7

Sohn

8

ce

8

ein

9

calli

9

Haus

1 2 3 4 5 6 7 8 9

ni- c- chihui -lia in no- piltzin ce calli

1 3 2 4 5 6 7 8 9

ich mache es für der mein Sohn ein Haus

This "inline" style allows examples to be included within the flow of text, and for the word order of the target language to be written in an order which approximates the target language syntax. (In the gloss here, mache es is reordered from the corresponding source order to approximate German syntax more naturally.) Even so, this approach requires the readers to "re-align" the correspondences between source and target forms.

More modern 19th- and 20th-century approaches took to glossing vertically, aligning the same sort of word-by-word content in such a way that the metalanguage terms were placed vertically below the source language terms. In this style, the given example might be rendered thus (here English gloss):

ni-

I

c-

it

chihui

make

-lia

for

in

to-the

no-

my

piltzin

son

ce

a

calli

house

ni- c- chihui -lia in no- piltzin ce calli

I it make for to-the my son a house

"I made my son a house."

Here word ordering is determined by the syntax of the object language.

Finally, modern linguists have adopted the practice of using abbreviated grammatical category labels. A 2008 publication which repeats this example labels it as follows: [2]

ni-c-chihui-lia

1SG.SUBJ-3SG.OBJ-mach-APPL

in

DET

no-piltzin

1SG.POSS-Sohn

ce

ein

calli

Haus

ni-c-chihui-lia in no-piltzin ce calli

1SG.SUBJ-3SG.OBJ-mach-APPL DET 1SG.POSS-Sohn ein Haus

This approach is denser and also requires effort to read, but it is less reliant on the grammatical structure of the metalanguage for expressing the semantics of the target forms.

In computing, special text markers are provided in the Specials Unicode block to indicate the start and end of interlinear glosses.

Structure

Though there is no formal specification for the IGT format, the Leipzig Glossing Rules [3] are a set of guidelines that aim to standardize the format as much as possible.

An interlinear text for linguistics will commonly consist of some or all of the following, usually in this order, from top to bottom:

and finally

As an example, the following Taiwanese Minnan clause has been transcribed with five lines of text:

1. the standard pe̍h-ōe-jī transliteration,
2. a gloss using tone numbers for the surface tones,
3. a gloss showing the underlying tones in citation form (before undergoing tone sandhi),
4. a morpheme-by-morpheme gloss in English, and
5. an English translation: [4]

(1.)

(2.)

(3.)

(4.)

goá

goa1

goa2

I

iáu-boē

iau1-boe3

iau2-boe7

not-yet

koat-tēng

koat2-teng3

koat4-teng7

decide

tang-sî

tang7-si5

tang1-si5

when

boeh

boeh2

boeh4

want

tńg-khì

tng1-khi3.

tng2-khi3.

return.

(1.) goá iáu-boē koat-tēng tang-sî boeh tńg-khì

(2.) goa1 iau1-boe3 koat2-teng3 tang7-si5 boeh2 tng1-khi3.

(3.) goa2 iau2-boe7 koat4-teng7 tang1-si5 boeh4 tng2-khi3.

(4.) I not-yet decide when want return.

(5.) "I have not yet decided when I shall return."

Word-by-word alignment. According to the Leipzig Glossing Rules, it is standard to left-align the words in the object language with the corresponding words in the metalanguage; this alignment can be seen between lines (1-3) and line (4).

Morpheme-by-morpheme correspondence. At the sub-word level, segmentable morphemes are separated by hyphens, both in the example and in the gloss. There should be the same number of hyphens in the example and in the gloss, as shown in the following example:

Gila

now

abur-u-n

they- OBL - GEN

ferma

farm

hamišaluǧ

forever

güǧüna

behind

amuqʼ-da-č

stay- FUT - NEG

Gila abur-u-n ferma hamišaluǧ güǧüna amuqʼ-da-č

now they-OBL-GEN farm forever behind stay-FUT-NEG

'Now their farm will not stay behind forever.'

Grammatical category labels. In amuqʼ-da-č, the stem (amuq) is translated into the corresponding English lexeme (stay) while the inflectional affixes (da) and (č) are inflectional affixes representing future tense and negation. These inflectional affixes are glossed as FUT and NEG; a list of standard abbreviations for grammatical categories that are widely used in linguistics can be found in the Leipzig Glossing Rules.

One-to-many correspondences. When a single object-language element corresponds to several metalanguage elements, they are separated by periods. [3] E.g.,

çık-mak

come.out- INF

çık-mak

come.out-INF

'to come out'

Non-overt elements. if the morpheme-by-morpheme gloss (middle line) contains an element that does not correspond to an overt element in the example, a standard strategy is to include an overt "ø" in the object-language text, [3] which is separated by a hyphen like an overt element would be:

puer-ø

boy- NOM

puer-ø

boy-NOM

'boy'

Reduplication is treated similarly to affixation but with a tilde (instead of the standard hyphen) that connects the copied element to the stem: [3]

bi~bili

IPFV ~buy

bi~bili

IPFV~buy

'is buying'

Punctuation

In interlinear morphological glosses, various forms of punctuation separate the glosses. Typically, the words are aligned with their glosses; within words, a hyphen is used when a boundary is marked in both the text and its gloss, a period when a boundary appears in only one. That is, there should be the same number of words separated with spaces in the text and its gloss, as well as the same number of hyphenated morphemes within a word and its gloss. This is the basic system, and can be applied universally. For example,

Odadan hızlı çıktım. (Turkish)

oda-dan

room- ABL

room-from

hız-lı

speed- COM

speed-with

çık-tı-m

go.out- PFV -1sg

go_out-perfective-I

oda-dan hız-lı çık-tı-m

room-ABL speed-COM go.out-PFV-1sg

room-from speed-with go_out-perfective-I

'I left the room quickly.'

An underscore may be used instead of a period, as in go_out-PFV, when a single word in the source language happens to correspond to a phrase in the glossing language, though a period would still be used for other situations, such as Greek oikíais house.FEM.PL.DAT 'to the houses'.

However, sometimes finer distinctions may be made. For example, clitics may be separated with a double hyphen (or, for ease of typing, an equal sign) rather than a hyphen:

Je t'aime. (French)

je⹀te⹀aime

I⹀you⹀love

je⹀te⹀aime

I⹀you⹀love

'I love you.'

Affixes which cause discontinuity (infixes, circumfixes, transfixes, etc.) may be set off by angle brackets, and reduplication with tildes, rather than with hyphens:

sulat, susulat, sumulat, sumusulat (verbal declensions) (Tagalog)

sulat

write

su~sulat

contemplative mood~write

sumulat

agent trigger.past⟩write

sumu~sulat

⟨agent trigger⟩contemplative~write

sulat su~sulat sumulat sumu~sulat

write contemplative mood~write agent trigger.pastwrite agent triggercontemplative~write

(See affix for other examples.)

Morphemes which cannot be easily separated out, such as umlaut, may be marked with a backslash rather than a period:

unser-n

our-DAT.PL

Väter-n

father\PL-DAT.PL

(German)

 

unser-n Väter-n

our-DAT.PL father\PL-DAT.PL

'to our fathers' (the singular of Väter 'fathers' is Vater)

A few other conventions which are sometimes seen are illustrated in the Leipzig Glossing Rules. [3]

Interlinear gloss resources

Efforts have been undertaken to digitize IGT for hundreds of the world's languages. [5]

Online Database of Interlinear Text

The Online Database of Interlinear Text (ODIN) is a database of over 200,000 instances of interlinear glosses for more than 1,500 languages extracted from scholarly linguistic research. [6] The database was constructed in two phases: automatic construction followed by manual correction. The automatic construction stage itself was completed in three steps:

  1. First, search engines (e.g., Google, Bing) were queried to retrieve scholarly documents that were likely to contain interlinear glosses. The queries comprised terms relevant to linguistic research such as grammatical morphemes (e.g., "NOM"—shorthand for nominative; "3SG"—shorthand for 3rd person singular).
  2. Second, each line in an extracted document was tagged for whether it was a line belonging to an interlinear gloss or not using sequence-labeling methods from Machine Learning.
  3. Third, each interlinear gloss instance was assigned a language name (e.g., Tagalog) and an ISO 693-3 language ID. Language names and IDs were automatically assigned to interlinear glosses using Coreference Resolution models from Natural Language Processing, where the interlinear gloss instance was tagged with the language name (and ID) that appears in the scholarly document the interlinear gloss instance was extracted from. [6]

In the manual correction phase, the database creators manually corrected the boundaries of the interlinear gloss instances discovered by the sequence-labelling method in Step 2 of the automatic construction phase. The creators then verified the language names and language codes in a second and third pass over the data, respectively.

The language distribution of interlinear gloss instances in Online Database of Interlinear Text after phase 1 and (phase 2)
Range of interlinear gloss

instances

Number of

languages

Number of

interlinear gloss instances

Percent of

interlinear gloss instances

>10,0003 (1)36,691 (10,814)19.39 (6.88)
1000-999937 (31)97,158 (81,218)51.34 (51.69)
100-999122 (139)40,260 (46,420)21.27 (29.55)
10-99326 (460)12,822 (15,560)6.78 (9.96)
1-9838 (862)2,313 (3,012)1.22 (1.92)
Total1,326 (1,493)189,244 (157,114)100 (100)

Automatic processing of interlinear gloss instances

Natural Language Processing models leveraging interlinear gloss resources, such as the Online Database of Interlinear Text, have been developed. [7] [8]

Automatic glossing

Natural Language Processing systems, for example, have been developed to automatically produce interlinear glosses.: [7]

mi-s

you-GEN

ħumukuli

camel

elu-ab-ok'ek'-asi

we.OBL-ERG.1.PL-steal-PRT

anu

be.NEG

mi-s ħumukuli elu-ab-ok'ek'-asi anu

you-GEN camel we.OBL-ERG.1.PL-steal-PRT be.NEG

'We didn't steal your camel.'

Given the morpheme segmented line (first line above) and the free translation line (third line above), the task is to produce the middle glossed line comprising stem translations (e.g., mi:you) and the grammatical category labels corresponding to affixes (e.g., a:ERG.1.PL). Sequence prediction models from Natural Language Processing have been used to perform this task. [7] Two factors contribute to the difficulty of this task:

  1. The translation is not necessarily in alignment with the morpheme segmented line (e.g., camel is the last word in the translation but the second word in the morpheme segmented line).
  2. Some words in the morpheme segmented line have multiple correspondences in the gloss (e.g., anu:be.NEG).

Automatic discovery of morphological structure from glosses

Researchers have used interlinear glosses is to obtain the morphological paradigms of the object language (i.e., the language being glossed). To automatically create morphological paradigms from interlinear glosses, researchers have created tables for every stem in the gloss and a (possibly empty) slot for every grammatical category (e.g., ERG) in the gloss. For instance, given the glossed sentence below: [7]

Vecher-om

evening-INS

ya

1 . SG . NOM

pobeja-la

run- PFV . PST . SG . FEM

v

in

magazin

store. ACC

Vecher-om ya pobeja-la v magazin

evening-INS 1.SG.NOM run-PFV.PST.SG.FEM in store.ACC

'In the evening I ran to the store.'

There would be a paradigm for the stem pobeja with slots for PFV.PST.SG.FEM and PFV.PST.SG.MASC:

(Partial) paradigm for pobeja
Slotinflection
PFV.PST.SG.FEMpobeja-la
PFV.PST.SG.MASC?

The slot for PFV.PST.SG.FEM would be filled (since it was observed in the interlinear gloss data) but the slot for PFV.PST.SG.MASC would be empty (assuming that no other interlinear gloss instance contains pobeja inflected for the PFV.PST.SG.MASC grammatical category). A statistical machine learning model for morphological inflection can be used to fill in the missing entries. [8] [9] [10] [11] [12]

See also

Related Research Articles

In linguistics, an affix is a morpheme that is attached to a word stem to form a new word or word form. The main two categories are derivational and inflectional affixes. The first ones, such as -un, -ation, anti-, pre- etc, introduce a semantic change to the word they are attached to. The latter ones introduce a syntactic change, such as singular into plural, or present simple tense into present continuous or past tense by adding -ing, -ed to an English word. All of them are bound morphemes by definition; prefixes and suffixes may be separable affixes.

A morpheme is the smallest meaningful constituent of a linguistic expression. The field of linguistic study dedicated to morphemes is called morphology.

Vaeakau-Taumako is a Polynesian language spoken in some of the Reef Islands as well as in the Taumako Islands in the Temotu province of the Solomon Islands.

In linguistic typology, object–subject–verb (OSV) or object–agent–verb (OAV) is a classification of languages, based on whether the structure predominates in pragmatically neutral expressions. An example of this would be "Oranges Sam ate."

Symmetrical voice, also known as Austronesian alignment, the Philippine-type voice system or the Austronesian focus system, is a typologically unusual kind of morphosyntactic alignment in which "one argument can be marked as having a special relationship to the verb". This special relationship manifests itself as a voice affix on the verb that corresponds to the syntactic role of a noun within the clause, that is either marked for a particular grammatical case or is found in a privileged structural position within the clause or both.

Apurinã, or Ipurina, is a Southern Maipurean language spoken by the Apurinã people of the Amazon basin. It has an active–stative syntax. Apurinã is a Portuguese word used to describe the Popikariwakori people and their language. Apurinã indigenous communities are predominantly found along the Purus River, in the Northwestern Amazon region in Brazil, in the Amazonas state. Its population is currently spread over twenty-seven different indigenous lands along the Purus River. with an estimated total population of 9,500 people. It is predicted, however, that fewer than 30% of the Apurinã population can speak the language fluently. A definite number of speakers cannot be firmly determined because of the regional scattered presence of its people. The spread of Apurinã speakers to different regions was initially caused by conflict or disease, which has consequently led natives to lose the ability to speak the language for lack of practice and also because of interactions with other communities.

<span class="mw-page-title-main">Wagiman language</span> Indigenous Australian language

Wagiman, also spelt Wageman, Wakiman, Wogeman, and other variants, is a near-extinct Aboriginal Australian language spoken by a small number of Wagiman people in and around Pine Creek, in the Katherine Region of the Northern Territory.

Roviana is a member of the North West Solomonic branch of Oceanic languages. It is spoken around Roviana and Vonavona lagoons at the north central New Georgia in the Solomon Islands. It has 10,000 first-language speakers and an additional 16,000 people mostly over 30 years old speak it as a second language. In the past, Roviana was widely used as a trade language and further used as a lingua franca, especially for church purposes in the Western Province, but now it is being replaced by the Solomon Islands Pijin. Published studies on Roviana include: Ray (1926), Waterhouse (1949) and Todd (1978) contain the syntax of Roviana. Corston-Oliver discuss ergativity in Roviana. Todd (2000) and Ross (1988) discuss the clause structure in Roviana. Schuelke (2020) discusses grammatical relations and syntactic ergativity in Roviana.

Central Alaskan Yupʼik is one of the languages of the Yupik family, in turn a member of the Eskimo–Aleut language group, spoken in western and southwestern Alaska. Both in ethnic population and in number of speakers, the Central Alaskan Yupik people form the largest group among Alaska Natives. As of 2010 Yupʼik was, after Navajo, the second most spoken aboriginal language in the United States. Yupʼik should not be confused with the related language Central Siberian Yupik spoken in Chukotka and St. Lawrence Island, nor Naukan Yupik likewise spoken in Chukotka.

<span class="mw-page-title-main">Dagaare language</span> Language

Dagaare is the language of the Dagaaba people of Ghana, Burkina Faso, and Ivory Coast. It has been described as a dialect continuum that also includes Waale and Birifor. Dagaare language varies in dialect stemming from other family languages including: Dagbane, Waale, Mabia, Gurene, Mampruli, Kusaal, Buli, Niger-Congo, and many other sub languages resulting in around 1.3 million Dagaare speakers. Throughout the regions of native Dagaare speakers the dialect comes from Northern, Central, Western, and Southern areas referring to the language differently. Burkina Faso refers to Dagaare as Dagara and Birifor to natives in the Republic of Côte d'Ivoire. The native tongue is still universally known as Dagaare. Amongst the different dialects, the standard for Dagaare is derived from the Central region's dialect. Southern Dagaare also stems from the Dagaare language and is known to be commonly spoken in Wa and Kaleo.

<span class="mw-page-title-main">Bangime language</span> Language isolate of southeastern Mali

Bangime is a language isolate spoken by 3,500 ethnic Dogon in seven villages in southern Mali, who call themselves the bàŋɡá–ndɛ̀. Bangande is the name of the ethnicity of this community and their population grows at a rate of 2.5% per year. The Bangande consider themselves to be Dogon, but other Dogon people insist they are not. Bangime is an endangered language classified as 6a - Vigorous by Ethnologue. Long known to be highly divergent from the (other) Dogon languages, it was first proposed as a possible isolate by Blench (2005). Heath and Hantgan have hypothesized that the cliffs surrounding the Bangande valley provided isolation of the language as well as safety for Bangande people. Even though Bangime is not closely related to Dogon languages, the Bangande still consider their language to be Dogon. Hantgan and List report that Bangime speakers seem unaware that it is not mutually intelligible with any Dogon language.

Biak, also known as Biak-Numfor, Noefoor, Mafoor, Mefoor, Nufoor, Mafoorsch, Myfoorsch and Noefoorsch, is an Austronesian language of the South Halmahera-West New Guinea subgroup of the Eastern Malayo-Polynesian languages.

<span class="mw-page-title-main">Hup language</span> Naduhup language of Colombia and Brazil

The Hup language is one of the four Naduhup languages. It is spoken by the Hupda indigenous Amazonian peoples who live on the border between Colombia and the Brazilian state of Amazonas. There are approximately 1500 speakers of the Hup language. As of 2005, according to the linguist Epps, Hup is not seriously endangered – although the actual number of speakers is few, all Hupda children learn Hup as their first language.

<span class="mw-page-title-main">Yolmo language</span> Sino-Tibetan language of Nepal

Yolmo (Hyolmo) or Helambu Sherpa, is a Tibeto-Burman language of the Hyolmo people of Nepal. Yolmo is spoken predominantly in the Helambu and Melamchi valleys in northern Nuwakot District and northwestern Sindhupalchowk District. Dialects are also spoken by smaller populations in Lamjung District and Ilam District and also in Ramecchap District. It is very similar to Kyirong Tibetan and less similar to Standard Tibetan and Sherpa. There are approximately 10,000 Yolmo speakers, although some dialects have larger populations than others.

Tuparí is an indigenous language of Brazil. It is one of six Tupari languages of the Tupian language family. The Tuparí language, and its people, is located predominantly within the state of Rondônia, though speakers are also present in the state of Acre on the Terra Indıgena Rio Branco. There are roughly 350 speakers of this language, with the total number of members of this ethnic group being around 600.

Tamashek or Tamasheq is a variety of Tuareg, a Berber macro-language widely spoken by nomadic tribes across North Africa in Algeria, Mali, Niger, and Burkina Faso. Tamasheq is one of the three main varieties of Tuareg, the others being Tamajaq and Tamahaq.

Lengo or informally known as doku is a Southeast Solomonic language of Guadalcanal and is closely related to Gela language.

Longgu (Logu) is a Southeast Solomonic language of Guadalcanal, but originally from Malaita.

Neverver (Nevwervwer), also known as Lingarak, is an Oceanic language. Neverver is spoken in Malampa Province, in central Malekula, Vanuatu. The names of the villages on Malekula Island where Neverver is spoken are Lingarakh and Limap.

References

  1. Lehmann, Christian (2004-01-23). "Directions for interlinear morphemic translations". In Geert Booij; Christian Lehmann; Joachim Mugdan; Stavros Skopeteas (eds.). Morphologie. Ein internationales Handbuch zur Flexion und Wortbildung. Handbücher der Sprach- und Kommunikationswissenschaft. Vol. 2. Berlin: W. de Gruyter. pp. 1834–1857.
  2. Haspelmath, Martin (2008). Language typology and language universals: an international handbook . Walter de Gruyter. p.  715. ISBN   978-3-11-011423-2.
  3. 1 2 3 4 5 Bickel, Balthasar; Bernard Comrie; Martin Haspelmath (February 2008). "The Leipzig Glossing Rules. Conventions for Interlinear Morpheme by Morpheme Glosses". Dept. of Linguistics – Resources – Glossing Rules. Retrieved 2010-06-30.
  4. Example from A Basic Vocabulary for a Beginner in Taiwanese by Ko Chek Hoan and Tan Pang Tin
  5. Georgi, Ryan (2016). From Aari to Zulu: massively multilingual creation of language tools using interlinear glossed tex (PhD). University of Washington.
  6. 1 2 Xia, Fei; Lewis, William; Wayne, Michael; Slayden, Glenn; Georgi, Ryan; Crowgey, Joshua; Bender, Emily (2016). "Enriching a massively multilingual database of interlinear glossed text". Language Resources and Evaluation. 50 (2): 321–349. doi:10.1007/s10579-015-9325-4. S2CID   2674996 . Retrieved 2021-12-15.
  7. 1 2 3 4 Xingyuan, Zhao; Satoru, Ozaki; Anastasopoulos, Antonios; Neubig, Graham; Levin, Lori (2020). "Automatic Interlinear Glossing for Under-Resourced Languages Leveraging Translations". COLING. Proceedings of the 28th International Conference on Computational Linguistics: 5397–5408. doi: 10.18653/v1/2020.coling-main.471 . S2CID   227231816 . Retrieved 2021-12-15.
  8. 1 2 Moeller, Sarah; Liu, Ling; Yang, Changbing; Kann, Katharina; Hulden, Mans (2020). "IG2P: From Interlinear Glossed Texts to Paradigms". EMNLP. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP): 5251–5262. doi: 10.18653/v1/2020.emnlp-main.424 . S2CID   226262296 . Retrieved 2021-12-15.
  9. Silfverberg, Miikka; Hulden, Mans (2018). "An Encoder-Decoder Approach to the Paradigm Cell Filling Problem". Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium: Association for Computational Linguistics: 2883–2889. doi: 10.18653/v1/D18-1315 . S2CID   53082616.
  10. Wu, Shijie; Cotterell, Ryan; Hulden, Mans (2021). "Applying the Transformer to Character-level Transduction". Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. Online: Association for Computational Linguistics: 1901–1907. arXiv: 2005.10213 . doi: 10.18653/v1/2021.eacl-main.163 . S2CID   218718982.
  11. Nicolai, Garrett; Cherry, Colin; Kondrak, Grzegorz (2015). "Inflection Generation as Discriminative String Transduction". Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Denver, Colorado: Association for Computational Linguistics: 922–931. doi: 10.3115/v1/N15-1093 . S2CID   14929030.
  12. Bhargava, Aditya; Kondrak, Grzegorz (2012). "Leveraging supplemental representations for sequential transduction". Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Montréal, Canada: Association for Computational Linguistics: 396–406.