International Corpus of English

Last updated

The International Corpus of English (ICE) is a set of text corpora representing varieties of English from around the world. Over twenty countries or groups of countries where English is the first language or an official second language are included.

Contents

History

Sidney Greenbaum's goal to compile corpora that would compare the syntax of world English became the ICE project that was achieved by Professor Charles F. Meyer. Sidney Greenbaum anticipated for international teams of researchers to collect comparable national variations of English both written and spoken. [1] Comparable variations would be British English, American English, and Indian English, that would be represented through a computer corpora. [2] The corpora are used by researchers to compare the syntax of the varieties of English. [3] ICE corpora completion would have comprehensive linguistic analysis of varieties of English that have emerged. [4] Ongoing research for ICE is implemented by international teams in diversified regions. [5] The project began in 1990 with the primary aim of collecting material for comparative studies of English worldwide. Twenty-three research teams around the world are preparing electronic corpora of their own national or regional variety of English. Each ICE corpus consists of one million words of spoken and written English produced after 1989. [6] For most participating countries, the ICE project is stimulating the first systematic investigation of the national variety. To ensure compatibility among the component corpora, each team is following a common corpus design, as well as a common scheme for grammatical annotation.

Description

Each corpus contains one million words in 500 texts of 2000 words, [7] following the sampling methodology used for the Brown Corpus. Unlike Brown or the Lancaster-Oslo-Bergen (LOB) Corpus (or indeed mega-corpora such as the British National Corpus), however, the majority of texts are derived from spoken data.

With only one million words per corpus, ICE corpora are considered very small for modern standards. [8] ICE corpora contain 60% (600,000 words) of orthographically transcribed spoken English. The father of the project, Sidney Greenbaum, insisted on the primacy of the spoken word, following Randolph Quirk and Jan Svartvik's collaboration on the original London-Lund Corpus (LLC). This emphasis on word-for-word transcription marks out ICE from many other corpora, including those containing, e.g. parliamentary or legal paraphrases.

The corpora consist entirely of data from 1990 or later. The subjects from which the data was collected are all adults who were educated in English and were either born, or moved at an early age, to the country to which their data is attributed. [7] There are speech and text samples from both men and women of many age groups, but the corpus website makes it a point to note that, "The proportions, however, are not representative of the proportions in the population as a whole: women are not equally represented in professions such as politics and law, and so do not produce equal amounts of discourse in these fields." [7]

The British Component of ICE, ICE-GB, is fully parsed with a detailed Quirk et al. [9] phrase structure grammar, and the analyses have been thoroughly checked and completed. This analysis includes a part-of-speech tagging and parsing of the entire corpus. The treebank can be thoroughly searched and explored with the ICE Corpus Utility Program or ICECUP software. More information is in the handbook. [10]

To ensure compatibility between the individual corpora in ICE, each team is following a common corpus design, as well as a common scheme for grammatical annotation. [11] Many corpora are currently available for download on the ICE official webpage, though some require a license. Others, however, are not ready for publication. [12]

Textual and Grammatical Annotation

Researchers and Linguists follow specific guidelines when annotating data for the corpus, which can be found here, in the International Corpus of English Manuals and Documentation. The three levels of annotation are Text Markup, Wordclass Tagging, Syntactic Parsing. [13]

Textual Markup

Original markup and layout such as sentence and paragraph parsing is preserved, with special markers indicating it as original. Spoken data is transcribed orthographically, with indicators for hesitations, false starts, and pauses. [13]

Word Class Tagging

Word Classes, also called Parts of Speech, are grammatical categories for words based upon their function in a sentence.

British texts are automatically tagged for wordclass by the ICE tagger, developed at University College London, which uses a comprehensive grammar of the English language.

All other languages are tagged automatically using the PENN Treebank and the CLAWS tagset. While the tags are not corrected manually, they are checked regularly for quality. [13]

Syntactic Parsing

The sentence are parsed automatically and, if necessary, are manually corrected with ICECUP, a syntax tree editor created specifically for the corpus.

Dependency parsing is also done automatically with the Dependency Parser Pro3GreS. The results are not manually verified. [13]

Pragmatic Parsing

Ireland is currently the only participant country who includes pragmatic annotation in their data.

Design of the Corpora

Below are the subsections of the ICE, with the number of corpora for each category and sub-category in parentheses. [7]

Spoken (300)
Dialogues (180)Private (100)Face-to-face conversations (90)

Phonecalls (10)

Public (80)Classroom Lessons (20)

Broadcast Discussions (20) Broadcast Interviews (10) Parliamentary Debates (10) Legal cross-examinations (10) Business Transactions (10)

Monologues (120)Unscripted (70)Spontaneous commentaries (20)

Unscripted Speeches (30)

Demonstrations (10)

Legal Presentations (10)

Scripted (50)Broadcast News (20)

Broadcast Talks (20) Non-broadcast Talks (10)

Written (200)
Non-Printed (50)Student Writing (20)Student Essays (10)

Exam Scripts (10)

Letters (30)Social Letters (15)

Business Letters (15)

Printed (150)Academic Writing (40)Humanities (10)

Social Sciences (10)

Natural Sciences (10)

Technology (10)

Popular Writing (40)Humanities (10)

Social Sciences (10)

Natural Sciences (10)

Technology (10)

Reportage (20)Press news reports (20)
Instructional Writing (20)Administrative Writing (10)

Skills/hobbies (10)

Persuasive Writing (10)Press editorials (10)
Creative Writing (20)Novels & short stories (20)

Publications

There are a number of books published about the International Corpus of English, as well as books based in part on the corpora. [14]

Participants

The current list of participant countries are (*= available):

See also

Related Research Articles

Computational linguistics is an interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, computational linguistics draws upon linguistics, computer science, artificial intelligence, mathematics, logic, philosophy, cognitive science, cognitive psychology, psycholinguistics, anthropology and neuroscience, among others.

Corpus linguistics is the study of a language as that language is expressed in its text corpus, its body of "real world" text. Corpus linguistics proposes that a reliable analysis of a language is more feasible with corpora collected in the field—the natural context ("realia") of that language—with minimal experimental interference. The large collections of text allow linguistics to run quantitative analyses on linguistic concepts, otherwise harder to quantify.

In linguistics and natural language processing, a corpus or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources, either annotated or unannotated.

In corpus linguistics, part-of-speech tagging, also called grammatical tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc.

<span class="mw-page-title-main">Randolph Quirk</span> British linguist (1920–2017)

Charles Randolph Quirk, Baron Quirk, CBE, FBA was a British linguist and life peer. He was the Quain Professor of English language and literature at University College London from 1968 to 1981. He sat as a crossbencher in the House of Lords.

The American National Corpus (ANC) is a text corpus of American English containing 22 million words of written and spoken data produced since 1990. Currently, the ANC includes a range of genres, including emerging genres such as email, tweets, and web data that are not included in earlier corpora such as the British National Corpus. It is annotated for part of speech and lemma, shallow parse, and named entities.

<span class="mw-page-title-main">Treebank</span>

In linguistics, a treebank is a parsed text corpus that annotates syntactic or semantic sentence structure. The construction of parsed corpora in the early 1990s revolutionized computational linguistics, which benefitted from large-scale empirical data.

Geoffrey Neil Leech FBA was a specialist in English language and linguistics. He was the author, co-author, or editor of more than 30 books and more than 120 published papers. His main academic interests were English grammar, corpus linguistics, stylistics, pragmatics, and semantics.

The British National Corpus (BNC) is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources. The corpus covers British English of the late 20th century from a wide variety of genres, with the intention that it be a representative sample of spoken and written British English of that time. It is used in corpus linguistics for analysis of corpora.

Linguistic categories include

The Survey of English Usage was the first research centre in Europe to carry out research with corpora. The Survey is based in the Department of English Language and Literature at University College London.

Sidney Greenbaum was a British scholar of the English language and of linguistics. He was Quain Professor of English language and literature at the University College London from 1983 to 1990 and Director of the Survey of English Usage, 1983–96. With Randolph Quirk and others, he wrote A Comprehensive Grammar of the English Language. He also wrote Oxford English Grammar.

The history of English grammars begins late in the sixteenth century with the Pamphlet for Grammar by William Bullokar. In the early works, the structure and rules of English grammar were based on those of Latin. A more modern approach, incorporating phonology, was introduced in the nineteenth century.

<span class="mw-page-title-main">Eckhard Bick</span> German Esperantist

Eckhard Bick is a German-born Esperantist who studied medicine in Bonn but now works as a researcher in computational linguistics. He was active in an Esperanto youth group in Bonn and in the Germana Esperanto-Junularo, a nationwide Esperanto youth federation. Since his marriage to a Danish woman he and his family live in Denmark.

<span class="mw-page-title-main">Quranic Arabic Corpus</span>

The Quranic Arabic Corpus is an annotated linguistic resource consisting of 77,430 words of Quranic Arabic. The project aims to provide morphological and syntactic annotations for researchers wanting to study the language of the Quran.

<i>A Comprehensive Grammar of the English Language</i> 1985 compendium on the English language

A Comprehensive Grammar of the English Language is a descriptive grammar of English written by Randolph Quirk, Sidney Greenbaum, Geoffrey Leech, and Jan Svartvik. It was first published by Longman in 1985.

The Spoken English Corpus (SEC) is a speech corpus collection of recordings of spoken British English compiled during 1984–1987. The corpus manual can be found on ICAME.

Manually Annotated Sub-Corpus (MASC) is a balanced subset of 500K words of written texts and transcribed speech drawn primarily from the Open American National Corpus (OANC). The OANC is a 15 million word corpus of American English produced since 1990, all of which is in the public domain or otherwise free of usage and redistribution restrictions.

Universal Dependencies, frequently abbreviated as UD, is an international cooperative project to create treebanks of the world's languages. These treebanks are openly accessible and available. Core applications are automated text processing in the field of natural language processing (NLP) and research into natural language syntax and grammar, especially within linguistic typology. The project's primary aim is to achieve cross-linguistic consistency of annotation, while still permitting language-specific extensions when necessary. The annotation scheme has it roots in three related projects: Stanford Dependencies, Google universal part-of-speech tags, and the Interset interlingua for morphosyntactic tagsets. The UD annotation scheme uses a representation in the form of dependency trees as opposed to a phrase structure trees. At the present time, there are just over 200 treebanks of more than 100 languages available in the UD inventory.

Jan Lars Svartvik is a Swedish linguist and former professor of English at Lund University (1970–1995). He is the author of several grammar books on English that are widely used in teaching English in Sweden. One of his research areas is also corpus linguistics.

References

  1. "The ICE Project" (PDF).
  2. "The ICE Project" (PDF).
  3. Nelson, Gerald (May 2004). "Introduction". World Englishes. 23 (2): 225–226. doi:10.1111/j.0883-2919.2004.00347.x. ISSN   0883-2919.
  4. "The ICE Project" (PDF).
  5. "The ICE Project" (PDF).
  6. "International Corpus of English (ICE) Homepage @ ICE-corpora.net".
  7. 1 2 3 4 "Corpus Design @ ICE-corpora.net". ice-corpora.net. Retrieved 2018-03-03.
  8. Nelson, Gerald (2017). "The ICE project and world Englishes". World Englishes. 36 (3): 367–370. doi:10.1111/weng.12276.
  9. Quirk, Randolph, Greenbaum, Sidney, Leech, Geoffrey and Svartvik, Jan (1985). A Comprehensive Grammar of the English Language London: Longman
  10. Nelson, Gerald, Wallis, Sean, and Aarts, Bas (2002). Exploring Natural Language. Working with the British Component of the International Corpus of English Amsterdam: John Benjamins
  11. "The International Corpus of English website". Archived from the original on 2009-02-04. Retrieved 2008-01-13.
  12. "International Corpus of English (ICE) Homepage @ ICE-corpora.net". ice-corpora.net. Retrieved 2018-03-03.
  13. 1 2 3 4 "Annotation". www.ice-corpora.uzh.ch. Retrieved 2018-03-29.
  14. "Publications @ ICE-corpora.net". ice-corpora.net. Retrieved 2018-04-22.