This article needs additional citations for verification .(May 2020) |
The Constituent Likelihood Automatic Word-tagging System (CLAWS) is a program that performs part-of-speech tagging. It was developed in the 1980s at Lancaster University by the University Centre for Computer Corpus Research on Language. [1] It has an overall accuracy rate of 96-97% with the latest version (CLAWS4) tagging around 100 million words of the British National Corpus. [1]
A Part-Of-Speech Tagger (POS Tagger) is a piece of software that reads text in some language and assigns parts of speech to each word (and other token), such as noun, verb, adjective, etc., although generally computational applications use more fine-grained POS tags like 'noun-plural'. [2] Developed in the early 1980s, [1] [3] CLAWS was built to fill the ever-growing gap created by always-changing POS necessities. Originally created to add part-of-speech tags to the LOB corpus of British English, the CLAWS tagset has since been adapted to other languages as well, including Urdu and Arabic. [4]
Since its inception, CLAWS has been hailed for its functionality and adaptability. Still, it is not without flaws, and though it boasts an error-rate of only 1.5% when judged in major categories, CLAWS still remains with c.3.3% ambiguities unresolved. Ambiguity arises in cases such as with the word flies, and whether it should be classified as a noun or a verb. [5] It's these ambiguities that will require the various upgrades and tagsets that CLAWS will endure.
CLAWS uses a Hidden Markov model to determine the likelihood of sequences of words in anticipating each part-of-speech label.
C5 | -----_PUN "_PUQ Welcome_VVB to_PRP my_DPS house_NN1 !_SENT -----_PUN Enter_VVB freely_AV0 and_CJC of_PRF your_DPS own_DT0 will_NN1 !_PUN "_SENT -----_PUN He_PNP made_VVD no_AT0 motion_NN1 of_PRF stepping_VVG to_TO0 meet_VVI me_PNP ,_PUN but_CJC stood_VVD like_PRP a_AT0 statue_NN1 ,_PUN as_CJS though_CJS his_DPS gesture_NN1 of_PRF welcome_NN1 had_VHD fixed_VVN him_PNP into_PRP stone_SENT ._PUN |
---|---|
C7 | "_" Welcome_VV0 to_II my_APPGE house_NN1 !_! Enter_VV0 freely_RR and_CC of_IO your_APPGE own_DA will_NN1 !_! "_" He_PPHS1 made_VVD no_AT motion_NN1 of_IO stepping_VVG to_TO meet_VVI me_PPIO1, _, but_CCB stood_VVD like_II a_AT1 statue_NN1, _, as_CS21 though_CS22 his_APPGE gesture_NN1 of_IO welcome_NN1 had_VHD fixed_VVN him_PPHO1 into_II stone_NN1 ._. |
This excerpt from Bram Stoker's Dracula (1897) has been tagged using both the CLAWS C5 and C7 tagsets. This is what a CLAWS output will generally look like, with the most likely part-of-speech tag following each word.
The first tagset developed in CLAWS, CLAWS1 tagset, has 132 word tags. In terms of form and application, C1 tagset is similar to Brown Corpus tags. [6] See Table of tags in C1 tagset here. [7]
From 1983 to 1986, updated versions leading to CLAWS2 were part of a larger attempt to deal with aspects such as recognizing sentence breaks, in order to avoid the need for manual pre-processing of a text before the tags were applied, moving instead to optional manual post-editing to adjust the output of the automatic annotation, if needed. [8] The CLAWS2 tagset has 166 word tags. [6] [9] See Table of tags in C2 tagset here. [10]
The CLAWS4 was used for the 100-million-word British National Corpus (BNC). A general-purpose grammatical tagger, it is a successor of the CLAWS1 tagger. [11] In tagging the BNC, the many rounds of work that went into CLAWS4 focused on making the CLAWS program independent from the tagsets. For example, the BNC project used two tagset versions: "a main tagset (C5) with 62 tags with which the whole of the corpus has been tagged, and a larger (C7) tagset with 152 tags, which has been used to make a selected 'core' sample corpus of two million words." [12] The latest version of CLAWS4 is offered by UCREL, a research center of Lancaster University. [6] [13]
The CLAWS5 tagset, which was used for BNC, has over 60 tags. [6] See Table of tags in C5 tagset here. [14]
The CLAWS6 tagset was used for the BNC sampler corpus and the COLT corpus. It has over 160 tags, including 13 determiner subtypes. [6] See Table of tags in C6 tagset here. [15]
The standard CLAWS7 tagset is used currently. It is only different in the punctuation tags when compared to the CLAWS6 tagset. [6] See Table of tags in C7 tagset here. [16]
CLAWS8 tagset was extended from C7 tagset with further distinctions in the determiner and pronoun categories, as well as 37 new auxiliary tags for forms of be, do, and have. [6] See Table of tags in C8 tagset here
In grammar, a part of speech or part-of-speech is a category of words that have similar grammatical properties. Words that are assigned to the same part of speech generally display similar syntactic behavior, sometimes similar morphological behavior in that they undergo inflection for similar properties and even similar semantic behavior. Commonly listed English parts of speech are noun, verb, adjective, adverb, pronoun, preposition, conjunction, interjection, numeral, article, and determiner.
English grammar is the set of structural rules of the English language. This includes the structure of words, phrases, clauses, sentences, and whole texts.
In linguistics, a corpus or text corpus is a language resource consisting of a large and structured set of texts. In corpus linguistics, they are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory.
The plural, in many languages, is one of the values of the grammatical category of number. The plural of a noun typically denotes a quantity greater than the default quantity represented by that noun. This default quantity is most commonly one. Therefore, plurals most typically denote two or more of something, although they may also denote fractional, zero or negative amounts. An example of a plural is the English word cats, which corresponds to the singular cat.
In corpus linguistics, part-of-speech tagging, also called grammatical tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc.
The Brown University Standard Corpus of Present-Day American English is an electronic collection of text samples of American English, the first major structured corpus of varied genres. This corpus first set the bar for the scientific study of the frequency and distribution of word categories in everyday language use. Compiled by Henry Kučera and W. Nelson Francis at Brown University, in Rhode Island, it is a general language corpus containing 500 samples of English, totaling roughly one million words, compiled from works published in the United States in 1961.
Geoffrey Neil Leech FBA was a specialist in English language and linguistics. He was the author, co-author, or editor of over 30 books and over 120 published papers. His main academic interests were English grammar, corpus linguistics, stylistics, pragmatics, and semantics.
The British National Corpus (BNC) is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources. The corpus covers British English of the late 20th century from a wide variety of genres, with the intention that it be a representative sample of spoken and written British English of that time. It is used in corpus linguistic for analysis of corpora
Linguistic categories include
The International Corpus of English(ICE) is a set of corpora representing varieties of English from around the world. Over twenty countries or groups of countries where English is the first language or an official second language are included.
The Corpus of Contemporary American English (COCA) is a one-billion-word corpus of contemporary American English. It was created by Mark Davies, retired professor of corpus linguistics at Brigham Young University (BYU).
The Lancaster-Oslo/Bergen (LOB) Corpus is a one-million-word collection of British English texts which was compiled in the 1970s in collaboration between the University of Lancaster, the University of Oslo, and the Norwegian Computing Centre for the Humanities, Bergen, to provide a British counterpart to the Brown Corpus compiled by Henry Kučera and W. Nelson Francis for American English in the 1960s.
The Quranic Arabic Corpus is an annotated linguistic resource consisting of 77,430 words of Quranic Arabic. The project aims to provide morphological and syntactic annotations for researchers wanting to study the language of the Quran.
In English, possessive words or phrases exist for nouns and most pronouns, as well as some noun phrases. These can play the roles of determiners or of nouns.
Classic monolingual Word Sense Disambiguation evaluation tasks uses WordNet as its sense inventory and is largely based on supervised / semi-supervised classification with the manually sense annotated corpora:
The Spoken English Corpus (SEC) is a speech corpus collection of recordings of spoken British English compiled during 1984-7. The corpus manual can be found on ICAME.
The International Computer Archive of Modern and Medieval English (ICAME) is an international group of linguists and data scientists working in corpus linguistics to digitise English texts. The organisation was founded in Oslo, Norway in 1977 as the International Computer Archive of Modern English, before being renamed to its current title.
Claire Hardaker is a British linguist. She is senior lecturer at the Department of Linguistics and English Language of Lancaster University, United Kingdom. Her research involves forensic linguistics and corpus linguistics. Her research focuses on deceptive, manipulative, and aggressive language in a range of online data. She has investigated behaviours ranging from trolling and disinformation to human trafficking and online scams. Her research typically uses corpus linguistic methods to approach forensic linguistic analyses.
CorCenCC or the National Corpus of Contemporary Welsh is a language resource for Welsh speakers, Welsh learners, Welsh language researchers, and anyone who is interested in the Welsh language. CorCenCC is a freely accessible collection of multiple language samples, gathered from real-life communication, and presented in the searchable online CorCenCC text corpus. The corpus is accompanied by an online teaching and learning toolkit – Y Tiwtiadur – which draws directly on the data from the corpus to provide resources for Welsh language learning at all ages and levels.