Sketch Engine

Last updated
Original author(s) Adam Kilgarriff, Pavel Rychlý
Developer(s) Lexical Computing CZ s.r.o.
Initial release23 July 2003;20 years ago (2003-07-23) [1]
Written in Go, JavaScript, jQuery, C++, Python
Operating system Linux, Mac OS X
Platform IA-32, x64 or IA-64
Standard(s) Unicode
Available in11 languages
List of languages
Arabic, Crimean Tatar, Czech, English, French, German, Irish, Italian, Nko, Spanish, Ukrainian
Type Corpus manager for 90+ languages, database management system
License Proprietary software; both commercial and freeware editions are available
Website www.sketchengine.eu

Sketch Engine is a corpus manager and text analysis software developed by Lexical Computing CZ s.r.o. since 2003. Its purpose is to enable people studying language behaviour (lexicographers, researchers in corpus linguistics, translators or language learners) to search large text collections according to complex and linguistically motivated queries. Sketch Engine gained its name after one of the key features, word sketches: one-page, automatic, corpus-derived summaries of a word's grammatical and collocational behaviour. [2] Currently, it supports and provides corpora in 90+ languages. [3]

Contents

History of development

Sketch Engine is a product of Lexical Computing Limited, a company founded in 2003 by the lexicographer and research scientist Adam Kilgarriff. [4] He started a collaboration with Pavel Rychlý, a computer scientist working at the Natural Language Processing Centre, Masaryk University, [5] and the developer of Manatee and Bonito (two major parts of the software suite), and introduced the concept of word sketches.

Since then, Sketch Engine has been commercial software, however, all the core features of Manatee and Bonito that were developed by 2003 (and extended since then) are freely available under the GPL license within the NoSketch Engine suite. [6]

Features

A list of tools available in Sketch Engine:

Keywords and terminology extraction

It is a tool for automatic term extraction for identifying words typical of a particular corpus, document, or text. It supports extracting one-word and multi-word units from monolingual and bilingual texts. The terminology extraction feature provides a list of relevant terms based on comparison with a large corpus of general language. This tool is also a separate service operating as OneClick terms with a dedicated interface. [8]

List of text corpora

Sketch Engine provides access to more than 700 text corpora. There are monolingual as well as multilingual language corpora of different sizes (from thousand of words up to 60 billions of words) and various sources (web, books, subtitles, legal documents, etc.). The list of corpora includes British National Corpus, Brown Corpus, Cambridge Academic English Corpus and Cambridge Learner Corpus, CHILDES corpora of child language, OpenSubtitles (a set of 60 parallel corpora), 24 multilingual corpora of EUR-Lex documents, TenTen Corpus Family (multi-billion web corpora), trends corpora (monitor corpora with daily updates), etc.

Architecture

Thesaurus cloud of the lemma work in Sketch Engine Thesaurus in Sketch Engine.png
Thesaurus cloud of the lemma work in Sketch Engine

Sketch Engine consists of three main components: an underlying database management system called Manatee, a web interface search front-end called Bonito and a web interface for corpus building and management called Corpus Architect. [9]

Manatee

Manatee is a database management system specifically devised for effective indexing of large text corpora. It is based on the idea of inverted indexing (keeping an index of all positions of a given word in the text). It has been used to index text corpora comprising tens of billions of words. [10]

Searching corpora indexed by Manatee is performed by formulating queries in the Corpus Query Language (CQL). [11]

Manatee is written in C++ and offers an API for a number of other programming languages including Python, Java, Perl and Ruby. Recently, it was rewritten into Go for faster processing of corpus queries. [12]

Bonito

Bonito is a web interface for Manatee providing access to corpus search. In the client–server model, Manatee is the server and Bonito plays the client part. It is written in Python. [9]

Corpus Architect

Corpus Architect is a web interface providing corpus building and management features. It is also written in Python.

Applications

Sketch Engine has been used by major British or other publishing houses for producing dictionaries such as Macmillan English Dictionary, Dictionnaires Le Robert, Oxford University Press or Shogakukan and four of the UK's five biggest dictionary publishers use Sketch Engine. [13]

See also

Related Research Articles

Corpus linguistics is the study of a language as that language is expressed in its text corpus, its body of "real world" text. Corpus linguistics proposes that a reliable analysis of a language is more feasible with corpora collected in the field—the natural context ("realia") of that language—with minimal experimental interference.

In linguistics and natural language processing, a corpus or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources, either annotated or unannotated.

Word-sense disambiguation (WSD) is the process of identifying which sense of a word is meant in a sentence or other segment of context. In human language processing and cognition, it is usually subconscious/automatic but can often come to conscious attention when ambiguity impairs clarity of communication, given the pervasive polysemy in natural language. In computational linguistics, it is an open problem that affects other computer-related writing, such as discourse, improving relevance of search engines, anaphora resolution, coherence, and inference.

<span class="mw-page-title-main">Collocation</span> Frequent occurrence of words next to each other

In corpus linguistics, a collocation is a series of words or terms that co-occur more often than would be expected by chance. In phraseology, a collocation is a type of compositional phraseme, meaning that it can be understood from the words that make it up. This contrasts with an idiom, where the meaning of the whole cannot be inferred from its parts, and may be completely unrelated.

<span class="mw-page-title-main">Dictionary-based machine translation</span>

Machine translation can use a method based on dictionary entries, which means that the words will be translated as a dictionary does – word by word, usually without much correlation of meaning between them. Dictionary lookups may be done with or without morphological analysis or lemmatisation. While this approach to machine translation is probably the least sophisticated, dictionary-based machine translation is ideally suitable for the translation of long lists of phrases on the subsentential level, e.g. inventories or simple catalogs of products and services.

<span class="mw-page-title-main">Concordance (publishing)</span> List of words or terms in a published book

A concordance is an alphabetical list of the principal words used in a book or body of work, listing every instance of each word with its immediate context. Concordances have been compiled only for works of special importance, such as the Vedas, Bible, Qur'an or the works of Shakespeare, James Joyce or classical Latin and Greek authors, because of the time, difficulty, and expense involved in creating a concordance in the pre-computer era.

Croatian National Corpus is the biggest and the most important corpus of Croatian. Its compilation started in 1998 at the Institute of Linguistics of the Faculty of Humanities and Social Sciences, University of Zagreb following the ideas of Marko Tadić. The theoretical foundations and the expression of the need for a general-purpose, representative and multi-million corpus of Croatian started to appear even earlier. The Croatian National Corpus is compiled from selected texts written in Croatian covering all fields, topics, genres and styles: from literary and scientific texts to text-books, newspaper, user-groups and chat rooms.

The British National Corpus (BNC) is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources. The corpus covers British English of the late 20th century from a wide variety of genres, with the intention that it be a representative sample of spoken and written British English of that time. It is used in corpus linguistics for analysis of corpora.

The Corpus of Contemporary American English (COCA) is a one-billion-word corpus of contemporary American English. It was created by Mark Davies, retired professor of corpus linguistics at Brigham Young University (BYU).

The knowledge acquisition bottleneck is perhaps the major impediment to solving the word sense disambiguation (WSD) problem. Unsupervised learning methods rely on knowledge about word senses, which is barely formulated in dictionaries and lexical databases. Supervised learning methods depend heavily on the existence of manually annotated examples for every word sense, a requisite that can so far be met only for a handful of words for testing purposes, as it is done in the Senseval exercises.

Macmillan English Dictionary for Advanced Learners, also known as MEDAL, is an advanced learner's dictionary first published in 2002 by Macmillan Education. It shares most of the features of this type of dictionary: it provides definitions in simple language, using a controlled defining vocabulary; most words have example sentences to illustrate how they are typically used; and information is given about how words combine grammatically or in collocations. MEDAL also introduced a number of innovations. These include:

Classic monolingual Word Sense Disambiguation evaluation tasks uses WordNet as its sense inventory and is largely based on supervised / semi-supervised classification with the manually sense annotated corpora:

A corpus manager is a tool for multilingual corpus analysis, which allows effective searching in corpora.

The Bulgarian WordNet (BulNet) is an electronic multilingual dictionary of synonym sets along with their explanatory definitions and sets of semantic relations with other words in the language.

<span class="mw-page-title-main">Adam Kilgarriff</span>

Adam Kilgarriff was a corpus linguist, lexicographer, and co-author of Sketch Engine.

<span class="mw-page-title-main">Word sketch</span>

A word sketch is a one-page, automatic, corpus-derived summary of a word’s grammatical and collocational behaviour. Word sketches were first introduced by the British corpus linguist Adam Kilgarriff and exploited within the Sketch Engine corpus management system. They are an extension of the general collocation concept used in corpus linguistics in that they group collocations according to particular grammatical relations. The collocation candidates in a word sketch are sorted either by their frequency or using a lexicographic association score like Dice, T-score or MI-score.

<span class="mw-page-title-main">SkELL</span>

SkELL is a free corpus-based web tool that allows language learners and teachers find authentic sentences for specific target words. For any word or a phrase, SkELL displays a concordance that lists example sentences drawn from a special text corpus crawled from the World Wide Web, which has been cleaned of spam and includes only high-quality texts covering everyday, standard, formal, and professional language. There are versions of SkELL for English, Russian, German, Italian, Czech and Estonian.

The TenTen Corpus Family (also called TenTen corpora) is a set of comparable web text corpora, i.e. collections of texts that have been crawled from the World Wide Web and processed to match the same standards. These corpora are made available through the Sketch Engine corpus manager. There are TenTen corpora for more than 35 languages. Their target size is 10 billion (1010) words per language, which gave rise to the corpus family's name.

References

  1. Companies House Searched on United Kingdom's registrar of companies (Company name: LEXICAL COMPUTING LIMITED or Company number: 04841901)
  2. Kilgarriff, Adam; Baisa, Vít; Bušta, Jan; Jakubíček, Miloš; Kovář, Vojtěch; Michelfeit, Jan; Rychlý, Pavel; Suchomel, Vít (10 July 2014). "The Sketch Engine: ten years on". Lexicography. 1 (1): 7–36. doi: 10.1007/s40607-014-0009-9 . ISSN   2197-4292.
  3. "Languages in Sketch Engine". Sketch Engine. Lexical Computing s.r.o. 7 June 2016. Retrieved 22 January 2018.
  4. Adam Kilgarriff's home page
  5. Natural Language Processing Centre, Masaryk University
  6. NoSketch Engine
  7. Kilgarriff, Adam; Herman, Ondřej; Bušta, Jan; Rychlý, Pavel; Jakubíček, Miloš (2015). "DIACRAN: a framework for diachronic analysis" (PDF). Corpus Linguistics 2015: 65–70.
  8. Baisa, Vít (2017). "Simplifying terminology extraction: OneClick Terms" (PDF). Proceedings of the 9th International Corpus Linguistics Conference.
  9. 1 2 Rychlý, Pavel (2007). "Manatee/bonito–a modular corpus manager" (PDF). 1st Workshop on Recent Advances in Slavonic Natural Language Processing: 65–70.
  10. Pomikálek, Jan; Jakubíček, Miloš; Rychlý, Pavel (2012). "Building a 70 billion word corpus of English from ClueWeb" (PDF). Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12).
  11. "CQL – Corpus Query Language". Sketch Engine. Lexical Computing s.r.o. 15 May 2015. Retrieved 22 January 2018.
  12. Rychlý, Pavel; Rábara, Radoslav (2015). "Concurrent Processing of Text Corpus Queries" (PDF). Workshop on Recent Advances in Slavonic Natural Language Processing: 49–58.
  13. "Using Computational Lexicography for Dictionary Production with the Sketch Engine". REF Impact Case Studies. University of Brighton. Retrieved 18 April 2015.