Slovenian National Corpus FidaPLUS is the 621 million words (tokens) corpus of the Slovenian language, gathered from selected texts written in Slovenian of different genres and styles, mainly from books and newspapers. [1]
The FidaPLUS database is an upgrade of the older (FIDA) corpus, which was developed between 1997 and 2000, with added texts that were published up to 2006 and was the result of the applicative research project of the Faculty of Arts, Faculty of Social Sciences, both University of Ljubljana, and Jožef Stefan Institute's Department of Knowledge Technologies. [2]
Corpus is available via a corpus manager Sketch Engine. [3] This version FidaPLUS corpus contains Word sketches, an automatic corpus-derived overview of word's grammatical and collocational behaviour.
Year of publication | Number of words | Percent |
---|---|---|
1979 - 1990 | 262.708 | 0.04% |
1991 | 1.487.895 | 0.24% |
1992 | 2.256.692 | 0.36% |
1993 | 3.208.687 | 0.52% |
1994 | 7.534.689 | 1.21% |
1995 | 7.433.897 | 1.2% |
1996 | 16.913.916 | 2.27% |
1997 | 31.589.250 | 5.09% |
1998 | 43.512.041 | 7.01% |
1999 | 54.711.630 | 8.81% |
2000 | 57.677.534 | 9.29% |
2001 | 74.720.532 | 12.03% |
2002 | 72.802.484 | 11.72% |
2003 | 82.897.097 | 13.35% |
2004 | 67.041.167 | 10.79% |
2005 | 39.086.695 | 6.29% |
2006 | 44.526.825 | 7.17% |
N/A | 13.486.261 | 2,17% |
Slovene or Slovenian is a Western member of South Slavic languages, which belong to the Balto-Slavic branch of the Indo-European language family. Most of its 2.5 million speakers are the inhabitants of Slovenia, majority of them ethnic Slovenes. As Slovenia is part of the European Union, Slovene is also one of its 24 official and working languages. Its syntax is highly fusional and characterized by dual grammatical number. Two accentual norms are used. Its flexible word order is often adjusted for emphasis or stylistic reasons, although basically it is a SVO language. It has a T–V distinction: the use of the V-form demonstrates a respectful attitude towards superiors and the elderly, while it can be sidestepped through the passive form.
In linguistics and natural language processing, a corpus or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources, either annotated or unannotated.
Word-sense disambiguation (WSD) is the process of identifying which sense of a word is meant in a sentence or other segment of context. In human language processing and cognition, it is usually subconscious/automatic but can often come to conscious attention when ambiguity impairs clarity of communication, given the pervasive polysemy in natural language. In computational linguistics, it is an open problem that affects other computer-related writing, such as discourse, improving relevance of search engines, anaphora resolution, coherence, and inference.
In corpus linguistics, a collocation is a series of words or terms that co-occur more often than would be expected by chance. In phraseology, a collocation is a type of compositional phraseme, meaning that it can be understood from the words that make it up. This contrasts with an idiom, where the meaning of the whole cannot be inferred from its parts, and may be completely unrelated.
The Brown University Standard Corpus of Present-Day American English is an electronic collection of text samples of American English, the first major structured corpus of varied genres. This corpus first set the bar for the scientific study of the frequency and distribution of word categories in everyday language use. Compiled by Henry Kučera and W. Nelson Francis at Brown University, in Rhode Island, it is a general language corpus containing 500 samples of English, totaling roughly one million words, compiled from works published in the United States in 1961.
A concordance is an alphabetical list of the principal words used in a book or body of work, listing every instance of each word with its immediate context. Historically, concordances have been compiled only for works of special importance, such as the Vedas, Bible, Qur'an or the works of Shakespeare, James Joyce or classical Latin and Greek authors, because of the time, difficulty, and expense involved in creating a concordance in the pre-computer era.
Croatian National Corpus is the biggest and the most important corpus of Croatian. Its compilation started in 1998 at the Institute of Linguistics of the Faculty of Humanities and Social Sciences, University of Zagreb following the ideas of Marko Tadić. The theoretical foundations and the expression of the need for a general-purpose, representative and multi-million corpus of Croatian started to appear even earlier. The Croatian National Corpus is compiled from selected texts written in Croatian covering all fields, topics, genres and styles: from literary and scientific texts to text-books, newspaper, user-groups and chat rooms.
The Oxford English Corpus (OEC) is a text corpus of 21st-century English, used by the makers of the Oxford English Dictionary and by Oxford University Press' language research programme. It is the largest corpus of its kind, containing nearly 2.1 billion words. It includes language from the UK, the United States, Ireland, Australia, New Zealand, the Caribbean, Canada, India, Singapore, and South Africa. The text is mainly collected from web pages; some printed texts, such as academic journals, have been collected to supplement particular subject areas. The sources are writings of all sorts, from "literary novels and specialist journals to everyday newspapers and magazines and from Hansard to the language of blogs, emails, and social media". This may be contrasted with similar databases that sample only a specific kind of writing. The corpus is generally available only to researchers at Oxford University Press, but other researchers who can demonstrate a strong need may apply for access.
Primož Jakopin, born 30 June 1949 is a Slovenian computer scientist, known for his work in the field of language technology and his contribution to speleology.
The British National Corpus (BNC) is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources. The corpus covers British English of the late 20th century from a wide variety of genres, with the intention that it be a representative sample of spoken and written British English of that time. It is used in corpus linguistics for analysis of corpora.
The International Corpus of English(ICE) is a set of corpora representing varieties of English from around the world. Over twenty countries or groups of countries where English is the first language or an official second language are included.
Amebis from Kamnik is a major company in Slovenia in the field of language technologies. Its current manager is Miro Romih. The company has published a number of machine-readable dictionaries and encyclopedic dictionaries, and developed spell checkers, grammar checkers, hyphenators and lemmatizers for Slovene, Serbian and Albanian languages. In co-operation with the Jožef Stefan Institute they have developed a speech synthesiser and screen reader Govorec (Speaker). They have also provided technical support for the largest text corpus of Slovene, called FidaPLUS.
Miran Hladnik is a Slovenian literary historian, specializing in quantitative analysis of Slovene rural stories and in Slovene historical novel.
Sketch Engine is a corpus manager and text analysis software developed by Lexical Computing CZ s.r.o. since 2003. Its purpose is to enable people studying language behaviour to search large text collections according to complex and linguistically motivated queries. Sketch Engine gained its name after one of the key features, word sketches: one-page, automatic, corpus-derived summaries of a word's grammatical and collocational behaviour. Currently, it supports and provides corpora in 90+ languages.
A corpus manager is a tool for multilingual corpus analysis, which allows effective searching in corpora.
Adam Kilgarriff was a corpus linguist, lexicographer, and co-author of Sketch Engine.
A word sketch is a one-page, automatic, corpus-derived summary of a word’s grammatical and collocational behaviour. Word sketches were first introduced by the British corpus linguist Adam Kilgarriff and exploited within the Sketch Engine corpus management system. They are an extension of the general collocation concept used in corpus linguistics in that they group collocations according to particular grammatical relations. The collocation candidates in a word sketch are sorted either by their frequency or using a lexicographic association score like Dice, T-score or MI-score.
SkELL is a free corpus-based web tool that allows language learners and teachers find authentic sentences for specific target words. For any word or a phrase, SkELL displays a concordance that lists example sentences drawn from a special text corpus crawled from the World Wide Web, which has been cleaned of spam and includes only high-quality texts covering everyday, standard, formal, and professional language. There are versions of SkELL for English, Russian, German, Italian, Czech and Estonian.
The TenTen Corpus Family (also called TenTen corpora) is a set of comparable web text corpora, i.e. collections of texts that have been crawled from the World Wide Web and processed to match the same standards. These corpora are made available through the Sketch Engine corpus manager. There are TenTen corpora for more than 35 languages. Their target size is 10 billion (1010) words per language, which gave rise to the corpus family's name.