Hermann Moisl

Last updated January 16, 2025

Hermann Moisl is a retired academic.^[1] He was a senior lecturer and visiting fellow in linguistics at Newcastle University.

Education

He received his BA from McGill University, his MPhil from Trinity College Dublin, his DPhil from University of Oxford, and his MSc from Newcastle University.

Career

Moisl's research interests include computational linguistics, natural language processing and text processing, corpus linguistics, the cultural role of literacy and Celtic languages and history. He has conducted multivariate analysis of text corpora. He was a key investigator of the Newcastle Electronic Corpus of Tyneside English Project alongside his colleague Karen Corrigan.

Related Research Articles

Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related to information retrieval, knowledge representation and computational linguistics, a subfield of linguistics. Typically data is collected in text corpora, using either rule-based, statistical or neural-based approaches in machine learning and deep learning.

Corpus linguistics is an empirical method for the study of language by way of a text corpus. Corpora are balanced, often stratified collections of authentic, "real world", text of speech or writing that aim to represent a given linguistic variety. Today, corpora are generally machine-readable data collections.

In linguistics and natural language processing, a corpus or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources, either annotated or unannotated.

In corpus linguistics, a collocation is a series of words or terms that co-occur more often than would be expected by chance. In phraseology, a collocation is a type of compositional phraseme, meaning that it can be understood from the words that make it up. This contrasts with an idiom, where the meaning of the whole cannot be inferred from its parts, and may be completely unrelated.

In linguistics, a treebank is a parsed text corpus that annotates syntactic or semantic sentence structure. The construction of parsed corpora in the early 1990s revolutionized computational linguistics, which benefitted from large-scale empirical data.

Croatian National Corpus is the biggest and the most important corpus of Croatian. Its compilation started in 1998 at the Institute of Linguistics of the Faculty of Humanities and Social Sciences, University of Zagreb following the ideas of Marko Tadić. The theoretical foundations and the expression of the need for a general-purpose, representative and multi-million corpus of Croatian started to appear even earlier. The Croatian National Corpus is compiled from selected texts written in Croatian covering all fields, topics, genres and styles: from literary and scientific texts to text-books, newspaper, user-groups and chat rooms.

The British National Corpus (BNC) is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources. The corpus covers British English of the late 20th century from a wide variety of genres, with the intention that it be a representative sample of spoken and written British English of that time. It is used in corpus linguistics for analysis of corpora.

Linguistic categories include

Stefan Th. Gries is Professor of Linguistics in the Department of Linguistics at the University of California, Santa Barbara (UCSB), Honorary Liebig-Professor of the Justus-Liebig-Universität Giessen, and since 1 April 2018 also Chair of English Linguistics in the Department of English at the Justus-Liebig-Universität Giessen.

Internet linguistics is a domain of linguistics advocated by the English linguist David Crystal. It studies new language styles and forms that have arisen under the influence of the Internet and of other new media, such as Short Message Service (SMS) text messaging. Since the beginning of human–computer interaction (HCI) leading to computer-mediated communication (CMC) and Internet-mediated communication (IMC), experts, such as Gretchen McCulloch have acknowledged that linguistics has a contributing role in it, in terms of web interface and usability. Studying the emerging language on the Internet can help improve conceptual organization, translation and web usability. Such study aims to benefit both linguists and web users combined.

The Bijankhan corpus is a tagged corpus that is suitable for natural language processing (NLP) research on the Persian language. This collection is gathered from daily news and common texts. In this collection all documents are categorized into different subjects such as political, cultural, etc.; in about 4300 different subject categories. The corpus contains about 2.6 million manually tagged words with a tag set that contains 550 Persian part-of-speech tags.

In natural language processing, semantic role labeling is the process that assigns labels to words or phrases in a sentence that indicates their semantic role in the sentence, such as that of an agent, goal, or result.

Contrastive linguistics is a practice-oriented linguistic approach that seeks to describe the differences and similarities between a pair of languages.

Linguistics is the scientific study of language. The areas of linguistic analysis are syntax, semantics (meaning), morphology, phonetics, phonology, and pragmatics. Subdisciplines such as biolinguistics and psycholinguistics bridge many of these divisions.

The English-Arabic Parallel Corpus of United Nations Texts (EAPCOUNT) is one of the biggest available parallel corpora involving the Arabic language. It is intended as a general research tool, available beyond the present project for applied and theoretical linguistic research. It started as a PhD research project at the Department of Linguistics, University of Carthage, in 2006 by Dr. Hammouda Salhi, in collaboration with some of his students, and completed in 2010. The whole description of the corpus was completed in 2009 and revised in 2010.

The following outline is provided as an overview of and topical guide to natural-language processing:

Sketch Engine is a corpus manager and text analysis software developed by Lexical Computing since 2003. Its purpose is to enable people studying language behaviour to search large text collections according to complex and linguistically motivated queries. Sketch Engine gained its name after one of the key features, word sketches: one-page, automatic, corpus-derived summaries of a word's grammatical and collocational behaviour. Currently, it supports and provides corpora in over 90 languages.

<span class="mw-page-title-main">Adam Kilgarriff</span>

Adam Kilgarriff was a corpus linguist, lexicographer, and co-author of Sketch Engine.

<span class="mw-page-title-main">Jost Gippert</span> German linguist (born 1956)

Jost Gippert is a German linguist, Caucasiologist, author, and the professor for Comparative Linguistics at the Institute of Empirical Linguistics at the Goethe University of Frankfurt.

The TenTen Corpus Family (also called TenTen corpora) is a set of comparable web text corpora, i.e. collections of texts that have been crawled from the World Wide Web and processed to match the same standards. These corpora are made available through the Sketch Engine corpus manager. There are TenTen corpora for more than 35 languages. Their target size is 10 billion (10¹⁰) words per language, which gave rise to the corpus family's name.

References

↑ "Hermann Moisl home page".

External links

Personal Newcastle University Homepage

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] "Hermann Moisl home page".

[1]

Authority control databases
International	ISNI VIAF WorldCat
National	United States France BnF data Czech Republic Netherlands Latvia Israel
Other	IdRef