Arvi Hurskainen

Last updated
Arvi Hurskainen
Arvi Hurskainen vuonna 2002.jpg
Arvi Hurskainen in 2002
Born (1941-01-25) January 25, 1941 (age 83)
Kitee, Finland
NationalityFinnish
Known fordeveloping SALAMA, a computational environment for language technology
Scientific career
Fields Linguistics Language technology, Machine translation
Institutions University of Helsinki
Doctoral advisors Juha Pentikäinen and Marja-Liisa Swantz  [ fi ]
Arvi Hurskainen in 2016 Arvi Hurskainen vuonna 2016.JPG
Arvi Hurskainen in 2016

Arvi Johannes Hurskainen (born January 25, 1941, in Kitee) is a Finnish scholar of language technology and linguistics. Since 1985, he has developed rule-based language technology mainly for Swahili, but also for other languages, including machine translation from English to Finnish. He has created a development environment called SALAMA (acronym for Swahili Language Manager), but it suits to any language. The major applications developed so far include the following: the spell checker for Swahili, [1] the annotator of corpus texts, [2] an advanced dictionary between Swahili and English [3] and translators [4] from Swahili to English, from English to Swahili, and from English to Finnish. He has also developed an advanced learning system for Swahili [5] and a system for producing targeted vocabularies for language learners. [2] Hurskainen has compiled two annotated corpora, Helsinki Corpus of Swahili 1.0 and Helsinki Corpus of Swahili 2.0. [6]

Contents

Study and work history

He first studied theology at the University of Helsinki. Later, after having worked in Tanzania, he studied anthropology and published his PhD dissertation Cattle and Culture. The Structure of a Pastoral Parakujo Society. [7] In 1976, he worked as a researcher in Jipemoyo Project, sponsored by the Academy of Finland in Tanzania, and in 1977–1980, in the service of the Finnish Lutheran Mission in Helsinki.

Hurskainen worked at the University of Helsinki, first as a lecturer in 1981–1989 and then as a professor in 1989–2006. In between, in 1984–1985, he worked at Tumaini University in Tanzania. Before the university career, he worked in Tanzania for eight years in various teaching tasks. He was the director of the Department of Asian and African Studies in 1999–2001. He retired in 2006.

In 1988–1992, he directed the fieldwork project Swahili Language and Folklore, sponsored by the Ministry of Foreign Affairs, Finland and the University of Dar-es-Salaam. The project produced the speech corpus DAHE (Dar-es-Salaam - Helsinki), which was later digitized.

Language technology

Hurskainen has developed language technology by making use of detailed language analysis. The basic description of language is made using the finite-state transducers, first developed by Kimmo Koskenniemi. The individual words are then disambiguated using constraint grammar technology. Also, the syntactic mapping is performed in this phase. Disambiguation and syntactic mapping are performed using Constraint Grammar 3.0, originally developed by Fred Karlsson and implemented by Pasi Tapanainen from Connexor. [8]

The rule-based approach developed by Hurskainen has similarities with other rule-based systems, such as Grammatical Framework [9] and Nooj. [10] Rule-based approaches to language technology, especially as they apply to machine translation, are considered suitable for low-resource languages with rich morphology, such as Bantu languages. [11]

Production

Web material

Related Research Articles

In linguistics, syntax is the study of how words and morphemes combine to form larger units such as phrases and sentences. Central concerns of syntax include word order, grammatical relations, hierarchical sentence structure (constituency), agreement, the nature of crosslinguistic variation, and the relationship between form and meaning (semantics). There are numerous approaches to syntax that differ in their central assumptions and goals.

<span class="mw-page-title-main">Swahili language</span> Bantu language spoken mainly in East Africa

Swahili, also known by its local name Kiswahili, is a Bantu language originally spoken by the Swahili people, who are found primarily in Tanzania, Kenya, and Mozambique. Estimates of the number of Swahili speakers, including both native and second-language speakers, vary widely. They generally range from 60 million to 150 million; with most of its native speakers residing in Tanzania.

Corpus linguistics is an empirical method for the study of language by way of a text corpus. Corpora are balanced, often stratified collections of authentic, "real world", text of speech or writing that aim to represent a given linguistic variety. Today, corpora are generally machine-readable data collections.

In linguistics, transformational grammar (TG) or transformational-generative grammar (TGG) is part of the theory of generative grammar, especially of natural languages. It considers grammar to be a system of rules that generate exactly those combinations of words that form grammatical sentences in a given language and involves the use of defined operations to produce new sentences from existing ones.

Generalized phrase structure grammar (GPSG) is a framework for describing the syntax and semantics of natural languages. It is a type of constraint-based phrase structure grammar. Constraint based grammars are based around defining certain syntactic processes as ungrammatical for a given language and assuming everything not thus dismissed is grammatical within that language. Phrase structure grammars base their framework on constituency relationships, seeing the words in a sentence as ranked, with some words dominating the others. For example, in the sentence "The dog runs", "runs" is seen as dominating "dog" since it is the main focus of the sentence. This view stands in contrast to dependency grammars, which base their assumed structure on the relationship between a single word in a sentence and its dependents.

Rodney D. Huddleston is a British linguist and grammarian specializing in the study and description of English.

Construction grammar is a family of theories within the field of cognitive linguistics which posit that constructions, or learned pairings of linguistic patterns with meanings, are the fundamental building blocks of human language. Constructions include words, morphemes, fixed expressions and idioms, and abstract grammatical rules such as the passive voice or the ditransitive. Any linguistic pattern is considered to be a construction as long as some aspect of its form or its meaning cannot be predicted from its component parts, or from other constructions that are recognized to exist. In construction grammar, every utterance is understood to be a combination of multiple different constructions, which together specify its precise meaning and form.

Principles and parameters is a framework within generative linguistics in which the syntax of a natural language is described in accordance with general principles and specific parameters that for particular languages are either turned on or off. For example, the position of heads in phrases is determined by a parameter. Whether a language is head-initial or head-final is regarded as a parameter which is either on or off for particular languages. Principles and parameters was largely formulated by the linguists Noam Chomsky and Howard Lasnik. Many linguists have worked within this framework, and for a period of time it was considered the dominant form of mainstream generative linguistics.

<i>Syntactic Structures</i> Book by Noam Chomsky

Syntactic Structures is an important work in linguistics by American linguist Noam Chomsky, originally published in 1957. A short monograph of about a hundred pages, it is recognized as one of the most significant and influential linguistic studies of the 20th century. It contains the now-famous sentence "Colorless green ideas sleep furiously", which Chomsky offered as an example of a grammatically correct sentence that has no discernible meaning, thus arguing for the independence of syntax from semantics.

<span class="mw-page-title-main">Kimmo Koskenniemi</span>

Kimmo Matti Koskenniemi is the inventor of finite-state two-level models for computational phonology and morphology. He was a professor of Computational Linguistics at the University of Helsinki, Finland. In the early 1980s Koskenniemi's work became accessible by early adopters such as Lauri Karttunen, Ronald M. Kaplan and Martin Kay, first at the University of Texas Austin, later at the Xerox Palo Alto Research Center.

<span class="mw-page-title-main">Treebank</span> Text corpus with tree annotations

In linguistics, a treebank is a parsed text corpus that annotates syntactic or semantic sentence structure. The construction of parsed corpora in the early 1990s revolutionized computational linguistics, which benefitted from large-scale empirical data.

Constraint grammar (CG) is a methodological paradigm for natural language processing (NLP). Linguist-written, context-dependent rules are compiled into a grammar that assigns grammatical tags ("readings") to words or other tokens in running text. Typical tags address lemmatisation, inflexion, derivation, syntactic function, dependency, valency, case roles, semantic type etc. Each rule either adds, removes, selects or replaces a tag or a set of grammatical tags in a given sentence context. Context conditions can be linked to any tag or tag set of any word anywhere in the sentence, either locally or globally. Context conditions in the same rule may be linked, i.e. conditioned upon each other, negated, or blocked by interfering words or tags. Typical CGs consist of thousands of rules, that are applied set-wise in progressive steps, covering ever more advanced levels of analysis. Within each level, safe rules are used before heuristic rules, and no rule is allowed to remove the last reading of a given kind, thus providing a high degree of robustness.

Bernd Heine is a German linguist and specialist in African studies.

Linguistic categories include

In linguistics, grammaticality is determined by the conformity to language usage as derived by the grammar of a particular speech variety. The notion of grammaticality rose alongside the theory of generative grammar, the goal of which is to formulate rules that define well-formed, grammatical sentences. These rules of grammaticality also provide explanations of ill-formed, ungrammatical sentences.

Linguistics is the scientific study of language. The areas of linguistic analysis are syntax, semantics (meaning), morphology, phonetics, phonology, and pragmatics. Subdisciplines such as biolinguistics and psycholinguistics bridge many of these divisions.

<i>Aspects of the Theory of Syntax</i> 1965 book by Noam Chomsky

Aspects of the Theory of Syntax is a book on linguistics written by American linguist Noam Chomsky, first published in 1965. In Aspects, Chomsky presented a deeper, more extensive reformulation of transformational generative grammar (TGG), a new kind of syntactic theory that he had introduced in the 1950s with the publication of his first book, Syntactic Structures. Aspects is widely considered to be the foundational document and a proper book-length articulation of Chomskyan theoretical framework of linguistics. It presented Chomsky's epistemological assumptions with a view to establishing linguistic theory-making as a formal discipline comparable to physical sciences, i.e. a domain of inquiry well-defined in its nature and scope. From a philosophical perspective, it directed mainstream linguistic research away from behaviorism, constructivism, empiricism and structuralism and towards mentalism, nativism, rationalism and generativism, respectively, taking as its main object of study the abstract, inner workings of the human mind related to language acquisition and production.

Elizabeth Closs Traugott is an American linguist and Professor Emerita of Linguistics and English, Stanford University. She is best known for her work on grammaticalization, subjectification, and constructionalization.

<span class="mw-page-title-main">Quranic Arabic Corpus</span>

The Quranic Arabic Corpus is an annotated linguistic resource consisting of 77,430 words of Quranic Arabic. The project aims to provide morphological and syntactic annotations for researchers wanting to study the language of the Quran.

Model-theoretic grammars, also known as constraint-based grammars, contrast with generative grammars in the way they define sets of sentences: they state constraints on syntactic structure rather than providing operations for generating syntactic objects. A generative grammar provides a set of operations such as rewriting, insertion, deletion, movement, or combination, and is interpreted as a definition of the set of all and only the objects that these operations are capable of producing through iterative application. A model-theoretic grammar simply states a set of conditions that an object must meet, and can be regarded as defining the set of all and only the structures of a certain sort that satisfy all of the constraints. The approach applies the mathematical techniques of model theory to the task of syntactic description: a grammar is a theory in the logician's sense and the well-formed structures are the models that satisfy the theory.

References

  1. "Zana za Uhakiki za Microsoft Office 2013 – Swahili". Microsoft. Retrieved 16 April 2018.
  2. 1 2 Hurskainen, Arvi. "Tagger" . Retrieved 16 April 2018.
  3. Hurskainen, Arvi. "Dictionary" . Retrieved 16 April 2018.
  4. Hurskainen, Arvi. "Translator" . Retrieved 16 April 2018.
  5. Hurskainen, Arvi. "Learn Swahili" . Retrieved 16 April 2018.
  6. "Hcs2-group | Kielipankki".
  7. Suomen professorit 1640–2007. Jyväskylä: Professoriliitto. 2008.
  8. "Natural Knowledge". Connexor. 2011–2016. Retrieved 16 April 2018.
  9. "GF – Grammatical Framework - A programming language for multilingual grammar applications". GF – Grammatical Framework. Retrieved 16 April 2018.
  10. "A Linguistic Development Environment". NooJ. Retrieved 16 April 2018.
  11. Hurskainen, Arvi. 2018. Sustainable language technology for African languages. In Agwuele, Augustine and Bodomo, Adams (eds), The Routledge Handbook of African Linguistics, 359-375. London: Routledge Publishers. ISBN   978-1-138-22829-0
  12. Hurskainen, Arvi. "Welcome to Salama" . Retrieved 25 June 2018. Salama (Swahili Language Manager) is an environment for language technology applications. All applications in Salama make use of rule-based language technology, started in 1985.
  13. Hurskainen, Arvi. "Technical reports on LT". Salama - Swahili Language Manager. Retrieved 25 June 2018.