Data-driven learning (DDL) is an approach to foreign language learning. Whereas most language learning is guided by teachers and textbooks, data-driven learning treats language as data and students as researchers undertaking guided discovery tasks. Underpinning this pedagogical approach is the data - information - knowledge paradigm (see DIKW pyramid). It is informed by a pattern-based approach to grammar and vocabulary, and a lexicogrammatical approach to language in general. Thus the basic task in DDL is to identify patterns at all levels of language. From their findings, foreign language students can see how an aspect of language is typically used, which in turn informs how they can use it in their own speaking and writing. Learning how to frame language questions and use the resources to obtain data and interpret it is fundamental to learner autonomy. When students arrive at their own conclusions through such procedures, they use their higher order thinking skills (see Bloom's taxonomy) and are creating knowledge (see Vygotsky).
In DDL, students use the same types of tools that professional linguists use, namely a corpus of texts that have been sampled and stored electronically, and a concordancer, which is a search engine designed for linguistic analysis. Some tools have been specifically created for data-driven learning, such as SkELL, WriteBetter, and Micro-concord.
Micro-concord was the first significant software designed for classroom use. It was developed for the MS-DOS microcomputers by Tim Johns and Mike Scott and published for DOS computers in 1993 by OUP. It evolved into the widely-used WordSmith Tools.
Johns (1936 – 2009) pioneered data-driven learning and coined the term. It first appeared in an article, Should you be persuaded: Two examples of data-driven learning (1991). [1] His paper, From Printout to Handout, [2] is reprinted and discussed at length in Volume 2 of Hubbard's Computer-Assisted Language Learning. [3] Thomas' task-based Discovering English with Sketch Engine [4] exemplifies DDL and acknowledges Johns throughout. Other recent books on DDL which credit Johns as the originator of the approach include those by Anderson and Corbett (2009), [5] Reppen (2010), [6] Bennett (2010), [7] Flowerdew (2012), [8] Boulton and Tyne (2014), [9] and Friginal (2018). [10]
Johns worked at the English for Overseas Students Unit of Birmingham University from 1971 till the end of his career. This was while John Sinclair led a large team of linguists at Birmingham University working on the COBUILD project which delivered the first major corpus-based dictionaries and grammars of English for foreign students. COBUILD however, never tasked students with exploring language data themselves.
Johns' referred to his specific DDL approach as kibitzing: when he returned his students' written work, together they would explore the errors using corpus data. A selection of these Kibbitzer tutorials are accessible on Mike Scott’s website.
Despite the widespread awareness of corpora among the major movers and shakers in foreign language teaching, DDL is not widely embraced by its practitioners. One of the main reasons for this is the incompatibility of views on language and language learning: traditional language teachers and textbooks have a prescriptive view of language treating it as a system of rules to be memorised, engaging only lower order thinking skills. A descriptive view of language permits the observation of language patterns and outliers that exist in language itself. DDL positions students to use higher order thinking skills to learn how to learn to make and learn from their own observations. Such guided discovery leads to fuzzy results, which are incompatible with prescriptive linguistics and teaching.
There is a considerable body of research conducted into DDL as evidenced by the professional bodies, books, journal articles and conference presentations. TaLC (Teaching and Language Corpora) is a biennial conference that is a platform for corpus-based research that has a pedagogical focus. CorpusCALL is a special interest group within EuroCALL and is mostly active through its Facebook group. The online teaching journal, Humanising Language Teaching hosts a section called Corpus Ideas.
Corpus linguistics is an empirical method for the study of language by way of a text corpus. Corpora are balanced, often stratified collections of authentic, "real world", text of speech or writing that aim to represent a given linguistic variety. Today, corpora are generally machine-readable data collections.
Word-sense disambiguation (WSD) is the process of identifying which sense of a word is meant in a sentence or other segment of context. In human language processing and cognition, it is usually subconscious/automatic but can often come to conscious attention when ambiguity impairs clarity of communication, given the pervasive polysemy in natural language. In computational linguistics, it is an open problem that affects other computer-related writing, such as discourse, improving relevance of search engines, anaphora resolution, coherence, and inference.
Computer-assisted language learning (CALL), British, or computer-aided instruction (CAI)/computer-aided language instruction (CALI), American, is briefly defined in a seminal work by Levy as "the search for and study of applications of the computer in language teaching and learning". CALL embraces a wide range of information and communications technology applications and approaches to teaching and learning foreign languages, from the "traditional" drill-and-practice programs that characterised CALL in the 1960s and 1970s to more recent manifestations of CALL, e.g. as used in a virtual learning environment and Web-based distance learning. It also extends to the use of corpora and concordancers, interactive whiteboards, computer-mediated communication (CMC), language learning in virtual worlds, and mobile-assisted language learning (MALL).
The Bank of English (BoE) is a representative subset of the 4.5 billion words COBUILD corpus, a collection of English texts. These are mainly British in origin, but content from North America, Australia, New Zealand, South Africa and other Commonwealth countries is also being included.
John McHardy Sinclair was a Professor of Modern English Language at Birmingham University from 1965 to 2000. He pioneered work in corpus linguistics, discourse analysis, lexicography, and language teaching.
In linguistics, a treebank is a parsed text corpus that annotates syntactic or semantic sentence structure. The construction of parsed corpora in the early 1990s revolutionized computational linguistics, which benefitted from large-scale empirical data.
A foreign language writing aid is a computer program or any other instrument that assists a non-native language user in writing decently in their target language. Assistive operations can be classified into two categories: on-the-fly prompts and post-writing checks. Assisted aspects of writing include: lexical, syntactic, lexical semantic and idiomatic expression transfer, etc. Different types of foreign language writing aids include automated proofreading applications, text corpora, dictionaries, translation aids and orthography aids.
The British National Corpus (BNC) is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources. The corpus covers British English of the late 20th century from a wide variety of genres, with the intention that it be a representative sample of spoken and written British English of that time. It is used in corpus linguistics for analysis of corpora.
Language and Computers: Studies in Practical Linguistics is a book series on corpus linguistics and related areas. As studies in linguistics, volumes in the series have, by definition, their foundations in linguistic theory; however, they are not concerned with theory for theory's sake, but always with a definite direct or indirect interest in the possibilities of practical application in the dynamic area where language and computers meet.
A speech corpus is a database of speech audio files and text transcriptions. In speech technology, speech corpora are used, among other things, to create acoustic models. In linguistics, spoken corpora are used to do research into phonetic, conversation analysis, dialectology and other fields.
Michael Hoey was a British linguist and Baines Professor of English Language. He lectured in applied linguistics in over 40 countries.
Mark E. Davies is an American linguist. He specializes in corpus linguistics and language variation and change. He is the creator of most of the text corpora from English-Corpora.org as well as the Corpus del español and the Corpus do português. He has also created large datasets of word frequency, collocates, and n-grams data, which have been used by many large companies in the fields of technology and also language learning.
Pattern Grammar is a model for describing the syntactic environments of individual lexical items, derived from studying their occurrences in authentic linguistic corpora. It was developed by Hunston, Francis, and Manning as part of the COBUILD project. It is a highly informal account that suggests a linear view of grammar.
Tatoeba is a free collection of example sentences with translations geared towards foreign language learners. It is available in more than 400 languages. Its name comes from the Japanese phrase "tatoeba" (例えば), meaning "for example". It is written and maintained by a community of volunteers through a model of open collaboration. Individual contributors are known as Tatoebans. It is run by Association Tatoeba, a French non-profit organization funded through donations.
English Profile is an interdisciplinary research programme designed to enhance the learning, teaching and assessment of English worldwide. The aim of the programme is to provide a clear benchmark for progress in English by clearly describing the language that learners need at each level of the Common European Framework of Reference for Languages (CEFR). By making the CEFR more accessible, English Profile will provide support for the development of curricula and teaching materials, and in assessing students' language proficiency.
Sketch Engine is a corpus manager and text analysis software developed by Lexical Computing since 2003. Its purpose is to enable people studying language behaviour to search large text collections according to complex and linguistically motivated queries. Sketch Engine gained its name after one of the key features, word sketches: one-page, automatic, corpus-derived summaries of a word's grammatical and collocational behaviour. Currently, it supports and provides corpora in over 90 languages.
Susan Elizabeth Hunston is a British linguist. She received her PhD in English under the supervision of Michael Hoey at the University of Birmingham in 1989. She does research in the areas of corpus linguistics and applied linguistics. She is one of the primary developers of the Pattern Grammar model of linguistic analysis, which is a way of describing the syntactic environments of individual words, based on studying their occurrences in large sets of authentic examples, i.e. language corpora. The Pattern Grammar model was developed as part of the COBUILD project, where Hunston worked for several years as a senior grammarian for the Collins Cobuild English Dictionary.
Svenja Adolphs is a British linguist whose research involves analysis of corpus data including sources of multimodal material such as the Nottingham Multimodal Corpus (NMMC) to examine communication in new forms of digital records. Using visual mark-up systems, her work allows a better understanding of the nature of natural language use. She is a co-founder of the Health Language Research Group at the University of Nottingham, bringing together academics and clinicians to advance the work of applied linguistics in health care settings.
Timothy Francis Johns was a British academic, strongly associated with the origins and development of data-driven learning (DDL), an approach to learning foreign languages which has learners use the output of computer concordancers, either interactively on screen or via paper printouts, to discover grammar rules and facts about word associations and meanings.
CorCenCC or the National Corpus of Contemporary Welsh is a language resource for Welsh speakers, Welsh learners, Welsh language researchers, and anyone who is interested in the Welsh language. CorCenCC is a freely accessible collection of multiple language samples, gathered from real-life communication, and presented in the searchable online CorCenCC text corpus. The corpus is accompanied by an online teaching and learning toolkit – Y Tiwtiadur – which draws directly on the data from the corpus to provide resources for Welsh language learning at all ages and levels.