Corpus-assisted discourse studies

Last updated

Corpus-assisted discourse studies (abbr.: CADS) is related historically and methodologically to the discipline of corpus linguistics. The principal endeavor of corpus-assisted discourse studies is the investigation, and comparison of features of particular discourse types, integrating into the analysis the techniques and tools developed within corpus linguistics. These include the compilation of specialised corpora and analyses of word and word-cluster frequency lists, comparative keyword lists and, above all, concordances.


A broader conceptualisation of corpus-assisted discourse studies would include any study that aims to bring together corpus linguistics and discourse analysis. Such research is often labelled as corpus-based or corpus-assisted discourse analysis, with the term CADS coined by a research group in Italy (Partington 2004) for a specific type of corpus-assisted discourse analysis (see the section 'in different countries' below).


Corpus-assisted discourse studies aim to uncover non-obvious meaning, that is, meaning which might not be readily available to naked-eye perusal. Much of what carries meaning in texts is not open to direct observation: “you cannot understand the world just by looking at it” (Stubbs [after Gellner 1959] 1996: 92). We use language “semi-automatically”, in the sense that speakers and writers make semi-conscious choices within the various complex overlapping systems of which language is composed, including those of transitivity, modality (Michael Halliday 1994), lexical sets (e.g. freedom, liberty, deliverance), modification, and so on. Authors themselves are, famously, generally unaware of all the meanings their texts convey. By combining the quantitative research approach, that is, statistical analysis of large amounts of the discourse in question - more precisely, large numbers of tokens of the discourse type under study contained in a corpus - with the more qualitative research approach typical of discourse analysis, that is, the close, detailed examination of particular stretches of discourse it may be possible to better understand the processes at play in the discourse type and to gain access to non-obvious meanings.

Aims can differ in other types of corpus-based or corpus-assisted discourse analysis; but in general such studies combine quantitative and qualitative research and aim to shed light on discourses, registers, discourse patterns, etc., with the help of a corpus linguistic approach. Specific aims and techniques depend on the relevant project.

In different countries

Comparison with traditional corpus linguistics

Traditional corpus linguistics has, quite naturally, tended to privilege the quantitative approach. In the drive to produce more authentic dictionaries and grammars of a language, it has been characterised by the compilation of some very large corpora of heterogeneric discourse types in the desire to obtain an overview of the greatest quantity and variety of discourse types possible, in other words, of the chimerical but useful fiction called the “general language” (“general English”, “general Italian”, and so on). This has led to the construction of immensely valuable research tools such as the Bank of English and the British National Corpus. Some branches of corpus linguistics have also promoted an approach that is "corpus-driven", in which we need, grammatically speaking, a mental tabula rasa to free ourselves of the baleful prejudice exerted by traditional models and allow the data to speak entirely for itself.

The aim of corpus-assisted discourse studies and related approaches is radically different. Here the aim of the exercise is to acquaint oneself as much as possible with the discourse type(s) in hand. Researchers typically engage with their corpus in a variety of ways. As well as via wordlists and concordancing, intuitions for further research can also arise from reading or watching or listening to parts of the data-set, a process which can help provide a feel for how things are done linguistically in the discourse-type being studied.

Corpus-assisted discourse analysis is also typically characterised by the compilation of ad hoc specialised corpora, since very frequently there exists no previously available collection of the discourse type in question. Often, other corpora are utilized in the course of a study for purposes of comparison. These may include pre-existing corpora or may themselves need to be compiled by the researcher. In some sense, all work with corpora – just as all work with discourse - is properly comparative. Even when a single corpus is employed, it is used to test the data it contains against another body of data. This may consist of the researcher's intuitions, or the data found in reference works such as dictionaries and grammars, or it may be statements made by previous authors in the field.

CADS as a specific type of corpus-based discourse analysis

Researchers in Italy have developed CADS as a specific type of corpus-based discourse analysis, creating a standard set of methods:

'A basic, standard methodology in CADS may resemble the following:'

  1. Step 1: Decide upon the research question;
  2. Step 2: Choose, compile or edit an appropriate corpus;
  3. Step 3: Choose, compile or edit an appropriate reference corpus / corpora;
  4. Step 4: Make frequency lists and run a keywords comparison of the corpora;
  5. Step 5: Determine the existence of sets of key items;
  6. Step 6: Concordance interesting key items (with differing quantities of co-text);
  7. Step 7: (Possibly) refine the research question and return to Step 2.

This basic procedure can of course vary according to individual research circumstances and requirements.

A particular way of conceptualising research questions has also been proposed in such CADS projects:

  1. How does P achieve G with language?
  2. What does this tell us about P?
  3. Comparative studies: how do P1 and P2 differ in their use of language? Does this tell us anything about their different principles and objectives?

A second general type of CADS research question, which might be asked of interactive discourse data, has been conceptualised as follows:

Another common type of research question has been conceptualised thus:

This is a classic “hypothesis-testing” research question: we test the hypothesis that whatever practice has been observed by a previous author in some discourse type will be observable in another. It is a process we might call para-replication, that is, the replication of an experiment with either a fresh set of texts of the same discourse type or of a related discourse type, “in order to see whether [findings] were an artefact of one single data set” (Stubbs 2001: 124).

A final example of conceptualising a CADS research question is the following:

Such research aims to ascertain whether different participants use a particular linguistic feature in the same or different ways. The research may proceed to attempt to explain why this is the case.

Some research to date

Studies that bring together corpus linguistics and discourse analysis include the following:

A comprehensive bibliography of discourse-oriented corpus studies is compiled at Edge Hill University. Currently, it contains 1120 entries:


Related Research Articles

Computational linguistics is an interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, computational linguistics draws upon linguistics, computer science, artificial intelligence, mathematics, logic, philosophy, cognitive science, cognitive psychology, psycholinguistics, anthropology and neuroscience, among others.

Corpus linguistics is the study of a language as that language is expressed in its text corpus, its body of "real world" text. Corpus linguistics proposes that a reliable analysis of a language is more feasible with corpora collected in the field—the natural context ("realia") of that language—with minimal experimental interference.

In linguistics and natural language processing, a corpus or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources, either annotated or unannotated.

Transcription in the linguistic sense is the systematic representation of spoken language in written form. The source can either be utterances or preexisting text in another writing system.

Critical discourse analysis (CDA) is an interdisciplinary approach to the study of discourse that views language as a form of social practice. CDA combines critique of discourse and explanation of how it figures within and contributes to the existing social reality, as a basis for action to change that existing reality in particular respects. Scholars working in the tradition of CDA generally argue that (non-linguistic) social practice and linguistic practice constitute one another and focus on investigating how societal power relations are established and reinforced through language use. In this sense, it differs from discourse analysis in that it highlights issues of power asymmetries, manipulation, exploitation, and structural inequities in domains such as education, media, and politics.

Semantic prosody, also discourse prosody, describes the way in which certain seemingly neutral words can be perceived with positive or negative associations through frequent occurrences with particular collocations. Coined in analogy to linguistic prosody, popularised by Bill Louw.

Dr. Hermann Moisl is a retired senior lecturer and visiting fellow in Linguistics at Newcastle University. He was educated at various institutes, including Trinity College Dublin and the University of Oxford.

John McHardy Sinclair was a Professor of Modern English Language at Birmingham University from 1965 to 2000. He pioneered work in corpus linguistics, discourse analysis, lexicography, and language teaching.

<span class="mw-page-title-main">Treebank</span>

In linguistics, a treebank is a parsed text corpus that annotates syntactic or semantic sentence structure. The construction of parsed corpora in the early 1990s revolutionized computational linguistics, which benefitted from large-scale empirical data.

The British National Corpus (BNC) is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources. The corpus covers British English of the late 20th century from a wide variety of genres, with the intention that it be a representative sample of spoken and written British English of that time. It is used in corpus linguistics for analysis of corpora.

Contrastive rhetoric is the study of how a person's first language and his or her culture influence writing in a second language or how a common language is used among different cultures. The term was first coined by the American applied linguist Robert Kaplan in 1966 to denote eclecticism and subsequent growth of collective knowledge in certain languages. It was widely expanded from 1996 to today by Finnish-born, US-based applied linguist Ulla Connor, among others. Since its inception the area of study has had a significant impact on the exploration of intercultural discourse structures that extend beyond the target language's native forms of discourse organization. The field brought attention to cultural and associated linguistic habits in expression of English language.

A speech corpus is a database of speech audio files and text transcriptions. In speech technology, speech corpora are used, among other things, to create acoustic models. In linguistics, spoken corpora are used to do research into phonetic, conversation analysis, dialectology and other fields.

Linguistics is the scientific study of language. The modern-day scientific study of linguistics takes all aspects of language into account — i.e., the cognitive, the social, the cultural, the psychological, the environmental, the biological, the literary, the grammatical, the paleographical, and the structural.

The German Reference Corpus is an electronic archive of text corpora of contemporary written German. It was first created in 1964 and is hosted at the Institute for the German Language in Mannheim, Germany. The corpus archive is continuously updated and expanded. It currently comprises more than 4.0 billion word tokens and constitutes the largest linguistically motivated collection of contemporary German texts. Today, it is one of the major resources worldwide for the study of written German.

<span class="mw-page-title-main">Michael Hoey (linguist)</span> British linguist (1948–2021)

Michael Hoey was a British linguist and Baines Professor of English Language. He lectured in applied linguistics in over 40 countries.

The following outline is provided as an overview of and topical guide to natural-language processing:

In Systemic Functional Linguistics (SFL), appraisal refers to the ways that writers or speakers express approval or disapproval for things, people, behaviour or ideas. Language users build relationships with their interlocutors by expressing such positions. In other approaches in linguistics, alternative terms such as evaluation or stance are preferred.

Susan Elizabeth Hunston is a British linguist. She received her PhD in English under the supervision of Michael Hoey at the University of Birmingham in 1989. She does research in the areas of corpus linguistics and applied linguistics. She is one of the primary developers of the Pattern Grammar model of linguistic analysis, which is a way of describing the syntactic environments of individual words, based on studying their occurrences in large sets of authentic examples, i.e. language corpora. The Pattern Grammar model was developed as part of the COBUILD project, where Hunston worked for several years as a senior grammarian for the Collins Cobuild English Dictionary.

Paul Baker is a British professor and linguist at the Department of Linguistics and English Language of Lancaster University, United Kingdom. His research focuses on corpus linguistics, critical discourse analysis, corpus-assisted discourse studies and language and identity. He is known for his research on the language of Polari. He is a Fellow of the Academy of Social Sciences and a Fellow of the Royal Society for Arts.

<span class="mw-page-title-main">Monika Bednarek</span> German-born Australian linguist

Monika Bednarek is a German-born Australian linguist. She is a professor in linguistics at the University of Sydney and director of the Sydney Corpus Lab. She is one of the co-developers of Discursive News Values Analysis (DNVA), which is a framework for analyzing how events are constructed as newsworthy through language and images. Her work ranges across various linguistic sub-disciplines, including corpus linguistics, media linguistics, sociolinguistics, discourse analysis, stylistics, and applied linguistics.


  1. Bray, Carly (2022). "Cooperation and demotion: A corpus-based critical discourse analysis of Aboriginal people(s) in Australian print news". Discourse & Communication. 16 (5): 504–524. doi:10.1177/17504813221099193. S2CID   250938608.
  2. Kemble, Melissa (2020). "As good as the men? A corpus analysis of evaluation in news articles about professional female athletes competing in masculine sports" (PDF). Critical Approaches to Discourse Analysis Across Disciplines. 12 (1): 87–111.