A concordancer is a computer program that automatically constructs a concordance. The output of a concordancer may serve as input to a translation memory system for computer-assisted translation, or as an early step in machine translation.
Concordancers are also used in corpus linguistics to retrieve alphabetically or otherwise sorted lists of linguistic data from the corpus in question, which the corpus linguist then analyzes.
A number of concordancers have been published, [1] notably Oxford Concordance Program (OCP), [2] a concordancer first released in 1981 by Oxford University Computing Services, which claims to be used in over 200 organisations worldwide. [3] [4]
Corpus linguistics is an empirical method for the study of language by way of a text corpus. Corpora are balanced, often stratified collections of authentic, "real world", text of speech or writing that aim to represent a given linguistic variety. Today, corpora are generally machine-readable data collections.
In linguistics and natural language processing, a corpus or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources, either annotated or unannotated.
A parallel text is a text placed alongside its translation or translations. Parallel text alignment is the identification of the corresponding sentences in both halves of the parallel text. The Loeb Classical Library and the Clay Sanskrit Library are two examples of dual-language series of texts. Reference Bibles may contain the original languages and a translation, or several translations by themselves, for ease of comparison and study; Origen's Hexapla placed six versions of the Old Testament side by side. A famous example is the Rosetta Stone, whose discovery allowed the Ancient Egyptian language to begin being deciphered.
Roberto Busa was an Italian Jesuit priest and one of the pioneers in the usage of computers for linguistic and literary analysis. He was the author of the Index Thomisticus, a complete lemmatization of the works of Saint Thomas Aquinas and of a few related authors.
Stylometry is the application of the study of linguistic style, usually to written language. It has also been applied successfully to music, paintings, and chess.
A concordance is an alphabetical list of the principal words used in a book or body of work, listing every instance of each word with its immediate context. Historically, concordances have been compiled only for works of special importance, such as the Vedas, Bible, Qur'an or the works of Shakespeare, James Joyce or classical Latin and Greek authors, because of the time, difficulty, and expense involved in creating a concordance in the pre-computer era.
Digital humanities (DH) is an area of scholarly activity at the intersection of computing or digital technologies and the disciplines of the humanities. It includes the systematic use of digital resources in the humanities, as well as the analysis of their application. DH can be defined as new ways of doing scholarship that involve collaborative, transdisciplinary, and computationally engaged research, teaching, and publishing. It brings digital tools and methods to the study of the humanities with the recognition that the printed word is no longer the main medium for knowledge production and distribution.
Beryl T. "Sue" Atkins was a British lexicographer, specialising in computational lexicography, who pioneered the creation of bilingual dictionaries from corpus data.
Geoffrey Sampson is Professor of Natural Language Computing in the Department of Informatics, University of Sussex. He produces annotation standards for compiling corpora (databases) of ordinary usage of the English language. His work has been applied in automatic language-understanding software, and in writing-skills training. He has also analysed Ronald Coase's "theory of the firm" and the economic and political implications of e-business.
WordSmith Tools is a software package primarily for linguists, in particular for work in the field of corpus linguistics. It is a collection of modules for searching patterns in a language. The software handles many languages.
The Quranic Arabic Corpus is an annotated linguistic resource consisting of 77,430 words of Quranic Arabic. The project aims to provide morphological and syntactic annotations for researchers wanting to study the language of the Quran.
Steven J DeRose is a computer scientist noted for his contributions to computational linguistics and to key standards related to document processing, mostly around ISO's Standard Generalized Markup Language (SGML) and W3C's Extensible Markup Language (XML).
COCOA was an early text file utility and associated file format for digital humanities, then known as humanities computing. It was approximately 4000 punched cards of FORTRAN and created in the late 1960s and early 1970s at University College London and the Atlas Computer Laboratory in Harwell, Oxfordshire. Functionality included word-counting and concordance building.
OCP Art Studio or Art Studio was a popular bitmap graphics editor for home computers released in 1985, created by Oxford Computer Publishing and written by James Hutchby.
Sketch Engine is a corpus manager and text analysis software developed by Lexical Computing since 2003. Its purpose is to enable people studying language behaviour to search large text collections according to complex and linguistically motivated queries. Sketch Engine gained its name after one of the key features, word sketches: one-page, automatic, corpus-derived summaries of a word's grammatical and collocational behaviour. Currently, it supports and provides corpora in over 90 languages.
Susan Hockey is an English computer scientist. She is Emeritus Professor of Library and Information Studies at University College London. She has written about the history of digital humanities, the development of text analysis applications, electronic textual mark-up, teaching computing in the humanities, and the role of libraries in managing digital resources. In 2014, University College London created a Digital Humanities lecture series in her honour.
The Oxford Concordance Program (OCP) was first released in 1981 and was a result of a project started in 1978 by Oxford University Computing Services (OUCS) to create a machine independent text analysis program for producing word lists, indexes and concordances in a variety of languages and alphabets.
CLOC was a first generation general purpose text analyzer program. It was produced at the University of Birmingham and could produce concordances as well as word lists and collocational analysis of text. First-generation concordancers were typically held on a mainframe computer and used at a single site; individual research teams would build their own concordancer and use it on the data they had access to locally, any further analysis was done by separate programs.
Lou Burnard is an internationally recognised expert in digital humanities, particularly in the area of text encoding and digital libraries. He was assistant director of Oxford University Computing Services (OUCS) from 2001 to September 2010, when he officially retired from OUCS. Before that, he was manager of the Humanities Computing Unit at OUCS for five years. He has worked in ICT support for research in the humanities since the 1990s. He was one of the founding editors of the Text Encoding Initiative (TEI) and continues to play an active part in its maintenance and development, as a consultant to the TEI Technical Council and as an elected TEI board member. He has played a key role in the establishment of many other activities and initiatives in this area, such as the UK Arts and Humanities Data Service and the British National Corpus, and has published and lectured widely. Since 2008 he has worked as a Member of the Conseil Scientifique for the CNRS-funded "Adonis" TGE.
Roy Albert Wisbey was a British medievalist, Professor of German at King's College, London, and one of the leading figures in British German studies. He was also a pioneer in the field of digital humanities, founding the Literary and Linguistic Computing Centre in Cambridge in 1964 and later promoting the establishment of the Centre for Computing in the Humanities at King's. Over a period of 40 years he led the transformation of the Modern Humanities Research Association (MHRA) into a major scholarly publisher. He was recognised by both the German and Austrian governments for his contribution to German Studies.