Variable rules analysis

Last updated

Variable rules analysis is a set of statistical analysis methods in linguistics that are commonly used in sociolinguistics and historical linguistics to describe patterns of variation between alternative forms in language use. It is also sometimes known as Varbrul analysis, after the name of a software package dedicated to carrying out the relevant statistical computations (Varbrul, from "variable rule"). The method goes back to a theoretical approach developed by the sociolinguist William Labov in the late 1960s and early 1970s, and its mathematical implementation was developed by Henrietta Cedergren and David Sankoff in 1974. [1]

A variable rules analysis is designed to provide a quantitative model of a situation where speakers alternate between different forms that have the same meaning and stand in free variation, but in such a way that the probability of choice of either the one or the other form is conditioned by a variety of context factors or social characteristics. Such a situation, where variation is not entirely random but rule-governed, is also known as "structured variation" or "orderly heterogeneity". A variable rules analysis computes a multivariate statistical model, on the basis of observed token counts, such that each determining factor is assigned a numerical factor weight that describes how it influences the probabilities of choice of either form. This is done by means of stepwise logistic regression, using a maximum likelihood algorithm.

Although the necessary computations required for a variable rules analysis can be carried out with the help of mainstream general-purpose statistics software packages such as SPSS, it is more often done by means of a specialised software dedicated to the needs of linguists, called Varbrul. It was originally written by David Sankoff and currently exists in freeware implementations for Mac OS and Microsoft Windows, under the title of Goldvarb X. [2] There are also versions implemented in the statistical language R and therefore available on most platforms. These include R-Varb and Rbrul. [3]

Variable rules approaches are commonly employed for the analysis of data in sociolinguistic research, especially in studies that aim to investigate how reflexes of linguistic change through time appear in the shape of structured variation patterns within a speech community. [4]

See also

Related Research Articles

Multivariate statistics is a subdivision of statistics encompassing the simultaneous observation and analysis of more than one outcome variable, i.e., multivariate random variables. Multivariate statistics concerns understanding the different aims and background of each of the different forms of multivariate analysis, and how they relate to each other. The practical application of multivariate statistics to a particular problem may involve several types of univariate and multivariate analyses in order to understand the relationships between variables and their relevance to the problem being studied.

Corpus linguistics is the study of a language as that language is expressed in its text corpus, its body of "real world" text. Corpus linguistics proposes that a reliable analysis of a language is more feasible with corpora collected in the field—the natural context ("realia") of that language—with minimal experimental interference. The large collections of text allow linguistics to run quantitative analyses on linguistic concepts, otherwise harder to quantify.

Sociolinguistics is the descriptive study of the effect of any or all aspects of society, including cultural norms, expectations, and context, on language and the ways it is used. It can overlap with the sociology of language, which focuses on the effect of language on society. Sociolinguistics overlaps considerably with pragmatics and is closely related to linguistic anthropology.

William Labov is an American linguist widely regarded as the founder of the discipline of variationist sociolinguistics. He has been described as "an enormously original and influential figure who has created much of the methodology" of sociolinguistics.

Anthropological linguistics is the subfield of linguistics and anthropology which deals with the place of language in its wider social and cultural context, and its role in making and maintaining cultural practices and societal structures. While many linguists believe that a true field of anthropological linguistics is nonexistent, preferring the term linguistic anthropology to cover this subfield, many others regard the two as interchangeable.

In corpus linguistics, part-of-speech tagging, also called grammatical tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc.

In sociolinguistics, a register is a variety of language used for a particular purpose or particular communicative situation. For example, when speaking officially or in a public setting, an English speaker may be more likely to follow prescriptive norms for formal usage than in a casual setting, for example, by pronouncing words ending in -ing with a velar nasal instead of an alveolar nasal, choosing words that are considered more "formal", and refraining from using words considered nonstandard, such as ain't and y'all.

In statistics, classification is the problem of identifying which of a set of categories (sub-populations) an observation belongs to. Examples are assigning a given email to the "spam" or "non-spam" class, and assigning a diagnosis to a given patient based on observed characteristics of the patient.

An interlanguage is an idiolect which has been developed by a learner of a second language (L2) which preserves some features of their first language (L1) and can overgeneralize some L2 writing and speaking rules. These two characteristics give an interlanguage its unique linguistic organization. It is idiosyncratically based on the learner's experiences with L2. An interlanguage can fossilize, or cease developing, in any of its developmental stages. It is claimed that several factors shape interlanguage rules, including L1 transfer, previous learning strategies, strategies of L2 acquisition, L2 communication strategies, and the overgeneralization of L2 language patterns.

<span class="mw-page-title-main">Kernel density estimation</span> Estimator

In statistics, kernel density estimation (KDE) is the application of kernel smoothing for probability density estimation, i.e., a non-parametric method to estimate the probability density function of a random variable based on kernels as weights. KDE answers a fundamental data smoothing problem where inferences about the population are made, based on a finite data sample. In some fields such as signal processing and econometrics it is also termed the Parzen–Rosenblatt window method, after Emanuel Parzen and Murray Rosenblatt, who are usually credited with independently creating it in its current form. One of the famous applications of kernel density estimation is in estimating the class-conditional marginal densities of data when using a naive Bayes classifier, which can improve its prediction accuracy.

In robust statistics, robust regression seeks to overcome some limitations of traditional regression analysis. A regression analysis models the relationship between one or more independent variables and a dependent variable. Standard types of regression, such as ordinary least squares, have favourable properties if their underlying assumptions are true, but can give misleading results otherwise. Robust regression methods are designed to limit the effect that violations of assumptions by the underlying data-generating process have on regression estimates.

Variation is a characteristic of language: there is more than one way of saying the same thing. Speakers may vary in pronunciation (accent), word choice (lexicon), or morphology and syntax. But while the diversity of variation is great, there seem to be boundaries on variation – speakers do not generally make drastic alterations in word order or use novel sounds that are completely foreign to the language being spoken. Linguistic variation does not equate to language ungrammaticality, but speakers are still sensitive to what is and is not possible in their native lect.

The apparent-time hypothesis is a methodological construct in sociolinguistics whereby language change is studied by comparing the speech of individuals of different ages. If language change is taking place, the apparent-time hypothesis assumes that older generations will represent an earlier form of the language and that younger generations will represent a later form.

Linguistics is the scientific study of language.

New Ways of Analyzing Variation (NWAV) is an annual academic conference in sociolinguistics. NWAV attracts researchers and students conducting linguistic scientific investigations into patterns of language variation, the study of language change in progress, and the interrelationship between language and society, including how language variation is shaped by and continually shapes societal institutions, social and interpersonal relationships, and individual and group identities.

Gillian Elizabeth Sankoff is a Canadian-American sociolinguist, and professor emerita of linguistics at the University of Pennsylvania. Sankoff's notable former students include Miriam Meyerhoff.

<span class="mw-page-title-main">David Sankoff</span> Canadian scientist

David Sankoff is a Canadian mathematician, bioinformatician, computer scientist and linguist. He holds the Canada Research Chair in Mathematical Genomics in the Mathematics and Statistics Department at the University of Ottawa, and is cross-appointed to the Biology Department and the School of Information Technology and Engineering. He was founding editor of the scientific journal Language Variation and Change (Cambridge) and serves on the editorial boards of a number of bioinformatics, computational biology and linguistics journals. Sankoff is best known for his pioneering contributions in computational linguistics and computational genomics. He is considered to be one of the founders of bioinformatics. In particular, he had a key role in introducing dynamic programming for sequence alignment and other problems in computational biology. In Pavel Pevzner's words, "[ Michael Waterman ] and David Sankoff are responsible for transforming bioinformatics from a ‘stamp collection' of ill-defined problems into a rigorous discipline with important biological applications."

<span class="mw-page-title-main">Glossary of artificial intelligence</span> List of definitions of terms and concepts commonly used in the study of artificial intelligence

This glossary of artificial intelligence is a list of definitions of terms and concepts relevant to the study of artificial intelligence, its sub-disciplines, and related fields. Related glossaries include Glossary of computer science, Glossary of robotics, and Glossary of machine vision.

References

  1. Cedergren, H.; Sankoff, D. (1974). "Variable rules: Performance as a statistical reflection of competence". Language . 50 (2): 333–355. doi:10.2307/412441. JSTOR   412441.
  2. "Sali A. Tagliamonte: Goldvarb". individual.utoronto.ca.
  3. "Rbrul". www.danielezrajohnson.com.
  4. Tagliamonte, S. (2006). Analysing Sociolinguistic Variation. Cambridge: Cambridge University Press. ISBN   0-521-77115-3.