Variable rules analysis is a set of statistical analysis methods in linguistics that are commonly used in sociolinguistics and historical linguistics to describe patterns of variation between alternative forms in language use. It is also sometimes known as Varbrul analysis, after the name of a software package dedicated to carrying out the relevant statistical computations (Varbrul, from "variable rule"). The method goes back to a theoretical approach developed by the sociolinguist William Labov in the late 1960s and early 1970s, and its mathematical implementation was developed by Henrietta Cedergren and David Sankoff in 1974. [1]
A variable rules analysis is designed to provide a quantitative model of a situation where speakers alternate between different forms that have the same meaning and stand in free variation, but in such a way that the probability of choice of either the one or the other form is conditioned by a variety of context factors or social characteristics. Such a situation, where variation is not entirely random but rule-governed, is also known as "structured variation" or "orderly heterogeneity". A variable rules analysis computes a multivariate statistical model, on the basis of observed token counts, such that each determining factor is assigned a numerical factor weight that describes how it influences the probabilities of choice of either form. This is done by means of stepwise logistic regression, using a maximum likelihood algorithm.
Although the necessary computations required for a variable rules analysis can be carried out with the help of mainstream general-purpose statistics software packages such as SPSS, it is more often[ needs update ] done by means of a specialised software dedicated to the needs of linguists, called Varbrul. It was originally written by David Sankoff and currently exists in freeware implementations for Mac OS and Microsoft Windows, under the title of Goldvarb X. [2] There are also versions implemented in the statistical language R and therefore available on most platforms. These include R-Varb and Rbrul. [3]
Variable rules approaches are commonly employed for the analysis of data in sociolinguistic research, especially in studies that aim to investigate how reflexes of linguistic change through time appear in the shape of structured variation patterns within a speech community. [4]
Multivariate statistics is a subdivision of statistics encompassing the simultaneous observation and analysis of more than one outcome variable, i.e., multivariate random variables. Multivariate statistics concerns understanding the different aims and background of each of the different forms of multivariate analysis, and how they relate to each other. The practical application of multivariate statistics to a particular problem may involve several types of univariate and multivariate analyses in order to understand the relationships between variables and their relevance to the problem being studied.
Corpus linguistics is an empirical method for the study of language by way of a text corpus. Corpora are balanced, often stratified collections of authentic, "real world", text of speech or writing that aim to represent a given linguistic variety. Today, corpora are generally machine-readable data collections.
Sociolinguistics is the descriptive study of the effect of any or all aspects of society, including cultural norms, expectations, and context, on language and the ways it is used. It can overlap with the sociology of language, which focuses on the effect of language on society. Sociolinguistics overlaps considerably with pragmatics and is closely related to linguistic anthropology.
In statistics, Markov chain Monte Carlo (MCMC) is a class of algorithms used to draw samples from a probability distribution. Given a probability distribution, one can construct a Markov chain whose elements' distribution approximates it – that is, the Markov chain's equilibrium distribution matches the target distribution. The more steps that are included, the more closely the distribution of the sample matches the actual desired distribution.
William Labov is an American linguist widely regarded as the founder of the discipline of variationist sociolinguistics. He has been described as "an enormously original and influential figure who has created much of the methodology" of sociolinguistics.
Anthropological linguistics is the subfield of linguistics and anthropology which deals with the place of language in its wider social and cultural context, and its role in making and maintaining cultural practices and societal structures. While many linguists believe that a true field of anthropological linguistics is nonexistent, preferring the term linguistic anthropology to cover this subfield, many others regard the two as interchangeable.
In corpus linguistics, part-of-speech tagging, also called grammatical tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc.
In sociolinguistics, a register is a variety of language used for a particular purpose or particular communicative situation. For example, when speaking officially or in a public setting, an English speaker may be more likely to follow prescriptive norms for formal usage than in a casual setting, for example, by pronouncing words ending in -ing with a velar nasal instead of an alveolar nasal, choosing words that are considered more formal, such as father vs. dad or child vs. kid, and refraining from using words considered nonstandard, such as ain't and y'all.
When classification is performed by a computer, statistical methods are normally used to develop the algorithm.
An interlanguage is an idiolect which has been developed by a learner of a second language (L2) which preserves some features of their first language (L1) and can overgeneralize some L2 writing and speaking rules. These two characteristics give an interlanguage its unique linguistic organization. It is idiosyncratically based on the learner's experiences with L2. An interlanguage can fossilize, or cease developing, in any of its developmental stages. It is claimed that several factors shape interlanguage rules, including L1 transfer, previous learning strategies, strategies of L2 acquisition, L2 communication strategies, and the overgeneralization of L2 language patterns.
In statistics, kernel density estimation (KDE) is the application of kernel smoothing for probability density estimation, i.e., a non-parametric method to estimate the probability density function of a random variable based on kernels as weights. KDE answers a fundamental data smoothing problem where inferences about the population are made based on a finite data sample. In some fields such as signal processing and econometrics it is also termed the Parzen–Rosenblatt window method, after Emanuel Parzen and Murray Rosenblatt, who are usually credited with independently creating it in its current form. One of the famous applications of kernel density estimation is in estimating the class-conditional marginal densities of data when using a naive Bayes classifier, which can improve its prediction accuracy.
In robust statistics, robust regression seeks to overcome some limitations of traditional regression analysis. A regression analysis models the relationship between one or more independent variables and a dependent variable. Standard types of regression, such as ordinary least squares, have favourable properties if their underlying assumptions are true, but can give misleading results otherwise. Robust regression methods are designed to limit the effect that violations of assumptions by the underlying data-generating process have on regression estimates.
Variation is a characteristic of language: there is more than one way of saying the same thing in a given language. Variation can exist in domains such as pronunciation, lexicon, grammar, and other features. Different communities or individuals speaking the same language may differ from each other in their choices of which of the available linguistic features to use, and how often, and the same speaker may make different choices on different occasions.
The apparent-time hypothesis is a methodological construct in sociolinguistics whereby language change is studied by comparing the speech of individuals of different ages. If language change is taking place, the apparent-time hypothesis assumes that older generations will represent an earlier form of the language and that younger generations will represent a later form. Thus, by comparing younger and older speakers, the direction of language change can be detected.
Linguistics is the scientific study of language. Linguistics is based on a theoretical as well as a descriptive study of language and is also interlinked with the applied fields of language studies and language learning, which entails the study of specific languages. Before the 20th century, linguistics evolved in conjunction with literary study and did not employ scientific methods. Modern-day linguistics is considered a science because it entails a comprehensive, systematic, objective, and precise analysis of all aspects of language – i.e., the cognitive, the social, the cultural, the psychological, the environmental, the biological, the literary, the grammatical, the paleographical, and the structural.
New Ways of Analyzing Variation (NWAV) is an annual academic conference in sociolinguistics. NWAV attracts researchers and students conducting linguistic scientific investigations into patterns of language variation, the study of language change in progress, and the interrelationship between language and society, including how language variation is shaped by and continually shapes societal institutions, social and interpersonal relationships, and individual and group identities.
Gillian Elizabeth Sankoff is a Canadian-American sociolinguist, and professor emerita of linguistics at the University of Pennsylvania. Sankoff's notable former students include Miriam Meyerhoff.
David Sankoff is a Canadian mathematician, bioinformatician, computer scientist and linguist. He holds the Canada Research Chair in Mathematical Genomics in the Mathematics and Statistics Department at the University of Ottawa, and is cross-appointed to the Biology Department and the School of Information Technology and Engineering. He was founding editor of the scientific journal Language Variation and Change (Cambridge) and serves on the editorial boards of a number of bioinformatics, computational biology and linguistics journals. Sankoff is best known for his pioneering contributions in computational linguistics and computational genomics. He is considered to be one of the founders of bioinformatics. In particular, he had a key role in introducing dynamic programming for sequence alignment and other problems in computational biology. In Pavel Pevzner's words, "Michael Waterman and David Sankoff are responsible for transforming bioinformatics from a ‘stamp collection' of ill-defined problems into a rigorous discipline with important biological applications."