Latent variable model

Last updated

A latent variable model is a statistical model that relates a set of observable variables (also called manifest variables or indicators) [1] to a set of latent variables. Latent variable models are applied across a wide range of fields such as biology, computer science, and social science [2] . Common use cases for latent variable models include applications in psychometrics (e.g., summarizing responses to a set of survey questions with a factor analysis model positing a smaller number of psychological attributes, such as the trait extraversion, that are presumed to cause the survey question responses), [3] and natural language processing (e.g., a topic model summarizing a corpus of texts with a number of "topics") [4] .

Contents

It is assumed that the responses on the indicators or manifest variables are the result of an individual's position on the latent variable(s), and that the manifest variables have nothing in common after controlling for the latent variable (local independence).

Different types of the latent variable models can be grouped according to whether the manifest and latent variables are categorical or continuous: [5]

Manifest variables
Latent variablesContinuousCategorical
Continuous Factor analysis Item response theory
Categorical Latent profile analysis Latent class analysis

The Rasch model represents the simplest form of item response theory. Mixture models are central to latent profile analysis.

In factor analysis and latent trait analysis [note 1] the latent variables are treated as continuous normally distributed variables, and in latent profile analysis and latent class analysis as from a multinomial distribution. [7] The manifest variables in factor analysis and latent profile analysis are continuous and in most cases, their conditional distribution given the latent variables is assumed to be normal. In latent trait analysis and latent class analysis, the manifest variables are discrete. These variables could be dichotomous, ordinal or nominal variables. Their conditional distributions are assumed to be binomial or multinomial.

See also

Notes

  1. The terms "latent trait analysis" and "item response theory" are often used interchangeably. [6]

Related Research Articles

Multivariate statistics is a subdivision of statistics encompassing the simultaneous observation and analysis of more than one outcome variable, i.e., multivariate random variables. Multivariate statistics concerns understanding the different aims and background of each of the different forms of multivariate analysis, and how they relate to each other. The practical application of multivariate statistics to a particular problem may involve several types of univariate and multivariate analyses in order to understand the relationships between variables and their relevance to the problem being studied.

Psychological statistics is application of formulas, theorems, numbers and laws to psychology. Statistical methods for psychology include development and application statistical theory and methods for modeling psychological data. These methods include psychometrics, factor analysis, experimental designs, and Bayesian statistics. The article also discusses journals in the same field.

Factor analysis is a statistical method used to describe variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors. For example, it is possible that variations in six observed variables mainly reflect the variations in two unobserved (underlying) variables. Factor analysis searches for such joint variations in response to unobserved latent variables. The observed variables are modelled as linear combinations of the potential factors plus "error" terms, hence factor analysis can be thought of as a special case of errors-in-variables models.

In psychometrics, item response theory (IRT) is a paradigm for the design, analysis, and scoring of tests, questionnaires, and similar instruments measuring abilities, attitudes, or other variables. It is a theory of testing based on the relationship between individuals' performances on a test item and the test takers' levels of performance on an overall measure of the ability that item was designed to measure. Several different statistical models are used to represent both item and test taker characteristics. Unlike simpler alternatives for creating scales and evaluating questionnaire responses, it does not assume that each item is equally difficult. This distinguishes IRT from, for instance, Likert scaling, in which "All items are assumed to be replications of each other or in other words items are considered to be parallel instruments". By contrast, item response theory treats the difficulty of each item as information to be incorporated in scaling items.

<span class="mw-page-title-main">Structural equation modeling</span> Form of causal modeling that fit networks of constructs to data

Structural equation modeling (SEM) is a diverse set of methods used by scientists doing both observational and experimental research. SEM is used mostly in the social and behavioral sciences but it is also used in epidemiology, business, and other fields. A definition of SEM is difficult without reference to technical language, but a good starting place is the name itself.

Probabilistic latent semantic analysis (PLSA), also known as probabilistic latent semantic indexing is a statistical technique for the analysis of two-mode and co-occurrence data. In effect, one can derive a low-dimensional representation of the observed variables in terms of their affinity to certain hidden variables, just as in latent semantic analysis, from which PLSA evolved.

The Rasch model, named after Georg Rasch, is a psychometric model for analyzing categorical data, such as answers to questions on a reading assessment or questionnaire responses, as a function of the trade-off between the respondent's abilities, attitudes, or personality traits, and the item difficulty. For example, they may be used to estimate a student's reading ability or the extremity of a person's attitude to capital punishment from responses on a questionnaire. In addition to psychometrics and educational research, the Rasch model and its extensions are used in other areas, including the health profession, agriculture, and market research.

In statistics, latent variables are variables that can only be inferred indirectly through a mathematical model from other observable variables that can be directly observed or measured. Such latent variable models are used in many disciplines, including engineering, medicine, ecology, physics, machine learning/artificial intelligence, natural language processing, bioinformatics, chemometrics, demography, economics, management, political science, psychology and the social sciences.

In statistics, the ordered logit model is an ordinal regression model—that is, a regression model for ordinal dependent variables—first considered by Peter McCullagh. For example, if one question on a survey is to be answered by a choice among "poor", "fair", "good", "very good" and "excellent", and the purpose of the analysis is to see how well that response can be predicted by the responses to other questions, some of which may be quantitative, then ordered logistic regression may be used. It can be thought of as an extension of the logistic regression model that applies to dichotomous dependent variables, allowing for more than two (ordered) response categories.

In statistics, a latent class model (LCM) is a model for clustering multivariate discrete data. It assumes that the data arise from a mixture of discrete distributions, within each of which the variables are independent. It is called a latent class model because the class to which each data point belongs is unobserved, or latent.

In statistics, binomial regression is a regression analysis technique in which the response has a binomial distribution: it is the number of successes in a series of independent Bernoulli trials, where each trial has probability of success . In binomial regression, the probability of a success is related to explanatory variables: the corresponding concept in ordinary regression is to relate the mean value of the unobserved response to explanatory variables.

Differential item functioning (DIF) is a statistical property of a test item that indicates how likely it is for individuals from distinct groups, possessing similar abilities, to respond differently to the item. It manifests when individuals from different groups, with comparable skill levels, do not have an equal likelihood of answering a question correctly. There are two primary types of DIF: uniform DIF, where one group consistently has an advantage over the other, and nonuniform DIF, where the advantage varies based on the individual's ability level. The presence of DIF requires review and judgment, but it doesn't always signify bias. DIF analysis provides an indication of unexpected behavior of items on a test. DIF characteristic of an item isn't solely determined by varying probabilities of selecting a specific response among individuals from different groups. Rather, DIF becomes pronounced when individuals from different groups, who possess the same underlying true ability, exhibit differing probabilities of giving a certain response. Even when uniform bias is present, test developers sometimes resort to assumptions such as DIF biases may offset each other due to the extensive work required to address it, compromising test ethics and perpetuating systemic biases. Common procedures for assessing DIF are Mantel-Haenszel procedure, logistic regression, item response theory (IRT) based methods, and confirmatory factor analysis (CFA) based methods.

In statistics, confirmatory factor analysis (CFA) is a special form of factor analysis, most commonly used in social science research. It is used to test whether measures of a construct are consistent with a researcher's understanding of the nature of that construct. As such, the objective of confirmatory factor analysis is to test whether the data fit a hypothesized measurement model. This hypothesized model is based on theory and/or previous analytic research. CFA was first developed by Jöreskog (1969) and has built upon and replaced older methods of analyzing construct validity such as the MTMM Matrix as described in Campbell & Fiske (1959).

In statistics, polychoric correlation is a technique for estimating the correlation between two hypothesised normally distributed continuous latent variables, from two observed ordinal variables. Tetrachoric correlation is a special case of the polychoric correlation applicable when both observed variables are dichotomous. These names derive from the polychoric and tetrachoric series which are used for estimation of these correlations.

Psychometric software refers to specialized programs used for the psychometric analysis of data obtained from tests, questionnaires, polls or inventories that measure latent psychoeducational variables. Although some psychometric analyses can be performed using general statistical software such as SPSS, most require specialized tools designed specifically for psychometric purposes.

The Mokken scale is a psychometric method of data reduction. A Mokken scale is a unidimensional scale that consists of hierarchically-ordered items that measure the same underlying, latent concept. This method is named after the political scientist Rob Mokken who suggested it in 1971.

A vine is a graphical tool for labeling constraints in high-dimensional probability distributions. A regular vine is a special case for which all constraints are two-dimensional or conditional two-dimensional. Regular vines generalize trees, and are themselves specializations of Cantor tree.

In statistics, specifically regression analysis, a binary regression estimates a relationship between one or more explanatory variables and a single output binary variable. Generally the probability of the two alternatives is modeled, instead of simply outputting a single value, as in linear regression.

References

  1. "Latent Variable Models". Statistics.com: Data Science, Analytics & Statistics Courses. Archived from the original on 2022-11-01. Retrieved 2022-11-01.
  2. Blei, David M. (2014-01-03). "Build, Compute, Critique, Repeat: Data Analysis with Latent Variable Models". Annual Review of Statistics and Its Application. 1 (1): 203–232. Bibcode:2014AnRSA...1..203B. doi:10.1146/annurev-statistics-022513-115657. ISSN   2326-8298.
  3. Borsboom, Denny; Mellenbergh, Gideon J.; van Heerden, Jaap (April 2003). "The theoretical status of latent variables". Psychological Review. 110 (2): 203–219. doi:10.1037/0033-295X.110.2.203. ISSN   1939-1471. PMID   12747522.
  4. Blei, David M.; Ng, Andrew Y.; Jordan, Michael I. (2003). "Latent dirichlet allocation". J. Mach. Learn. Res. 3 (3/1/2003): 993–1022. ISSN   1532-4435.
  5. Bartholomew, David J.; Steel, Fiona; Moustaki, Irini; Galbraith, Jane I. (2002). The Analysis and Interpretation of Multivariate Data for Social Scientists. Chapman & Hall/CRC. p. 145. ISBN   1-58488-295-6.
  6. Uebersax, John. "Latent Trait Analysis and Item Response Theory (IRT) Models". John-Uebersax.com. Archived from the original on 2022-11-01. Retrieved 2022-11-01.
  7. Everitt, BS (1984). An Introduction to Latent Variables Models. Chapman & Hall. ISBN   0-412-25310-0.

Further reading