Confirmatory composite analysis

Last updated

In statistics, confirmatory composite analysis (CCA) is a sub-type of structural equation modeling (SEM). [1] [2] [3] Although, historically, CCA emerged from a re-orientation and re-start of partial least squares path modeling (PLS-PM), [4] [5] [6] [7] it has become an independent approach and the two should not be confused. In many ways it is similar to, but also quite distinct from confirmatory factor analysis (CFA). It shares with CFA the process of model specification, model identification, model estimation, and model assessment. However, in contrast to CFA which always assumes the existence of latent variables, in CCA all variables can be observable, with their interrelationships expressed in terms of composites, i.e., linear compounds of subsets of the variables. The composites are treated as the fundamental objects and path diagrams can be used to illustrate their relationships. This makes CCA particularly useful for disciplines examining theoretical concepts that are designed to attain certain goals, so-called artifacts, [8] and their interplay with theoretical concepts of behavioral sciences. [9]

Contents

Development

The initial idea of CCA was sketched by Theo K. Dijkstra and Jörg Henseler in 2014. [4] The scholarly publishing process took its time until the first full description of CCA was published by Florian Schuberth, Jörg Henseler and Theo K. Dijkstra in 2018. [2] As common for statistical developments, interim developments of CCA were shared with the scientific community in written form. [10] [9] Moreover, CCA was presented at several conferences including the 5th Modern Modeling Methods Conference, the 2nd International Symposium on Partial Least Squares Path Modeling, the 5th CIM Community Workshop, and the Meeting of the SEM Working Group in 2018.

Statistical model

Example of a model containing 3 composites 3 composite model.svg
Example of a model containing 3 composites

A composite is typically a linear combination of observable random variables. [11] However, also so-called second-order composites as linear combinations of latent variables and composites, respectively, are conceivable. [9] [12] [3] [13]

For a random column vector of observable variables that is partitioned into sub-vectors , composites can be defined as weighted linear combinations. So the i-th composite equals:

,

where the weights of each composite are appropriately normalized (see Confirmatory composite analysis#Model identification). In the following, it is assumed that the weights are scaled in such a way that each composite has a variance of one, i.e., . Moreover, it is assumed that the observable random variables are standardized having a mean of zero and a unit variance. Generally, the variance-covariance matrices of the sub-vectors are not constrained beyond being positive definite. Similar to the latent variables of a factor model, the composites explain the covariances between the sub-vectors leading to the following inter-block covariance matrix:

,

where is the correlation between the composites and . The composite model imposes rank one constraints on the inter-block covariance matrices , i.e., . Generally, the variance-covariance matrix of is positive definite iff the correlation matrix of the composites and the variance-covariance matrices 's are both positive definite. [7]

In addition, the composites can be related via a structural model which constrains the correlation matrix indirectly via a set of simultaneous equations: [7]

,

where the vector is partitioned in an exogenous and an endogenous part, and the matrices and contain the so-called path (and feedback) coefficients. Moreover, the vector contains the structural error terms having a zero mean and being uncorrelated with . As the model needs not to be recursive, the matrix is not necessarily triangular and the elements of may be correlated.

Model identification

To ensure identification of the composite model, each composite must be correlated with at least one variable not forming the composite. Additionally to this non-isolation condition, each composite needs to be normalized, e.g., by fixing one weight per composite, the length of each weight vector, or the composite’s variance to a certain value. [2] If the composites are embedded in a structural model, also the structural model needs to be identified. [7] Finally, since the weight signs are still undetermined, it is recommended to select a dominant indicator per block of indicators that dictates the orientation of the composite. [3]

The degrees of freedom of the basic composite model, i.e., with no constraints imposed on the composites' correlation matrix , are calculated as follows: [2]

df=number of non-redundant off-diagonal elements of the indicator covariance matrix
-number of free correlations among the composites
-number of free covariances between the composites and indicators not forming a composite
-number of covariances among the indicators not forming a composite
-number of free non-redundant off-diagonal elements of each intra-block covariance matrix
-number of weights
+number of blocks

Model estimation

To estimate the parameters of a composite model, various methods that create composites can be used [6] such as approaches to generalized canonical correlation, principal component analysis, and linear discriminant analysis. Moreover, a maximum-likelihood estimator [14] [15] [16] and composite-based methods for SEM such as partial least squares path modeling and generalized structured component analysis [17] can be employed to estimate weights and the correlations among the composites.

Evaluating model fit

In CCA, the model fit, i.e., the discrepancy between the estimated model-implied variance-covariance matrix and its sample counterpart , can be assessed in two non-exclusive ways. On the one hand, measures of fit can be employed; on the other hand, a test for overall model fit can be used. While the former relies on heuristic rules, the latter is based on statistical inferences.

Fit measures for composite models comprises statistics such as the standardized root mean square residual (SRMR), [18] [4] and the root mean squared error of outer residuals (RMS) [19] In contrast to fit measures for common factor models, fit measures for composite models are relatively unexplored and reliable thresholds still need to be determined. To assess the overall model fit by means of statistical testing, the bootstrap test for overall model fit, [20] also known as Bollen-Stine bootstrap test, [21] can be used to investigate whether a composite model fits to the data. [4] [2]

Alternative views on CCA

Besides the originally proposed CCA, the evaluation steps known from partial least squares structural equation modeling [22] (PLS-SEM) are dubbed CCA. [23] [24] It is emphasized that PLS-SEM's evaluation steps, in the following called PLS-CCA, differ from CCA in many regards:. [25] (i) While PLS-CCA aims at conforming reflective and formative measurement models, CCA aims at assessing composite models; (ii) PLS-CCA omits overall model fit assessment, which is a crucial step in CCA as well as SEM; (iii) PLS-CCA is strongly linked to PLS-PM, while for CCA PLS-PM can be employed as one estimator, but this is in no way mandatory. Hence, researchers who employ need to be aware to which technique they are referring to.

Related Research Articles

<span class="mw-page-title-main">Multivariate normal distribution</span> Generalization of the one-dimensional normal distribution to higher dimensions

In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One definition is that a random vector is said to be k-variate normally distributed if every linear combination of its k components has a univariate normal distribution. Its importance derives mainly from the multivariate central limit theorem. The multivariate normal distribution is often used to describe, at least approximately, any set of (possibly) correlated real-valued random variables each of which clusters around a mean value.

<span class="mw-page-title-main">Principal component analysis</span> Method of data analysis

Principal component analysis (PCA) is a linear dimensionality reduction technique with applications in data analysis, visualization and data preprocessing. This is accomplished by linearly transforming the data onto a new coordinate system such that the directions capturing the largest variation in the data can be easily identified. Many studies use the first two principal components in order to plot the data in two dimensions and to visually identify clusters of closely related data points. Principal component analysis has applications in many fields such as population genetics, microbiome studies, and atmospheric science.

Covariance in probability theory and statistics is a measure of the joint variability of two random variables.

<span class="mw-page-title-main">Covariance matrix</span> Measure of covariance of components of a random vector

In probability theory and statistics, a covariance matrix is a square matrix giving the covariance between each pair of elements of a given random vector.

Factor analysis is a statistical method used to describe variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors. For example, it is possible that variations in six observed variables mainly reflect the variations in two unobserved (underlying) variables. Factor analysis searches for such joint variations in response to unobserved latent variables. The observed variables are modelled as linear combinations of the potential factors plus "error" terms, hence factor analysis can be thought of as a special case of errors-in-variables models.

<span class="mw-page-title-main">Canonical correlation</span> Way of inferring information from cross-covariance matrices

In statistics, canonical-correlation analysis (CCA), also called canonical variates analysis, is a way of inferring information from cross-covariance matrices. If we have two vectors X = (X1, ..., Xn) and Y = (Y1, ..., Ym) of random variables, and there are correlations among the variables, then canonical-correlation analysis will find linear combinations of X and Y which have maximum correlation with each other. T. R. Knapp notes that "virtually all of the commonly encountered parametric tests of significance can be treated as special cases of canonical-correlation analysis, which is the general procedure for investigating the relationships between two sets of variables." The method was first introduced by Harold Hotelling in 1936, although in the context of angles between flats the mathematical concept was published by Jordan in 1875.

In statistics, propagation of uncertainty is the effect of variables' uncertainties on the uncertainty of a function based on them. When the variables are the values of experimental measurements they have uncertainties due to measurement limitations which propagate due to the combination of variables in the function.

<span class="mw-page-title-main">Total least squares</span>

In applied statistics, total least squares is a type of errors-in-variables regression, a least squares data modeling technique in which observational errors on both dependent and independent variables are taken into account. It is a generalization of Deming regression and also of orthogonal regression, and can be applied to both linear and non-linear models.

In statistics, sometimes the covariance matrix of a multivariate random variable is not known but has to be estimated. Estimation of covariance matrices then deals with the question of how to approximate the actual covariance matrix on the basis of a sample from the multivariate distribution. Simple cases, where observations are complete, can be dealt with by using the sample covariance matrix. The sample covariance matrix (SCM) is an unbiased and efficient estimator of the covariance matrix if the space of covariance matrices is viewed as an extrinsic convex cone in Rp×p; however, measured using the intrinsic geometry of positive-definite matrices, the SCM is a biased and inefficient estimator. In addition, if the random variable has a normal distribution, the sample covariance matrix has a Wishart distribution and a slightly differently scaled version of it is the maximum likelihood estimate. Cases involving missing data, heteroscedasticity, or autocorrelated residuals require deeper considerations. Another issue is the robustness to outliers, to which sample covariance matrices are highly sensitive.

<span class="mw-page-title-main">Structural equation modeling</span> Form of causal modeling that fit networks of constructs to data

Structural equation modeling (SEM) is a diverse set of methods used by scientists doing both observational and experimental research. SEM is used mostly in the social and behavioral sciences but it is also used in epidemiology, business, and other fields. A definition of SEM is difficult without reference to technical language, but a good starting place is the name itself.

Weighted least squares (WLS), also known as weighted linear regression, is a generalization of ordinary least squares and linear regression in which knowledge of the unequal variance of observations (heteroscedasticity) is incorporated into the regression. WLS is also a specialization of generalized least squares, when all the off-diagonal entries of the covariance matrix of the errors, are null.

In psychology, discriminant validity tests whether concepts or measurements that are not supposed to be related are actually unrelated.

In probability theory and statistics, partial correlation measures the degree of association between two random variables, with the effect of a set of controlling random variables removed. When determining the numerical relationship between two variables of interest, using their correlation coefficient will give misleading results if there is another confounding variable that is numerically related to both variables of interest. This misleading information can be avoided by controlling for the confounding variable, which is done by computing the partial correlation coefficient. This is precisely the motivation for including other right-side variables in a multiple regression; but while multiple regression gives unbiased results for the effect size, it does not give a numerical value of a measure of the strength of the relationship between the two variables of interest.

In statistics, confirmatory factor analysis (CFA) is a special form of factor analysis, most commonly used in social science research. It is used to test whether measures of a construct are consistent with a researcher's understanding of the nature of that construct. As such, the objective of confirmatory factor analysis is to test whether the data fit a hypothesized measurement model. This hypothesized model is based on theory and/or previous analytic research. CFA was first developed by Jöreskog (1969) and has built upon and replaced older methods of analyzing construct validity such as the MTMM Matrix as described in Campbell & Fiske (1959).

Linear least squares (LLS) is the least squares approximation of linear functions to data. It is a set of formulations for solving statistical problems involved in linear regression, including variants for ordinary (unweighted), weighted, and generalized (correlated) residuals. Numerical methods for linear least squares include inverting the matrix of the normal equations and orthogonal decomposition methods.

The partial least squares path modeling or partial least squares structural equation modeling is a method for structural equation modeling that allows estimation of complex cause-effect relationships in path models with latent variables.

<span class="mw-page-title-main">SmartPLS</span> Software

SmartPLS is a software with graphical user interface for variance-based structural equation modeling (SEM) using the partial least squares (PLS) path modeling method. Users can estimate models with their data by using basic PLS-SEM, weighted PLS-SEM (WPLS), consistent PLS-SEM (PLSc-SEM), and sumscores regression algorithms. The software computes standard results assessment criteria and it supports additional statistical analyses . Since SmartPLS is programmed in Java, it can be executed and run on different computer operating systems such as Windows and Mac.

<span class="mw-page-title-main">WarpPLS</span>

WarpPLS is a software with graphical user interface for variance-based and factor-based structural equation modeling (SEM) using the partial least squares and factor-based methods. The software can be used in empirical research to analyse collected data and test hypothesized relationships. Since it runs on the MATLAB Compiler Runtime, it does not require the MATLAB software development application to be installed; and can be installed and used on various operating systems in addition to Windows, with virtual installations.

<span class="mw-page-title-main">Average variance extracted</span>

In statistics (classical test theory), average variance extracted (AVE) is a measure of the amount of variance that is captured by a construct in relation to the amount of variance due to measurement error.

In the mathematical theory of probability, multivariate Laplace distributions are extensions of the Laplace distribution and the asymmetric Laplace distribution to multiple variables. The marginal distributions of symmetric multivariate Laplace distribution variables are Laplace distributions. The marginal distributions of asymmetric multivariate Laplace distribution variables are asymmetric Laplace distributions.

References

  1. Henseler, Jörg; Schuberth, Florian (2020). "Using confirmatory composite analysis to assess emergent variables in business research". Journal of Business Research. 120: 147–156. doi: 10.1016/j.jbusres.2020.07.026 . hdl: 10362/103667 .
  2. 1 2 3 4 5 Schuberth, Florian; Henseler, Jörg; Dijkstra, Theo K. (2018). "Confirmatory Composite Analysis". Frontiers in Psychology. 9: 2541. doi: 10.3389/fpsyg.2018.02541 . PMC   6300521 . PMID   30618962.
  3. 1 2 3 Henseler, Jörg; Hubona, Geoffrey; Ray, Pauline Ash (2016). "Using PLS path modeling in new technology research: updated guidelines". Industrial Management & Data Systems. 116 (1): 2–20. doi: 10.1108/IMDS-09-2015-0382 .
  4. 1 2 3 4 Henseler, Jörg; Dijkstra, Theo K.; Sarstedt, Marko; Ringle, Christian M.; Diamantopoulos, Adamantios; Straub, Detmar W.; Ketchen, David J.; Hair, Joseph F.; Hult, G. Tomas M.; Calantone, Roger J. (2014). "Common Beliefs and Reality About PLS". Organizational Research Methods. 17 (2): 182–209. doi: 10.1177/1094428114526928 . hdl: 10362/117915 .
  5. Dijkstra, Theo K. (2010). "Latent Variables and Indices: Herman Wold's Basic Design and Partial Least Squares". In Esposito Vinzi, Vincenzo; Chin, Wynne W.; Henseler, Jörg; Wang, Huiwen (eds.). Handbook of Partial Least Squares. Berlin, Heidelberg: Springer Handbooks of Computational Statistics. pp. 23–46. CiteSeerX   10.1.1.579.8461 . doi:10.1007/978-3-540-32827-8_2. ISBN   978-3-540-32825-4.
  6. 1 2 Dijkstra, Theo K.; Henseler, Jörg (2011). "Linear indices in nonlinear structural equation models: best fitting proper indices and other composites". Quality & Quantity. 45 (6): 1505–1518. doi:10.1007/s11135-010-9359-z. S2CID   120868602.
  7. 1 2 3 4 Dijkstra, Theo K. (2017). "A Perfect Match Between a Model and a Mode". In Latan, Hengky; Noonan, Richard (eds.). Partial Least Squares Path Modeling: Basic Concepts, Methodological Issues and Applications. Cham: Springer International Publishing. pp. 55–80. doi:10.1007/978-3-319-64069-3_4. ISBN   978-3-319-64068-6.
  8. Simon, Herbert A. (1969). The sciences of the artificial (3rd ed.). Cambridge, MA: MIT Press.
  9. 1 2 3 Henseler, Jörg (2017). "Bridging Design and Behavioral Research With Variance-Based Structural Equation Modeling" (PDF). Journal of Advertising. 46 (1): 178–192. doi: 10.1080/00913367.2017.1281780 .
  10. Henseler, Jörg (2015). Is the whole more than the sum of its parts? On the interplay of marketing and design research. Enschede: University of Twente.
  11. Bollen, Kenneth A.; Bauldry, Shawn (2011). "Three Cs in measurement models: Causal indicators, composite indicators, and covariates". Psychological Methods. 16 (3): 265–284. doi:10.1037/a0024448. PMC   3889475 . PMID   21767021.
  12. van Riel, Allard C. R.; Henseler, Jörg; Kemény, Ildikó; Sasovova, Zuzana (2017). "Estimating hierarchical constructs using consistent partial least squares: The case of second-order composites of common factors". Industrial Management & Data Systems. 117 (3): 459–477. doi: 10.1108/IMDS-07-2016-0286 .
  13. Schuberth, Florian; Rademaker, Manuel E; Henseler, Jörg (2020). "Estimating and assessing second-order constructs using PLS-PM: the case of composites of composites". Industrial Management & Data Systems. 120 (12): 2211–2241. doi:10.1108/IMDS-12-2019-0642. hdl: 10362/104253 . S2CID   225288321.
  14. Henseler, Jörg & Schuberth, Florian (2021). "Chapter 8: Confirmatory Composite Analysis". In Henseler, Jörg (ed.). Composite-based Structural Equation Modeling: Analyzing Latent and Emergent Variables. The Guilford Press. pp. 179–201. ISBN   9781462545605.
  15. Schuberth, Florian (2023). "The Henseler-Ogasawara specification of composites in structural equation modeling: A tutorial". Psychological Methods. 28 (4): 843–859. doi:10.1037/met0000432. PMID   34914475. S2CID   237984577.
  16. Yu, Xi; Schuberth, Florian; Henseler, Jörg (2023). "Specifying composites in structural equation modeling: A refinement of the Henseler-Ogasawara specification". Statistical Analysis and Data Mining. 16 (4): 348–357. doi: 10.1002/sam.11608 . hdl: 10362/148024 .
  17. Hwang, Heungsun; Takane, Yoshio (2004). "Generalized structured component analysis". Psychometrika. 69 (1): 81–99. doi:10.1007/BF02295841. S2CID   120403741.
  18. Hu, Li-tze; Bentler, Peter M. (1998). "Fit indices in covariance structure modeling: Sensitivity to underparameterized model misspecification". Psychological Methods. 3 (4): 424–453. doi:10.1037/1082-989X.3.4.424.
  19. Lohmöller, Jan-Bernd (1989). Latent Variable Path Modeling with Partial Least Squares. Physica-Verlag Heidelberg. ISBN   9783642525148.
  20. Beran, Rudolf; Srivastava, Muni S. (1985). "Bootstrap Tests and Confidence Regions for Functions of a Covariance Matrix". The Annals of Statistics. 13 (1): 95–115. doi: 10.1214/aos/1176346579 .
  21. Bollen, Kenneth A.; Stine, Robert A. (1992). "Bootstrapping Goodness-of-Fit Measures in Structural Equation Models". Sociological Methods & Research. 21 (2): 205–229. doi:10.1177/0049124192021002004. S2CID   121228129.
  22. Hair, Joe F.; Hult, G Tomas M.; Ringle, Christian M.; Sarstedt, Marko (2014). A Primer on Partial Least Squares Structural Equation Modeling (PLS-SEM). Thousand Oaks: Sage.
  23. Hair, Joseph F.; Anderson, Drexel; Babin, Barry; Black, William (2018). Multivariate data analysis (8 ed.). Cengage Learning EMEA. ISBN   978-1473756540.
  24. Hair, Joe F.; Howard, Matt C.; Nitzl, Christian (March 2020). "Assessing measurement model quality in PLS-SEM using confirmatory composite analysis". Journal of Business Research. 109: 101–110. doi:10.1016/j.jbusres.2019.11.069. S2CID   214571652.
  25. Schuberth, Florian (2021). "Confirmatory composite analysis using partial least squares: Setting the record straight". Review of Managerial Science. In print. 15 (5): 1311–1345. doi: 10.1007/s11846-020-00405-0 .