Chi-square automatic interaction detection

Last updated

Chi-square automatic interaction detection (CHAID) [1] [2] [3] is a decision tree technique based on adjusted significance testing (Bonferroni correction, Holm-Bonferroni testing). The technique was developed in South Africa in 1975 and was published in 1980 by Gordon V. Kass, who had completed a PhD thesis on this topic. CHAID can be used for prediction (in a similar fashion to regression analysis, this version of CHAID being originally known as XAID) as well as classification, and for detection of interaction between variables. CHAID is based on a formal extension of AID (Automatic Interaction Detection) [4] and THAID (THeta Automatic Interaction Detection) [5] [6] procedures of the 1960s and 1970s, which in turn were extensions of earlier research, including that performed by Belson in the UK in the 1950s. [7] A history of earlier supervised tree methods together with a detailed description of the original CHAID algorithm and the exhaustive CHAID extension by Biggs, De Ville, and Suen, [2] can be found in Ritschard. [3]

Contents

In practice, CHAID is often used in the context of direct marketing to select groups of consumers to predict how their responses to some variables affect other variables, although other early applications were in the fields of medical and psychiatric research.

Like other decision trees, CHAID's advantages are that its output is highly visual and easy to interpret. Because it uses multiway splits by default, it needs rather large sample sizes to work effectively, since with small sample sizes the respondent groups can quickly become too small for reliable analysis.

One important advantage of CHAID over alternatives such as multiple regression is that it is non-parametric.

See also

Related Research Articles

Multivariate statistics is a subdivision of statistics encompassing the simultaneous observation and analysis of more than one outcome variable, i.e., multivariate random variables. Multivariate statistics concerns understanding the different aims and background of each of the different forms of multivariate analysis, and how they relate to each other. The practical application of multivariate statistics to a particular problem may involve several types of univariate and multivariate analyses in order to understand the relationships between variables and their relevance to the problem being studied.

Chemometrics is the science of extracting information from chemical systems by data-driven means. Chemometrics is inherently interdisciplinary, using methods frequently employed in core data-analytic disciplines such as multivariate statistics, applied mathematics, and computer science, in order to address problems in chemistry, biochemistry, medicine, biology and chemical engineering. In this way, it mirrors other interdisciplinary fields, such as psychometrics and econometrics.

Decision tree learning is a supervised learning approach used in statistics, data mining and machine learning. In this formalism, a classification or regression decision tree is used as a predictive model to draw conclusions about a set of observations.

<span class="mw-page-title-main">Mathematical statistics</span> Branch of statistics

Mathematical statistics is the application of probability theory, a branch of mathematics, to statistics, as opposed to techniques for collecting statistical data. Specific mathematical techniques which are used for this include mathematical analysis, linear algebra, stochastic analysis, differential equations, and measure theory.

<span class="mw-page-title-main">Kruskal–Wallis test</span> Non-parametric method for testing whether samples originate from the same distribution

The Kruskal–Wallis test by ranks, Kruskal–Wallis test, or one-way ANOVA on ranks is a non-parametric method for testing whether samples originate from the same distribution. It is used for comparing two or more independent samples of equal or different sample sizes. It extends the Mann–Whitney U test, which is used for comparing only two groups. The parametric equivalent of the Kruskal–Wallis test is the one-way analysis of variance (ANOVA).

<span class="mw-page-title-main">Linear discriminant analysis</span> Method used in statistics, pattern recognition, and other fields

Linear discriminant analysis (LDA), normal discriminant analysis (NDA), or discriminant function analysis is a generalization of Fisher's linear discriminant, a method used in statistics and other fields, to find a linear combination of features that characterizes or separates two or more classes of objects or events. The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification.

When classification is performed by a computer, statistical methods are normally used to develop the algorithm.

The term kernel is used in statistical analysis to refer to a window function. The term "kernel" has several distinct meanings in different branches of statistics.

In statistics, multivariate adaptive regression splines (MARS) is a form of regression analysis introduced by Jerome H. Friedman in 1991. It is a non-parametric regression technique and can be seen as an extension of linear models that automatically models nonlinearities and interactions between variables.

In statistics, multiple correspondence analysis (MCA) is a data analysis technique for nominal categorical data, used to detect and represent underlying structures in a data set. It does this by representing data as points in a low-dimensional Euclidean space. The procedure thus appears to be the counterpart of principal component analysis for categorical data. MCA can be viewed as an extension of simple correspondence analysis (CA) in that it is applicable to a large set of categorical variables.

In statistics, one purpose for the analysis of variance (ANOVA) is to analyze differences in means between groups. The test statistic, F, assumes independence of observations, homogeneous variances, and population normality. ANOVA on ranks is a statistic designed for situations when the normality assumption has been violated.

<span class="mw-page-title-main">Bivariate analysis</span> Concept in statistical analysis

Bivariate analysis is one of the simplest forms of quantitative (statistical) analysis. It involves the analysis of two variables, for the purpose of determining the empirical relationship between them.

<span class="mw-page-title-main">Peter Rousseeuw</span> Belgian statistician (born 1956)

Peter J. Rousseeuw is a statistician known for his work on robust statistics and cluster analysis. He obtained his PhD in 1981 at the Vrije Universiteit Brussel, following research carried out at the ETH in Zurich, which led to a book on influence functions. Later he was professor at the Delft University of Technology, The Netherlands, at the University of Fribourg, Switzerland, and at the University of Antwerp, Belgium. Next he was a senior researcher at Renaissance Technologies. He then returned to Belgium as professor at KU Leuven, until becoming emeritus in 2022. His former PhD students include Annick Leroy, Hendrik Lopuhaä, Geert Molenberghs, Christophe Croux, Mia Hubert, Stefan Van Aelst, Tim Verdonck and Jakob Raymaekers.

In statistics, the two-way analysis of variance (ANOVA) is an extension of the one-way ANOVA that examines the influence of two different categorical independent variables on one continuous dependent variable. The two-way ANOVA not only aims at assessing the main effect of each independent variable but also if there is any interaction between them.

Log-linear analysis is a technique used in statistics to examine the relationship between more than two categorical variables. The technique is used for both hypothesis testing and model building. In both these uses, models are tested to find the most parsimonious model that best accounts for the variance in the observed frequencies.

<span class="mw-page-title-main">Sequence analysis in social sciences</span> Analysis of sets of categorical sequences

In social sciences, sequence analysis (SA) is concerned with the analysis of sets of categorical sequences that typically describe longitudinal data. Analyzed sequences are encoded representations of, for example, individual life trajectories such as family formation, school to work transitions, working careers, but they may also describe daily or weekly time use or represent the evolution of observed or self-reported health, of political behaviors, or the development stages of organizations. Such sequences are chronologically ordered unlike words or DNA sequences for example.

The following outline is provided as an overview of and topical guide to machine learning:

<i>Data Science and Predictive Analytics</i>

The first edition of the textbook Data Science and Predictive Analytics: Biomedical and Health Applications using R, authored by Ivo D. Dinov, was published in August 2018 by Springer. The second edition of the book was printed in 2023.

<span class="mw-page-title-main">Homoscedasticity and heteroscedasticity</span> Statistical property

In statistics, a sequence of random variables is homoscedastic if all its random variables have the same finite variance; this is also known as homogeneity of variance. The complementary notion is called heteroscedasticity, also known as heterogeneity of variance. The spellings homoskedasticity and heteroskedasticity are also frequently used. Skedasticity comes from the Ancient Greek word skedánnymi, meaning “to scatter”. Assuming a variable is homoscedastic when in reality it is heteroscedastic results in unbiased but inefficient point estimates and in biased estimates of standard errors, and may result in overestimating the goodness of fit as measured by the Pearson coefficient.

References

  1. Kass, G. V. (1980). "An Exploratory Technique for Investigating Large Quantities of Categorical Data". Applied Statistics. 29 (2): 119–127. doi:10.2307/2986296. JSTOR   2986296.
  2. 1 2 Biggs, David; De Ville, Barry; Suen, Ed (1991). "A method of choosing multiway partitions for classification and decision trees". Journal of Applied Statistics. 18 (1): 49–62. doi:10.1080/02664769100000005. ISSN   0266-4763.
  3. 1 2 Ritschard, Gilbert (2013). "CHAID and Earlier Supervised Tree Methods". Contemporary Issues in Exploratory Data Mining in the Behavioral Sciences, McArdle, J.J. And G. Ritschard (Eds). New York: Routledge: 48–74.
  4. Morgan, James N.; Sonquist, John A. (1963). "Problems in the Analysis of Survey Data, and a Proposal". Journal of the American Statistical Association. 58 (302): 415–434. doi:10.1080/01621459.1963.10500855. ISSN   0162-1459.
  5. Messenger, Robert; Mandell, Lewis (1972). "A Modal Search Technique for Predictive Nominal Scale Multivariate Analysis". Journal of the American Statistical Association. 67 (340): 768–772. doi:10.1080/01621459.1972.10481290. ISSN   0162-1459.
  6. Morgan, James N. (1973). THAID, a sequential analysis program for the analysis of nominal scale dependent variables. Robert C. Messenger. Ann Arbor, Mich. ISBN   0-87944-137-2. OCLC   666930.{{cite book}}: CS1 maint: location missing publisher (link)
  7. Belson, William A. (1959). "Matching and Prediction on the Principle of Biological Classification". Applied Statistics. 8 (2): 65–75. doi:10.2307/2985543. JSTOR   2985543.

Further reading

Software