Item tree analysis

Last updated

Item tree analysis (ITA) is a data analytical method which allows constructing a hierarchical structure on the items of a questionnaire or test from observed response patterns.
Assume that we have a questionnaire with m items and that subjects can answer positive (1) or negative (0) to each of these items, i.e. the items are dichotomous. If n subjects answer the items this results in a binary data matrix D with m columns and n rows. Typical examples of this data format are test items which can be solved (1) or failed (0) by subjects. Other typical examples are questionnaires where the items are statements to which subjects can agree (1) or disagree (0).
Depending on the content of the items it is possible that the response of a subject to an item j determines her or his responses to other items. It is, for example, possible that each subject who agrees to item j will also agree to item i. In this case we say that item j implies item i (short ). The goal of an ITA is to uncover such deterministic implications from the data set D.

Contents

Algorithms for ITA

ITA was originally developed by Van Leeuwe in 1974. [1] The result of his algorithm, which we refer in the following as Classical ITA, is a logically consistent set of implications . Logically consistent means that if i implies j and j implies k then i implies k for each triple i, j, k of items. Thus the outcome of an ITA is a reflexive and transitive relation on the item set, i.e. a quasi-order on the items.
A different algorithm to perform an ITA was suggested in Schrepp (1999). This algorithm is called Inductive ITA.
Classical ITA and inductive ITA both construct a quasi-order on the item set by explorative data analysis. But both methods use a different algorithm to construct this quasi-order. For a given data set the resulting quasi-orders from classical and inductive ITA will usually differ.
A detailed description of the algorithms used in classical and inductive ITA can be found in Schrepp (2003) or Schrepp (2006) . In a recent paper (Sargin & Ünlü, 2009) some modifications to the algorithm of inductive ITA are proposed, which improve the ability of this method to detect the correct implications from data (especially in the case of higher random response error rates).

Relation to other methods

ITA belongs to a group of data analysis methods called Boolean analysis of questionnaires. Boolean analysis was introduced by Flament in 1976. [2] The goal of a Boolean analysis is to detect deterministic dependencies (formulas from Boolean logic connecting the items, like for example , , and ) between the items of a questionnaire or test. Since the basic work of Flament (1976) a number of different methods for boolean analysis have been developed. See, for example, Van Buggenhaut and Degreef (1987), Duquenne (1987) or Theuns (1994). These methods share the goal to derive deterministic dependencies between the items of a questionnaire from data, but differ in the algorithms to reach this goal. A comparison of ITA to other methods of boolean data analysis can be found in Schrepp (2003).

Applications

There are several research papers available, which describe concrete applications of item tree analysis. Held and Korossy (1998) analyzes implications on a set of algebra problems with classical ITA. Item tree analysis is also used in a number of social science studies to get insight into the structure of dichotomous data. In Bart and Krus (1973), for example, a predecessor of ITA is used to establish a hierarchical order on items that describe socially unaccepted behavior. In Janssens (1999) a method of Boolean analysis is used to investigate the integration process of minorities into the value system of the dominant culture. Schrepp [3] describes several applications of inductive ITA in the analysis of dependencies between items of social science questionnaires.

Example of an application

To show the possibilities of an analysis of a data set by ITA we analyse the statements of question 4 of the International Social Science Survey Programme (ISSSP) for the year 1995 by inductive and classical ITA. The ISSSP is a continuing annual program of cross-national collaboration on surveys covering important topics for social science research. The program conducts each year one survey with comparable questions in each of the participating nations. The theme of the 1995 survey was national identity. We analyze the results for question 4 for the data set of Western Germany. The statement for question 4 was:

Some people say the following things are important for being truly German. Others say they are not important. How important do you think each of the following is:
1. to have been born in Germany
2. to have German citizenship
3. to have lived in Germany for most of one’s life
4. to be able to speak German
5. to be a Christian
6. to respect Germany’s political institutions
7. to feel German

The subjects had the response possibilities Very important, Important, Not very important, Not important at all, and Can’t choose to answer the statements. To apply ITA to this data set we changed the answer categories.
Very important and Important are coded as 1. Not very important and Not important at all are coded as 0. Can’t choose was handled as missing data.
The following figure shows the resulting quasi-orders from inductive ITA and from classical ITA.
Results of ITA for ISSP 1995 question 4.jpg

Available software

The program ITA 2.0 implements both classical and inductive ITA. The program is available on . A short documentation of the program is available in .

See also

Item response theory

Notes

  1. See Van Leeuwe (1974)
  2. See Flament (1976)
  3. See Schrepp (2002) and Schrepp(2003)

Related Research Articles

Logical connective Symbol connecting sentential formulas in logic

In logic, a logical connective is a logical constant used to connect two or more formulas. For instance in the syntax of propositional logic, the binary connective can be used to join the two atomic formulas and , rendering the complex formula .

A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of sample data. A statistical model represents, often in considerably idealized form, the data-generating process.

In psychometrics, item response theory (IRT) is a paradigm for the design, analysis, and scoring of tests, questionnaires, and similar instruments measuring abilities, attitudes, or other variables. It is a theory of testing based on the relationship between individuals' performances on a test item and the test takers' levels of performance on an overall measure of the ability that item was designed to measure. Several different statistical models are used to represent both item and test taker characteristics. Unlike simpler alternatives for creating scales and evaluating questionnaire responses, it does not assume that each item is equally difficult. This distinguishes IRT from, for instance, Likert scaling, in which "All items are assumed to be replications of each other or in other words items are considered to be parallel instruments" (p. 197). By contrast, item response theory treats the difficulty of each item as information to be incorporated in scaling items.

Association rule learning

Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness.

In mathematical logic and computer science, the calculus of constructions (CoC) is a type theory created by Thierry Coquand. It can serve as both a typed programming language and as constructive foundation for mathematics. For this second reason, the CoC and its variants have been the basis for Coq and other proof assistants.

Material conditional Logical connective

The material conditional is an operation commonly used in logic. When the conditional symbol is interpreted as material implication, a formula is true unless is true and is false. Material implication can also be characterized inferentially by modus ponens, modus tollens, conditional proof, and classical reductio ad absurdum.

Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text. A matrix containing word counts per document is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns. Documents are then compared by taking the cosine of the angle between the two vectors formed by any two columns. Values close to 1 represent very similar documents while values close to 0 represent very dissimilar documents.

Boolean function Function with domain {0,1}^k for some k and with range {0,1}

In mathematics, a Boolean function is a function whose arguments, as well as the function itself, assume values from a two-element set. Alternative names are switching function, used especially in older computer science literature, and truth function, used in logic. Boolean functions are the subject of Boolean algebra and switching theory.

System F, also known as the (Girard–Reynolds) polymorphic lambda calculus or the second-order lambda calculus, is a typed lambda calculus that differs from the simply typed lambda calculus by the introduction of a mechanism of universal quantification over types. System F thus formalizes the notion of parametric polymorphism in programming languages, and forms a theoretical basis for languages such as Haskell and ML. System F was discovered independently by logician Jean-Yves Girard (1972) and computer scientist John C. Reynolds (1974).

Bunched logic is a variety of substructural logic proposed by Peter O'Hearn and David Pym. Bunched logic provides primitives for reasoning about resource composition, which aid in the compositional analysis of computer and other systems. It has category-theoretic and truth-functional semantics which can be understood in terms of an abstract concept of resource, and a proof theory in which the contexts Γ in an entailment judgement Γ ⊢ A are tree-like structures (bunches) rather than lists or (multi)sets as in most proof calculi. Bunched logic has an associated type theory, and its first application was in providing a way to control the aliasing and other forms of interference in imperative programs. The logic has seen further applications in program verification, where it is the basis of the assertion language of separation logic, and in systems modelling, where it provides a way to decompose the resources used by components of a system.

In computer science, a rough set, first described by Polish computer scientist Zdzisław I. Pawlak, is a formal approximation of a crisp set in terms of a pair of sets which give the lower and the upper approximation of the original set. In the standard version of rough set theory, the lower- and upper-approximation sets are crisp sets, but in other variations, the approximating sets may be fuzzy sets.

Probit Mathematical function, inverse of error function

In probability theory and statistics, the probit function is the quantile function associated with the standard normal distribution. It has applications in data analysis and machine learning, in particular exploratory statistical graphics and specialized regression modeling of binary response variables.

Structural equation modeling Form of causal modeling that fit networks of constructs to data

Structural equation modeling (SEM) is a label for a diverse set of methods used by scientists in both experimental and observational research across the sciences, business, and other fields. It is used most in the social and behavioral sciences. A definition of SEM is difficult without reference to highly technical language, but a good starting place is the name itself.

Richard Jay Lipton is an American-South African computer scientist who has worked in computer science theory, cryptography, and DNA computing. Lipton is Associate Dean of Research, Professor, and the Frederick G. Storey Chair in Computing in the College of Computing at the Georgia Institute of Technology.

In mathematical psychology, a knowledge space is a combinatorial structure describing the possible states of knowledge of a human learner. To form a knowledge space, one models a domain of knowledge as a set of concepts, and a feasible state of knowledge as a subset of that set containing the concepts known or knowable by some individual. Typically, not all subsets are feasible, due to prerequisite relations among the concepts. The knowledge space is the family of all the feasible subsets.

Group method of data handling (GMDH) is a family of inductive algorithms for computer-based mathematical modeling of multi-parametric datasets that features fully automatic structural and parametric optimization of models.

Boolean analysis was introduced by Flament (1976). The goal of a Boolean analysis is to detect deterministic dependencies between the items of a questionnaire or similar data-structures in observed response patterns. These deterministic dependencies have the form of logical formulas connecting the items. Assume, for example, that a questionnaire contains items ij, and k. Examples of such deterministic dependencies are then i → j, i ∧ j → k, and i ∨ j → k.

In statistics, multiple correspondence analysis (MCA) is a data analysis technique for nominal categorical data, used to detect and represent underlying structures in a data set. It does this by representing data as points in a low-dimensional Euclidean space. The procedure thus appears to be the counterpart of principal component analysis for categorical data. MCA can be viewed as an extension of simple correspondence analysis (CA) in that it is applicable to a large set of categorical variables.

Quantum optimization algorithms are quantum algorithms that are used to solve optimization problems. Mathematical optimization deals with finding the best solution to a problem from a set of possible solutions. Mostly, the optimization problem is formulated as a minimization problem, where one tries to minimize an error which depends on the solution: the optimal solution has the minimal error. Different optimization techniques are applied in various fields such as mechanics, economics and engineering, and as the complexity and amount of data involved rise, more efficient ways of solving optimization problems are needed. The power of quantum computing may allow problems which are not practically feasible on classical computers to be solved, or suggest a considerable speed up with respect to the best known classical algorithm.

Interpolation sort is a kind of bucket sort. It uses an interpolation formula to assign data to the bucket. A general interpolation formula is:

References