Item-total correlation

Last updated • 2 min readFrom Wikipedia, The Free Encyclopedia

The item–total correlation is the correlation between a scored item and the total test score. It is an item statistic used in psychometric analysis to diagnose assessment items that fail to indicate the underlying psychological trait so that they can be removed or revised. [1]

Contents

The item-total correlation in item analysis

In item analysis, an item–total correlation is usually calculated for each item of a scale or test to diagnose the degree to which assessment items indicate the underlying trait. Assuming that most of the items of an assessment do indicate the underlying trait, each item should have a reasonably strong positive correlation with the total score on that assessment. An important goal of item analysis is to identify and remove or revise items that are not good indicators of the underlying trait. [2]

A small or negative item-correlation provides empirical evidence that the item is not measuring the same construct measured by the assessment. Exact values depend on the type of measure, but as a heuristic, a correlation value less than 0.2 indicates that the corresponding item does not correlate very well with the scale overall and, thus, it may be dropped. A negative value indicates that the item may be damaging the overall psychometric reliability of the measure. [3] [4] Identifying and removing (or revising) poorly-performing items is a critical way that psychometric analysis can improve the quality of a measure.

When items are scored dichotomously, as in exams with correct and incorrect answers, the item-total correlation may be calculated as either a point-biserial correlation or a biserial correlation. This is considered important because items vary in difficulty and the point-biserial correlation cannot attain its theoretical maxima [+1,-1] unless the proportion correct is 0.50 (50% answering the item correctly). The biserial correlation has a correction that, in theory, avoids this issue. [1] In practice, analysts should choose either the point-biserial or biserial and not try to compare, because the correction of the biserial will always produce a slightly larger magnitude as compared to the point-biserial. [5]

The item-reliability index (IRI) is defined as the product of the point-biserial item-total correlation and the item standard deviation. In classical test theory, the IRI indexes the degree to which an item contributes true score variance to the exam observed score variance. In practice, a negative IRI indicates the relative degree which an item damages the reliability estimate and a positive value indicates the relative degree which it contributes towards a high reliability estimate. [5]

See also

Related Research Articles

Psychological statistics is application of formulas, theorems, numbers and laws to psychology. Statistical methods for psychology include development and application statistical theory and methods for modeling psychological data. These methods include psychometrics, factor analysis, experimental designs, and Bayesian statistics. The article also discusses journals in the same field.

Psychometrics is a field of study within psychology concerned with the theory and technique of measurement. Psychometrics generally covers specialized fields within psychology and education devoted to testing, measurement, assessment, and related activities. Psychometrics is concerned with the objective measurement of latent constructs that cannot be directly observed. Examples of latent constructs include intelligence, introversion, mental disorders, and educational achievement. The levels of individuals on nonobservable latent variables are inferred through mathematical modeling based on what is observed from individuals' responses to items on tests and scales.

In statistics and psychometrics, reliability is the overall consistency of a measure. A measure is said to have a high reliability if it produces similar results under consistent conditions:

"It is the characteristic of a set of test scores that relates to the amount of random error from the measurement process that might be embedded in the scores. Scores that are highly reliable are precise, reproducible, and consistent from one testing occasion to another. That is, if the testing process were repeated with a group of test takers, essentially the same results would be obtained. Various kinds of reliability coefficients, with values ranging between 0.00 and 1.00, are usually used to indicate the amount of error in the scores."

Cronbach's alpha, also known as tau-equivalent reliability or coefficient alpha, is a reliability coefficient and a measure of the internal consistency of tests and measures. It was named after the American psychologist Lee Cronbach.

Classical test theory (CTT) is a body of related psychometric theory that predicts outcomes of psychological testing such as the difficulty of items or the ability of test-takers. It is a theory of testing based on the idea that a person's observed or obtained score on a test is the sum of a true score (error-free score) and an error score. Generally speaking, the aim of classical test theory is to understand and improve the reliability of psychological tests.

In psychometrics, item response theory (IRT) is a paradigm for the design, analysis, and scoring of tests, questionnaires, and similar instruments measuring abilities, attitudes, or other variables. It is a theory of testing based on the relationship between individuals' performances on a test item and the test takers' levels of performance on an overall measure of the ability that item was designed to measure. Several different statistical models are used to represent both item and test taker characteristics. Unlike simpler alternatives for creating scales and evaluating questionnaire responses, it does not assume that each item is equally difficult. This distinguishes IRT from, for instance, Likert scaling, in which "All items are assumed to be replications of each other or in other words items are considered to be parallel instruments". By contrast, item response theory treats the difficulty of each item as information to be incorporated in scaling items.

<span class="mw-page-title-main">Likert scale</span> Psychometric measurement scale

A Likert scale is a psychometric scale named after its inventor, American social psychologist Rensis Likert, which is commonly used in research questionnaires. It is the most widely used approach to scaling responses in survey research, such that the term is often used interchangeably with rating scale, although there are other types of rating scales.

In statistics and research, internal consistency is typically a measure based on the correlations between different items on the same test. It measures whether several items that propose to measure the same general construct produce similar scores. For example, if a respondent expressed agreement with the statements "I like to ride bicycles" and "I've enjoyed riding bicycles in the past", and disagreement with the statement "I hate bicycles", this would be indicative of good internal consistency of the test.

In psychometrics, the Kuder–Richardson formulas, first published in 1937, are a measure of internal consistency reliability for measures with dichotomous choices. They were developed by Kuder and Richardson.

A self-report inventory is a type of psychological test in which a person fills out a survey or questionnaire with or without the help of an investigator. Self-report inventories often ask direct questions about personal interests, values, symptoms, behaviors, and traits or personality types. Inventories are different from tests in that there is no objectively correct answer; responses are based on opinions and subjective perceptions. Most self-report inventories are brief and can be taken or administered within five to 15 minutes, although some, such as the Minnesota Multiphasic Personality Inventory (MMPI), can take several hours to fully complete. They are popular because they can be inexpensive to give and to score, and their scores can often show good reliability.

In statistics, inter-rater reliability is the degree of agreement among independent observers who rate, code, or assess the same phenomenon.

The Millon Clinical Multiaxial Inventory – Fourth Edition (MCMI-IV) is the most recent edition of the Millon Clinical Multiaxial Inventory. The MCMI is a psychological assessment tool intended to provide information on personality traits and psychopathology, including specific mental disorders outlined in the DSM-5. It is intended for adults with at least a 5th grade reading level who are currently seeking mental health services. The MCMI was developed and standardized specifically on clinical populations, and the authors are very specific that it should not be used with the general population or adolescents. However, there is evidence base that shows that it may still retain validity on non-clinical populations, and so psychologists will sometimes administer the test to members of the general population, with caution. The concepts involved in the questions and their presentation make it unsuitable for those with below average intelligence or reading ability.

Differential item functioning (DIF) is a statistical property of a test item that indicates how likely it is for individuals from distinct groups, possessing similar abilities, to respond differently to the item. It manifests when individuals from different groups, with comparable skill levels, do not have an equal likelihood of answering a question correctly. There are two primary types of DIF: uniform DIF, where one group consistently has an advantage over the other, and nonuniform DIF, where the advantage varies based on the individual's ability level. The presence of DIF requires review and judgment, but it doesn't always signify bias. DIF analysis provides an indication of unexpected behavior of items on a test. DIF characteristic of an item isn't solely determined by varying probabilities of selecting a specific response among individuals from different groups. Rather, DIF becomes pronounced when individuals from different groups, who possess the same underlying true ability, exhibit differing probabilities of giving a certain response. Even when uniform bias is present, test developers sometimes resort to assumptions such as DIF biases may offset each other due to the extensive work required to address it, compromising test ethics and perpetuating systemic biases. Common procedures for assessing DIF are Mantel-Haenszel procedure, logistic regression, item response theory (IRT) based methods, and confirmatory factor analysis (CFA) based methods.

The multitrait-multimethod (MTMM) matrix is an approach to examining construct validity developed by Campbell and Fiske (1959). It organizes convergent and discriminant validity evidence for comparison of how a measure relates to other measures. The conceptual approach has influenced experimental design and measurement theory in psychology, including applications in structural equation models.

Psychometric software refers to specialized programs used for the psychometric analysis of data obtained from tests, questionnaires, polls or inventories that measure latent psychoeducational variables. Although some psychometric analyses can be performed using general statistical software such as SPSS, most require specialized tools designed specifically for psychometric purposes.

The Narcissistic Personality Inventory (NPI) was developed in 1979 by Raskin and Hall, and since then, has become one of the most widely utilized personality measures for non-clinical levels of the trait narcissism. Since its initial development, the NPI has evolved from 220 items to the more commonly employed NPI-40 (1984) and NPI-16 (2006), as well as the novel NPI-1 inventory (2014). Derived from the DSM-III criteria for Narcissistic personality disorder (NPD), the NPI has been employed heavily by personality and social psychology researchers.

The Psychopathic Personality Inventory (PPI-Revised) is a personality test for traits associated with psychopathy in adults. The PPI was developed by Scott Lilienfeld and Brian Andrews to assess these traits in non-criminal populations, though it is still used in clinical populations as well. In contrast to other psychopathy measures, such as the Hare Psychopathy Checklist (PCL), the PPI is a self-report scale, rather than an interview-based assessment. It is intended to comprehensively index psychopathic personality traits without assuming particular links to anti-social or criminal behaviors. It also includes measures to detect impression management or careless responding.

Empathy quotient (EQ) is a psychological self-report measure of empathy developed by Simon Baron-Cohen and Sally Wheelwright at the Autism Research Centre at the University of Cambridge. EQ is based on a definition of empathy that includes cognition and affect.

The Dark Triad Dirty Dozen (DTDD) is a brief 12-question personality inventory test to assess the possible presence of the three subclinical dark triad traits: Machiavellianism, narcissism, and psychopathy. The DTDD was developed to identify the dark triad traits among subclinical adult populations. It is a screening test.

References

  1. 1 2 Henrysson, Sten (1963-06-01). "Correction of item-total correlations in item analysis". Psychometrika. 28 (2): 211–218. doi:10.1007/BF02289618. ISSN   1860-0980. S2CID   120534016.
  2. Churchill, G.A., (1979). "A paradigm for developing better measures of marketing constructs", Journal of Marketing Research , 16(1) pp 64–73, doi : 10.1177/002224377901600110, JSTOR   3150876
  3. Everitt, B.S. (2002) The Cambridge Dictionary of Statistics, 2nd Edition, CUP. ISBN   0-521-81099-X
  4. Field, A., (2005). Discovering Statistics Using SPSS. 2nd ed. London: Sage
  5. 1 2 Allen, M.J., & Yen, W. M. (1979) Introduction to Measurement Theory, Wadsworth. ISBN   0-8185-0283-5