Reliability (statistics)

Last updated

In statistics and psychometrics, reliability is the overall consistency of a measure. [1] A measure is said to have a high reliability if it produces similar results under consistent conditions:

Contents

"It is the characteristic of a set of test scores that relates to the amount of random error from the measurement process that might be embedded in the scores. Scores that are highly reliable are precise, reproducible, and consistent from one testing occasion to another. That is, if the testing process were repeated with a group of test takers, essentially the same results would be obtained. Various kinds of reliability coefficients, with values ranging between 0.00 (much error) and 1.00 (no error), are usually used to indicate the amount of error in the scores." [2]

For example, measurements of people's height and weight are often extremely reliable. [3] [4]

Types

There are several general classes of reliability estimates:

Difference from validity

Reliability does not imply validity. That is, a reliable measure that is measuring something consistently is not necessarily measuring what you want to be measured. For example, while there are many reliable tests of specific abilities, not all of them would be valid for predicting, say, job performance.

While reliability does not imply validity, reliability does place a limit on the overall validity of a test. A test that is not perfectly reliable cannot be perfectly valid, either as a means of measuring attributes of a person or as a means of predicting scores on a criterion. While a reliable test may provide useful valid information, a test that is not reliable cannot possibly be valid. [7]

For example, if a set of weighing scales consistently measured the weight of an object as 500 grams over the true weight, then the scale would be very reliable, but it would not be valid (as the returned weight is not the true weight). For the scale to be valid, it should return the true weight of an object. This example demonstrates that a perfectly reliable measure is not necessarily valid, but that a valid measure necessarily must be reliable.

Difference from reproducibility


General model

In practice, testing measures are never perfectly consistent. Theories of test reliability have been developed to estimate the effects of inconsistency on the accuracy of measurement. The basic starting point for almost all theories of test reliability is the idea that test scores reflect the influence of two sorts of factors: [7]

1. Consistency factors: stable characteristics of the individual or the attribute that one is trying to measure.

2. Inconsistency factors: features of the individual or the situation that can affect test scores but have nothing to do with the attribute being measured.

These factors include: [7]

The goal of estimating reliability is to determine how much of the variability in test scores is due to measurement errors and how much is due to variability in true scores ( true value ). [7]

A true score is the replicable feature of the concept being measured. It is the part of the observed score that would recur across different measurement occasions in the absence of error.

Errors of measurement are composed of both random error and systematic error. It represents the discrepancies between scores obtained on tests and the corresponding true scores.

This conceptual breakdown is typically represented by the simple equation:

Observed test score = true score + errors of measurement

Classical test theory

The goal of reliability theory is to estimate errors in measurement and to suggest ways of improving tests so that errors are minimized.

The central assumption of reliability theory is that measurement errors are essentially random. This does not mean that errors arise from random processes. For any individual, an error in measurement is not a completely random event. However, across a large number of individuals, the causes of measurement error are assumed to be so varied that measure errors act as random variables. [7]

If errors have the essential characteristics of random variables, then it is reasonable to assume that errors are equally likely to be positive or negative, and that they are not correlated with true scores or with errors on other tests.

It is assumed that: [8]

1. Mean error of measurement = 0

2. True scores and errors are uncorrelated

3. Errors on different measures are uncorrelated

Reliability theory shows that the variance of obtained scores is simply the sum of the variance of true scores plus the variance of errors of measurement. [7]

This equation suggests that test scores vary as the result of two factors:

1. Variability in true scores

2. Variability due to errors of measurement.

The reliability coefficient provides an index of the relative influence of true and error scores on attained test scores. In its general form, the reliability coefficient is defined as the ratio of true score variance to the total variance of test scores. Or, equivalently, one minus the ratio of the variation of the error score and the variation of the observed score:

Unfortunately, there is no way to directly observe or calculate the true score, so a variety of methods are used to estimate the reliability of a test.

Some examples of the methods to estimate reliability include test-retest reliability, internal consistency reliability, and parallel-test reliability. Each method comes at the problem of figuring out the source of error in the test somewhat differently.

Item response theory

It was well known to classical test theorists that measurement precision is not uniform across the scale of measurement. Tests tend to distinguish better for test-takers with moderate trait levels and worse among high- and low-scoring test-takers. Item response theory extends the concept of reliability from a single index to a function called the information function. The IRT information function is the inverse of the conditional observed score standard error at any given test score.

Estimation

The goal of estimating reliability is to determine how much of the variability in test scores is due to errors in measurement and how much is due to variability in true scores.

Four practical strategies have been developed that provide workable methods of estimating test reliability. [7]

1. Test-retest reliability method: directly assesses the degree to which test scores are consistent from one test administration to the next.

It involves:

The correlation between scores on the first test and the scores on the retest is used to estimate the reliability of the test using the Pearson product-moment correlation coefficient: see also item-total correlation.

2. Parallel-forms method:

The key to this method is the development of alternate test forms that are equivalent in terms of content, response processes and statistical characteristics. For example, alternate forms exist for several tests of general intelligence, and these tests are generally seen equivalent. [7]

With the parallel test model it is possible to develop two forms of a test that are equivalent in the sense that a person's true score on form A would be identical to their true score on form B. If both forms of the test were administered to a number of people, differences between scores on form A and form B may be due to errors in measurement only. [7]

It involves:

The correlation between scores on the two alternate forms is used to estimate the reliability of the test.

This method provides a partial solution to many of the problems inherent in the test-retest reliability method. For example, since the two forms of the test are different, carryover effect is less of a problem. Reactivity effects are also partially controlled; although taking the first test may change responses to the second test. However, it is reasonable to assume that the effect will not be as strong with alternate forms of the test as with two administrations of the same test. [7]

However, this technique has its disadvantages:

3. Split-half method:

This method treats the two halves of a measure as alternate forms. It provides a simple solution to the problem that the parallel-forms method faces: the difficulty in developing alternate forms. [7]

It involves:

The correlation between these two split halves is used in estimating the reliability of the test. This halves reliability estimate is then stepped up to the full test length using the Spearman–Brown prediction formula.

There are several ways of splitting a test to estimate reliability. For example, a 40-item vocabulary test could be split into two subtests, the first one made up of items 1 through 20 and the second made up of items 21 through 40. However, the responses from the first half may be systematically different from responses in the second half due to an increase in item difficulty and fatigue. [7]

In splitting a test, the two halves would need to be as similar as possible, both in terms of their content and in terms of the probable state of the respondent. The simplest method is to adopt an odd-even split, in which the odd-numbered items form one half of the test and the even-numbered items form the other. This arrangement guarantees that each half will contain an equal number of items from the beginning, middle, and end of the original test. [7]

4. Internal consistency : assesses the consistency of results across items within a test. The most common internal consistency measure is Cronbach's alpha, which is usually interpreted as the mean of all possible split-half coefficients. [9] Cronbach's alpha is a generalization of an earlier form of estimating internal consistency, Kuder–Richardson Formula 20. [9] Although the most commonly used, there are some misconceptions regarding Cronbach's alpha. [10] [11]

These measures of reliability differ in their sensitivity to different sources of error and so need not be equal. Also, reliability is a property of the scores of a measure rather than the measure itself and are thus said to be sample dependent. Reliability estimates from one sample might differ from those of a second sample (beyond what might be expected due to sampling variations) if the second sample is drawn from a different population because the true variability is different in this second population. (This is true of measures of all types—yardsticks might measure houses well yet have poor reliability when used to measure the lengths of insects.)

Reliability may be improved by clarity of expression (for written assessments), lengthening the measure, [9] and other informal means. However, formal psychometric analysis, called item analysis, is considered the most effective way to increase reliability. This analysis consists of computation of item difficulties and item discrimination indices, the latter index involving computation of correlations between the items and sum of the item scores of the entire test. If items that are too difficult, too easy, and/or have near-zero or negative discrimination are replaced with better items, the reliability of the measure will increase.

See also

Related Research Articles

Psychological statistics is application of formulas, theorems, numbers and laws to psychology. Statistical methods for psychology include development and application statistical theory and methods for modeling psychological data. These methods include psychometrics, factor analysis, experimental designs, and Bayesian statistics. The article also discusses journals in the same field.

Psychometrics is a field of study within psychology concerned with the theory and technique of measurement. Psychometrics generally covers specialized fields within psychology and education devoted to testing, measurement, assessment, and related activities. Psychometrics is concerned with the objective measurement of latent constructs that cannot be directly observed. Examples of latent constructs include intelligence, introversion, mental disorders, and educational achievement. The levels of individuals on nonobservable latent variables are inferred through mathematical modeling based on what is observed from individuals' responses to items on tests and scales.

Validity is the main extent to which a concept, conclusion, or measurement is well-founded and likely corresponds accurately to the real world. The word "valid" is derived from the Latin validus, meaning strong. The validity of a measurement tool is the degree to which the tool measures what it claims to measure. Validity is based on the strength of a collection of different types of evidence described in greater detail below.

Quantitative marketing research is the application of quantitative research techniques to the field of marketing research. It has roots in both the positivist view of the world, and the modern marketing viewpoint that marketing is an interactive process in which both the buyer and seller reach a satisfying agreement on the "four Ps" of marketing: Product, Price, Place (location) and Promotion.

Cronbach's alpha, also known as tau-equivalent reliability or coefficient alpha, is a reliability coefficient and a measure of the internal consistency of tests and measures.

The Spearman–Brown prediction formula, also known as the Spearman–Brown prophecy formula, is a formula relating psychometric reliability to test length and used by psychometricians to predict the reliability of a test after changing the test length. The method was published independently by Spearman (1910) and Brown (1910).

Classical test theory (CTT) is a body of related psychometric theory that predicts outcomes of psychological testing such as the difficulty of items or the ability of test-takers. It is a theory of testing based on the idea that a person's observed or obtained score on a test is the sum of a true score (error-free score) and an error score. Generally speaking, the aim of classical test theory is to understand and improve the reliability of psychological tests.

In statistics and research, internal consistency is typically a measure based on the correlations between different items on the same test. It measures whether several items that propose to measure the same general construct produce similar scores. For example, if a respondent expressed agreement with the statements "I like to ride bicycles" and "I've enjoyed riding bicycles in the past", and disagreement with the statement "I hate bicycles", this would be indicative of good internal consistency of the test.

Repeatability or test–retest reliability is the closeness of the agreement between the results of successive measurements of the same measure, when carried out under the same conditions of measurement. In other words, the measurements are taken by a single person or instrument on the same item, under the same conditions, and in a short period of time. A less-than-perfect test–retest reliability causes test–retest variability. Such variability can be caused by, for example, intra-individual variability and inter-observer variability. A measurement may be said to be repeatable when this variation is smaller than a pre-determined acceptance criterion.

In psychometrics, the Kuder–Richardson formulas, first published in 1937, are a measure of internal consistency reliability for measures with dichotomous choices. They were developed by Kuder and Richardson.

Lee Joseph Cronbach was an American educational psychologist who made contributions to psychological testing and measurement.

In statistics, inter-rater reliability is the degree of agreement among independent observers who rate, code, or assess the same phenomenon.

Generalizability theory, or G theory, is a statistical framework for conceptualizing, investigating, and designing reliable observations. It is used to determine the reliability of measurements under specific conditions. It is particularly useful for assessing the reliability of performance assessments. It was originally introduced in Cronbach, L.J., Rajaratnam, N., & Gleser, G.C. (1963).

<span class="mw-page-title-main">Intraclass correlation</span> Descriptive statistic

In statistics, the intraclass correlation, or the intraclass correlation coefficient (ICC), is a descriptive statistic that can be used when quantitative measurements are made on units that are organized into groups. It describes how strongly units in the same group resemble each other. While it is viewed as a type of correlation, unlike most other correlation measures, it operates on data structured as groups rather than data structured as paired observations.

Anthony F. Gregorc is an American who has taught educational administration. He is best known for his disputed theory of a Mind Styles Model and its associated Style Delineator. The model tries to match education to particular learning styles, as identified by Gregorc.

The multitrait-multimethod (MTMM) matrix is an approach to examining construct validity developed by Campbell and Fiske (1959). It organizes convergent and discriminant validity evidence for comparison of how a measure relates to other measures. The conceptual approach has influenced experimental design and measurement theory in psychology, including applications in structural equation models.

Psychometric software is software that is used for psychometric analysis of data from tests, questionnaires, or inventories reflecting latent psychoeducational variables. While some psychometric analyses can be performed with standard statistical software like SPSS, most analyses require specialized tools.

The Quality of Life In Depression Scale (QLDS), originally proposed by Sonja Hunt and Stephen McKenna, is a disease specific patient-reported outcome which assesses the impact that depression has on a patient's quality of life. It is the most commonly used measure of quality of life in clinical trials and studies of depression. The QLDS was developed as a measure to be used in future clinical trials of anti-depressant therapy.

In statistical models applied to psychometrics, congeneric reliability a single-administration test score reliability coefficient, commonly referred to as composite reliability, construct reliability, and coefficient omega. is a structural equation model (SEM)-based reliability coefficients and is obtained from on a unidimensional model. is the second most commonly used reliability factor after tau-equivalent reliability(; also known as Cronbach's alpha), and is often recommended as its alternative.

<span class="mw-page-title-main">Average variance extracted</span>

In statistics (classical test theory), average variance extracted (AVE) is a measure of the amount of variance that is captured by a construct in relation to the amount of variance due to measurement error.

References


  1. William M.K. Trochim, Reliability
  2. National Council on Measurement in Education http://www.ncme.org/ncme/NCME/Resource_Center/Glossary/NCME/Resource_Center/Glossary1.aspx?hkey=4bb87415-44dc-4088-9ed9-e8515326a061#anchorR
  3. al.], Neil R. Carlson ... [et (2009). Psychology : the science of behaviour (4th Canadian ed.). Toronto: Pearson. ISBN   978-0-205-64524-4.
  4. 1 2 The Marketing Accountability Standards Board (MASB) endorses this definition as part of its ongoing Common Language: Marketing Activities and Metrics Project Archived 12 February 2013 at the Wayback Machine .
  5. Durand, V. Mark. (2015). Essentials of abnormal psychology. [Place of publication not identified]: Cengage Learning. ISBN   978-1305633681. OCLC   884617637.
  6. 1 2 Types of Reliability The Research Methods Knowledge Base. Last Revised: 20 October 2006
  7. 1 2 3 4 5 6 7 8 9 10 11 12 13 Davidshofer, Kevin R. Murphy, Charles O. (2005). Psychological testing : principles and applications (6th ed.). Upper Saddle River, N.J.: Pearson/Prentice Hall. ISBN   0-13-189172-3.{{cite book}}: CS1 maint: multiple names: authors list (link)
  8. Gulliksen, Harold (1987). Theory of mental tests. Hillsdale, N.J.: L. Erlbaum Associates. ISBN   978-0-8058-0024-1.
  9. 1 2 3 Cortina, J.M., (1993). What Is Coefficient Alpha? An Examination of Theory and Applications. Journal of Applied Psychology, 78(1), 98–104.
  10. Ritter, N. (2010). Understanding a widely misunderstood statistic: Cronbach's alpha. Paper presented at Southwestern Educational Research Association (SERA) Conference 2010, New Orleans, LA (ED526237).
  11. Eisinga, R.; Te Grotenhuis, M.; Pelzer, B. (2012). "The reliability of a two-item scale: Pearson, Cronbach or Spearman-Brown?" (PDF). International Journal of Public Health. 58 (4): 637–642. doi:10.1007/s00038-012-0416-3. hdl: 2066/116735 . PMID   23089674. S2CID   215730043.