Content validity

Last updated

In psychometrics, content validity (also known as logical validity) refers to the extent to which a measure represents all facets of a given construct. For example, a depression scale may lack content validity if it only assesses the affective dimension of depression but fails to take into account the behavioral dimension. An element of subjectivity exists in relation to determining content validity, which requires a degree of agreement about what a particular personality trait such as extraversion represents. A disagreement about a personality trait will prevent the gain of a high content validity. [1]

Contents

Description

Content validity is different from face validity, which refers not to what the test actually measures, but to what it superficially appears to measure. Face validity assesses whether the test "looks valid" to the examinees who take it, the administrative personnel who decide on its use, and other technically untrained observers. Content validity requires the use of recognized subject matter experts to evaluate whether test items assess defined content and more rigorous statistical tests than does the assessment of face validity. Content validity is most often addressed in academic and vocational testing, where test items need to reflect the knowledge actually required for a given topic area (e.g., history) or job skill (e.g., accounting). In clinical settings, content validity refers to the correspondence between test items and the symptom content of a syndrome.

Measurement

One widely used method of measuring content validity was developed by C. H. Lawshe. It is essentially a method for gauging agreement among raters or judges regarding how essential a particular item is. In an article regarding pre-employment testing, Lawshe (1975) [2] proposed that each of the subject matter expert raters (SMEs) on the judging panel respond to the following question for each item: "Is the skill or knowledge measured by this item 'essential,' 'useful, but not essential,' or 'not necessary' to the performance of the job?" According to Lawshe, if more than half the panelists indicate that an item is essential, that item has at least some content validity. Greater levels of content validity exist as larger numbers of panelists agree that a particular item is essential. Using these assumptions, Lawshe developed a formula termed the content validity ratio: where content validity ratio, number of SME panelists indicating "essential", total number of SME panelists. This formula yields values which range from +1 to -1; positive values indicate that at least half the SMEs rated the item as essential. The mean CVR across items may be used as an indicator of overall test content validity.

Lawshe (1975) provided a table of critical values for the CVR by which a test evaluator could determine, for a pool of SMEs of a given size, the size of a calculated CVR necessary to exceed chance expectation. This table had been calculated for Lawshe by his friend, Lowell Schipper. Close examination of this published table revealed an anomaly. In Schipper's table, the critical value for the CVR increases monotonically from the case of 40 SMEs (minimum value = .29) to the case of 9 SMEs (minimum value = .78) only to unexpectedly drop at the case of 8 SMEs (minimum value = .75) before hitting its ceiling value at the case of 7 SMEs (minimum value = .99). However, when applying the formula to 8 raters, the result from 7 Essential and 1 other rating yields a CVR of .75. If .75 was not the critical value, then 8 of 8 raters of Essential would be needed that would yield a CVR of 1.00. In that case, to be consistent with the ascending order of CVRs the value for 8 raters would have to be 1.00. That would violate the same principle because you would have the "perfect" value required for 8 raters, but not for ratings at other numbers of raters at either higher or lower than 8 raters. Whether this departure from the table's otherwise monotonic progression was due to a calculation error on Schipper's part or an error in typing or typesetting is unclear. Wilson, Pan & Schumsky (2012), seeking to correct the error, found no explanation in Lawshe's writings nor any publications by Schipper describing how the table of critical values was computed. Wilson and colleagues determined that the Schipper values were close approximations to the normal approximation to the binomial distribution. By comparing Schipper's values to the newly calculated binomial values, they also found that Lawshe and Schipper had erroneously labeled their published table as representing a one-tailed test when in fact the values mirrored the binomial values for a two-tailed test. Wilson and colleagues published a recalculation of critical values for the content validity ratio providing critical values in unit steps at multiple alpha levels. [3]

The table of values is the following one: [2]

N° of panelistsMin. Value
5.99
6.99
7.99
8.75
9.78
10.62
11.59
12.56
20.42
30.33
40.29

See also

Related Research Articles

Pearson's chi-squared test is a statistical test applied to sets of categorical data to evaluate how likely it is that any observed difference between the sets arose by chance. It is the most widely used of many chi-squared tests – statistical procedures whose results are evaluated by reference to the chi-squared distribution. Its properties were first investigated by Karl Pearson in 1900. In contexts where it is important to improve a distinction between the test statistic and its distribution, names similar to Pearson χ-squared test or statistic are used.

<span class="mw-page-title-main">Chi-squared test</span> Statistical hypothesis test

A chi-squared test is a statistical hypothesis test used in the analysis of contingency tables when the sample sizes are large. In simpler terms, this test is primarily used to examine whether two categorical variables are independent in influencing the test statistic. The test is valid when the test statistic is chi-squared distributed under the null hypothesis, specifically Pearson's chi-squared test and variants thereof. Pearson's chi-squared test is used to determine whether there is a statistically significant difference between the expected frequencies and the observed frequencies in one or more categories of a contingency table. For contingency tables with smaller sample sizes, a Fisher's exact test is used instead.

In computer science, rate-monotonic scheduling (RMS) is a priority assignment algorithm used in real-time operating systems (RTOS) with a static-priority scheduling class. The static priorities are assigned according to the cycle duration of the job, so a shorter cycle duration results in a higher job priority.

In statistics and psychometrics, reliability is the overall consistency of a measure. A measure is said to have a high reliability if it produces similar results under consistent conditions:

"It is the characteristic of a set of test scores that relates to the amount of random error from the measurement process that might be embedded in the scores. Scores that are highly reliable are precise, reproducible, and consistent from one testing occasion to another. That is, if the testing process were repeated with a group of test takers, essentially the same results would be obtained. Various kinds of reliability coefficients, with values ranging between 0.00 and 1.00, are usually used to indicate the amount of error in the scores."

Cronbach's alpha, also known as tau-equivalent reliability or coefficient alpha, is a reliability coefficient that provides a method of measuring internal consistency of tests and measures. Numerous studies warn against using it unconditionally, and note that reliability coefficients based on structural equation modeling (SEM) are in many cases a suitable alternative.

<span class="mw-page-title-main">Multidimensional scaling</span> Set of related ordination techniques used in information visualization

Multidimensional scaling (MDS) is a means of visualizing the level of similarity of individual cases of a dataset. MDS is used to translate "information about the pairwise 'distances' among a set of objects or individuals" into a configuration of points mapped into an abstract Cartesian space.

In statistics, an effect size is a value measuring the strength of the relationship between two variables in a population, or a sample-based estimate of that quantity. It can refer to the value of a statistic calculated from a sample of data, the value of a parameter for a hypothetical population, or to the equation that operationalizes how statistics or parameters lead to the effect size value. Examples of effect sizes include the correlation between two variables, the regression coefficient in a regression, the mean difference, or the risk of a particular event happening. Effect sizes complement statistical hypothesis testing, and play an important role in power analyses, sample size planning, and in meta-analyses. The cluster of data-analysis methods concerning effect sizes is referred to as estimation statistics.

Fisher's exact test is a statistical significance test used in the analysis of contingency tables. Although in practice it is employed when sample sizes are small, it is valid for all sample sizes. It is named after its inventor, Ronald Fisher, and is one of a class of exact tests, so called because the significance of the deviation from a null hypothesis can be calculated exactly, rather than relying on an approximation that becomes exact in the limit as the sample size grows to infinity, as with many statistical tests.

<span class="mw-page-title-main">Bond duration</span> Weighted term of future cash flows

In finance, the duration of a financial asset that consists of fixed cash flows, such as a bond, is the weighted average of the times until those fixed cash flows are received. When the price of an asset is considered as a function of yield, duration also measures the price sensitivity to yield, the rate of change of price with respect to yield, or the percentage change in price for a parallel shift in yields.

In probability theory and statistics, the coefficient of variation (CV), also known as relative standard deviation (RSD), is a standardized measure of dispersion of a probability distribution or frequency distribution. It is often expressed as a percentage, and is defined as the ratio of the standard deviation to the mean . The CV or RSD is widely used in analytical chemistry to express the precision and repeatability of an assay. It is also commonly used in fields such as engineering or physics when doing quality assurance studies and ANOVA gauge R&R, by economists and investors in economic models, and in neuroscience.

Mean opinion score (MOS) is a measure used in the domain of Quality of Experience and telecommunications engineering, representing overall quality of a stimulus or system. It is the arithmetic mean over all individual "values on a predefined scale that a subject assigns to his opinion of the performance of a system quality". Such ratings are usually gathered in a subjective quality evaluation test, but they can also be algorithmically estimated.

Cohen's kappa coefficient is a statistic that is used to measure inter-rater reliability for qualitative (categorical) items. It is generally thought to be a more robust measure than simple percent agreement calculation, as κ takes into account the possibility of the agreement occurring by chance. There is controversy surrounding Cohen's kappa due to the difficulty in interpreting indices of agreement. Some researchers have suggested that it is conceptually simpler to evaluate disagreement between items.

In statistics, a binomial proportion confidence interval is a confidence interval for the probability of success calculated from the outcome of a series of success–failure experiments. In other words, a binomial proportion confidence interval is an interval estimate of a success probability p when only the number of experiments n and the number of successes nS are known.

A rating scale is a set of categories designed to elicit information about a quantitative or a qualitative attribute. In the social sciences, particularly psychology, common examples are the Likert response scale and 1-10 rating scales in which a person selects the number which is considered to reflect the perceived quality of a product.

Fleiss' kappa is a statistical measure for assessing the reliability of agreement between a fixed number of raters when assigning categorical ratings to a number of items or classifying items. This contrasts with other kappas such as Cohen's kappa, which only work when assessing the agreement between not more than two raters or the intra-rater reliability. The measure calculates the degree of agreement in classification over that which would be expected by chance.

In statistics, inter-rater reliability is the degree of agreement among independent observers who rate, code, or assess the same phenomenon.

Differential item functioning (DIF) is a statistical characteristic of an item that shows the extent to which the item might be measuring different abilities for members of separate subgroups. Average item scores for subgroups having the same overall score on the test are compared to determine whether the item is measuring in essentially the same way for all subgroups. The presence of DIF requires review and judgment, and it does not necessarily indicate the presence of bias. DIF analysis provides an indication of unexpected behavior of items on a test. An item does not display DIF if people from different groups have a different probability to give a certain response; it displays DIF if and only if people from different groups with the same underlying true ability have a different probability of giving a certain response. Common procedures for assessing DIF are Mantel-Haenszel, item response theory (IRT) based methods, and logistic regression.

In the theory of finite population sampling, Bernoulli sampling is a sampling process where each element of the population is subjected to an independent Bernoulli trial which determines whether the element becomes part of the sample. An essential property of Bernoulli sampling is that all elements of the population have equal probability of being included in the sample.

Taylor's power law is an empirical law in ecology that relates the variance of the number of individuals of a species per unit area of habitat to the corresponding mean by a power law relationship. It is named after the ecologist who first proposed it in 1961, Lionel Roy Taylor (1924–2007). Taylor's original name for this relationship was the law of the mean. The name Taylor's law was coined by Southwood in 1966.

The Quality of Life In Depression Scale (QLDS), originally proposed by Sonja Hunt and Stephen McKenna, is a disease specific patient-reported outcome which assesses the impact that depression has on a patient's quality of life. It is the most commonly used measure of quality of life in clinical trials and studies of depression. The QLDS was developed as a measure to be used in future clinical trials of anti-depressant therapy.

References

  1. Pennington, Donald (2003). Essential Personality. Arnold. p. 37. ISBN   0-340-76118-0.
  2. 1 2 Lawshe, Charles H. (1975). "A Quantitative Approach to Content Validity". Personnel Psychology. 28 (4): 563–575. CiteSeerX   10.1.1.460.9380 . doi:10.1111/j.1744-6570.1975.tb01393.x. S2CID   34660500.
  3. Wilson, F. Robert; Pan, Wei; Schumsky, Donald A. (2012). "Recalculation of the Critical Values for Lawshe's Content Validity Ratio". Measurement and Evaluation in Counseling and Development. Informa UK Limited. 45 (3): 197–210. doi:10.1177/0748175612440286. ISSN   0748-1756. S2CID   145201317.