Ceiling effect (statistics)

Last updated

The "ceiling effect" is one type of scale attenuation effect; [1] the other scale attenuation effect is the "floor effect". The ceiling effect is observed when an independent variable no longer has an effect on a dependent variable, or the level above which variance in an independent variable is no longer measurable. [2] The specific application varies slightly in differentiating between two areas of use for this term: pharmacological or statistical. An example of use in the first area, a ceiling effect in treatment, is pain relief by some kinds of analgesic drugs, which have no further effect on pain above a particular dosage level (see also: ceiling effect in pharmacology). An example of use in the second area, a ceiling effect in data-gathering, is a survey that groups all respondents into income categories, not distinguishing incomes of respondents above the highest level measured in the survey instrument. The maximum income level able to be reported creates a "ceiling" that results in measurement inaccuracy, as the dependent variable range is not inclusive of the true values above that point. The ceiling effect can occur any time a measure involves a set range in which a normal distribution predicts multiple scores at or above the maximum value for the dependent variable.

Contents

Data-gathering

A ceiling effect in data-gathering, when variance in a dependent variable is not measured or estimated above a certain level, is a commonly encountered practical issue in gathering data in many scientific disciplines. Such an effect is often the result of constraints on data-gathering instruments. When a ceiling effect occurs in data-gathering, there is a bunching of scores at the upper level reported by an instrument. [3]

Response bias constraints

Response bias occurs commonly in research regarding issues that may have ethical bases or are generally perceived as having negative connotations. [4] Participants may fail to respond to a measure appropriately based on whether they believe the accurate response is viewed negatively. A population survey about lifestyle variables influencing health outcomes might include a question about smoking habits. To guard against the possibility that a respondent who is a heavy smoker might decline to give an accurate response about smoking, the highest level of smoking asked about in the survey instrument might be "two packs a day or more". This results in a ceiling effect in that persons who smoke three packs or more a day are not distinguished from persons who smoke exactly two packs. A population survey about income similarly might have a highest response level of "$100,000 per year or more", rather than including higher income ranges, as respondents might decline to answer at all if the survey questions identify their income too specifically. This too results in a ceiling effect, not distinguishing persons who have an income of $500,000 per year or higher from those whose income is exactly $100,000 per year. The role of response bias in causing ceiling effects is clearly seen through the example of survey respondents believing the desirable response to be the maximum reportable value, resulting in a clustering of data points. The attempted prevention of response bias, in the case of the smoking habit survey, leads to ceiling effects through the basic design of the measure.

Range-of-instrument constraints

The range of data that can be gathered by a particular instrument may be constrained by inherent limits in the instrument's design. Often design of a particular instrument involves trade-offs between ceiling effects and floor effects. If a dependent variable measured on a nominal scale does not have response categories that appropriately cover the upper end of the sample's distribution, the maximum value response will have to include all values above the end of the scale. This will result in a ceiling effect due to the grouping of respondents into the single maximum category, which prevents an accurate representation of the deviation beyond that point. This issue occurs in many types of surveys that use pre-determined bracket-style responses. When many subjects have scores on a variable at the upper limit of what an instrument reports, data analysis provides inaccurate information because some actual variation in the data is not reflected in the scores obtained from that instrument. [5]

A ceiling effect is said to occur when a high proportion of subjects in a study have maximum scores on the observed variable. This makes discrimination among subjects among the top end of the scale impossible. For example, an examination paper may lead to, say, 50% of the students scoring 100%. While such a paper may serve as a useful threshold test, it does not allow ranking of the top performers. For this reason, examination of test results for a possible ceiling effect, and the converse floor effect, is often built into the validation of instruments such as those used for measuring quality of life. [6]

In such a case, the ceiling effect keeps the instrument from noting a measurement or estimate higher than some limit not related to the phenomenon being observed, but rather related to the design of the instrument. A crude example would be measuring the heights of trees with a ruler only 20 meters in length, if it is apparent on the basis of other evidence that there are trees much taller than 20 meters. Using the 20-meter ruler as the sole means of measuring trees would impose a ceiling on gathering data about tree height. Ceiling effects and floor effects both limit the range of data reported by the instrument, reducing variability in the gathered data. Limited variability in the data gathered on one variable may reduce the power of statistics on correlations between that variable and another variable.

College admission tests

In the various countries that use admission tests as the main element or an important element for determining eligibility for college or university study, the data gathered relates to the differing levels of performance of applicants on the tests. When a college admission test has a maximum possible score that can be attained without perfect performance on the test's item content, the test's scoring scale has a ceiling effect. Moreover, if the test's item content is easy for many test-takers, the test may not reflect actual differences in performance (as would be detected with other instruments) among test-takers at the high end of the test performance range. Mathematics tests used for college admission in the United States and similar tests used for university admission in Britain illustrate both phenomena.

Cognitive psychology

In cognitive psychology, mental processes such as problem solving and memorization are studied experimentally by using operational definitions that allow for clear measurements. A common measurement of interest is the time taken to respond to a given stimulus. In studying this variable, a ceiling may be the lowest possible number (the fewest milliseconds to a response), rather than the highest value, as is the usual interpretation of "ceiling". In response time studies, it may appear that a ceiling had occurred in the measurements due to an apparent clustering around some minimum amount of time (such as the fastest time recorded in an experiment). [7] However, this clustering could actually represent a natural physiological limit of response time, rather than an artifact of the stopwatch sensitivity (which of course would be a ceiling effect). Further statistical study, and scientific judgment, can resolve whether or not the observations are due to a ceiling or are the truth of the matter.

Validity of instrument constraints

IQ testing

Some authors[ who? ] on gifted education write about ceiling effects in IQ testing having negative consequences on individuals. Those authors sometimes claim such ceilings produce systematic underestimation of the IQs of intellectually gifted people. In this case, it is necessary to distinguish carefully two different ways the term "ceiling" is used in writings about IQ testing.

IQ scores can differ to some degree for the same individual on different IQ tests (age 12–13 years). (IQ score table data and pupil pseudonyms adapted from description of KABC-II norming study cited in Kaufman 2009. [8] )
PupilKABC-IIWISC-IIIWJ-III
Asher9095111
Brianna125110105
Colin10093101
Danica116127118
Elpha9310593
Fritz106105105
Georgi9510090
Hector112113103
Imelda1049697
Jose1019986
Keoku817875
Leo116124102

The ceilings of IQ subtests are imposed by their ranges of progressively more difficult items. An IQ test with a wide range of progressively more difficult questions will have a higher ceiling than one with a narrow range and few difficult items. Ceiling effects result in an inability, first, to distinguish among the gifted (whether moderately gifted, profoundly gifted, etc.), and second, results in the erroneous classification of some gifted people as above average, but not gifted.

Suppose that an IQ test has three subtests: vocabulary, arithmetic, and picture analogies. The scores on each of the subtests are normalized (see standard score) and then added together to produce a composite IQ score. Now suppose that Joe obtains the maximum score of 20 on the arithmetic test, but gets 10 out of 20 on the vocabulary and analogies tests. Is it fair to say that Joe's total score of 20+10+10, or 40, represents his total ability? The answer is no, because Joe achieved the maximum possible score of 20 on the arithmetic test. Had the arithmetic test included additional, more difficult items, Joe might have gotten 30 points on that subtest, producing a "true" score of 30+10+10 or 50. Compare Joe's performance with that of Jim, who scored 15+15+15 = 45, without running into any subtest ceilings. In the original formulation of the test, Jim did better than Joe (45 versus 40), whereas it is Joe who actually should have gotten the higher "total" intelligence score than Jim (score of 50 for Joe versus 45 for Jim) using a reformulated test that includes more difficult arithmetic items.

Writings on gifted education bring up two reasons for supposing that some IQ scores are underestimates of a test-taker's intelligence:

  1. they tend to perform all subtests better than less talented people;
  2. they tend to do much better on some subtests than others, raising the inter-subtest variability and chance that a ceiling will be encountered.

Statistical analysis

Ceiling effects on measurement compromise scientific truth and understanding through a number of related statistical aberrations.

First, ceilings impair the ability of investigators to determine the central tendency of the data. When a ceiling effect relates to data gathered on a dependent variable, failure to recognize that ceiling effect may "lead to the mistaken conclusion that the independent variable has no effect." [3] For mathematical reasons beyond the scope of this article (see analysis of variance), this inhibited variance reduces the sensitivity of scientific experiments designed to determine if the average of one group is significantly different from the average of another group. For example, a treatment given to one group may produce an effect, but the effect may escape detection because the mean of the treated group won't look different enough from the mean of the untreated group.

Thus "ceiling effects are a complex of matters and their avoidance a matter of careful evaluation of a range of issues." [3]

Prevention

Because ceiling effects prevent accurate interpretation of data, it is important to attempt preventing the effects from occurring or using the presence of the effects to adjust the instrument and procedures that were used. Researchers may try to prevent ceiling effects from occurring using a number of methods. The first of which is choosing a previously validated measure by reviewing past research. If no validated measures exist, pilot testing may be conducted using the proposed methods. Pilot testing, or conducting a pilot experiment, involves a small-scale trial of instruments and procedures prior to the actual experiment, allowing for the recognition that adjustments should be made for the most efficient and accurate data collection. If researchers are using a design that is not previously validated, a combination of surveys, involving that originally-proposed and another supported by past literature, may be used to assess for the presence of ceiling effects. [9] If any research, especially the pilot study, shows a ceiling effect, efforts should be made to adjust the instrument so that the effect may be mitigated and informative research can be conducted. [2]

See also

Notes

  1. "Scale Attenuation Effect - SAGE Research Methods". methods.sagepub.com. Retrieved 22 October 2020.
  2. 1 2 "Ceiling Effect". Encyclopedia of Research Design. 2455 Teller Road, Thousand Oaks California 91320 United States: SAGE Publications, Inc. 2010. doi:10.4135/9781412961288.n44. ISBN   9781412961271.CS1 maint: location (link)
  3. 1 2 3 Cramer 2005, p. 21
  4. Randall, D.M.; Fernandes, M.F. (1991). "The social desirability response bias in ethics research". Journal of Business Ethics. 10 (11): 805–817. doi:10.1007/BF00383696. S2CID   189901264.
  5. Vogt 2005, p. 40
  6. Po 1998, p. 20
  7. Dykiert, Dominika; Der, Geoff; Starr, John M.; Deary, Ian J. (11 October 2012). "Age Differences in Intra-Individual Variability in Simple and Choice Reaction Time: Systematic Review and Meta-Analysis". PLOS One. 7 (10): e45759. Bibcode:2012PLoSO...745759D. doi: 10.1371/journal.pone.0045759 . PMC   3469552 . PMID   23071524.
  8. Kaufman 2009, pp. 151–153
  9. J., Privitera, Gregory (27 January 2016). Research methods for the behavioral sciences (Second ed.). Los Angeles. ISBN   9781506326573. OCLC   915250239.

Bibliography

Further reading

Related Research Articles

Intelligence quotient Score derived from tests purported to measure individual differences in human intelligence

An intelligence quotient (IQ) is a total score derived from a set of standardized tests or subtests designed to assess human intelligence. The abbreviation "IQ" was coined by the psychologist William Stern for the German term Intelligenzquotient, his term for a scoring method for intelligence tests at University of Breslau he advocated in a 1912 book.

Observational error is the difference between a measured value of a quantity and its true value. In statistics, an error is not a "mistake". Variability is an inherent part of the results of measurements and of the measurement process.

The Stanford–Binet Intelligence Scales is an individually administered intelligence test that was revised from the original Binet–Simon Scale by Lewis Terman, a psychologist at Stanford University. The Stanford–Binet Intelligence Scale is now in its fifth edition (SB5) and was released in 2003. It is a cognitive ability and intelligence test that is used to diagnose developmental or intellectual deficiencies in young children. The test measures five weighted factors and consists of both verbal and nonverbal subtests. The five factors being tested are knowledge, quantitative reasoning, visual-spatial processing, working memory, and fluid reasoning.

In statistics and psychometrics, reliability is the overall consistency of a measure. A measure is said to have a high reliability if it produces similar results under consistent conditions:

"It is the characteristic of a set of test scores that relates to the amount of random error from the measurement process that might be embedded in the scores. Scores that are highly reliable are precise, reproducible, and consistent from one testing occasion to another. That is, if the testing process were repeated with a group of test takers, essentially the same results would be obtained. Various kinds of reliability coefficients, with values ranging between 0.00 and 1.00, are usually used to indicate the amount of error in the scores."

Validity is the main extent to which a concept, conclusion or measurement is well-founded and likely corresponds accurately to the real world. The word "valid" is derived from the Latin validus, meaning strong. The validity of a measurement tool is the degree to which the tool measures what it claims to measure. Validity is based on the strength of a collection of different types of evidence described in greater detail below.

In the social sciences, scaling is the process of measuring or ordering entities with respect to quantitative attributes or traits. For example, a scaling technique might involve estimating individuals' levels of extraversion, or the perceived quality of products. Certain methods of scaling permit estimation of magnitudes on a continuum, while other methods provide only for relative ordering of the entities.

Questionnaire construction refers to the design of a questionnaire to gather statistically useful information about a given topic. When properly constructed and responsibly administered, questionnaires can provide valuable data about any given subject.

Survey methodology is "the study of survey methods". As a field of applied statistics concentrating on human-research surveys, survey methodology studies the sampling of individual units from a population and associated techniques of survey data collection, such as questionnaire construction and methods for improving the number and accuracy of responses to surveys. Survey methodology targets instruments or procedures that ask one or more questions that may or may not be answered.

Quantitative marketing research is the application of quantitative research techniques to the field of marketing. It has roots in both the positivist view of the world, and the modern marketing viewpoint that marketing is an interactive process in which both the buyer and seller reach a satisfying agreement on the "four Ps" of marketing: Product, Price, Place (location) and Promotion.

Likert scale Psychometric measurement scale

A Likert scale is a psychometric scale commonly involved in research that employs questionnaires. It is the most widely used approach to scaling responses in survey research, such that the term is often used interchangeably with rating scale, although there are other types of rating scales.

Level of measurement or scale of measure is a classification that describes the nature of information within the values assigned to variables. Psychologist Stanley Smith Stevens developed the best-known classification with four levels, or scales, of measurement: nominal, ordinal, interval, and ratio. This framework of distinguishing levels of measurement originated in psychology and is widely criticized by scholars in other disciplines. Other classifications include those by Mosteller and Tukey, and by Chrisman.

Questionnaire Research instrument consisting of a series of questions and other prompts for the purpose of gathering information from respondents

A questionnaire is a research instrument consisting of a series of questions for the purpose of gathering information from respondents through survey or statistical study. The questionnaire was invented by the Statistical Society of London in 1838.

The Wechsler Adult Intelligence Scale (WAIS) is an IQ test designed to measure intelligence and cognitive ability in adults and older adolescents. The original WAIS was published in February 1955 by David Wechsler, as a revision of the Wechsler–Bellevue Intelligence Scale, released in 1939. It is currently in its fourth edition (WAIS-IV) released in 2008 by Pearson, and is the most widely used IQ test, for both adults and older adolescents, in the world. Data collection for the next version began in 2016 and is expected to end in spring 2020.

Internal validity is the extent to which a piece of evidence supports a claim about cause and effect, within the context of a particular study. It is one of the most important properties of scientific studies and is an important concept in reasoning about evidence more generally. Internal validity is determined by how well a study can rule out alternative explanations for its findings. It contrasts with external validity, the extent to which results can justify conclusions about other contexts.

The Wechsler Intelligence Scale for Children (WISC) is an individually administered intelligence test for children between the ages of 6 and 16. The Fifth Edition is the most recent version.

In social science research, social-desirability bias is a type of response bias that is the tendency of survey respondents to answer questions in a manner that will be viewed favorably by others. It can take the form of over-reporting "good behavior" or under-reporting "bad", or undesirable behavior. The tendency poses a serious problem with conducting research with self-reports. This bias interferes with the interpretation of average tendencies as well as individual differences.

The Wechsler Preschool and Primary Scale of Intelligence (WPPSI) is an intelligence test designed for children ages 2 years 6 months to 7 years 7 months developed by David Wechsler in 1967. It is a descendant of the earlier Wechsler Adult Intelligence Scale and the Wechsler Intelligence Scale for Children tests. Since its original publication the WPPSI has been revised three times in 1989, 2002, and 2012. The current version, WPPSI–IV, published by Pearson Education, is a revision of the WPPSI-R and the WPPSI-III. It provides subtest and composite scores that represent intellectual functioning in verbal and performance cognitive domains, as well as providing a composite score that represents a child’s general intellectual ability.

The Kaufman Assessment Battery for Children (KABC) is a clinical instrument for assessing cognitive development. Its construction incorporates several recent developments in both psychological theory and statistical methodology. The test was developed by Alan S. Kaufman and Nadeen L. Kaufman in 1983 and revised in 2004. The test has been translated and adopted for many countries, such as the Japanese version of the K-ABC by the Japanese psychologists Tatsuya Matsubara, Kazuhiro Fujita, Hisao Maekawa, and Toshinori Ishikuma.

In statistics, a floor effect arises when a data-gathering instrument has a lower limit to the data values it can reliably specify. This lower limit is known as the "floor". The "floor effect" is one type of scale attenuation effect; the other scale attenuation effect is the "ceiling effect". Floor effects are occasionally encountered in psychological testing, when a test designed to estimate some psychological trait has a minimum standard score that may not distinguish some test-takers who differ in their responses on the test item content. Giving preschool children an IQ test designed for adults would likely show many of the test-takers with scores near the lowest standard score for adult test-takers. To indicate differences in current intellectual functioning among young children, IQ tests specifically for young children are developed, on which many test-takers can score well above the floor score. An IQ test designed to help assess intellectually disabled persons might intentionally be designed with easier item content and a lower floor score to better distinguish among individuals taking the test as part of an assessment process.

IQ classification Categorisation of people based on IQ

IQ classification is the practice by IQ test publishers of labeling IQ score ranges with category names such as "superior" or "average".