Scale (social sciences)

Last updated April 26, 2024

In the social sciences, scaling is the process of measuring or ordering entities with respect to quantitative attributes or traits. For example, a scaling technique might involve estimating individuals' levels of extraversion, or the perceived quality of products. Certain methods of scaling permit estimation of magnitudes on a continuum, while other methods provide only for relative ordering of the entities.

Scale construction decisions

What level (level of measurement) of data is involved (nominal, ordinal, interval, or ratio)?
What will the results be used for?
What should be used - a scale, index, or typology?
What types of statistical analysis would be useful?
Choose to use a comparative scale or a noncomparative scale.
How many scale divisions or categories should be used (1 to 10; 1 to 7; −3 to +3)?
Should there be an odd or even number of divisions? (Odd gives neutral center value; even forces respondents to take a non-neutral position.)
What should the nature and descriptiveness of the scale labels be?
What should the physical form or layout of the scale be? (graphic, simple linear, vertical, horizontal)
Should a response be forced or be left optional?

Multi-Item and Single-Item Scales

In most practical situations, multi-item scales are more effective in predicting outcomes compared to single items. The use of single-item measures in research is advised cautiously, their use should be limited to specific circumstances. ^[2]^[3]

Criterion	Multi-item scale	Single-item scale
Construct concreteness	Abstract	Concrete
Construct dimensionality/complexity	Multidimensional, moderately complex	Unidimensional or extremely complex
Semantic redundancy	Low	High
Primary role of construct	Dependent or independent variable	Moderator or control variable
Desired precision	High	Low
Monitoring changes	Appropriate	Problematic
Sampled population	Homogenous	Diverse
Sample size	Large	Limited

Table: Criteria for Assessing the Potential Use of Single-Item Measures^[3]

Data types

The type of information collected can influence scale construction. Different types of information are measured in different ways.

Some data are measured at the nominal level . That is, any numbers used are mere labels; they express no mathematical properties. Examples are SKU inventory codes and UPC bar codes.
Some data are measured at the ordinal level . Numbers indicate the relative position of items, but not the magnitude of difference. An example is a preference ranking.
Some data are measured at the interval level . Numbers indicate the magnitude of difference between items, but there is no absolute zero point. Examples are attitude scales and opinion scales.
Some data are measured at the ratio level . Numbers indicate magnitude of difference and there is a fixed zero point. Ratios can be calculated. Examples include: age, income, price, costs, sales revenue, sales volume, and market share.

Composite measures

Composite measures of variables are created by combining two or more separate empirical indicators into a single measure. Composite measures measure complex concepts more adequately than single indicators, extend the range of scores available and are more efficient at handling multiple items.

In addition to scales, there are two other types of composite measures. Indexes are similar to scales except multiple indicators of a variable are combined into a single measure. The index of consumer confidence, for example, is a combination of several measures of consumer attitudes. A typology is similar to an index except the variable is measured at the nominal level.

Indexes are constructed by accumulating scores assigned to individual attributes, while scales are constructed through the assignment of scores to patterns of attributes.

While indexes and scales provide measures of a single dimension, typologies are often employed to examine the intersection of two or more dimensions. Typologies are very useful analytical tools and can be easily used as independent variables, although since they are not unidimensional it is difficult to use them as a dependent variable.

Comparative and non comparative scaling

With comparative scaling, the items are directly compared with each other (example: Does one prefer Pepsi or Coke?). In noncomparative scaling each item is scaled independently of the others. (Example: How does one feel about Coke?)

Comparative scaling techniques

Pairwise comparison scale – a respondent is presented with two items at a time and asked to select one (example : does one prefer Pepsi or Coke?). This is an ordinal level technique when a measurement model is not applied. Krus and Kennedy (1977) elaborated the paired comparison scaling within their domain-referenced model. The Bradley–Terry–Luce (BTL) model (Bradley and Terry, 1952; Luce, 1959) can be applied in order to derive measurements provided the data derived from paired comparisons possess an appropriate structure. Thurstone's Law of comparative judgment can also be applied in such contexts.
Rasch model scaling – respondents interact with items and comparisons are inferred between items from the responses to obtain scale values. Respondents are subsequently also scaled based on their responses to items given the item scale values. The Rasch model has a close relation to the BTL model.
Rank-ordering – a respondent is presented with several items simultaneously and asked to rank them (example : Rate the following advertisements from 1 to 10.). This is an ordinal level technique.
Bogardus social distance scale – measures the degree to which a person is willing to associate with a class or type of people. It asks how willing the respondent is to make various associations. The results are reduced to a single score on a scale. There are also non-comparative versions of this scale.
Q-Sort – Up to 140 items are sorted into groups based on rank-order procedure.
Guttman scale – This is a procedure to determine whether a set of items can be rank-ordered on a unidimensional scale. It utilizes the intensity structure among several indicators of a given variable. Statements are listed in order of importance. The rating is scaled by summing all responses until the first negative response in the list. The Guttman scale is related to Rasch measurement; specifically, Rasch models bring the Guttman approach within a probabilistic framework.
Constant sum scale – a respondent is given a constant sum of money, script, credits, or points and asked to allocate these to various items (example : If one had 100 Yen to spend on food products, how much would one spend on product A, on product B, on product C, etc.). This is an ordinal level technique.
Magnitude estimation scale – In a psychophysics procedure invented by S. S. Stevens people simply assign numbers to the dimension of judgment. The geometric mean of those numbers usually produces a power law with a characteristic exponent. In cross-modality matching instead of assigning numbers, people manipulate another dimension, such as loudness or brightness to match the items. Typically the exponent of the psychometric function can be predicted from the magnitude estimation exponents of each dimension.

Non-comparative scaling techniques

Visual analogue scale (also called the Continuous rating scale and the graphic rating scale) – respondents rate items by placing a mark on a line. The line is usually labeled at each end. There are sometimes a series of numbers, called scale points, (say, from zero to 100) under the line. Scoring and codification is difficult for paper-and-pencil scales, but not for computerized and Internet-based visual analogue scales.^[4]
Likert scale – Respondents are asked to indicate the amount of agreement or disagreement (from strongly agree to strongly disagree) on a five- to nine-point response scale (not to be confused with a Likert scale). The same format is used for multiple questions. It is the combination of these questions that forms the Likert scale. This categorical scaling procedure can easily be extended to a magnitude estimation procedure that uses the full scale of numbers rather than verbal categories.
Phrase completion scales – Respondents are asked to complete a phrase on an 11-point response scale in which 0 represents the absence of the theoretical construct and 10 represents the theorized maximum amount of the construct being measured. The same basic format is used for multiple questions.
Semantic differential scale – Respondents are asked to rate on a 7-point scale an item on various attributes. Each attribute requires a scale with bipolar terminal labels.
Stapel scale – This is a unipolar ten-point rating scale. It ranges from +5 to −5 and has no neutral zero point.
Thurstone scale – This is a scaling technique that incorporates the intensity structure among indicators.
Mathematically derived scale – Researchers infer respondents’ evaluations mathematically. Two examples are multi dimensional scaling and conjoint analysis.

Scale evaluation

Scales should be tested for reliability, generalizability, and validity. Generalizability is the ability to make inferences from a sample to the population, given the scale one have selected. Reliability is the extent to which a scale will produce consistent results. Test-retest reliability checks how similar the results are if the research is repeated under similar circumstances. Alternative forms reliability checks how similar the results are if the research is repeated using different forms of the scale. Internal consistency reliability checks how well the individual measures included in the scale are converted into a composite measure.

Scales and indexes have to be validated. Internal validation checks the relation between the individual measures included in the scale, and the composite scale itself. External validation checks the relation between the composite scale and other indicators of the variable, indicators not included in the scale. Content validation (also called face validity) checks how well the scale measures what is supposed to measured. Criterion validation checks how meaningful the scale criteria are relative to other possible criteria. Construct validation checks what underlying construct is being measured. There are three variants of construct validity. They are convergent validity, discriminant validity, and nomological validity (Campbell and Fiske, 1959; Krus and Ney, 1978). The coefficient of reproducibility indicates how well the data from the individual measures included in the scale can be reconstructed from the composite scale.

Related Research Articles

Psychometrics is a field of study within psychology concerned with the theory and technique of measurement. Psychometrics generally covers specialized fields within psychology and education devoted to testing, measurement, assessment, and related activities. Psychometrics is concerned with the objective measurement of latent constructs that cannot be directly observed. Examples of latent constructs include intelligence, introversion, mental disorders, and educational achievement. The levels of individuals on nonobservable latent variables are inferred through mathematical modeling based on what is observed from individuals' responses to items on tests and scales.

In statistics and psychometrics, reliability is the overall consistency of a measure. A measure is said to have a high reliability if it produces similar results under consistent conditions:

"It is the characteristic of a set of test scores that relates to the amount of random error from the measurement process that might be embedded in the scores. Scores that are highly reliable are precise, reproducible, and consistent from one testing occasion to another. That is, if the testing process were repeated with a group of test takers, essentially the same results would be obtained. Various kinds of reliability coefficients, with values ranging between 0.00 and 1.00, are usually used to indicate the amount of error in the scores."

Validity is the main extent to which a concept, conclusion, or measurement is well-founded and likely corresponds accurately to the real world. The word "valid" is derived from the Latin validus, meaning strong. The validity of a measurement tool is the degree to which the tool measures what it claims to measure. Validity is based on the strength of a collection of different types of evidence described in greater detail below.

Questionnaire construction refers to the design of a questionnaire to gather statistically useful information about a given topic. When properly constructed and responsibly administered, questionnaires can provide valuable data about any given subject.

Quantitative marketing research is the application of quantitative research techniques to the field of marketing research. It has roots in both the positivist view of the world, and the modern marketing viewpoint that marketing is an interactive process in which both the buyer and seller reach a satisfying agreement on the "four Ps" of marketing: Product, Price, Place (location) and Promotion.

In psychometrics, item response theory (IRT) is a paradigm for the design, analysis, and scoring of tests, questionnaires, and similar instruments measuring abilities, attitudes, or other variables. It is a theory of testing based on the relationship between individuals' performances on a test item and the test takers' levels of performance on an overall measure of the ability that item was designed to measure. Several different statistical models are used to represent both item and test taker characteristics. Unlike simpler alternatives for creating scales and evaluating questionnaire responses, it does not assume that each item is equally difficult. This distinguishes IRT from, for instance, Likert scaling, in which "All items are assumed to be replications of each other or in other words items are considered to be parallel instruments". By contrast, item response theory treats the difficulty of each item as information to be incorporated in scaling items.

A Likert scale is a psychometric scale named after its inventor, American social psychologist Rensis Likert, which is commonly used in research questionnaires. It is the most widely used approach to scaling responses in survey research, such that the term is often used interchangeably with rating scale, although there are other types of rating scales.

In statistics and research, internal consistency is typically a measure based on the correlations between different items on the same test. It measures whether several items that propose to measure the same general construct produce similar scores. For example, if a respondent expressed agreement with the statements "I like to ride bicycles" and "I've enjoyed riding bicycles in the past", and disagreement with the statement "I hate bicycles", this would be indicative of good internal consistency of the test.

In psychology and sociology, the Thurstone scale was the first formal technique to measure an attitude. It was developed by Louis Leon Thurstone in 1928, originally as a means of measuring attitudes towards religion. Today it is used to measure attitudes towards a wide variety of issues. The technique uses a number of statements about a particular issue, and each statement is given a numerical value indicating how favorable or unfavorable it is judged to be. These numerical values are prepared ahead of time by the researcher and not shown to the test subjects. The subjects then check each of the statements with which they agree, and a mean score of those statements' values is computed, indicating their attitude.

Construct validity concerns how well a set of indicators represent or reflect a concept that is not directly measurable. Construct validation is the accumulation of evidence to support the interpretation of what a measure reflects. Modern validity theory defines construct validity as the overarching concern of validity research, subsuming all other types of validity evidence such as content validity and criterion validity.

Level of measurement or scale of measure is a classification that describes the nature of information within the values assigned to variables. Psychologist Stanley Smith Stevens developed the best-known classification with four levels, or scales, of measurement: nominal, ordinal, interval, and ratio. This framework of distinguishing levels of measurement originated in psychology and has since had a complex history, being adopted and extended in some disciplines and by some scholars, and criticized or rejected by others. Other classifications include those by Mosteller and Tukey, and by Chrisman.

A questionnaire is a research instrument that consists of a set of questions for the purpose of gathering information from respondents through survey or statistical study. A research questionnaire is typically a mix of close-ended questions and open-ended questions. Open-ended, long-term questions offer the respondent the ability to elaborate on their thoughts. The Research questionnaire was developed by the Statistical Society of London in 1838.

In the analysis of multivariate observations designed to assess subjects with respect to an attribute, a Guttman scale is a single (unidimensional) ordinal scale for the assessment of the attribute, from which the original observations may be reproduced. The discovery of a Guttman scale in data depends on their multivariate distribution's conforming to a particular structure. Hence, a Guttman scale is a hypothesis about the structure of the data, formulated with respect to a specified attribute and a specified population and cannot be constructed for any given set of observations. Contrary to a widespread belief, a Guttman scale is not limited to dichotomous variables and does not necessarily determine an order among the variables. But if variables are all dichotomous, the variables are indeed ordered by their sensitivity in recording the assessed attribute, as illustrated by Example 1.

The Rasch model, named after Georg Rasch, is a psychometric model for analyzing categorical data, such as answers to questions on a reading assessment or questionnaire responses, as a function of the trade-off between the respondent's abilities, attitudes, or personality traits, and the item difficulty. For example, they may be used to estimate a student's reading ability or the extremity of a person's attitude to capital punishment from responses on a questionnaire. In addition to psychometrics and educational research, the Rasch model and its extensions are used in other areas, including the health profession, agriculture, and market research.

Composite measure in statistics and research design refer to composite measures of variables, i.e. measurements based on multiple data items.

Phrase completion scales are a type of psychometric scale used in questionnaires. Developed in response to the problems associated with Likert scales, phrase completions are concise, unidimensional measures that tap ordinal level data in a manner that approximates interval level data.

A rating scale is a set of categories designed to obtain information about a quantitative or a qualitative attribute. In the social sciences, particularly psychology, common examples are the Likert response scale and 0-10 rating scales, where a person selects the number that reflecting the perceived quality of a product.

The theory of conjoint measurement is a general, formal theory of continuous quantity. It was independently discovered by the French economist Gérard Debreu (1960) and by the American mathematical psychologist R. Duncan Luce and statistician John Tukey.

The Mokken scale is a psychometric method of data reduction. A Mokken scale is a unidimensional scale that consists of hierarchically-ordered items that measure the same underlying, latent concept. This method is named after the political scientist Rob Mokken who suggested it in 1971.

In statistics and research design, an index is a composite statistic – a measure of changes in a representative group of individual data points, or in other words, a compound measure that aggregates multiple indicators. Indexes – also known as composite indicators – summarize and rank specific observations.

References

↑ Earl Babbie (1 January 2012). The Practice of Social Research. Cengage Learning. p. 162. ISBN 978-1-133-04979-1.
↑ Diamantopoulos, Adamantio; Sarstedt, Marko; Fuchs, Christoph (2012). "Guidelines for choosing between multi-item and single-item scales for construct measurement: a predictive validity perspective". Journal of the Academy of Marketing Science. 40. doi:10.1007/s11747-011-0300-3. hdl: 1959.13/1052296 .
1 2 Fuchs, Christoph; Diamantopoulos, Adamantios (2009). "Using single-item measures for construct measurement in management research: Conceptual issues and application guidelines" (PDF). Die Betriebswirtschaft. 69 (2).
↑ U.-D. Reips and F. Funke (2008) "Interval level measurement with visual analogue scales in Internet-based research: VAS Generator." doi : 10.3758/BRM.40.3.699

McDonald, Roderick P. (2013-06-17). Test Theory: A Unified Treatment. Psychology Press. ISBN 978-1-135-67531-8.

External links

Handbook of Management Scales – Multi-item metrics to be used in research, Wikibooks

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[Babbie2012-162-1] Earl Babbie (1 January 2012). The Practice of Social Research. Cengage Learning. p. 162. ISBN 978-1-133-04979-1.

[2] Diamantopoulos, Adamantio; Sarstedt, Marko; Fuchs, Christoph (2012). "Guidelines for choosing between multi-item and single-item scales for construct measurement: a predictive validity perspective". Journal of the Academy of Marketing Science. 40. doi:10.1007/s11747-011-0300-3. hdl: 1959.13/1052296 .

[:0-3] 1 2 Fuchs, Christoph; Diamantopoulos, Adamantios (2009). "Using single-item measures for construct measurement in management research: Conceptual issues and application guidelines" (PDF). Die Betriebswirtschaft. 69 (2).

[Reips_Funke-4] U.-D. Reips and F. Funke (2008) "Interval level measurement with visual analogue scales in Internet-based research: VAS Generator." doi : 10.3758/BRM.40.3.699

[1]

[2]

[3]

[4]