Scale (social sciences)

Last updated December 23, 2024

In the social sciences, scaling is the process of measuring or ordering entities with respect to quantitative attributes or traits. For example, a scaling technique might involve estimating individuals' levels of extraversion, or the perceived quality of products. Certain methods of scaling permit estimation of magnitudes on a continuum, while other methods provide only for relative ordering of the entities.

Scale construction decisions

What level (level of measurement) of data is involved (nominal, ordinal, interval, or ratio)?^[2]

What will the results be used for?
What should be used - a scale, index, or typology?^[3]

What types of statistical analysis would be useful?
Choose to use a comparative scale or a non-comparative scale.^[4]

How many scale divisions or categories should be used (1 to 10; 1 to 7; −3 to +3)?^[5]
Should there be an odd or even number of divisions? (Odd gives neutral center value; even forces respondents to take a non-neutral position.)^[5]
What should the nature and descriptiveness of the scale labels be?
What should the physical form or layout of the scale be? (graphic, simple linear, vertical, horizontal)
Should a response be forced or be left optional?

Scale construction method

Scales constructed should be representative of the construct that it intends to measure.^[6] It is possible that something similar to the scale a person intends to create will already exist, so including those scale(s) and possible dependent variables in one's survey may increase validity of one's scale.

Begin by generating at least ten items to represent each of the sub-scales. Administer the survey; the more representative and larger the sample, the more credibility one will have in the scales.
Review the means and standard deviations for the items, dropping any items with skewed means or very low variance.
Run an exploratory factor analysis with oblique rotation on items for the scales - it is important to differentiate them based on their loading on factors to create sub-scales that represents the construct. Request factors with eigenvalues (for calculating eigenvalue for each factor square the factor loading's and sum down the columns) greater than 1. It is easier to group the items by targeted scales. The more distinct the other items, the better the chances the items will load better in one's own scale.
“Cleanly loaded items” are those items that load at least .40 on one factor and more than .10 greater on that factor than on any others. Identify those in the factor pattern.
“Cross loaded items” are those that do not meet the above criterion. These are candidates to drop.
Identify factors with only a few items that do not represent clear concepts, these are “uninterpretable scales.” Also identify any factors with only one item. These factors and their items are candidates to drop.
Look at the candidates to drop and the factors to be dropped. Is there anything that needs to be retained because it is critical to one's construct. For example, if a conceptually important item only cross loads on a factor to be dropped, it is good to keep it for the next round.
Drop the items, and run a confirmatory factor analysis asking the program to give only the number of factors after dropping the uninterpretable and single-item ones. Go through the process again starting at Step 3. Here various test reliability measures could also be taken.
Keep running through the process until one get “clean factors” (until all factors have cleanly loaded items).
Run the Alpha in the statistical program (asking for the Alpha's if each item is dropped). Any scales with insufficient Alphas should be dropped and the process repeated from Step 3. [Coefficient alpha=number of items² x average correlation between different items/sum of all correlations in the correlation matrix (including the diagonal values)]
Run correlational or regressional statistics to ensure the validity of the scale. For better practices, keep the final factors and all loadings of yours and similar scales selected in the Appendix of the created scale.

Multi-Item and Single-Item Scales

In most practical situations, multi-item scales are more effective in predicting outcomes compared to single items. The use of single-item measures in research is advised cautiously, their use should be limited to specific circumstances. ^[7]^[8]

Criterion	Multi-item scale	Single-item scale
Construct concreteness	Abstract	Concrete
Construct dimensionality/complexity	Multidimensional, moderately complex	Unidimensional or extremely complex
Semantic redundancy	Low	High
Primary role of construct	Dependent or independent variable	Moderator or control variable
Desired precision	High	Low
Monitoring changes	Appropriate	Problematic
Sampled population	Homogenous	Diverse
Sample size	Large	Limited

Table: Criteria for Assessing the Potential Use of Single-Item Measures^[8]

Data types

The type of information collected can influence scale construction. Different types of information are measured in different ways.

Some data are measured at the nominal level . That is, any numbers used are mere labels; they express no mathematical properties. Examples are SKU inventory codes and UPC bar codes.
Some data are measured at the ordinal level . Numbers indicate the relative position of items, but not the magnitude of difference. An example is a preference ranking.
Some data are measured at the interval level . Numbers indicate the magnitude of difference between items, but there is no absolute zero point. Examples are attitude scales and opinion scales.
Some data are measured at the ratio level . Numbers indicate magnitude of difference and there is a fixed zero point. Ratios can be calculated. Examples include: age, income, price, costs, sales revenue, sales volume, and market share.

Composite measures

Composite measures of variables are created by combining two or more separate empirical indicators into a single measure. Composite measures measure complex concepts more adequately than single indicators, extend the range of scores available and are more efficient at handling multiple items.

In addition to scales, there are two other types of composite measures. Indexes are similar to scales except multiple indicators of a variable are combined into a single measure. The index of consumer confidence, for example, is a combination of several measures of consumer attitudes. A typology is similar to an index except the variable is measured at the nominal level.

Indexes are constructed by accumulating scores assigned to individual attributes, while scales are constructed through the assignment of scores to patterns of attributes.

While indexes and scales provide measures of a single dimension, typologies are often employed to examine the intersection of two or more dimensions. Typologies are very useful analytical tools and can be easily used as independent variables, although since they are not unidimensional it is difficult to use them as a dependent variable.

Comparative and non comparative scaling

With comparative scaling, the items are directly compared with each other (example: Does one prefer Pepsi or Coke?). In noncomparative scaling each item is scaled independently of the others. (Example: How does one feel about Coke?)

Comparative scaling techniques

Pairwise comparison scale – a respondent is presented with two items at a time and asked to select one (example : does one prefer Pepsi or Coke?). This is an ordinal level technique when a measurement model is not applied. Krus and Kennedy (1977) elaborated the paired comparison scaling within their domain-referenced model. The Bradley–Terry–Luce (BTL) model (Bradley and Terry, 1952; Luce, 1959) can be applied in order to derive measurements provided the data derived from paired comparisons possess an appropriate structure. Thurstone's Law of comparative judgment can also be applied in such contexts.
Rasch model scaling – respondents interact with items and comparisons are inferred between items from the responses to obtain scale values. Respondents are subsequently also scaled based on their responses to items given the item scale values. The Rasch model has a close relation to the BTL model.
Rank-ordering – a respondent is presented with several items simultaneously and asked to rank them (example : Rate the following advertisements from 1 to 10.). This is an ordinal level technique.
Bogardus social distance scale – measures the degree to which a person is willing to associate with a class or type of people. It asks how willing the respondent is to make various associations. The results are reduced to a single score on a scale. There are also non-comparative versions of this scale.
Q-Sort – Up to 140 items are sorted into groups based on rank-order procedure.
Guttman scale – This is a procedure to determine whether a set of items can be rank-ordered on a unidimensional scale. It utilizes the intensity structure among several indicators of a given variable. Statements are listed in order of importance. The rating is scaled by summing all responses until the first negative response in the list. The Guttman scale is related to Rasch measurement; specifically, Rasch models bring the Guttman approach within a probabilistic framework.
Constant sum scale – a respondent is given a constant sum of money, script, credits, or points and asked to allocate these to various items (example : If one had 100 Yen to spend on food products, how much would one spend on product A, on product B, on product C, etc.). This is an ordinal level technique.
Magnitude estimation scale – In a psychophysics procedure invented by S. S. Stevens people simply assign numbers to the dimension of judgment. The geometric mean of those numbers usually produces a power law with a characteristic exponent. In cross-modality matching instead of assigning numbers, people manipulate another dimension, such as loudness or brightness to match the items. Typically the exponent of the psychometric function can be predicted from the magnitude estimation exponents of each dimension.

Non-comparative scaling techniques

Visual analogue scale (also called the Continuous rating scale and the graphic rating scale) – respondents rate items by placing a mark on a line. The line is usually labeled at each end. There are sometimes a series of numbers, called scale points, (say, from zero to 100) under the line. Scoring and codification is difficult for paper-and-pencil scales, but not for computerized and Internet-based visual analogue scales.^[9]
Likert scale – Respondents are asked to indicate the amount of agreement or disagreement (from strongly agree to strongly disagree) on a five- to nine-point response scale (not to be confused with a Likert scale). The same format is used for multiple questions. It is the combination of these questions that forms the Likert scale. This categorical scaling procedure can easily be extended to a magnitude estimation procedure that uses the full scale of numbers rather than verbal categories.
Phrase completion scales – Respondents are asked to complete a phrase on an 11-point response scale in which 0 represents the absence of the theoretical construct and 10 represents the theorized maximum amount of the construct being measured. The same basic format is used for multiple questions.
Semantic differential scale – Respondents are asked to rate on a 7-point scale an item on various attributes. Each attribute requires a scale with bipolar terminal labels.
Stapel scale – This is a unipolar ten-point rating scale. It ranges from +5 to −5 and has no neutral zero point.
Thurstone scale – This is a scaling technique that incorporates the intensity structure among indicators.
Mathematically derived scale – Researchers infer respondents’ evaluations mathematically. Two examples are multi dimensional scaling and conjoint analysis.

Scale evaluation

Scales should be tested for reliability, generalizability, and validity. Generalizability is the ability to make inferences from a sample to the population, given the scale one have selected. Reliability is the extent to which a scale will produce consistent results. Test-retest reliability checks how similar the results are if the research is repeated under similar circumstances. Alternative forms reliability checks how similar the results are if the research is repeated using different forms of the scale. Internal consistency reliability checks how well the individual measures included in the scale are converted into a composite measure.

Scales and indexes have to be validated. Internal validation checks the relation between the individual measures included in the scale, and the composite scale itself. External validation checks the relation between the composite scale and other indicators of the variable, indicators not included in the scale. Content validation (also called face validity) checks how well the scale measures what is supposed to measured. Criterion validation checks how meaningful the scale criteria are relative to other possible criteria. Construct validation checks what underlying construct is being measured. There are three variants of construct validity. They are convergent validity, discriminant validity, and nomological validity (Campbell and Fiske, 1959; Krus and Ney, 1978). The coefficient of reproducibility indicates how well the data from the individual measures included in the scale can be reconstructed from the composite scale.

Related Research Articles

Psychometrics is a field of study within psychology concerned with the theory and technique of measurement. Psychometrics generally covers specialized fields within psychology and education devoted to testing, measurement, assessment, and related activities. Psychometrics is concerned with the objective measurement of latent constructs that cannot be directly observed. Examples of latent constructs include intelligence, introversion, mental disorders, and educational achievement. The levels of individuals on nonobservable latent variables are inferred through mathematical modeling based on what is observed from individuals' responses to items on tests and scales.

In statistics and psychometrics, reliability is the overall consistency of a measure. A measure is said to have a high reliability if it produces similar results under consistent conditions:

"It is the characteristic of a set of test scores that relates to the amount of random error from the measurement process that might be embedded in the scores. Scores that are highly reliable are precise, reproducible, and consistent from one testing occasion to another. That is, if the testing process were repeated with a group of test takers, essentially the same results would be obtained. Various kinds of reliability coefficients, with values ranging between 0.00 and 1.00, are usually used to indicate the amount of error in the scores."

Validity is the main extent to which a concept, conclusion, or measurement is well-founded and likely corresponds accurately to the real world. The word "valid" is derived from the Latin validus, meaning strong. The validity of a measurement tool is the degree to which the tool measures what it claims to measure. Validity is based on the strength of a collection of different types of evidence described in greater detail below.

Questionnaire construction refers to the design of a questionnaire to gather statistically useful information about a given topic. When properly constructed and responsibly administered, questionnaires can provide valuable data about any given subject.

Survey methodology is "the study of survey methods". As a field of applied statistics concentrating on human-research surveys, survey methodology studies the sampling of individual units from a population and associated techniques of survey data collection, such as questionnaire construction and methods for improving the number and accuracy of responses to surveys. Survey methodology targets instruments or procedures that ask one or more questions that may or may not be answered.

Quantitative marketing research is the application of quantitative research techniques to the field of marketing research. It has roots in both the positivist view of the world, and the modern marketing viewpoint that marketing is an interactive process in which both the buyer and seller reach a satisfying agreement on the "four Ps" of marketing: Product, Price, Place (location) and Promotion.

A Likert scale is a psychometric scale named after its inventor, American social psychologist Rensis Likert, which is commonly used in research questionnaires. It is the most widely used approach to scaling responses in survey research, such that the term is often used interchangeably with rating scale, although there are other types of rating scales.

In statistics and research, internal consistency is typically a measure based on the correlations between different items on the same test. It measures whether several items that propose to measure the same general construct produce similar scores. For example, if a respondent expressed agreement with the statements "I like to ride bicycles" and "I've enjoyed riding bicycles in the past", and disagreement with the statement "I hate bicycles", this would be indicative of good internal consistency of the test.

Construct validity concerns how well a set of indicators represent or reflect a concept that is not directly measurable. Construct validation is the accumulation of evidence to support the interpretation of what a measure reflects. Modern validity theory defines construct validity as the overarching concern of validity research, subsuming all other types of validity evidence such as content validity and criterion validity.

Level of measurement or scale of measure is a classification that describes the nature of information within the values assigned to variables. Psychologist Stanley Smith Stevens developed the best-known classification with four levels, or scales, of measurement: nominal, ordinal, interval, and ratio. This framework of distinguishing levels of measurement originated in psychology and has since had a complex history, being adopted and extended in some disciplines and by some scholars, and criticized or rejected by others. Other classifications include those by Mosteller and Tukey, and by Chrisman.

A questionnaire is a research instrument that consists of a set of questions for the purpose of gathering information from respondents through survey or statistical study. A research questionnaire is typically a mix of close-ended questions and open-ended questions. Open-ended, long-term questions offer the respondent the ability to elaborate on their thoughts. The Research questionnaire was developed by the Statistical Society of London in 1838.

Personality Assessment Inventory (PAI), developed by Leslie Morey, is a self-report 344-item personality test that assesses a respondent's personality and psychopathology. Each item is a statement about the respondent that the respondent rates with a 4-point scale. It is used in various contexts, including psychotherapy, crisis/evaluation, forensic, personnel selection, pain/medical, and child custody assessment. The test construction strategy for the PAI was primarily deductive and rational. It shows good convergent validity with other personality tests, such as the Minnesota Multiphasic Personality Inventory and the Revised NEO Personality Inventory.

Phrase completion scales are a type of psychometric scale used in questionnaires. Developed in response to the problems associated with Likert scales, phrase completions are concise, unidimensional measures that tap ordinal level data in a manner that approximates interval level data.

A rating scale is a set of categories designed to obtain information about a quantitative or a qualitative attribute. In the social sciences, particularly psychology, common examples are the Likert response scale and 0-10 rating scales, where a person selects the number that reflecting the perceived quality of a product.

In psychology, discriminant validity tests whether concepts or measurements that are not supposed to be related are actually unrelated.

In statistics, confirmatory factor analysis (CFA) is a special form of factor analysis, most commonly used in social science research. It is used to test whether measures of a construct are consistent with a researcher's understanding of the nature of that construct. As such, the objective of confirmatory factor analysis is to test whether the data fit a hypothesized measurement model. This hypothesized model is based on theory and/or previous analytic research. CFA was first developed by Jöreskog (1969) and has built upon and replaced older methods of analyzing construct validity such as the MTMM Matrix as described in Campbell & Fiske (1959).

The multitrait-multimethod (MTMM) matrix is an approach to examining construct validity developed by Campbell and Fiske (1959). It organizes convergent and discriminant validity evidence for comparison of how a measure relates to other measures. The conceptual approach has influenced experimental design and measurement theory in psychology, including applications in structural equation models.

Measurement invariance or measurement equivalence is a statistical property of measurement that indicates that the same construct is being measured across some specified groups. For example, measurement invariance can be used to study whether a given measure is interpreted in a conceptually similar manner by respondents representing different genders or cultural backgrounds. Violations of measurement invariance may preclude meaningful interpretation of measurement data. Tests of measurement invariance are increasingly used in fields such as psychology to supplement evaluation of measurement quality rooted in classical test theory.

In statistics and research design, an index is a composite statistic – a measure of changes in a representative group of individual data points, or in other words, a compound measure that aggregates multiple indicators. Indices – also known as indexes and composite indicators – summarize and rank specific observations.

The Comrey Personality Scales is a personality test developed by Andrew L. Comrey in 1970. The CPT measures eight main scales and two validity scales. The test is currently distributed by Educational and Industrial Testing Service. The test consists of 180 items rated on a seven-point scale.

References

↑ Earl Babbie (1 January 2012). The Practice of Social Research. Cengage Learning. p. 162. ISBN 978-1-133-04979-1.
↑ Bandalos, Deborah L. (2018). Measurement Theory and Applications for the Social Sciences. New York: Guilford Press. pp. 6–8. ISBN 9781462532131.
↑ Bhattacherjee, Anol (2012). Social Science Research: Principles, Methods, and Practices. University of South Florida. pp. 58–63. Retrieved 2024-11-12.
↑ Naresh K. Malhotra, Daniel Nunan, David F. Birks (2017). "Measurement and scaling: fundamentals, comparative and non-comparative scaling". Marketing Research: An Applied Approach (5th ed.). Pearson. ISBN 978-1-292-10312-9.{{cite book}}: CS1 maint: multiple names: authors list (link)
1 2 Bhattacherjee 2012, p. 54.
↑ McDonald, Roderick P. (2013-06-17). Test Theory: A Unified Treatment. Psychology Press. ISBN 978-1-135-67531-8.
↑ Diamantopoulos, Adamantio; Sarstedt, Marko; Fuchs, Christoph (2012). "Guidelines for choosing between multi-item and single-item scales for construct measurement: a predictive validity perspective". Journal of the Academy of Marketing Science. 40 (3): 434–449. doi:10.1007/s11747-011-0300-3. hdl: 1959.13/1052296 .
1 2 Fuchs, Christoph; Diamantopoulos, Adamantios (2009). "Using single-item measures for construct measurement in management research: Conceptual issues and application guidelines" (PDF). Die Betriebswirtschaft. 69 (2).
↑ U.-D. Reips and F. Funke (2008) "Interval level measurement with visual analogue scales in Internet-based research: VAS Generator." doi : 10.3758/BRM.40.3.699

External links

Handbook of Management Scales – Multi-item metrics to be used in research, Wikibooks

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[Babbie2012-162-1] Earl Babbie (1 January 2012). The Practice of Social Research. Cengage Learning. p. 162. ISBN 978-1-133-04979-1.

[2] Bandalos, Deborah L. (2018). Measurement Theory and Applications for the Social Sciences. New York: Guilford Press. pp. 6–8. ISBN 9781462532131.

[3] Bhattacherjee, Anol (2012). Social Science Research: Principles, Methods, and Practices. University of South Florida. pp. 58–63. Retrieved 2024-11-12.

[4] Naresh K. Malhotra, Daniel Nunan, David F. Birks (2017). "Measurement and scaling: fundamentals, comparative and non-comparative scaling". Marketing Research: An Applied Approach (5th ed.). Pearson. ISBN 978-1-292-10312-9.{{cite book}}: CS1 maint: multiple names: authors list (link)

[FOOTNOTEBhattacherjee201254-5] 1 2 Bhattacherjee 2012, p. 54.

[6] McDonald, Roderick P. (2013-06-17). Test Theory: A Unified Treatment. Psychology Press. ISBN 978-1-135-67531-8.

[7] Diamantopoulos, Adamantio; Sarstedt, Marko; Fuchs, Christoph (2012). "Guidelines for choosing between multi-item and single-item scales for construct measurement: a predictive validity perspective". Journal of the Academy of Marketing Science. 40 (3): 434–449. doi:10.1007/s11747-011-0300-3. hdl: 1959.13/1052296 .

[:0-8] 1 2 Fuchs, Christoph; Diamantopoulos, Adamantios (2009). "Using single-item measures for construct measurement in management research: Conceptual issues and application guidelines" (PDF). Die Betriebswirtschaft. 69 (2).

[Reips_Funke-9] U.-D. Reips and F. Funke (2008) "Interval level measurement with visual analogue scales in Internet-based research: VAS Generator." doi : 10.3758/BRM.40.3.699

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]