Rating scale

Last updated
Concerning rating scales as systems of educational marks, see more articles about education in different countries (named "Education in ..."), for example, Education in Ukraine.
Concerning rating scales used in the practice of medicine, see articles about diagnoses, for example, Major depressive disorder.

A rating scale is a set of categories designed to elicit information about a quantitative or a qualitative attribute. In the social sciences, particularly psychology, common examples are the Likert response scale and 1-10 rating scales in which a person selects the number that is considered to reflect the perceived quality of a product.

Contents

Background

A rating scale is a method that requires the rater to assign a value, sometimes numeric, to the rated object, as a measure of some rated attribute

Types of rating scales

All rating scales can be classified into one of these types:

  1. Numeric Rating Scale (NRS)
  2. Verbal Rating Scale (VRS)
  3. Visual Analogue Scale (VAS)
  4. Likert
  5. Graphic rating scale
  6. Descriptive graphic rating scale

Some data are measured at the ordinal level. Numbers indicate the relative position of items, but not the magnitude of difference. Attitude and opinion scales are usually ordinal; one example is a Likert response scale:

Statement
e.g. "I could not live without my computer".
Response options
  1. Strongly disagree
  2. Disagree
  3. Neutral
  4. Agree
  5. Strongly agree

Some data are measured at the interval level. Numbers indicate the magnitude of difference between items, but there is no absolute zero point. A good example is a Fahrenheit/Celsius temperature scale where the differences between numbers matter, but placement of zero does not.

Some data are measured at the ratio level. Numbers indicate magnitude of difference and there is a fixed zero point. Ratios can be calculated. Examples include age, income, price, costs, sales revenue, sales volume and market share.

More than one rating scale question is required to measure an attitude or perception due to the requirement for statistical comparisons between the categories in the polytomous Rasch model for ordered categories. [1] In terms of Classical test theory, more than one question is required to obtain an index of internal reliability such as Cronbach's alpha, [2] which is a basic criterion for assessing the effectiveness of a rating scale and, more generally, a psychometric instrument.

Rating scales used online

Rating scales are used widely online in an attempt to provide indications of consumer opinions of products. Examples of sites which employ ratings scales are IMDb, Epinions.com, Yahoo! Movies, Amazon.com, BoardGameGeek and TV.com which use a rating scale from 0 to 100 in order to obtain "personalised film recommendations".

In almost all cases, online rating scales only allow one rating per user per product, though there are exceptions such as Ratings.net, which allows users to rate products in relation to several qualities. Most online rating facilities also provide few or no qualitative descriptions of the rating categories, although again there are exceptions such as Yahoo! Movies, which labels each of the categories between F and A+ and BoardGameGeek, which provides explicit descriptions of each category from 1 to 10. Often, only the top and bottom category is described, such as on IMDb's online rating facility.

Validity

Validity refers to how well a tool measures what it intends to measure. With each user rating a product only once, for example in a category from 1 to 10, there is no means for evaluating internal reliability using an index such as Cronbach's alpha. It is therefore impossible to evaluate the validity of the ratings as measures of viewer perceptions. Establishing validity would require establishing both reliability and accuracy (i.e. that the ratings represent what they are supposed to represent). The degree of validity of an instrument is determined through the application of logic/or statistical procedures. "A measurement procedure is valid to the degree that if measures what it proposes to measure."

Another fundamental issue is that online ratings usually involve convenience sampling much like television polls, i.e. they represent only the opinions of those inclined to submit ratings.

Validity is concerned with different aspects of the measurement process. Each of these types uses logic, statistical verification or both to determine the degree of validity and has special value under certain conditions. Types of validity include content validity, predictive validity, and construct validity.

Sampling

Sampling errors can lead to results which have a specific bias, or are only relevant to a specific subgroup. Consider this example: suppose that a film only appeals to a specialist audience—90% of them are devotees of this genre, and only 10% are people with a general interest in movies. Assume the film is very popular among the audience that views it, and that only those who feel most strongly about the film are inclined to rate the film online; hence the raters are all drawn from the devotees. This combination may lead to very high ratings of the film, which do not generalize beyond the people who actually see the film (or possibly even beyond those who actually rate it).

Qualitative description

Qualitative description of categories improve the usefulness of a rating scale. For example, if only the points 1-10 are given without description, some people may select 10 rarely, whereas others may select the category often. If, instead, "10" is described as "near flawless", the category is more likely to mean the same thing to different people. This applies to all categories, not just the extreme points.

The above issues are compounded, when aggregated statistics such as averages are used for lists and rankings of products. User ratings are at best ordinal categorizations. While it is not uncommon to calculate averages or means for such data, doing so cannot be justified because in calculating averages, equal intervals are required to represent the same difference between levels of perceived quality. The key issues with aggregate data based on the kinds of rating scales commonly used online are as follow:

More developed methodologies include Choice Modelling or Maximum Difference methods, the latter being related to the Rasch model due to the connection between Thurstone's law of comparative judgement[ clarification needed ] and the Rasch model.

Rating scale reduction

An international collaborative research effort [3] has introduced a data-driven algorithm for a rating scale reduction. It is based on the area under the receiver operating characteristic.

Origins

The historical origins of rating scales were reevaluated following a significant archaeological discovery in Tbilisi, Georgia, in 2010. Excavators unearthed a tablet dating back to the early medieval period, marked with ancient Georgian script. [4] This tablet showcased a series of linear markings, interpreted as an early form of a rating scale. The inscriptions provided insights into medieval methods of quantification and evaluation, suggesting an embryonic version of modern rating scales. This discovery is currently preserved at the National Museum of Georgia. [5]

See also

Related Research Articles

Psychological statistics is application of formulas, theorems, numbers and laws to psychology. Statistical methods for psychology include development and application statistical theory and methods for modeling psychological data. These methods include psychometrics, factor analysis, experimental designs, and Bayesian statistics. The article also discusses journals in the same field.

Psychometrics is a field of study within psychology concerned with the theory and technique of measurement. Psychometrics generally covers specialized fields within psychology and education devoted to testing, measurement, assessment, and related activities. Psychometrics is concerned with the objective measurement of latent constructs that cannot be directly observed. Examples of latent constructs include intelligence, introversion, mental disorders, and educational achievement. The levels of individuals on nonobservable latent variables are inferred through mathematical modeling based on what is observed from individuals' responses to items on tests and scales.

In the social sciences, scaling is the process of measuring or ordering entities with respect to quantitative attributes or traits. For example, a scaling technique might involve estimating individuals' levels of extraversion, or the perceived quality of products. Certain methods of scaling permit estimation of magnitudes on a continuum, while other methods provide only for relative ordering of the entities.

Quantitative marketing research is the application of quantitative research techniques to the field of marketing research. It has roots in both the positivist view of the world, and the modern marketing viewpoint that marketing is an interactive process in which both the buyer and seller reach a satisfying agreement on the "four Ps" of marketing: Product, Price, Place (location) and Promotion.

Cronbach's alpha, also known as tau-equivalent reliability or coefficient alpha, is a reliability coefficient and a measure of the internal consistency of tests and measures.

In psychometrics, item response theory (IRT) is a paradigm for the design, analysis, and scoring of tests, questionnaires, and similar instruments measuring abilities, attitudes, or other variables. It is a theory of testing based on the relationship between individuals' performances on a test item and the test takers' levels of performance on an overall measure of the ability that item was designed to measure. Several different statistical models are used to represent both item and test taker characteristics. Unlike simpler alternatives for creating scales and evaluating questionnaire responses, it does not assume that each item is equally difficult. This distinguishes IRT from, for instance, Likert scaling, in which "All items are assumed to be replications of each other or in other words items are considered to be parallel instruments". By contrast, item response theory treats the difficulty of each item as information to be incorporated in scaling items.

<span class="mw-page-title-main">Likert scale</span> Psychometric measurement scale

A Likert scale is a psychometric scale named after its inventor, American social psychologist Rensis Likert, which is commonly used in research questionnaires. It is the most widely used approach to scaling responses in survey research, such that the term is often used interchangeably with rating scale, although there are other types of rating scales.

In statistics and research, internal consistency is typically a measure based on the correlations between different items on the same test. It measures whether several items that propose to measure the same general construct produce similar scores. For example, if a respondent expressed agreement with the statements "I like to ride bicycles" and "I've enjoyed riding bicycles in the past", and disagreement with the statement "I hate bicycles", this would be indicative of good internal consistency of the test.

Level of measurement or scale of measure is a classification that describes the nature of information within the values assigned to variables. Psychologist Stanley Smith Stevens developed the best-known classification with four levels, or scales, of measurement: nominal, ordinal, interval, and ratio. This framework of distinguishing levels of measurement originated in psychology and has since had a complex history, being adopted and extended in some disciplines and by some scholars, and criticized or rejected by others. Other classifications include those by Mosteller and Tukey, and by Chrisman.

In the analysis of multivariate observations designed to assess subjects with respect to an attribute, a Guttman scale is a single (unidimensional) ordinal scale for the assessment of the attribute, from which the original observations may be reproduced. The discovery of a Guttman scale in data depends on their multivariate distribution's conforming to a particular structure. Hence, a Guttman scale is a hypothesis about the structure of the data, formulated with respect to a specified attribute and a specified population and cannot be constructed for any given set of observations. Contrary to a widespread belief, a Guttman scale is not limited to dichotomous variables and does not necessarily determine an order among the variables. But if variables are all dichotomous, the variables are indeed ordered by their sensitivity in recording the assessed attribute, as illustrated by Example 1.

Mean opinion score (MOS) is a measure used in the domain of Quality of Experience and telecommunications engineering, representing overall quality of a stimulus or system. It is the arithmetic mean over all individual "values on a predefined scale that a subject assigns to his opinion of the performance of a system quality". Such ratings are usually gathered in a subjective quality evaluation test, but they can also be algorithmically estimated.

The polytomous Rasch model is generalization of the dichotomous Rasch model. It is a measurement model that has potential application in any context in which the objective is to measure a trait or ability through a process in which responses to items are scored with successive integers. For example, the model is applicable to the use of Likert scales, rating scales, and to educational assessment items for which successively higher integer scores are intended to indicate increasing levels of competence or attainment.

A self-report study is a type of survey, questionnaire, or poll in which respondents read the question and select a response by themselves without any outside interference. A self-report is any method which involves asking a participant about their feelings, attitudes, beliefs and so on. Examples of self-reports are questionnaires and interviews; self-reports are often used as a way of gaining participants' responses in observational studies and experiments.

In statistics, scale analysis is a set of methods to analyze survey data, in which responses to questions are combined to measure a latent variable. These items can be dichotomous or polytomous. Any measurement for such data is required to be reliable, valid, and homogeneous with comparable results over different studies.

In statistics, inter-rater reliability is the degree of agreement among independent observers who rate, code, or assess the same phenomenon.

Generalizability theory, or G theory, is a statistical framework for conceptualizing, investigating, and designing reliable observations. It is used to determine the reliability of measurements under specific conditions. It is particularly useful for assessing the reliability of performance assessments. It was originally introduced in Cronbach, L.J., Rajaratnam, N., & Gleser, G.C. (1963).

Psychometric software is software that is used for psychometric analysis of data from tests, questionnaires, or inventories reflecting latent psychoeducational variables. While some psychometric analyses can be performed with standard statistical software like SPSS, most analyses require specialized tools.

Krippendorff's alpha coefficient, named after academic Klaus Krippendorff, is a statistical measure of the agreement achieved when coding a set of units of analysis. Since the 1970s, alpha has been used in content analysis where textual units are categorized by trained readers, in counseling and survey research where experts code open-ended interview data into analyzable terms, in psychological testing where alternative tests of the same phenomena need to be compared, or in observational studies where unstructured happenings are recorded for subsequent analysis.

The Questionnaire For User Interaction Satisfaction (QUIS) is a tool developed to assess users' subjective satisfaction with specific aspects of the human-computer interface. It was developed in 1987 by a multi-disciplinary team of researchers at the University of Maryland Human–Computer Interaction Lab. The QUIS is currently at Version 7.0 with demographic questionnaire, a measure of overall system satisfaction along 6 scales, and measures of 9 specific interface factors. These 9 factors are: screen factors, terminology and system feedback, learning factors, system capabilities, technical manuals, on-line tutorials, multimedia, teleconferencing, and software installation. Currently available in: German, Italian, Portuguese, and Spanish.

Ordinal data is a categorical, statistical data type where the variables have natural, ordered categories and the distances between the categories are not known. These data exist on an ordinal scale, one of four levels of measurement described by S. S. Stevens in 1946. The ordinal scale is distinguished from the nominal scale by having a ranking. It also differs from the interval scale and ratio scale by not having category widths that represent equal increments of the underlying attribute.

References

  1. Andrich, David (December 1978). "A rating formulation for ordered response categories". Psychometrika. 43 (4): 561–573. doi:10.1007/BF02293814. S2CID   120687848.
  2. Cronbach, Lee J. (September 1951). "Coefficient alpha and the internal structure of tests". Psychometrika. 16 (3): 297–334. CiteSeerX   10.1.1.452.6417 . doi:10.1007/BF02310555. S2CID   13820448.
  3. Koczkodaj, Waldemar W; Kakiashvili, T.; Szymańska, A.; Montero-Marin, J.; Araya, R.; Garcia-Campayo, J.; Rutkowski, K.; Strzałka, D. (2017). "How to reduce the number of rating scale items without predictability loss?". Scientometrics. 111 (2): 581–593(2017). doi: 10.1007/s11192-017-2283-4 . PMC   5400800 . PMID   28490822.
  4. "მსოფლიოში ერთ-ერთი უძველესი კბილის აღმომჩენები შარში ეხვევიან - სად არის ოროზმანელი ადამიანის კბილი?". რადიო თავისუფლება (in Georgian). 2022-09-21. Retrieved 2024-01-17.
  5. ""არ არის აუცილებელი, მთელ საქართველოში ერთდროულად გათხრები ტარდებოდეს" - არქეოლოგები გათხრის უფლებას ვერ იღებენ". რადიო თავისუფლება (in Georgian). 2022-06-21. Retrieved 2024-01-17.