Discrimination testing

Last updated

Discrimination testing is a technique employed in sensory analysis to determine whether there is a detectable difference among two or more products. The test uses a group of assessors (panellists) with a degree of training appropriate to the complexity of the test to discriminate from one product to another through one of a variety of experimental designs. Though useful, these tests typically do not quantify or describe any differences, requiring a more specifically trained panel under different study design to describe differences and assess significance of the difference.

Contents

Statistical basis

The statistical principle behind any discrimination test should be to reject a null hypothesis (H0) that states there is no detectable difference between two (or more) products. If there is sufficient evidence to reject H0 in favor of the alternative hypothesis, HA: There is a detectable difference, then a difference can be recorded. However, failure to reject H0 should not be assumed to be sufficient evidence to accept it. H0 is formulated on the premise that all of the assessors guessed when they made their response. The statistical test chosen should give a probability value that the result was arrived at through pure guesswork. If this probability is sufficiently low (usually below 0.05 or 5%) then H0 can be rejected in favor of HA.

Tests used to decide whether or not to reject H0 include binomial, χ2 (Chi-squared), t-test etc.

Types of test

A number of tests can be classified as discrimination tests. If it's designed to detect a difference then it's a discrimination test. The type of test determines the number of samples presented to each member of the panel and also the question(s) they are asked to respond to.

Schematically, these tests may be described as follows; A & B are used for knowns, X and Y are used for different unknowns, while (AB) means that the order of presentation is unknown:

Paired comparison
XY or (AB) – two unknown samples, known to be different, test is which satisfies some criterion (X or Y); unlike the others this is not an equality test.
Duo-trio
AXY – one known, two unknown, test is which unknown is the known (X = A or Y = A)
Triangle
(XXY) – three unknowns, test is which is odd one out (Y = 1, Y = 2, or Y = 3).
ABX
ABX – two knowns, one unknown, test is which of the knowns the unknown is (X = A or X = B).
Duo-trio in constant reference mode
(AB)X – three unknowns, where it is stated that the first two are different, but which is which is not identified, test is which of the first two the third is (X = 1 or X = 2).

Paired comparison

In this type of test the assessors are presented with two products and are asked to state which product fulfils a certain condition. This condition will usually be some attribute such as sweetness, sourness, intensity of flavor, etc. The probability for each assessor arriving at a correct response by guessing is

Advantages

Minimum number of samples required. Most straightforward approach when the question is 'Which sample is more ____?"

Disadvantages

Need to know in advance the attribute that is likely to change. Not statistically powerful with large panel sizes required to obtain sufficient confidence.[ citation needed ]

Duo-trio

The assessors are presented with three products, one of which is identified as the control. Of the other two, one is identical to the control, the other is the test product. The assessors are asked to state which product more closely resembles the control.

The probability for each assessor arriving at a correct response by guessing is

Advantages

Quick to set up and execute. No need to have prior knowledge of nature of difference.

Disadvantages

Not statistically powerful therefore relatively large panel sizes required to obtain sufficient confidence.

Triangle

The assessors are presented with three products, two of which are identical and the other one different. The assessors are asked to state which product they believe is the odd one out. [1]

The probability for each assessor arriving at a correct response by guessing is

Advantages

Can be quick to execute and offers greater power than paired comparison or duo-trio.

Disadvantages

Error might occur:

  • Expectation error: This error occurs when the panelists are given more than enough information about the test before actually doing it. Too many facts or hints cause panelists to make a judgment on expectation rather than intuition. For this reason it is important to provide only the facts necessary to complete the test (e.g. Random three digit codes on the samples because people generally associate "1" or "A" with "best").
  • Stimulus error: It is important to mask all differences between the two samples. This is because people generally aspire to get the correct answer and any visible differences will "stimulate" error. Lighting, uniformity of size and shape of samples, the use of transparent or opaque cups, etc. must all be taken into account if this error is to be avoided.
  • Logical error: can cause panelists to evaluate samples according to particular qualities because they appear to be logically associated with other characteristics. To avoid this error, uniformity of appearance and disguising of disparities must be dealt with before the experiment takes place.
  • Leniency error: Error based on the panelists' opinions of the researcher(s). Tests must be conducted in an organized, professional approach.
  • Suggestion effect: Panelists can influence each other by voicing their opinions or making known their reactions. Silence and separation of panelists by booth-like partitions help decrease the suggestion effect enormously.
  • Positional Bias (order effect): Usually the middle sample is chosen as odd. This is common in the triangle test, especially when the samples look close to identical. This can be avoided by presenting the samples randomly (e.g. in a triangle shape so that there is no middle sample).
  • Contrast effect and convergence error: The juxtaposition of two noticeably diverse samples commonly causes the panelists to exaggerate the contrasts, hence the contrast effect. But this can also incur the opposite effect, whereby a significant difference can camouflage the more minute unlikeness — the convergence error. In order to correct and prevent these errors, there must be randomized arrangements of samples for each panelist, so as to balance both effects.
  • Central tendency error: Occurs when the panelists rate a sample mid-range, to avoid extremes. Consequently, results may suggest that samples are more comparable than they actually are. This becomes apparent especially when the panelist is not accustomed with the products or test procedure. Prevention of this flaw can be achieved by acquainting panelists with the test approach and products and by randomizing the order of arrangement of samples.
  • Motivation: Motivation of panel members affects their sensory acuity. It is therefore important to maintain the interest of the panelists. This can be achieved just by conducting the experiment in a professional, controlled manner, or even by offering a report of their results. Usually trained panelists are more motivated than those who are not.

There are many other errors which can occur but the above are the main possible errors. It is evident from the above information that randomization, control and professional conduct of the experiment are essential for obtaining the most accurate results.

Important

Used to assist research and development in formulating and reformulating products. Using the triangle design to determine if a particular ingredient change, or a change in processing, creates a detectable difference in the final product. Triangle taste testing is also used in quality control to determine if a particular production run (or production from different factories) meets the quality-control standard (i.e., is not different from the product standard in a triangle taste test using discriminators).

ABX

The assessors are presented with three products, two of which are identified as reference A and alternative B, the third is unknown X, and identical to either A or B. The assessors are asked to state which of A and B the unknown is; the test may also be described as "matching-to-sample", or "duo-trio in balanced reference mode" (both knowns are presented as reference, rather than only one).

ABX testing is widely used in comparison of audio compression algorithms, but less used in food science.

ABX testing differs from the other listed tests in that subjects are given two known different samples, and thus are able to compare them with an eye towards differences – there is an "inspection phase". While this may be hypothesized to make discrimination easier, no advantage has been observed in discrimination performance in ABX testing compared with other testing methods. [2]

Duo-trio in constant reference mode

Like triangle testing, but third is known to not be the odd one out. Intermediate between ABX (where which of the first is which – which is control, which is proposed new one – is stated), and triangle, where any of the three could be out.

Degree of difference (DoD)

Signal Detection Theory

Experimental design

Notes and references

  1. ISO 4120:2004 Sensory analysis - Methodology - Triangle test
  2. Huang, Y. T.; Lawless, H. T. (1998). "Sensitivity of the Abx Discrimination Test". Journal of Sensory Studies. 13 (2): 229–239. doi:10.1111/j.1745-459X.1998.tb00085.x.

Related Research Articles

<span class="mw-page-title-main">Kolmogorov–Smirnov test</span> Non-parametric statistical test between two distributions

Kolmogorov–Smirnov test is a nonparametric test of the equality of continuous, one-dimensional probability distributions that can be used to test whether a sample came from a given reference probability distribution, or to test whether two samples came from the same distribution. Intuitively, the test provides a method to qualitatively answer the question "How likely is it that we would see a collection of samples like this if they were drawn from that probability distribution?" or, in the second case, "How likely is it that we would see two sets of samples like this if they were drawn from the same probability distribution?". It is named after Andrey Kolmogorov and Nikolai Smirnov.

<span class="mw-page-title-main">Statistics</span> Study of the collection and analysis of data

Statistics is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, industrial, or social problem, it is conventional to begin with a statistical population or a statistical model to be studied. Populations can be diverse groups of people or objects such as "all people living in a country" or "every atom composing a crystal". Statistics deals with every aspect of data, including the planning of data collection in terms of the design of surveys and experiments.

<span class="mw-page-title-main">Statistical hypothesis test</span> Method of statistical inference

A statistical hypothesis test is a method of statistical inference used to decide whether the data sufficiently supports a particular hypothesis. A statistical hypothesis test typically involves a calculation of a test statistic. Then a decision is made, either by comparing the test statistic to a critical value or equivalently by evaluating a p-value computed from the test statistic. Roughly 100 specialized statistical tests have been defined.

In scientific research, the null hypothesis is the claim that the effect being studied does not exist. The null hypothesis can also be described as the hypothesis in which no relationship exists between two sets of data or variables being analyzed. If the null hypothesis is true, any experimentally observed effect is due to chance alone, hence the term "null". In contrast with the null hypothesis, an alternative hypothesis is developed, which claims that a relationship does exist between two variables.

In frequentist statistics, power is a measure of the ability of an experimental design and hypothesis testing setup to detect a particular effect if it is truly present. In typical use, it is a function of the test used, the assumed distribution of the test, and the effect size of interest. High statistical power is related to low variability, large sample sizes, large effects being looked for, and less stringent requirements for statistical significance.

<i>Z</i>-test Statistical test

A Z-test is any statistical test for which the distribution of the test statistic under the null hypothesis can be approximated by a normal distribution. Z-test tests the mean of a distribution. For each significance level in the confidence interval, the Z-test has a single critical value which makes it more convenient than the Student's t-test whose critical values are defined by the sample size. Both the Z-test and Student's t-test have similarities in that they both help determine the significance of a set of data. However, the z-test is rarely used in practice because the population deviation is difficult to determine.

Student's t-test is a statistical test used to test whether the difference between the response of two groups is statistically significant or not. It is any statistical hypothesis test in which the test statistic follows a Student's t-distribution under the null hypothesis. It is most commonly applied when the test statistic would follow a normal distribution if the value of a scaling term in the test statistic were known. When the scaling term is estimated based on the data, the test statistic—under certain conditions—follows a Student's t distribution. The t-test's most common application is to test whether the means of two populations are significantly different. In many cases, a Z-test will yield very similar results to a t-test because the latter converges to the former as the size of the dataset increases.

<span class="mw-page-title-main">One- and two-tailed tests</span> Alternative ways of computing the statistical significance of a parameter inferred from a data set

In statistical significance testing, a one-tailed test and a two-tailed test are alternative ways of computing the statistical significance of a parameter inferred from a data set, in terms of a test statistic. A two-tailed test is appropriate if the estimated value is greater or less than a certain range of values, for example, whether a test taker may score above or below a specific range of scores. This method is used for null hypothesis testing and if the estimated value exists in the critical areas, the alternative hypothesis is accepted over the null hypothesis. A one-tailed test is appropriate if the estimated value may depart from the reference value in only one direction, left or right, but not both. An example can be whether a machine produces more than one-percent defective products. In this situation, if the estimated value exists in one of the one-sided critical areas, depending on the direction of interest, the alternative hypothesis is accepted over the null hypothesis. Alternative names are one-sided and two-sided tests; the terminology "tail" is used because the extreme portions of distributions, where observations lead to rejection of the null hypothesis, are small and often "tail off" toward zero as in the normal distribution, colored in yellow, or "bell curve", pictured on the right and colored in green.

Sample size determination or estimation is the act of choosing the number of observations or replicates to include in a statistical sample. The sample size is an important feature of any empirical study in which the goal is to make inferences about a population from a sample. In practice, the sample size used in a study is usually determined based on the cost, time, or convenience of collecting the data, and the need for it to offer sufficient statistical power. In complex studies, different sample sizes may be allocated, such as in stratified surveys or experimental designs with multiple treatment groups. In a census, data is sought for an entire population, hence the intended sample size is equal to the population. In experimental design, where a study may be divided into different treatment groups, there may be different sample sizes for each group.

The Wilcoxon signed-rank test is a non-parametric rank test for statistical hypothesis testing used either to test the location of a population based on a sample of data, or to compare the locations of two populations using two matched samples. The one-sample version serves a purpose similar to that of the one-sample Student's t-test. For two matched samples, it is a paired difference test like the paired Student's t-test. The Wilcoxon test is a good alternative to the t-test when the normal distribution of the differences between paired individuals cannot be assumed. Instead, it assumes a weaker hypothesis that the distribution of this difference is symmetric around a central value and it aims to test whether this center value differs significantly from zero. The Wilcoxon test is a more powerful alternative to the sign test because it considers the magnitude of the differences, but it requires this moderately strong assumption of symmetry.

This glossary of statistics and probability is a list of definitions of terms and concepts used in the mathematical sciences of statistics and probability, their sub-disciplines, and related fields. For additional related terms, see Glossary of mathematics and Glossary of experimental design.

The sign test is a statistical test for consistent differences between pairs of observations, such as the weight of subjects before and after treatment. Given pairs of observations for each subject, the sign test determines if one member of the pair tends to be greater than the other member of the pair.

In statistical hypothesis testing, a type I error, or a false positive, is the rejection of the null hypothesis when it is actually true. A type II error, or a false negative, is the failure to reject a null hypothesis that is actually false.

<span class="mw-page-title-main">STR analysis</span> Biological DNA analysis for allele repeats

Shorttandemrepeat (STR) analysis is a common molecular biology method used to compare allele repeats at specific loci in DNA between two or more samples. A short tandem repeat is a microsatellite with repeat units that are 2 to 7 base pairs in length, with the number of repeats varying among individuals, making STRs effective for human identification purposes. This method differs from restriction fragment length polymorphism analysis (RFLP) since STR analysis does not cut the DNA with restriction enzymes. Instead, polymerase chain reaction (PCR) is employed to discover the lengths of the short tandem repeats based on the length of the PCR product.

Omnibus tests are a kind of statistical test. They test whether the explained variance in a set of data is significantly greater than the unexplained variance, overall. One example is the F-test in the analysis of variance. There can be legitimate significant effects within a model even if the omnibus test is not significant. For instance, in a model with two independent variables, if only one variable exerts a significant effect on the dependent variable and the other does not, then the omnibus test may be non-significant. This fact does not affect the conclusions that may be drawn from the one significant variable. In order to test effects within an omnibus test, researchers often use contrasts.

An ABX test is a method of comparing two choices of sensory stimuli to identify detectable differences between them. A subject is presented with two known samples followed by one unknown sample X that is randomly selected from either A or B. The subject is then required to identify X as either A or B. If X cannot be identified reliably with a low p-value in a predetermined number of trials, then the null hypothesis cannot be rejected and it cannot be proven that there is a perceptible difference between A and B.

Differential item functioning (DIF) is a statistical property of a test item that indicates how likely it is for individuals from distinct groups, possessing similar abilities, to respond differently to the item. It manifests when individuals from different groups, with comparable skill levels, do not have an equal likelihood of answering a question correctly. There are two primary types of DIF: uniform DIF, where one group consistently has an advantage over the other, and nonuniform DIF, where the advantage varies based on the individual's ability level. The presence of DIF requires review and judgment, but it doesn't always signify bias. DIF analysis provides an indication of unexpected behavior of items on a test. DIF characteristic of an item isn't solely determined by varying probabilities of selecting a specific response among individuals from different groups. Rather, DIF becomes pronounced when individuals from different groups, who possess the same underlying true ability, exhibit differing probabilities of giving a certain response. Even when uniform bias is present, test developers sometimes resort to assumptions such as DIF biases may offset each other due to the extensive work required to address it, compromising test ethics and perpetuating systemic biases. Common procedures for assessing DIF are Mantel-Haenszel procedure, logistic regression, item response theory (IRT) based methods, and confirmatory factor analysis (CFA) based methods.

Frequentist inference is a type of statistical inference based in frequentist probability, which treats “probability” in equivalent terms to “frequency” and draws conclusions from sample-data by means of emphasizing the frequency or proportion of findings in the data. Frequentist inference underlies frequentist statistics, in which the well-established methodologies of statistical hypothesis testing and confidence intervals are founded.

<span class="mw-page-title-main">Plot (graphics)</span> Graphical technique for data sets

A plot is a graphical technique for representing a data set, usually as a graph showing the relationship between two or more variables. The plot can be drawn by hand or by a computer. In the past, sometimes mechanical or electronic plotters were used. Graphs are a visual representation of the relationship between variables, which are very useful for humans who can then quickly derive an understanding which may not have come from lists of values. Given a scale or ruler, graphs can also be used to read off the value of an unknown variable plotted as a function of a known one, but this can also be done with data presented in tabular form. Graphs of functions are used in mathematics, sciences, engineering, technology, finance, and other areas.

The Newman–Keuls or Student–Newman–Keuls (SNK)method is a stepwise multiple comparisons procedure used to identify sample means that are significantly different from each other. It was named after Student (1927), D. Newman, and M. Keuls. This procedure is often used as a post-hoc test whenever a significant difference between three or more sample means has been revealed by an analysis of variance (ANOVA). The Newman–Keuls method is similar to Tukey's range test as both procedures use studentized range statistics. Unlike Tukey's range test, the Newman–Keuls method uses different critical values for different pairs of mean comparisons. Thus, the procedure is more likely to reveal significant differences between group means and to commit type I errors by incorrectly rejecting a null hypothesis when it is true. In other words, the Neuman-Keuls procedure is more powerful but less conservative than Tukey's range test.