Gold standard (test)

Last updated

In medicine and medical statistics, the gold standard, criterion standard, [1] or reference standard [2] is the diagnostic test or benchmark that is the best available under reasonable conditions. [3] It is the test against which new tests are compared to gauge their validity, and it is used to evaluate the efficacy of treatments. [1]

Contents

The meanings may differ between practical medicine and the statistical ideal because, in medicine with some conditions, only an autopsy guarantees diagnostic certainty, thus the gold standard test would be the best one that keeps the patient alive instead of the autopsy. In these cases, even so-called "gold standard" tests require follow-up to confirm or refute the diagnosis. [4]

History

The term 'gold standard' in its current sense in medical research was coined by Rudd in 1979, in reference to the monetary gold standard. [5]

In medicine

"Gold standard" can refer to the criteria by which scientific evidence is evaluated. For example, in resuscitation research, the "gold standard" test of a medication or procedure is whether or not it leads to an increase in the number of neurologically intact survivors that walk out of the hospital. [6] Other types of medical research might regard a significant decrease in 30-day mortality as the gold standard.[ citation needed ]

The AMA Style Guide has preferred the phrase criterion standard instead of "gold standard." Other journals have also issued mandates in their instructions for contributors. For instance, the Archives of Biological Medicine and Rehabilitation specifies this usage. [7] In practice, however, the uptake of this term by authors, as well as enforcement by editorial staff, is notably poor, at least for AMA journals. [8]

When the criterion is a whole clinical testing procedure it is usually referred to as clinical case definition. Differing case definitions can produce wildly different results when used as the basis for evalulating a given diagnostic method. [9]

A hypothetical ideal "gold standard" test has a sensitivity of 100% concerning the presence of the disease (it identifies all individuals with a well-defined disease process; it does not have any false-negative results) and a specificity of 100% (it does not falsely identify someone with a condition that does not have the condition; it does not have any false-positive results). In practice, there are sometimes no true gold standard tests. [10]

As new diagnostic methods become available, the "gold standard" test may change over time. For instance, for the diagnosis of aortic dissection, the gold standard test used to be the aortogram, which had a sensitivity as low as 83% and a specificity as low as 87%. Since the advancements of magnetic resonance imaging, the magnetic resonance angiogram (MRA) has become the new gold standard test for aortic dissection, with a sensitivity of 95% and a specificity of 92%.[ citation needed ] Before the widespread acceptance of any new test, the former test retains its status as the "gold standard".

Test calibration

Because tests can be incorrect (yielding a false-negative or a false-positive), results should be interpreted in the context of the history, physical findings, and other test results of the individual being tested. It is within this context that the sensitivity and specificity of the "gold standard" test is determined.[ citation needed ]

When the gold standard is not a perfect one, its sensitivity and specificity must be calibrated against more accurate tests or against the definition of the condition. [11] This calibration is especially important when a perfect test is available only by autopsy. It is important to emphasize that a test has to meet some interobserver agreement, to avoid some bias induced by the study itself. [12]

Calibration errors can lead to misdiagnosis. [13] [ dubious ]

Ambiguity

Sometimes "gold standard test" refers to the best-performing test available. In these cases, there is no other criterion against which it can be compared and it is equivalent to a definition. When referring to this meaning, gold standard tests are normally not performed at all. This is because the gold standard test may be difficult to perform or may be impossible to perform on a living person (i.e. the test is performed as part of an autopsy or may take too long for the results of the test to be available to be clinically useful).

Other times, the "gold standard" does not refer to the best-performing test available, but the best available under reasonable conditions. For example, in this sense, an MRI is the gold standard for brain tumor diagnosis, though it is not as good as a biopsy. In this case, the sensitivity and specificity of the gold standard are not 100% and it is said to be an "imperfect gold standard" or "alloyed gold standard". [11]

The term ground truth refers to the underlying absolute state of information; the gold standard strives to represent the ground truth as closely as possible. While the gold standard is the best effort to obtain the truth, ground truth is typically collected by direct observations. In machine learning and information retrieval, "ground truth" is the preferred term even when classifications may be imperfect; the gold standard is assumed to be the ground truth.[ citation needed ]

Some authors use the term "golden standard". Claassen argues this usage is incorrect, as "golden standard" implies a level of perfection that is unattainable in medical science. [5]

See also

Related Research Articles

Evidence-based medicine (EBM) is "the conscientious, explicit and judicious use of current best evidence in making decisions about the care of individual patients." The aim of EBM is to integrate the experience of the clinician, the values of the patient, and the best available scientific information to guide decision-making about clinical management. The term was originally used to describe an approach to teaching the practice of medicine and improving decisions by individual physicians about individual patients.

<span class="mw-page-title-main">Pathology</span> Study of the causes and effects of disease or injury, and how they arise

Pathology is the study of disease and injury. The word pathology also refers to the study of disease in general, incorporating a wide range of biology research fields and medical practices. However, when used in the context of modern medical treatment, the term is often used in a narrower fashion to refer to processes and tests that fall within the contemporary medical field of "general pathology", an area that includes a number of distinct but inter-related medical specialties that diagnose disease, mostly through analysis of tissue and human cell samples. Idiomatically, "a pathology" may also refer to the predicted or actual progression of particular diseases, and the affix pathy is sometimes used to indicate a state of disease in cases of both physical ailment and psychological conditions. A physician practicing pathology is called a pathologist.

Multiple chemical sensitivity (MCS), also known as idiopathic environmental intolerances (IEI), is an unrecognized and controversial diagnosis characterized by chronic symptoms attributed to exposure to low levels of commonly used chemicals. Symptoms are typically vague and non-specific. They may include fatigue, headaches, nausea, and dizziness.

A radioallergosorbent test (RAST) is a blood test using radioimmunoassay test to detect specific IgE antibodies in order to determine the substances a subject is allergic to. This is different from a skin allergy test, which determines allergy by the reaction of a person's skin to different substances.

<span class="mw-page-title-main">Prothrombin time</span> Assay for evaluating the extrinsic pathway & common pathway of coagulation

The prothrombin time (PT) – along with its derived measures of prothrombin ratio (PR) and international normalized ratio (INR) – is an assay for evaluating the extrinsic pathway and common pathway of coagulation. This blood test is also called protime INR and PT/INR. They are used to determine the clotting tendency of blood, in such things as the measure of warfarin dosage, liver damage, and vitamin K status. PT measures the following coagulation factors: I (fibrinogen), II (prothrombin), V (proaccelerin), VII (proconvertin), and X.

<span class="mw-page-title-main">Medical Subject Headings</span> Controlled vocabulary

Medical Subject Headings (MeSH) is a comprehensive controlled vocabulary for the purpose of indexing journal articles and books in the life sciences. It serves as a thesaurus that facilitates searching. Created and updated by the United States National Library of Medicine (NLM), it is used by the MEDLINE/PubMed article database and by NLM's catalog of book holdings. MeSH is also used by ClinicalTrials.gov registry to classify which diseases are studied by trials registered in ClinicalTrials.

<span class="mw-page-title-main">Positive and negative predictive values</span> In biostatistics, proportion of true positive and true negative results

The positive and negative predictive values are the proportions of positive and negative results in statistics and diagnostic tests that are true positive and true negative results, respectively. The PPV and NPV describe the performance of a diagnostic test or other statistical measure. A high result can be interpreted as indicating the accuracy of such a statistic. The PPV and NPV are not intrinsic to the test ; they depend also on the prevalence. Both PPV and NPV can be derived using Bayes' theorem.

The limit of detection is the lowest signal, or the lowest corresponding quantity to be determined from the signal, that can be observed with a sufficient degree of confidence or statistical significance. However, the exact threshold used to decide when a signal significantly emerges above the continuously fluctuating background noise remains arbitrary and is a matter of policy and often of debate among scientists, statisticians and regulators depending on the stakes in different fields.

<span class="mw-page-title-main">Antibiotic sensitivity testing</span> Microbiology test used in medicine

Antibiotic sensitivity testing or antibiotic susceptibility testing is the measurement of the susceptibility of bacteria to antibiotics. It is used because bacteria may have resistance to some antibiotics. Sensitivity testing results can allow a clinician to change the choice of antibiotics from empiric therapy, which is when an antibiotic is selected based on clinical suspicion about the site of an infection and common causative bacteria, to directed therapy, in which the choice of antibiotic is based on knowledge of the organism and its sensitivities.

Cross-reactivity, in a general sense, is the reactivity of an observed agent which initiates reactions outside the main reaction expected. This has implications for any kind of test or assay, including diagnostic tests in medicine, and can be a cause of false positives. In immunology, the definition of cross-reactivity refers specifically to the reaction of the immune system to antigens. There can be cross-reactivity between the immune system and the antigens of two different pathogens, or between one pathogen and proteins on non-pathogens, which in some cases can be the cause of allergies.

<span class="mw-page-title-main">Sensitivity and specificity</span> Statistical measures of the performance of a binary classification test

In medicine and statistics, sensitivity and specificity mathematically describe the accuracy of a test that reports the presence or absence of a medical condition. If individuals who have the condition are considered "positive" and those who do not are considered "negative", then sensitivity is a measure of how well a test can identify true positives and specificity is a measure of how well a test can identify true negatives:

In epidemiology, a clinical case definition, a clinical definition, or simply a case definition lists the clinical criteria by which public health professionals determine whether a person's illness is included as a case in an outbreak investigation—that is, whether a person is considered directly affected by an outbreak. Absent an outbreak, case definitions are used in the surveillance of public health in order to categorize those conditions present in a population.

<span class="mw-page-title-main">Medical test</span> Medical procedure

A medical test is a medical procedure performed to detect, diagnose, or monitor diseases, disease processes, susceptibility, or to determine a course of treatment. Medical tests such as, physical and visual exams, diagnostic imaging, genetic testing, chemical and cellular analysis, relating to clinical chemistry and molecular diagnostics, are typically performed in a medical setting.

<span class="mw-page-title-main">Heterophile antibody test</span> Diagnosistic test for infectious mononucleosis

The mononuclear spot test or monospot test, a form of the heterophile antibody test, is a rapid test for infectious mononucleosis due to Epstein–Barr virus (EBV). It is an improvement on the Paul–Bunnell test. The test is specific for heterophile antibodies produced by the human immune system in response to EBV infection. Commercially available test kits are 70–92% sensitive and 96–100% specific, with a lower sensitivity in the first two weeks after clinical symptoms begin.

Laboratory quality control is designed to detect, reduce, and correct deficiencies in a laboratory's internal analytical process prior to the release of patient results, in order to improve the quality of the results reported by the laboratory. Quality control (QC) is a measure of precision, or how well the measurement system reproduces the same result over time and under varying operating conditions. Laboratory quality control material is usually run at the beginning of each shift, after an instrument is serviced, when reagent lots are changed, after equipment calibration, and whenever patient results seem inappropriate. Quality control material should approximate the same matrix as patient specimens, taking into account properties such as viscosity, turbidity, composition, and color. It should be simple to use, with minimal vial-to-vial variability, because variability could be misinterpreted as systematic error in the method or instrument. It should be stable for long periods of time, and available in large enough quantities for a single batch to last at least one year. Liquid controls are more convenient than lyophilized (freeze-dried) controls because they do not have to be reconstituted, minimizing pipetting error. Dried Tube Specimen (DTS) is slightly cumbersome as a QC material but it is very low-cost, stable over long periods and efficient, especially useful for resource-restricted settings in under-developed and developing countries. DTS can be manufactured in-house by a laboratory or Blood Bank for its use.

Clinical epidemiology is a subfield of epidemiology specifically focused on issues relevant to clinical medicine. The term was first introduced by Jean Paul in his presidential address to the American Society for Clinical Investigation in 1938. It is sometimes referred to as "the basic science of clinical medicine".

<span class="mw-page-title-main">Diagnostic odds ratio</span>

In medical testing with binary classification, the diagnostic odds ratio (DOR) is a measure of the effectiveness of a diagnostic test. It is defined as the ratio of the odds of the test being positive if the subject has a disease relative to the odds of the test being positive if the subject does not have the disease.

In statistics, verification bias is a type of measurement bias in which the results of a diagnostic test affect whether the gold standard procedure is used to verify the test result. This type of bias is also known as "work-up bias" or "referral bias".

Receiver Operating Characteristic Curve Explorer and Tester (ROCCET) is an open-access web server for performing biomarker analysis using ROC curve analyses on metabolomic data sets. ROCCET is designed specifically for performing and assessing a standard binary classification test. ROCCET accepts metabolite data tables, with or without clinical/observational variables, as input and performs extensive biomarker analysis and biomarker identification using these input data. It operates through a menu-based navigation system that allows users to identify or assess those clinical variables and/or metabolites that contain the maximal diagnostic or class-predictive information. ROCCET supports both manual and semi-automated feature selection and is able to automatically generate a variety of mathematical models that maximize the sensitivity and specificity of the biomarker(s) while minimizing the number of biomarkers used in the biomarker model. ROCCET also supports the rigorous assessment of the quality and robustness of newly discovered biomarkers using permutation testing, hold-out testing and cross-validation.

<span class="mw-page-title-main">Forensic epidemiology</span>

The discipline of forensic epidemiology (FE) is a hybrid of principles and practices common to both forensic medicine and epidemiology. FE is directed at filling the gap between clinical judgment and epidemiologic data for determinations of causality in civil lawsuits and criminal prosecution and defense.

References

  1. 1 2 Borowitz D, Aronoff N, Cummings LC, Maqbool A, Mulberg AE (April 2022). "Coefficient of Fat Absorption to Measure the Efficacy of Pancreatic Enzyme Replacement Therapy in People With Cystic Fibrosis: Gold Standard or Coal Standard?". Pancreas. 51 (4): 310–318. doi:10.1097/MPA.0000000000002016. PMC   9257055 . PMID   35695742.
  2. Gold, R; Reichman, M; Greenberg, E; Ivanidze, J; Elias, E; Tsiouris, AJ; Comunale, JP; Johnson, CE; Sanelli, PC (September 2010). "Developing a new reference standard: is validation necessary?". Academic Radiology. 17 (9): 1079–82. doi:10.1016/j.acra.2010.05.021. PMC   2919497 . PMID   20692619.
  3. Versi E (July 1992). ""Gold standard" is an appropriate term". BMJ. 305 (6846): 187. doi:10.1136/bmj.305.6846.187-b. PMC   1883235 . PMID   1515860.
  4. Fardy, John M.; Barrett, Brendan J. (2015). "Evaluation of Diagnostic Tests". Clinical Epidemiology (PDF). Methods in Molecular Biology. Vol. 1281. pp. 289–300. doi:10.1007/978-1-4939-2428-8_17. ISBN   978-1-4939-2427-1. PMID   25694317.
  5. 1 2 Claassen, JA (24 December 2005). "['Gold standard', not 'golden standard']". Nederlands Tijdschrift voor Geneeskunde. 149 (52): 2937. PMID   16402524.
  6. ACLS: Principles and Practice. p. 62. Dallas: American Heart Association, 2003. ISBN   0-87493-341-2.
  7. "Guide for Authors". Archives of biological Medicine and Rehabilitation. Elsevier.
  8. "Criterion Standard - AMA Style Insider". 21 June 2011. Retrieved 2021-05-18.
  9. Bachmann, Lucas M; Jüni, Peter; Reichenbach, Stephan; Ziswiler, Hans-Rudolf; Kessels, Alfons G; Vögelin, Esther (1 August 2005). "Consequences of different diagnostic 'gold standards' in test accuracy research: Carpal Tunnel Syndrome as an example". International Journal of Epidemiology. 34 (4): 953–955. doi: 10.1093/ije/dyi105 . PMID   15911545.
  10. Troy LM, Michels KB, Hunter DJ, Spiegelman D, Manson JE, Colditz GA, et al. (February 1996). "Self-reported birthweight and history of having been breastfed among younger women: an assessment of validity". International Journal of Epidemiology. 25 (1): 122–127. doi: 10.1093/ije/25.1.122 . PMID   8666479.
  11. 1 2 Spiegelman D, Schneeweiss S, McDermott A (January 1997). "Measurement error correction for logistic regression models with an "alloyed gold standard"". American Journal of Epidemiology. 145 (2): 184–196. doi: 10.1093/oxfordjournals.aje.a009089 . PMID   9006315.
  12. Stein PD, Athanasoulis C, Alavi A, Greenspan RH, Hales CA, Saltzman HA, et al. (February 1992). "Complications and validity of pulmonary angiography in acute pulmonary embolism". Circulation. 85 (2): 462–468. doi: 10.1161/01.CIR.85.2.462 . PMID   1735144.
  13. Gallaher MP, Mobley LR, Klee GG, Schryver P (April 2004). The Impact of Calibration Error in Medical Decision Making (PDF) (Report). Washington (DC): National Institute of Standards and Technology.