Hit selection

Last updated

In high-throughput screening (HTS), one of the major goals is to select compounds (including small molecules, siRNAs, shRNA, genes, et al.) with a desired size of inhibition or activation effects. A compound with a desired size of effects in an HTS screen is called a hit. The process of selecting hits is called hit selection.[ citation needed ]

Contents

Methods for hit selection in general

HTS experiments have the ability to screen tens of thousands (or even millions) of compounds rapidly. Hence, it is a challenge to glean chemical/biochemical significance from mounds of data in the process of hit selection. To address this challenge, appropriate analytic methods have been adopted for hit selection. There are two main strategies of selecting hits with large effects. [1] One is to use certain metric(s) to rank and/or classify the compounds by their effects and then to select the largest number of potent compounds that is practical for validation assays. [2] [3] The other strategy is to test whether a compound has effects strong enough to reach a pre-set level. In this strategy, false-negative rates (FNRs) and/or false-positive rates (FPRs) must be controlled. [4] [5] [6] [7] [8] [9] [10] [11]

There are two major types of HTS experiments, one without replicates (usually in primary screens) and one with replicates (usually in confirmatory screens). The analytic methods for hit selection differ in those two types of HTS experiments. For example, the z-score method is suitable for screens without replicates whereas the t-statistic is suitable for screens with replicate. The calculation of SSMD for screens without replicates also differs from that for screens with replicates. [1]

Screens without replicates

There are many metrics used for hit selection in primary screens without replicates. The easily interpretable ones are fold change, mean difference, percent inhibition, and percent activity. However, the drawback common to all of these metrics is that they do not capture data variability effectively. To address this issue, researchers then turned to the z-score method or SSMD, which can capture data variability in negative references. [12] [13]

The z-score method is based on the assumption that the measured values (usually fluorescent intensity in log scale) of all investigated compounds in a plate have a normal distribution. SSMD also works the best under the normality assumption. However, true hits with large effects should behave very different from the majority of the compounds and thus are outliers. Strong assay artifacts may also behave as outliers. Thus, outliers are not uncommon in HTS experiments. The regular versions of z-score and SSMD are sensitive to outliers and can be problematic. Consequently, robust methods such as the z*-score method, SSMD*, B-score method, and quantile-based method have been proposed and adopted for hit selection in primary screens without replicates. [14] [15]

In a primary screen without replicates, every compound is measured only once. Consequently, we cannot directly estimate the data variability for each compound. Instead, we indirectly estimate data variability by making a strong assumption that every compound has the same variability as a negative reference in a plate in the screen. The z-score, z*-score and B-score relies on this strong assumption; so are the SSMD and SSMD* for cases without replicates.

Screens with replicates

In a screen with replicates, we can directly estimate data variability for each compound, and thus we can use more powerful methods, such as SSMD for cases with replicates and t-statistic that does not rely on the strong assumption that the z-score and z*-score rely on. One issue with the use of t-statistic and associated p-values is that they are affected by both sample size and effect size. [16] They come from testing for no mean difference, thus are not designed to measure the size of small molecule or siRNA effects. For hit selection, the major interest is the size of effect in a tested small molecule or siRNA. SSMD directly assesses the size of effects. [17] SSMD has also been shown to be better than other commonly used effect sizes. [18] The population value of SSMD is comparable across experiments and thus we can use the same cutoff for the population value of SSMD to measure the size of siRNA effects. [19]

SSMD can overcome the drawback of average fold change not being able to capture data variability. On the other hand, because SSMD is the ratio of mean to standard deviation, we may get a large SSMD value when the standard deviation is very small, even if the mean is small. In some cases, a too small mean value may not have a biological impact. As such, the compounds with large SSMD values (or differentiations) but too small mean values may not be of interest. The concept of dual-flashlight plot has been proposed to address this issue. In a dual-flashlight plot, we plot the SSMD versus average log fold-change (or average percent inhibition/activation) on the y- and x-axes, respectively, for all compounds investigated in an experiment. [19] With the dual-flashlight plot, we can see how the genes or compounds are distributed into each category in effect sizes, as shown in the figure. Meanwhile, we can also see the average fold-change for each compound. [19] [20]

See also

Related Research Articles

Gene silencing is the regulation of gene expression in a cell to prevent the expression of a certain gene. Gene silencing can occur during either transcription or translation and is often used in research. In particular, methods used to silence genes are being increasingly used to produce therapeutics to combat cancer and other diseases, such as infectious diseases and neurodegenerative disorders.

<span class="mw-page-title-main">Drug discovery</span> Pharmaceutical procedure

In the fields of medicine, biotechnology and pharmacology, drug discovery is the process by which new candidate medications are discovered.

<span class="mw-page-title-main">High-throughput screening</span> Drug discovery technique

High-throughput screening (HTS) is a method for scientific discovery especially used in drug discovery and relevant to the fields of biology, materials science and chemistry. Using robotics, data processing/control software, liquid handling devices, and sensitive detectors, high-throughput screening allows a researcher to quickly conduct millions of chemical, genetic, or pharmacological tests. Through this process one can quickly recognize active compounds, antibodies, or genes that modulate a particular biomolecular pathway. The results of these experiments provide starting points for drug design and for understanding the noninteraction or role of a particular location.

<span class="mw-page-title-main">Functional genomics</span> Field of molecular biology

Functional genomics is a field of molecular biology that attempts to describe gene functions and interactions. Functional genomics make use of the vast data generated by genomic and transcriptomic projects. Functional genomics focuses on the dynamic aspects such as gene transcription, translation, regulation of gene expression and protein–protein interactions, as opposed to the static aspects of the genomic information such as DNA sequence or structures. A key characteristic of functional genomics studies is their genome-wide approach to these questions, generally involving high-throughput methods rather than a more traditional "candidate-gene" approach.

<span class="mw-page-title-main">Two-hybrid screening</span> Molecular biology technique

Two-hybrid screening is a molecular biology technique used to discover protein–protein interactions (PPIs) and protein–DNA interactions by testing for physical interactions between two proteins or a single protein and a DNA molecule, respectively.

The Z-factor is a measure of statistical effect size. It has been proposed for use in high-throughput screening, and commonly written as Z' to judge whether the response in a particular assay is large enough to warrant further attention.

High-content screening (HCS), also known as high-content analysis (HCA) or cellomics, is a method that is used in biological research and drug discovery to identify substances such as small molecules, peptides, or RNAi that alter the phenotype of a cell in a desired manner. Hence high content screening is a type of phenotypic screen conducted in cells involving the analysis of whole cells or components of cells with simultaneous readout of several parameters. HCS is related to high-throughput screening (HTS), in which thousands of compounds are tested in parallel for their activity in one or more biological assays, but involves assays of more complex cellular phenotypes as outputs. Phenotypic changes may include increases or decreases in the production of cellular products such as proteins and/or changes in the morphology of the cell. Hence HCA typically involves automated microscopy and image analysis. Unlike high-content analysis, high-content screening implies a level of throughput which is why the term "screening" differentiates HCS from HCA, which may be high in content but low in throughput.

Hit to lead (H2L) also known as lead generation is a stage in early drug discovery where small molecule hits from a high throughput screen (HTS) are evaluated and undergo limited optimization to identify promising lead compounds. These lead compounds undergo more extensive optimization in a subsequent step of drug discovery called lead optimization (LO). The drug discovery process generally follows the following path that includes a hit to lead stage:

High throughput biology is the use of automation equipment with classical cell biology techniques to address biological questions that are otherwise unattainable using conventional methods. It may incorporate techniques from optics, chemistry, biology or image analysis to permit rapid, highly parallel research into how cells function, interact with each other and how pathogens exploit them in disease.

Fragment-based lead discovery (FBLD) also known as fragment-based drug discovery (FBDD) is a method used for finding lead compounds as part of the drug discovery process. Fragments are small organic molecules which are small in size and low in molecular weight. It is based on identifying small chemical fragments, which may bind only weakly to the biological target, and then growing them or combining them to produce a lead with a higher affinity. FBLD can be compared with high-throughput screening (HTS). In HTS, libraries with up to millions of compounds, with molecular weights of around 500 Da, are screened, and nanomolar binding affinities are sought. In contrast, in the early phase of FBLD, libraries with a few thousand compounds with molecular weights of around 200 Da may be screened, and millimolar affinities can be considered useful. FBLD is a technique being used in research for discovering novel potent inhibitors. This methodology could help to design multitarget drugs for multiple diseases. The multitarget inhibitor approach is based on designing an inhibitor for the multiple targets. This type of drug design opens up new polypharmacological avenues for discovering innovative and effective therapies. Neurodegenerative diseases like Alzheimer’s (AD) and Parkinson’s, among others, also show rather complex etiopathologies. Multitarget inhibitors are more appropriate for addressing the complexity of AD and may provide new drugs for controlling the multifactorial nature of AD, stopping its progression.

DNA-encoded chemical libraries (DECL) is a technology for the synthesis and screening on an unprecedented scale of collections of small molecule compounds. DECL is used in medicinal chemistry to bridge the fields of combinatorial chemistry and molecular biology. The aim of DECL technology is to accelerate the drug discovery process and in particular early phase discovery activities such as target validation and hit identification.

<span class="mw-page-title-main">RNA interference</span> Biological process of gene regulation

RNA interference (RNAi) is a biological process in which RNA molecules are involved in sequence-specific suppression of gene expression by double-stranded RNA, through translational or transcriptional repression. Historically, RNAi was known by other names, including co-suppression, post-transcriptional gene silencing (PTGS), and quelling. The detailed study of each of these seemingly different processes elucidated that the identity of these phenomena were all actually RNAi. Andrew Fire and Craig C. Mello shared the 2006 Nobel Prize in Physiology or Medicine for their work on RNAi in the nematode worm Caenorhabditis elegans, which they published in 1998. Since the discovery of RNAi and its regulatory potentials, it has become evident that RNAi has immense potential in suppression of desired genes. RNAi is now known as precise, efficient, stable and better than antisense therapy for gene suppression. Antisense RNA produced intracellularly by an expression vector may be developed and find utility as novel therapeutic agents.

In statistics, the strictly standardized mean difference (SSMD) is a measure of effect size. It is the mean divided by the standard deviation of a difference between two random values each from one of two groups. It was initially proposed for quality control and hit selection in high-throughput screening (HTS) and has become a statistical parameter measuring effect sizes for the comparison of any two groups with random values.

In statistics, the standardized mean of a contrast variable , is a parameter assessing effect size. The SMCV is defined as mean divided by the standard deviation of a contrast variable. The SMCV was first proposed for one-way ANOVA cases and was then extended to multi-factor ANOVA cases .

<span class="mw-page-title-main">Dual-flashlight plot</span>

In statistics, a dual-flashlight plot is a type of scatter-plot in which the standardized mean of a contrast variable (SMCV) is plotted against the mean of a contrast variable representing a comparison of interest . The commonly used dual-flashlight plot is for the difference between two groups in high-throughput experiments such as microarrays and high-throughput screening studies, in which we plot the SSMD versus average log fold-change on the y- and x-axes, respectively, for all genes or compounds investigated in an experiment. As a whole, the points in a dual-flashlight plot look like the beams of a flashlight with two heads, hence the name dual-flashlight plot.

In statistics, a c+-probability is the probability that a contrast variable obtains a positive value. Using a replication probability, the c+-probability is defined as follows: if we get a random draw from each group and calculate the sampled value of the contrast variable based on the random draws, then the c+-probability is the chance that the sampled values of the contrast variable are greater than 0 when the random drawing process is repeated infinite times. The c+-probability is a probabilistic index accounting for distributions of compared groups.

Chemoproteomics entails a broad array of techniques used to identify and interrogate protein-small molecule interactions. Chemoproteomics complements phenotypic drug discovery, a paradigm that aims to discover lead compounds on the basis of alleviating a disease phenotype, as opposed to target-based drug discovery, in which lead compounds are designed to interact with predetermined disease-driving biological targets. As phenotypic drug discovery assays do not provide confirmation of a compound's mechanism of action, chemoproteomics provides valuable follow-up strategies to narrow down potential targets and eventually validate a molecule's mechanism of action. Chemoproteomics also attempts to address the inherent challenge of drug promiscuity in small molecule drug discovery by analyzing protein-small molecule interactions on a proteome-wide scale. A major goal of chemoproteomics is to characterize the interactome of drug candidates to gain insight into mechanisms of off-target toxicity and polypharmacology.

Perturb-seq refers to a high-throughput method of performing single cell RNA sequencing (scRNA-seq) on pooled genetic perturbation screens. Perturb-seq combines multiplexed CRISPR mediated gene inactivations with single cell RNA sequencing to assess comprehensive gene expression phenotypes for each perturbation. Inferring a gene’s function by applying genetic perturbations to knock down or knock out a gene and studying the resulting phenotype is known as reverse genetics. Perturb-seq is a reverse genetics approach that allows for the investigation of phenotypes at the level of the transcriptome, to elucidate gene functions in many cells, in a massively parallel fashion.

<span class="mw-page-title-main">James Inglese</span> American biochemist

James Inglese is an American biochemist, the director of the Assay Development and Screening Technology laboratory at the National Center for Advancing Translational Sciences, a Center within the National Institutes of Health. His specialty is small molecule high throughput screening. Inglese's laboratory develops methods and strategies in molecular pharmacology with drug discovery applications. The work of his research group and collaborators focuses on genetic and infectious disease-associated biology.

<span class="mw-page-title-main">Genome-wide CRISPR-Cas9 knockout screens</span> Research tool in genomics

Genome-wide CRISPR-Cas9 knockout screens aim to elucidate the relationship between genotype and phenotype by ablating gene expression on a genome-wide scale and studying the resulting phenotypic alterations. The approach utilises the CRISPR-Cas9 gene editing system, coupled with libraries of single guide RNAs (sgRNAs), which are designed to target every gene in the genome. Over recent years, the genome-wide CRISPR screen has emerged as a powerful tool for performing large-scale loss-of-function screens, with low noise, high knockout efficiency and minimal off-target effects.

References

  1. 1 2 Zhang XHD (2011). Optimal High-Throughput Screening: Practical Experimental Design and Data Analysis for Genome-scale RNAi Research. Cambridge University Press. ISBN   978-0-521-73444-8.
  2. Birmingham A, Selfors LM, Forster T, Wrobel D, Kennedy CJ, Shanks E, Santoyo-Lopez J, Dunican DJ, Long A, Kelleher D, Smith Q, Beijersbergen RL, Ghazal P, Shamu CE (2009). "Statistical methods for analysis of high-throughput RNA interference screens". Nature Methods. 6 (8): 569–75. doi:10.1038/nmeth.1351. PMC   2789971 . PMID   19644458.
  3. Zhang XHD (2010). "Genome-wide screens for effective siRNAs through assessing the size of siRNA effects". BMC Research Notes. 1: 33. doi:10.1186/1756-0500-1-33. PMC   2526086 . PMID   18710486.{{cite journal}}: CS1 maint: unflagged free DOI (link)
  4. Malo N, Hanley JA, Cerquozzi S, Pelletier J, Nadon R (2006). "Statistical practice in high-throughput screening data analysis". Nature Biotechnology. 24 (2): 167–75. doi:10.1038/nbt1186. PMID   16465162. S2CID   6158255.
  5. Zhang XH, Kuan PF, Ferrer M, Shu X, Liu YC, Gates AT, Kunapuli P, Stec EM, Xu M, Marine SD, Holder DJ, Stulovici B, Heyse JF, Espeseth AS (2009). "Hit selection with false discovery rate control in genome-scale RNAi screens". Nucleic Acids Research. 36 (14): 4667–79. doi:10.1093/nar/gkn435. PMC   2504311 . PMID   18628291.
  6. Klinghoffer RA, Frazier J, Annis J, Berndt JD, Roberts BS, Arthur WT, Lacson R, Zhang XH, Ferrer M, Moon RT, Cleary MA (2010). "A lentivirus-mediated genetic screen identifies dihydrofolate reductase (DHFR) as a modulator of beta-catenin/GSK3 signaling". PLOS ONE. 4 (9): e6892. doi: 10.1371/journal.pone.0006892 . PMC   2731218 . PMID   19727391.
  7. Zhang XHD (2010). "An effective method controlling false discoveries and false non-discoveries in genome-scale RNAi screens". Journal of Biomolecular Screening. 15 (9): 1116–22. doi: 10.1177/1087057110381783 . PMID   20855561.
  8. Malo N, Hanley JA, Carlile G, Liu J, Pelletier J, Thomas D, Nadon R (2010). "Experimental design and statistical methods for improved hit detection in high-throughput screening". Journal of Biomolecular Screening. 15 (8): 990–1000. doi: 10.1177/1087057110377497 . PMID   20817887. S2CID   41358896.
  9. Zhang XH, Lacson R, Yang R, Marine SD, McCampbell, Toolan DM, Hare TR, Kajdas J, Berger JP, Holder DJ, Heyse JF, Ferrer M (2010). "The use of SSMD-based false discovery and false non-discovery rates in genome-scale RNAi screens". Journal of Biomolecular Screening. 15 (9): 1123–31. doi: 10.1177/1087057110381919 . PMID   20852024.
  10. Quon K, Kassner PD (2009). "RNA interference screening for the discovery of oncology targets". Expert Opinion on Therapeutic Targets. 13 (9): 1027–35. doi:10.1517/14728220903179338. PMID   19650760. S2CID   10714162.
  11. Zhang XH, Marine SD, Ferrer M (2009). "Error rates and power in genome-scale RNAi screens". Journal of Biomolecular Screening. 14 (3): 230–38. doi: 10.1177/1087057109331475 . PMID   19211781.
  12. Zhang XHD (2007). "A new method with flexible and balanced control of false negatives and false positives for hit selection in RNA interference high-throughput screening assays". Journal of Biomolecular Screening. 12 (5): 645–55. doi: 10.1177/1087057107300645 . PMID   17517904.
  13. Zhang XH, Ferrer M, Espeseth AS, Marine SD, Stec EM, Crackower MA, Holder DJ, Heyse JF, Strulovici B (2007). "The use of strictly standardized mean difference for hit selection in primary RNA interference high-throughput screening experiments". Journal of Biomolecular Screening. 12 (4): 645–55. doi: 10.1177/1087057107300646 . PMID   17435171. S2CID   7542230.
  14. Zhang XH, Yang XC, Chung N, Gates A, Stec E, Kunapuli P, Holder DJ, Ferrer M, Espeseth AS (2006). "Robust statistical methods for hit selection in RNA interference high-throughput screening experiments". Pharmacogenomics. 7 (3): 299–09. doi:10.2217/14622416.7.3.299. PMID   16610941.
  15. Brideau C, Gunter G, Pikounis B, Liaw A (2003). "Improved statistical methods for hit selection in high-throughput screening". Journal of Biomolecular Screening. 8 (6): 634–47. doi: 10.1177/1087057103258285 . PMID   14711389.
  16. Cohen J (1994). "The Earth Is Round (P-Less-Than.05)". American Psychologist. 49 (12): 997–1003. doi:10.1037/0003-066X.49.12.997. ISSN   0003-066X.
  17. Zhang XHD (2009). "A method for effectively comparing gene effects in multiple conditions in RNAi and expression-profiling research". Pharmacogenomics. 10 (3): 345–58. doi:10.2217/14622416.10.3.345. PMID   20397965.
  18. Zhang XHD (2010). "Strictly standardized mean difference, standardized mean difference and classical t-test for the comparison of two groups". Statistics in Biopharmaceutical Research. 2 (2): 292–99. doi:10.1198/sbr.2009.0074. S2CID   119825625.
  19. 1 2 3 Zhang XHD (2010). "Assessing the size of gene or RNAi effects in multifactor high-throughput experiments". Pharmacogenomics. 11 (2): 199–213. doi:10.2217/PGS.09.136. PMID   20136359.
  20. Zhao WQ, Santini F, Breese R, Ross D, Zhang XD, Stone DJ, Ferrer M, Townsend M, Wolfe AL, Seager MA, Kinney GG, Shughrue PJ, Ray WJ (2010). "Inhibition of calcineurin-mediated endocytosis and alpha-amino-3-hydroxy-5-methyl-4-isoxazolepropionic acid (AMPA) receptors prevents amyloid beta oligomer-induced synaptic disruption". Journal of Biological Chemistry. 285 (10): 7619–32. doi: 10.1074/jbc.M109.057182 . PMC   2844209 . PMID   20032460.

Further reading