Scan statistic

Last updated

In statistics, a scan statistic or window statistic is a problem relating to the clustering of randomly positioned points. An example of a typical problem is the maximum size of a cluster of points on a line or the longest series of successes recorded by a moving window of fixed length. [1]

Joseph Naus first published on the problem in the 1960s, [2] and has been called the "father of the scan statistic" in honour of his early contributions. [3] The results can be applied in epidemiology, public health and astronomy to find unusual clusters of events. [4]

It was extended by Martin Kulldorff to multidimensional settings and varying window sizes in a 1997 paper, [5] which is (as of 11 October 2015) the most cited article in its journal, Communications in Statistics – Theory and Methods . [6] This work lead to the creation of the software SaTScan, a program trademarked by Martin Kulldorff that applies his methods to data.

Recent results have shown that using scale-dependent critical values for the scan statistic allows to attain asymptotically optimal detection simultaneously for all signal lengths, thereby improving on the traditional scan, but this procedure has been criticized for losing too much power for short signals. Walther and Perry (2022) considered the problem of detecting an elevated mean on an interval with unknown location and length in the univariate Gaussian sequence model. [7] They explain this discrepancy by showing that these asymptotic optimality results will necessarily be too imprecise to discern the performance of scan statistics in a practically relevant way, even in a large sample context. Instead, they propose to assess the performance with a new finite sample criterion. They presented three new calibration techniques for scan statistics that perform well across a range of relevant signal lengths to optimally increase performance of short signals.

The scan-statistic-based methods have been specifically developed to detect rare variant associations in the noncoding genome, especially for the intergenic region. Compared with fixed-size sliding window analysis, scan-statistic-based methods use data-adaptive size dynamic window to scan the genome continuously, and increase the analysis power by flexibly selecting the locations and sizes of the signal regions. [8] Some examples of these methods are Q-SCAN, [9] SCANG, [10] WGScan. [11]

Related Research Articles

<span class="mw-page-title-main">Bioinformatics</span> Computational analysis of large, complex sets of biological data

Bioinformatics is an interdisciplinary field of science that develops methods and software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, chemistry, physics, computer science, computer programming, information engineering, mathematics and statistics to analyze and interpret biological data. The subsequent process of analyzing and interpreting data is referred to as computational biology.

<span class="mw-page-title-main">DNA microarray</span> Collection of microscopic DNA spots attached to a solid surface

A DNA microarray is a collection of microscopic DNA spots attached to a solid surface. Scientists use DNA microarrays to measure the expression levels of large numbers of genes simultaneously or to genotype multiple regions of a genome. Each DNA spot contains picomoles of a specific DNA sequence, known as probes. These can be a short section of a gene or other DNA element that are used to hybridize a cDNA or cRNA sample under high-stringency conditions. Probe-target hybridization is usually detected and quantified by detection of fluorophore-, silver-, or chemiluminescence-labeled targets to determine relative abundance of nucleic acid sequences in the target. The original nucleic acid arrays were macro arrays approximately 9 cm × 12 cm and the first computerized image based analysis was published in 1981. It was invented by Patrick O. Brown. An example of its application is in SNPs arrays for polymorphisms in cardiovascular diseases, cancer, pathogens and GWAS analysis. It is also used for the identification of structural variations and the measurement of gene expression.

In computational biology, gene prediction or gene finding refers to the process of identifying the regions of genomic DNA that encode genes. This includes protein-coding genes as well as RNA genes, but may also include prediction of other functional elements such as regulatory regions. Gene finding is one of the first and most important steps in understanding the genome of a species once it has been sequenced.

<span class="mw-page-title-main">High-throughput screening</span> Drug discovery technique

High-throughput screening (HTS) is a method for scientific discovery especially used in drug discovery and relevant to the fields of biology, materials science and chemistry. Using robotics, data processing/control software, liquid handling devices, and sensitive detectors, high-throughput screening allows a researcher to quickly conduct millions of chemical, genetic, or pharmacological tests. Through this process one can quickly recognize active compounds, antibodies, or genes that modulate a particular biomolecular pathway. The results of these experiments provide starting points for drug design and for understanding the noninteraction or role of a particular location.

Computational genomics refers to the use of computational and statistical analysis to decipher biology from genome sequences and related data, including both DNA and RNA sequence as well as other "post-genomic" data. These, in combination with computational and statistical approaches to understanding the function of the genes and statistical association analysis, this field is also often referred to as Computational and Statistical Genetics/genomics. As such, computational genomics may be regarded as a subset of bioinformatics and computational biology, but with a focus on using whole genomes to understand the principles of how the DNA of a species controls its biology at the molecular level and beyond. With the current abundance of massive biological datasets, computational studies have become one of the most important means to biological discovery.

ChIP-sequencing, also known as ChIP-seq, is a method used to analyze protein interactions with DNA. ChIP-seq combines chromatin immunoprecipitation (ChIP) with massively parallel DNA sequencing to identify the binding sites of DNA-associated proteins. It can be used to map global binding sites precisely for any protein of interest. Previously, ChIP-on-chip was the most common technique utilized to study these protein–DNA relations.

<span class="mw-page-title-main">Whole genome sequencing</span> Determining nearly the entirety of the DNA sequence of an organisms genome at a single time

Whole genome sequencing (WGS), also known as full genome sequencing, complete genome sequencing, or entire genome sequencing, is the process of determining the entirety, or nearly the entirety, of the DNA sequence of an organism's genome at a single time. This entails sequencing all of an organism's chromosomal DNA as well as DNA contained in the mitochondria and, for plants, in the chloroplast.

<span class="mw-page-title-main">RNA-Seq</span> Lab technique in cellular biology

RNA-Seq is a technique that uses next-generation sequencing (NGS) to reveal the presence and quantity of RNA molecules in a biological sample, providing a snapshot of gene expression in the sample, also known as transcriptome.

Determining the number of clusters in a data set, a quantity often labelled k as in the k-means algorithm, is a frequent problem in data clustering, and is a distinct issue from the process of actually solving the clustering problem.

<span class="mw-page-title-main">Exome sequencing</span> Sequencing of all the exons of a genome

Exome sequencing, also known as whole exome sequencing (WES), is a genomic technique for sequencing all of the protein-coding regions of genes in a genome. It consists of two steps: the first step is to select only the subset of DNA that encodes proteins. These regions are known as exons—humans have about 180,000 exons, constituting about 1% of the human genome, or approximately 30 million base pairs. The second step is to sequence the exonic DNA using any high-throughput DNA sequencing technology.

Peak calling is a computational method used to identify areas in a genome that have been enriched with aligned reads as a consequence of performing a ChIP-sequencing or MeDIP-seq experiment. These areas are those where a protein interacts with DNA. When the protein is a transcription factor, the enriched area is its transcription factor binding site (TFBS). Popular software programs include MACS. Wilbanks and colleagues is a survey of the ChIP-seq peak callers, and Bailey et al. is a description of practical guidelines for peak calling in ChIP-seq data.

FAIRE-Seq is a method in molecular biology used for determining the sequences of DNA regions in the genome associated with regulatory activity. The technique was developed in the laboratory of Jason D. Lieb at the University of North Carolina, Chapel Hill. In contrast to DNase-Seq, the FAIRE-Seq protocol doesn't require the permeabilization of cells or isolation of nuclei, and can analyse any cell type. In a study of seven diverse human cell types, DNase-seq and FAIRE-seq produced strong cross-validation, with each cell type having 1-2% of the human genome as open chromatin.

A rare variant is a genetic variant which occurs at low frequency in a population. Rare variants play a significant role in both complex and Mendelian disease and are responsible for a portion of the missing heritability of complex diseases. The theoretical case for a significant role of rare variants is that alleles that strongly predispose an individual to disease will be kept at low frequencies in populations by purifying selection. Rare variants are increasingly being studied, as a consequence of whole exome and whole genome sequencing efforts. While these variants are individually infrequent in populations, there are many in human populations, and they can be unique to specific populations. They are more likely to be deleterious than common variants, as a result of rapid population growth and weak purifying selection. They have been suspected of acting independently or along with common variants to cause disease states.

Single nucleotide polymorphism annotation is the process of predicting the effect or function of an individual SNP using SNP annotation tools. In SNP annotation the biological information is extracted, collected and displayed in a clear form amenable to query. SNP functional annotation is typically performed based on the available information on nucleic acid and protein sequences.

Third-generation sequencing is a class of DNA sequencing methods currently under active development.

Mark Joseph Daly is Director of the Finnish Institute for Molecular Medicine (FIMM) at the University of Helsinki, a Professor of Genetics at Harvard Medical School, Chief of the Analytic and Translational Genetic Unit at Massachusetts General Hospital, and a member of the Broad Institute of MIT and Harvard. In the early days of the Human Genome Project, Daly helped develop the genetic model by which linkage disequilibrium could be used to map the haplotype structure of the human genome. In addition, he developed statistical methods to find associations between genes and disorders such as Crohn's disease, inflammatory bowel disease, autism and schizophrenia.

<span class="mw-page-title-main">Genome architecture mapping</span>

In molecular biology, genome architecture mapping (GAM) is a cryosectioning method to map colocalized DNA regions in a ligation independent manner. It overcomes some limitations of Chromosome conformation capture (3C), as these methods have a reliance on digestion and ligation to capture interacting DNA segments. GAM is the first genome-wide method for capturing three-dimensional proximities between any number of genomic loci without ligation.

Robert E. Kass is the Maurice Falk Professor of Statistics and Computational Neuroscience in the Department of Statistics and Data Science, the Machine Learning Department, and the Neuroscience Institute at Carnegie Mellon University.

<span class="mw-page-title-main">Martin Kulldorff</span> Professor of medicine, biostatistician

Martin Kulldorff is a Swedish biostatistician. He has been a professor of medicine at Harvard Medical School since 2003, though on leave as of 2023. He is a member of the US Food and Drug Administration's Drug Safety and Risk Management Advisory Committee and a former member of the Vaccine Safety Subgroup of the Advisory Committee on Immunization Practices at the US Centers for Disease Control and Prevention.

SaTScan is a software tool that employs scan statistics for the spatial and temporal analysis of clusters of events. The software is trademarked by Martin Kulldorff, and was designed originally for public health and epidemiology to identify clusters of cases in both space and time and to perform statistical analysis to determine if these clusters are significantly different from what would be expected by chance The software provides a user-friendly interface and a range of statistical methods, making it accessible to researchers and practitioners. While not a full Geographic Information System, the outputs from SaTScan can be integrated with software such as ArcGIS or QGIS to visualize and analyze spatial data, and to map the distribution of various phenomena.

References

  1. Naus, J. I. (1982). "Approximations for Distributions of Scan Statistics". Journal of the American Statistical Association . 77 (377): 177–183. doi:10.1080/01621459.1982.10477783. JSTOR   2287786.
  2. Naus, Joseph Irwin (1964). Clustering of random points in line and plane (Ph. D.). Retrieved 6 January 2014.
  3. Wallenstein, S. (2009). "Joseph Naus: Father of the Scan Statistic". Scan Statistics. pp. 1–25. doi:10.1007/978-0-8176-4749-0_1. ISBN   978-0-8176-4748-3.
  4. Glaz, J.; Naus, J.; Wallenstein, S. (2001). "Introduction". Scan Statistics. Springer Series in Statistics. pp. 3–9. doi:10.1007/978-1-4757-3460-7_1. ISBN   978-1-4419-3167-2.
  5. Kulldorff, Martin (1997). "A spatial scan statistic" (PDF). Communications in Statistics – Theory and Methods. 26 (6): 1481–1496. doi:10.1080/03610929708831995.
  6. "Most Cited Articles". Communications in Statistics – Theory and Methods. Retrieved 11 October 2015.
  7. Walther, Guenther; Perry, Andrew (November 2022). "Calibrating the scan statistic: Finite sample performance versus asymptotics". Journal of the Royal Statistical Society, Series B (Statistical Methodology). 84 (5): 1608–1639. doi:10.1111/rssb.12549. ISSN   1369-7412. S2CID   221713232.
  8. Li, Zilin; Li, Xihao; Zhou, Hufeng; Gaynor, Sheila M.; Margaret, Sunitha Selvaraj; Arapoglou, Theodore; Qiuck, Corbin; Liu, Yaowu; Chen, Han; Sun, Ryan; Dey, Rounak; Arnett, Donna K.; Auer, Paul L.; Bielak, Lawrence F.; Bis, Joshua C.; Blackwell, Thomas W.; Blangero, John; Boerwinkle, Eric; Bowden, Donald W.; Brody, Jennifer A.; Cade, Brian E.; Conomos, Matthew P.; Correa, Adolfo; Cupples, L. Adrienne; Curran, Joanne E.; de Vries, Paul S.; Duggirala, Ravindranath; Franceschini, Nora; Freedman, Barry I.; Goring, Harald H.H.; Guo, Xiuqing; Kalyani, Rita R.; Kooperberg, Charles; Kral, Brian G.; Lange, Leslie A.; Lin, Bridget; Manichaikul, Ani; Martin, Lisa W.; Mathias, Rasika A.; Meigs, James B.; Mitchell, Braxton D.; Mitchell, Braxton D.; Montasser, May E.; Morrison, Alanna C.; Naseri, Take; O’Connell, Jeffrey R.; Palmer, Nicholette D.; Reupena, Muagututi’a Sefuiva; Rice, Kenneth M.; Rich, Stephen S.; Smith, Jennifer A.; Taylor, Kent D.; Taub, Margaret A.; Vasan, Ramachandran S.; Weeks, Daniel E.; Wilson, James G.; Yanek, Lisa R.; Zhao, Wei; NHLBI Trans-Omics for Precision Medicine (TOPMed) Consortium; TOPMed Lipids Working Group; Rotter, Jerome I.; Willer, Cristen; Natarajan, Pradeep; Peloso, Gina M.; Lin, Xihong (2022). "A framework for detecting noncoding rare-variant associations of large-scale whole-genome sequencing studies". Nature Methods. 19 (12): 1599–1611. doi:10.1038/s41592-022-01640-x. PMC   10008172 . PMID   36303018. S2CID   243873361.
  9. Li, Zilin; Liu, Yaowu; Lin, Xihong (2022). "Simultaneous Detection of Signal Regions Using Quadratic Scan Statistics With Applications to Whole Genome Association Studies". Journal of the American Statistical Association. 117 (538): 823–834. doi:10.1080/01621459.2020.1822849. PMC   9285665 . PMID   35845434.
  10. Li, Zilin; Li, Xihao; Liu, Yaowu; Shen, Jincheng; Chen, Han; Zhou, Hufeng; Morrison, Alanna C.; Boerwinkle, Eric; Lin, Xihong (2019). "Dynamic Scan Procedure for Detecting Rare-Variant Association Regions in Whole-Genome Sequencing Studies". American Journal of Human Genetics. 104 (5): 802–814. doi:10.1016/j.ajhg.2019.03.002. PMC   6507043 . PMID   30982610.
  11. He, Zihuai; Xu, Bin; Buxbaum, Joseph; Ionita-Laza, Iuliana (2019). "A genome-wide scan statistic framework for whole-genome sequence data analysis". Nature Communications. 10 (1): 3018. doi:10.1038/s41467-019-11023-0. PMC   6616627 . PMID   31289270.