Peak calling

Last updated

Peak calling is a computational method used to identify areas in a genome that have been enriched with aligned reads as a consequence of performing a ChIP-sequencing or MeDIP-seq experiment. These areas are those where a protein interacts with DNA. [1] When the protein is a transcription factor, the enriched area is its transcription factor binding site (TFBS). Popular software programs include MACS. [2] Wilbanks and colleagues [3] is a survey of the ChIP-seq peak callers, and Bailey et al. [4] is a description of practical guidelines for peak calling in ChIP-seq data.

Peak calling may be conducted on transcriptome/exome as well to RNA epigenome sequencing data from MeRIPseq [5] or m6Aseq [6] for detection of post-transcriptional RNA modification sites with software programs, such as exomePeak. [7] Many of the peak calling tools are optimised for only some kind of assays such as only for transcription-factor ChIP-seq or only for DNase-seq. [8] However new generation of peak callers such as DFilter [9] are based on generalised optimal theory of detection and has been shown to work for nearly all kinds for tag profile signals from next-gen sequencing data. It is also possible to do more complex analysis using such tools like combining multiple ChIP-seq signal to detect regulatory sites. [10]

In the context of ChIP-exo, this process is known as 'peak-pair calling'. [11]

Differential peak calling is about identifying significant differences in two ChIP-seq signals. One can distinguish between one-stage and two-stage differential peak callers. One stage differential peak callers work in two phases: first, call peaks on individual ChIP-seq signals and second, combine individual signals and apply statistical tests to estimate differential peaks. DBChIP [12] and MAnorm [13] are examples for one stage differential peak callers.

Two stage differential peak callers segment two ChIP-seq signals and identify differential peaks in one step. They take advantage of signal segmentation approaches such as Hidden Markov Models. Examples for two-stage differential peak callers are ChIPDiff, [14] ODIN. [15] and THOR. Differential peak calling can also be applied in the context of analyzing RNA-binding protein binding sites. [16]

See also

Related Research Articles

Functional genomics Field of molecular biology

Functional genomics is a field of molecular biology that attempts to describe gene functions and interactions. Functional genomics make use of the vast data generated by genomic and transcriptomic projects. Functional genomics focuses on the dynamic aspects such as gene transcription, translation, regulation of gene expression and protein–protein interactions, as opposed to the static aspects of the genomic information such as DNA sequence or structures. A key characteristic of functional genomics studies is their genome-wide approach to these questions, generally involving high-throughput methods rather than a more traditional “gene-by-gene” approach.

The transcriptome is the set of all RNA transcripts, including coding and non-coding, in an individual or a population of cells. The term can also sometimes be used to refer to all RNAs, or just mRNA, depending on the particular experiment. The term transcriptome is a portmanteau of the words transcript and genome; it is associated with the process of transcript production during the biological process of transcription.

ChIP-on-chip Molecular biology method

ChIP-on-chip is a technology that combines chromatin immunoprecipitation ('ChIP') with DNA microarray ("chip"). Like regular ChIP, ChIP-on-chip is used to investigate interactions between proteins and DNA in vivo. Specifically, it allows the identification of the cistrome, the sum of binding sites, for DNA-binding proteins on a genome-wide basis. Whole-genome analysis can be performed to determine the locations of binding sites for almost any protein of interest. As the name of the technique suggests, such proteins are generally those operating in the context of chromatin. The most prominent representatives of this class are transcription factors, replication-related proteins, like origin recognition complex protein (ORC), histones, their variants, and histone modifications.

ChIP-sequencing, also known as ChIP-seq, is a method used to analyze protein interactions with DNA. ChIP-seq combines chromatin immunoprecipitation (ChIP) with massively parallel DNA sequencing to identify the binding sites of DNA-associated proteins. It can be used to map global binding sites precisely for any protein of interest. Previously, ChIP-on-chip was the most common technique utilized to study these protein–DNA relations.

Epigenomics is the study of the complete set of epigenetic modifications on the genetic material of a cell, known as the epigenome. The field is analogous to genomics and proteomics, which are the study of the genome and proteome of a cell. Epigenetic modifications are reversible modifications on a cell's DNA or histones that affect gene expression without altering the DNA sequence. Epigenomic maintenance is a continuous process and plays an important role in stability of eukaryotic genomes by taking part in crucial biological mechanisms like DNA repair. Plant flavones are said to be inhibiting epigenomic marks that cause cancers. Two of the most characterized epigenetic modifications are DNA methylation and histone modification. Epigenetic modifications play an important role in gene expression and regulation, and are involved in numerous cellular processes such as in differentiation/development and tumorigenesis. The study of epigenetics on a global level has been made possible only recently through the adaptation of genomic high-throughput assays.

Methylated DNA immunoprecipitation is a large-scale purification technique in molecular biology that is used to enrich for methylated DNA sequences. It consists of isolating methylated DNA fragments via an antibody raised against 5-methylcytosine (5mC). This technique was first described by Weber M. et al. in 2005 and has helped pave the way for viable methylome-level assessment efforts, as the purified fraction of methylated DNA can be input to high-throughput DNA detection methods such as high-resolution DNA microarrays (MeDIP-chip) or next-generation sequencing (MeDIP-seq). Nonetheless, understanding of the methylome remains rudimentary; its study is complicated by the fact that, like other epigenetic properties, patterns vary from cell-type to cell-type.

RNA-Seq Lab technique in cellular biology

RNA-Seq is a sequencing technique which uses next-generation sequencing (NGS) to reveal the presence and quantity of RNA in a biological sample at a given moment, analyzing the continuously changing cellular transcriptome.

DNA binding sites are a type of binding site found in DNA where other molecules may bind. DNA binding sites are distinct from other binding sites in that (1) they are part of a DNA sequence and (2) they are bound by DNA-binding proteins. DNA binding sites are often associated with specialized proteins known as transcription factors, and are thus linked to transcriptional regulation. The sum of DNA binding sites of a specific transcription factor is referred to as its cistrome. DNA binding sites also encompasses the targets of other proteins, like restriction enzymes, site-specific recombinases and methyltransferases.

Chromatin immunoprecipitation

Chromatin immunoprecipitation (ChIP) is a type of immunoprecipitation experimental technique used to investigate the interaction between proteins and DNA in the cell. It aims to determine whether specific proteins are associated with specific genomic regions, such as transcription factors on promoters or other DNA binding sites, and possibly defining cistromes. ChIP also aims to determine the specific location in the genome that various histone modifications are associated with, indicating the target of the histone modifiers.

FAIRE-Seq is a method in molecular biology used for determining the sequences of DNA regions in the genome associated with regulatory activity. The technique was developed in the laboratory of Jason D. Lieb at the University of North Carolina, Chapel Hill. In contrast to DNase-Seq, the FAIRE-Seq protocol doesn't require the permeabilization of cells or isolation of nuclei, and can analyse any cell type. In a study of seven diverse human cell types, DNase-seq and FAIRE-seq produced strong cross-validation, with each cell type having 1-2% of the human genome as open chromatin.

DNase-seq is a method in molecular biology used to identify the location of regulatory regions, based on the genome-wide sequencing of regions sensitive to cleavage by DNase I. FAIRE-Seq is a successor of DNase-seq for the genome-wide identification of accessible DNA regions in the genome. Both the protocols for identifying open chromatin regions have biases depending on underlying nucleosome structure. For example, FAIRE-seq provides higher tag counts at non-promoter regions. On the other hand, DNase-seq signal is higher at promoter regions, and DNase-seq has been shown to have better sensitivity than FAIRE-seq even at non-promoter regions.

Enhancer RNAs (eRNAs) represent a class of relatively long non-coding RNA molecules transcribed from the DNA sequence of enhancer regions. They were first detected in 2010 through the use of genome-wide techniques such as RNA-seq and ChIP-seq. eRNAs can be subdivided into two main classes: 1D eRNAs and 2D eRNAs, which differ primarily in terms of their size, polyadenylation state, and transcriptional directionality. The expression of a given eRNA correlates with the activity of its corresponding enhancer in target genes. Increasing evidence suggests that eRNAs actively play a role in transcriptional regulation in cis and in trans, and while their mechanisms of action remain unclear, a few models have been proposed.

EPD is a biological database and web resource of eukaryotic RNA polymerase II promoters with experimentally defined transcription start sites. Originally, EPD was a manually curated resource relying on transcript mapping experiments targeted at individual genes and published in academic journals. More recently, automatically generated promoter collections derived from electronically distributed high-throughput data produced with the CAGE or TSS-Seq protocols were added as part of a special subsection named EPDnew. The EPD web server offers additional services, including an entry viewer which enables users to explore the genomic context of a promoter in a UCSC Genome Browser window, and direct links for uploading EPD-derived promoter subsets to associated web-based promoter analysis tools of the Signal Search Analysis (SSA) and ChIP-Seq servers. EPD also features a collection of position weight matrices (PWMs) for common promoter sequence motifs.

DRIP-seq (DRIP-sequencing) is a technology for genome-wide profiling of a type of DNA-RNA hybrid called an "R-loop". DRIP-seq utilizes a sequence-independent but structure-specific antibody for DNA-RNA immunoprecipitation (DRIP) to capture R-loops for massively parallel DNA sequencing.

Identification of genomic regulatory elements is essential for understanding the dynamics of developmental, physiological and pathological processes. Recent advances in chromatin immunoprecipitation followed by sequencing (ChIP-seq) have provided powerful ways to identify genome-wide profiling of DNA-binding proteins and histone modifications. The application of ChIP-seq methods has reliably discovered transcription factor binding sites and histone modification sites.

Transcriptomics technologies are the techniques used to study an organism's transcriptome, the sum of all of its RNA transcripts. The information content of an organism is recorded in the DNA of its genome and expressed through transcription. Here, mRNA serves as a transient intermediary molecule in the information network, whilst non-coding RNAs perform additional diverse functions. A transcriptome captures a snapshot in time of the total transcripts present in a cell. Transcriptomics technologies provide a broad account of which cellular processes are active and which are dormant. A major challenge in molecular biology lies in understanding how the same genome can give rise to different cell types and how gene expression is regulated.

CUT&RUN-sequencing, also known as cleavage under targets and release using nuclease, is a method used to analyze protein interactions with DNA. CUT&RUN-sequencing combines antibody-targeted controlled cleavage by micrococcal nuclease with massively parallel DNA sequencing to identify the binding sites of DNA-associated proteins. It can be used to map global DNA binding sites precisely for any protein of interest. Currently, ChIP-Seq is the most common technique utilized to study protein–DNA relations, however, it suffers from a number of practical and economical limitations that CUT&RUN-sequencing does not.

CUT&Tag-sequencing, also known as cleavage under targets and tagmentation, is a method used to analyze protein interactions with DNA. CUT&Tag-sequencing combines antibody-targeted controlled cleavage by a protein A-Tn5 fusion with massively parallel DNA sequencing to identify the binding sites of DNA-associated proteins. It can be used to map global DNA binding sites precisely for any protein of interest. Currently, ChIP-Seq is the most common technique utilized to study protein–DNA relations, however, it suffers from a number of practical and economical limitations that CUT&RUN and CUT&Tag sequencing do not. CUT&Tag sequencing is an improvement over CUT&RUN because it does not require cells to be lysed or chromatin to be fractionated. CUT&RUN is not suitable for single-cell platforms so CUT&Tag is advantageous for these.

ChIL sequencing (ChIL-seq), also known as Chromatin Integration Labeling sequencing, is a method used to analyze protein interactions with DNA. ChIL-sequencing combines antibody-targeted controlled cleavage by Tn5 transposase with massively parallel DNA sequencing to identify the binding sites of DNA-associated proteins. It can be used to map global DNA binding sites precisely for any protein of interest. Currently, ChIP-Seq is the most common technique utilized to study protein–DNA relations, however, it suffers from a number of practical and economical limitations that ChIL-Sequencing does not. ChIL-Seq is a precise technique that reduces sample loss could be applied to single-cells.

References

  1. Valouev A, et al. (September 2008). "Genome-wide analysis of transcription factor binding sites based on ChIP-seq data". Nature Methods. 5 (9): 829–834. doi:10.1038/nmeth.1246. PMC   2917543 . PMID   19160518.
  2. Feng, Jianxing; Liu, Tao; Qin, Bo; Zhang, Yong; Liu, Xiaole Shirley (29 August 2012). "Identifying ChIP-seq enrichment using MACS". Nature Protocols. 7 (9): 1728–1740. doi:10.1038/nprot.2012.101. PMC   3868217 . PMID   22936215.
  3. Wilbanks, Elizabeth G.; Facciotti, Marc T. (7 July 2010). "Evaluation of Algorithm Performance in ChIP-Seq Peak Detection". PLOS ONE. 5 (7): e11471. Bibcode:2010PLoSO...511471W. doi: 10.1371/journal.pone.0011471 . PMC   2900203 . PMID   20628599.
  4. Bailey, TL; Krajewski P; Ladunga I; Lefebvre C; Li Q; Liu T; Madrigal P; Taslim C; Zhang J. (14 November 2013). "Practical guidelines for the comprehensive analysis of ChIP-seq data". PLOS Comput Biol. 9 (11): e1003326. Bibcode:2013PLSCB...9E3326B. doi:10.1371/journal.pcbi.1003326. PMC   3828144 . PMID   24244136.
  5. Meyer, Kate D.; Saletore, Yogesh; Zumbo, Paul; Elemento, Olivier; Mason, Christopher E.; Jaffrey, Samie R. (31 May 2012). "Comprehensive Analysis of mRNA Methylation Reveals Enrichment in 3′ UTRs and near Stop Codons". Cell. 149 (7): 1635–1646. doi:10.1016/j.cell.2012.05.003. PMC   3383396 . PMID   22608085.
  6. Dominissini, Dan; Moshitch-Moshkovitz, Sharon; Schwartz, Schraga; Salmon-Divon, Mali; Ungar, Lior; Osenberg, Sivan; Cesarkas, Karen; Jacob-Hirsch, Jasmine; Amariglio, Ninette; Kupiec, Martin; Sorek, Rotem; Rechavi, Gideon (28 April 2012). "Topology of the human and mouse m6A RNA methylomes revealed by m6A-seq". Nature. 485 (7397): 201–206. Bibcode:2012Natur.485..201D. doi:10.1038/nature11112. PMID   22575960. S2CID   3517716.
  7. Meng, J.; Cui, X.; Rao, M. K.; Chen, Y.; Huang, Y. (14 April 2013). "Exome-based analysis for RNA epigenome sequencing data". Bioinformatics. 29 (12): 1565–1567. doi:10.1093/bioinformatics/btt171. PMC   3673212 . PMID   23589649.
  8. Koohy, Hashem; Down, Thomas A.; Spivakov, Mikhail; Hubbard, Tim; Helmer-Citterich, Manuela (8 May 2014). "A Comparison of Peak Callers Used for DNase-Seq Data". PLOS ONE. 9 (5): e96303. Bibcode:2014PLoSO...996303K. doi: 10.1371/journal.pone.0096303 . PMC   4014496 . PMID   24810143.
  9. Kumar, Vibhor; Masafumi Muratani; Nirmala Arul Rayan; Petra Kraus; Thomas Lufkin; Huck Hui Ng; Shyam Prabhakar (Jul 2013). "Uniform, optimal signal processing of mapped deep-sequencing data". Nature Biotechnology. 31 (7): 615–622. doi: 10.1038/nbt.2596 . PMID   23770639.
  10. Wong, Ka-Chun; et al. (2014). "SignalSpider: probabilistic pattern discovery on multiple normalized ChIP-Seq signal profiles". Bioinformatics. 31 (1): 17–24. doi: 10.1093/bioinformatics/btu604 . PMID   25192742.
  11. Madrigal, Pedro (2015). "Identification of Transcription Factor Binding Sites in ChIP-exo using R/Bioconductor". Epigenesys Bioinformatics Protocols. 68.
  12. Keles, Liang (26 October 2011). "Detecting differential binding of transcription factors with ChIP-seq". Bioinformatics. 28 (1): 121–122. doi:10.1093/bioinformatics/btr605. PMC   3244766 . PMID   22057161.
  13. Waxman, Shao; Zhang; Yuan; Orkin (16 March 2012). "MAnorm: a robust model for quantitative comparison of ChIP-Seq data sets". Genome Biology. 13 (3): R16. doi:10.1186/gb-2012-13-3-r16. PMC   3439967 . PMID   22424423.
  14. Xu, Sung; Wei; Lin (28 July 2008). "An HMM approach to genome-wide identification of differential histone modification sites from ChIP-seq data". Bioinformatics. 24 (20): 2344–2349. doi: 10.1093/bioinformatics/btn402 . PMID   18667444.
  15. Allhoff, Costa; Sere; Chauvistre; Lin; Zenke (24 October 2014). "Detecting differential peaks in ChIP-seq signals with ODIN". Bioinformatics. 30 (24): 3467–3475. doi:10.1093/bioinformatics/btu722. PMID   25371479.
  16. Holmqvist E, Wright PR, Li L, Bischler T, Barquist L, Reinhardt R, Backofen R, Vogel J (2016). "Global RNA recognition patterns of post-transcriptional regulators Hfq and CsrA revealed by UV crosslinking in vivo". EMBO J. 35 (9): 991–1011. doi:10.15252/embj.201593360. PMC   5207318 . PMID   27044921.