MEME suite

Last updated

The MEME suite is a collection of tools for the discovery and analysis of sequence motifs.

Contents

Motif discovery

MEME

Multiple Expectation maximizations for Motif Elicitation (MEME) is a tool for discovering motifs in a group of related DNA or protein sequences. [1] MEME takes as input a group of DNA or protein sequences and outputs as many motifs as requested up to a user-specified statistical confidence threshold. MEME uses statistical modeling techniques to automatically choose the best width, number of occurrences, and description for each motif. [2]

GLAM2

Gapped local alignment of motifs (GLAM 2) is a tool for discovering gapped motifs in a group of DNA or protein sequences. Unlike MEME, GLAM2 does not try to find several different motifs all in one go. Instead, it performs replicates: it tries to find the best possible motif multiple times. [3]

DREME

Discriminative Regular Expression Motif Elicitation (DREME) is a tool for discovering motifs in large collections of sequences. DREME is computationally efficient and therefore is suitable for motif search on large data sets derived from ChIP-seq (Chromatin immunoprecipitation followed by sequencing) experiments. In the interest of computational efficiency, DREME finds only motifs that can be expressed in the IUPAC alphabet, which contains the standard DNA alphabet ACGT as well as eleven 'wildcard' characters (for example, R indicates either A or G).

MEME-ChIP

MEME-ChIP is a tool for discovering motifs in data sets derived from ChIP-seq (Chromatin immunoprecipitation followed by sequencing) experiments. [4]

FIMO

Find Individual Motif Occurrences (FIMO) is a tool for finding instances of motifs in a sequence database. FIMO searches the database for the provided motifs, and reports a q-value for each match. [5]

GLAM2SCAN

GLAM2SCAN is a tool for finding occurrences of a GLAM2 motif in a sequence database. [6]

MAST

Motif Alignment & Search Tool (MAST) is a tool for searching biological sequence databases for sequences that contain an occurrence of each motif in a given set of motifs. MAST scores the matches and reports p-values for four types of events:

Motif enrichment analysis

SpaMo

Spaced Motif Analysis Tool (SpaMo) is a tool for inferring interactions between transcription factors. SpaMo takes a set of sequences (typically sequences surrounding ChIP-seq peaks), a motif represented in these sequences, and a database of known motifs. SpaMo searches the database for instances of database motifs enriched in sites neighboring the given motif. These enrichments suggest physical interaction between the factors that bind each motif. [7]

CentriMo

Central Motif Enrichment Analysis (CentriMo) is a tool for inferring direct DNA binding from ChIP-seq data. CentriMo is based on the observation that the positional distribution of binding sites matching the direct-binding motif tends to be unimodal, well centered and maximal in the precise center of the ChIP-seq peak regions. CentriMo takes a set of sequences and plots the occurrence of motifs relative to the ChIP-seq peak. Motifs that occur exclusively at the peak provide good evidence of direct binding, while motifs that do not occur in a consistent position relative to the peak may not bind directly. [8]

MCAST

Motif Cluster Alignment and Search Tool (MCAST) is a tool for searching a sequence database for statistically significant clusters of non-overlapping occurrences of a set of motifs. Such clusters may represent regulatory modules.

Motif comparison

TOMTOM

Tomtom is a tool for comparing a DNA motif to a database of known motifs. TOMTOM searches for statistically significantly similar motifs to the query motif. TOMTOM is useful for determining whether a discovered motif is novel or is a variation of a known motif.

Motif function analysis

GOMO

Gene Ontology for MOtifs (GOMO) is a tool for identifying possible roles for DNA binding motifs. It does so by comparing genes the motif occurs upstream of to a Gene Ontology database. If the motif occurs statistically significantly upstream of genes related to a particular function (for example, lactose digestion), it suggests that the transcription factor that binds the motif may regulate that function (for example, by promoting transcription of proteins that digest lactose).

Related Research Articles

In biology, a sequence motif is a nucleotide or amino-acid sequence pattern that is widespread and usually assumed to be related to biological function of the macromolecule. For example, an N-glycosylation site motif can be defined as Asn, followed by anything but Pro, followed by either Ser or Thr, followed by anything but Pro residue.

In computational biology, gene prediction or gene finding refers to the process of identifying the regions of genomic DNA that encode genes. This includes protein-coding genes as well as RNA genes, but may also include prediction of other functional elements such as regulatory regions. Gene finding is one of the first and most important steps in understanding the genome of a species once it has been sequenced.

<span class="mw-page-title-main">DNA-binding protein</span> Proteins that bind with DNA, such as transcription factors, polymerases, nucleases and histones

DNA-binding proteins are proteins that have DNA-binding domains and thus have a specific or general affinity for single- or double-stranded DNA. Sequence-specific DNA-binding proteins generally interact with the major groove of B-DNA, because it exposes more functional groups that identify a base pair. However, there are some known minor groove DNA-binding ligands such as netropsin, distamycin, Hoechst 33258, pentamidine, DAPI and others.

<span class="mw-page-title-main">ENCODE</span> Research consortium investigating functional elements in human and model organism DNA

The Encyclopedia of DNA Elements (ENCODE) is a public research project which aims "to build a comprehensive parts list of functional elements in the human genome."

A position weight matrix (PWM), also known as a position-specific weight matrix (PSWM) or position-specific scoring matrix (PSSM), is a commonly used representation of motifs (patterns) in biological sequences.

Multiple Expectation maximizations for Motif Elicitation (MEME) is a tool for discovering motifs in a group of related DNA or protein sequences.

<span class="mw-page-title-main">ChIP-on-chip</span> Molecular biology method

ChIP-on-chip is a technology that combines chromatin immunoprecipitation ('ChIP') with DNA microarray ("chip"). Like regular ChIP, ChIP-on-chip is used to investigate interactions between proteins and DNA in vivo. Specifically, it allows the identification of the cistrome, the sum of binding sites, for DNA-binding proteins on a genome-wide basis. Whole-genome analysis can be performed to determine the locations of binding sites for almost any protein of interest. As the name of the technique suggests, such proteins are generally those operating in the context of chromatin. The most prominent representatives of this class are transcription factors, replication-related proteins, like origin recognition complex protein (ORC), histones, their variants, and histone modifications.

<span class="mw-page-title-main">Chromosome conformation capture</span>

Chromosome conformation capture techniques are a set of molecular biology methods used to analyze the spatial organization of chromatin in a cell. These methods quantify the number of interactions between genomic loci that are nearby in 3-D space, but may be separated by many nucleotides in the linear genome. Such interactions may result from biological functions, such as promoter-enhancer interactions, or from random polymer looping, where undirected physical motion of chromatin causes loci to collide. Interaction frequencies may be analyzed directly, or they may be converted to distances and used to reconstruct 3-D structures.

ChIP-sequencing, also known as ChIP-seq, is a method used to analyze protein interactions with DNA. ChIP-seq combines chromatin immunoprecipitation (ChIP) with massively parallel DNA sequencing to identify the binding sites of DNA-associated proteins. It can be used to map global binding sites precisely for any protein of interest. Previously, ChIP-on-chip was the most common technique utilized to study these protein–DNA relations.

<span class="mw-page-title-main">DNA binding site</span> Regions of DNA capable of binding to biomolecules

DNA binding sites are a type of binding site found in DNA where other molecules may bind. DNA binding sites are distinct from other binding sites in that (1) they are part of a DNA sequence and (2) they are bound by DNA-binding proteins. DNA binding sites are often associated with specialized proteins known as transcription factors, and are thus linked to transcriptional regulation. The sum of DNA binding sites of a specific transcription factor is referred to as its cistrome. DNA binding sites also encompasses the targets of other proteins, like restriction enzymes, site-specific recombinases and methyltransferases.

Peak calling is a computational method used to identify areas in a genome that have been enriched with aligned reads as a consequence of performing a ChIP-sequencing or MeDIP-seq experiment. These areas are those where a protein interacts with DNA. When the protein is a transcription factor, the enriched area is its transcription factor binding site (TFBS). Popular software programs include MACS. Wilbanks and colleagues is a survey of the ChIP-seq peak callers, and Bailey et al. is a description of practical guidelines for peak calling in ChIP-seq data.

TRANSFAC is a manually curated database of eukaryotic transcription factors, their genomic binding sites and DNA binding profiles. The contents of the database can be used to predict potential transcription factor binding sites.

<span class="mw-page-title-main">ChIP-exo</span>

ChIP-exo is a chromatin immunoprecipitation based method for mapping the locations at which a protein of interest binds to the genome. It is a modification of the ChIP-seq protocol, improving the resolution of binding sites from hundreds of base pairs to almost one base pair. It employs the use of exonucleases to degrade strands of the protein-bound DNA in the 5'-3' direction to within a small number of nucleotides of the protein binding site. The nucleotides of the exonuclease-treated ends are determined using some combination of DNA sequencing, microarrays, and PCR. These sequences are then mapped to the genome to identify the locations on the genome at which the protein binds.

<span class="mw-page-title-main">STARR-seq</span>

STARR-seq is a method to assay enhancer activity for millions of candidates from arbitrary sources of DNA. It is used to identify the sequences that act as transcriptional enhancers in a direct, quantitative, and genome-wide manner.

ATAC-seq is a technique used in molecular biology to assess genome-wide chromatin accessibility. In 2013, the technique was first described as an alternative advanced method for MNase-seq, FAIRE-Seq and DNase-Seq. ATAC-seq is a faster and more sensitive analysis of the epigenome than DNase-seq or MNase-seq.

Transcription factors are proteins that bind genomic regulatory sites. Identification of genomic regulatory elements is essential for understanding the dynamics of developmental, physiological and pathological processes. Recent advances in chromatin immunoprecipitation followed by sequencing (ChIP-seq) have provided powerful ways to identify genome-wide profiling of DNA-binding proteins and histone modifications. The application of ChIP-seq methods has reliably discovered transcription factor binding sites and histone modification sites.

Selective microfluidics-based ligand enrichment followed by sequencing (SMiLE-seq) is a technique developed for the rapid identification of DNA binding specificities and affinities of full length monomeric and dimeric transcription factors in a fast and semi-high-throughput fashion.

CUT&RUN sequencing, also known as cleavage under targets and release using nuclease, is a method used to analyze protein interactions with DNA. CUT&RUN sequencing combines antibody-targeted controlled cleavage by micrococcal nuclease with massively parallel DNA sequencing to identify the binding sites of DNA-associated proteins. It can be used to map global DNA binding sites precisely for any protein of interest. Currently, ChIP-Seq is the most common technique utilized to study protein–DNA relations, however, it suffers from a number of practical and economical limitations that CUT&RUN sequencing does not.

CUT&Tag-sequencing, also known as cleavage under targets and tagmentation, is a method used to analyze protein interactions with DNA. CUT&Tag-sequencing combines antibody-targeted controlled cleavage by a protein A-Tn5 fusion with massively parallel DNA sequencing to identify the binding sites of DNA-associated proteins. It can be used to map global DNA binding sites precisely for any protein of interest. Currently, ChIP-Seq is the most common technique utilized to study protein–DNA relations, however, it suffers from a number of practical and economical limitations that CUT&RUN and CUT&Tag sequencing do not. CUT&Tag sequencing is an improvement over CUT&RUN because it does not require cells to be lysed or chromatin to be fractionated. CUT&RUN is not suitable for single-cell platforms so CUT&Tag is advantageous for these.

ChIL sequencing (ChIL-seq), also known as Chromatin Integration Labeling sequencing, is a method used to analyze protein interactions with DNA. ChIL-sequencing combines antibody-targeted controlled cleavage by Tn5 transposase with massively parallel DNA sequencing to identify the binding sites of DNA-associated proteins. It can be used to map global DNA binding sites precisely for any protein of interest. Currently, ChIP-Seq is the most common technique utilized to study protein–DNA relations, however, it suffers from a number of practical and economical limitations that ChIL-Sequencing does not. ChIL-Seq is a precise technique that reduces sample loss could be applied to single-cells.

References

  1. Bailey T.L., Elkan C. Unsupervised Learning of Multiple Motifs In Biopolymers Using EM. Mach. Learn. 1995;21:51–80.
  2. Timothy L. Bailey, "DREME: Motif discovery in transcription factor ChIP-seq data", Bioinformatics, 27(12):1653-1659, 2011.
  3. MC Frith, NFW Saunders, B Kobe, TL Bailey, "Discovering sequence motifs with arbitrary insertions and deletions", PLoS Computational Biology, 4(5):e1000071, 2008
  4. Philip Machanick and Timothy L. Bailey, "MEME-ChIP: motif analysis of large DNA datasets", Bioinformatics, 2712, 1696-1697, 2011
  5. Charles E. Grant, Timothy L. Bailey, and William Stafford Noble, "FIMO: Scanning for occurrences of a given motif", Bioinformatics, 27(7):1017-1018, 2011
  6. MC Frith, NFW Saunders, B Kobe, TL Bailey (2008) Discovering sequence motifs with arbitrary insertions and deletions, PLoS Computational Biology, 4(5), e1000071, 2008
  7. Whitington, T., Frith, M. C., Johnson, J., & Bailey, T. L. (2011). Inferring transcription factor complexes from ChIP-seq data. Nucleic Acids Research, 39(15), e98-e98.
  8. Bailey, T. L., & Machanick, P. (2012). Inferring direct DNA binding from ChIP-seq. Nucleic Acids Research, 40(17), e128-e128