Multiple EM for Motif Elicitation

Last updated

Multiple Expectation maximizations for Motif Elicitation (MEME) is a tool for discovering motifs in a group of related DNA or protein sequences. [1]

Contents

A motif is a sequence pattern that occurs repeatedly in a group of related protein or DNA sequences and is often associated with some biological function. MEME represents motifs as position-dependent letter-probability matrices which describe the probability of each possible letter at each position in the pattern. Individual MEME motifs do not contain gaps. Patterns with variable-length gaps are split by MEME into two or more separate motifs.

MEME takes as input a group of DNA or protein sequences (the training set) and outputs as many motifs as requested. It uses statistical modeling techniques to automatically choose the best width, number of occurrences, and description for each motif.

MEME is the first of a collection of tools for analyzing motifs called the MEME suite.

Definition

The MEME algorithm could be understood from two different perspectives. From a biological point of view, MEME identifies and characterizes shared motifs in a set of unaligned sequences. From the computer science aspect, MEME finds a set of non-overlapping, approximately matching substrings given a starting set of strings.[ citation needed ]

Use

MEME can be used to find similar biological functions and structures in different sequences. It is necessary to take into account that the sequences variation can be significant and that the motifs are sometimes very small. It is also useful to take into account that the binding sites for proteins are very specific. This makes it easier to reduce wet-lab experiments (saving cost and time). Indeed, to better discover the motifs relevant from a biological point it is necessary to carefully choose: the best width of motifs, the number of occurrences in each sequence, and the composition of each motif.

Algorithm components

The algorithm uses several types of well known functions:

However, one often doesn't know where the starting position is. Several possibilities exist: exactly one motif per sequence, or one or zero motif per sequence, or any number of motifs per sequence.

See also

Related Research Articles

Bioinformatics Computational analysis of large, complex sets of biological data

Bioinformatics is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combines biology, computer science, information engineering, mathematics and statistics to analyze and interpret the biological data. Bioinformatics has been used for in silico analyses of biological queries using mathematical and statistical techniques.

Hidden Markov Model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process — call it — with unobservable ("hidden") states. As part of the definition, HMM requires that there be an observable process whose outcomes are "influenced" by the outcomes of in a known way. Since cannot be observed directly, the goal is to learn about by observing HMM has an additional requirement that the outcome of at time may be "influenced" exclusively by the outcome of at and that the outcomes of and at must not affect the outcome of at

Sequence alignment Process in bioinformatics that identifies equivalent sites within molecular sequences

In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix. Gaps are inserted between the residues so that identical or similar characters are aligned in successive columns. Sequence alignments are also used for non-biological sequences, such as calculating the distance cost between strings in a natural language or in financial data.

Grammar theory to model symbol strings originated from work in computational linguistics aiming to understand the structure of natural languages. Probabilistic context free grammars (PCFGs) have been applied in probabilistic modeling of RNA structures almost 40 years after they were introduced in computational linguistics.

In bioinformatics, BLAST is an algorithm and program for comparing primary biological sequence information, such as the amino-acid sequences of proteins or the nucleotides of DNA and/or RNA sequences. A BLAST search enables a researcher to compare a subject protein or nucleotide sequence with a library or database of sequences, and identify database sequences that resemble the query sequence above a certain threshold. For example, following the discovery of a previously unknown gene in the mouse, a scientist will typically perform a BLAST search of the human genome to see if humans carry a similar gene; BLAST will identify sequences in the human genome that resemble the mouse gene based on similarity of sequence.

In biology, a sequence motif is a nucleotide or amino-acid sequence pattern that is widespread and usually assumed to be related to biological function of the macromolecule. For example, an N-glycosylation site motif can be defined as Asn, followed by anything but Pro, followed by either Ser or Thr, followed by anything but Pro residue.

Expectation–maximization algorithm Iterative method for finding maximum likelihood estimates in statistical models

In statistics, an expectation–maximization (EM) algorithm is an iterative method to find (local) maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step.

Structural alignment Aligning molecular sequences using sequence and structural information

Structural alignment attempts to establish homology between two or more polymer structures based on their shape and three-dimensional conformation. This process is usually applied to protein tertiary structures but can also be used for large RNA molecules. In contrast to simple structural superposition, where at least some equivalent residues of the two structures are known, structural alignment requires no a priori knowledge of equivalent positions. Structural alignment is a valuable tool for the comparison of proteins with low sequence similarity, where evolutionary relationships between proteins cannot be easily detected by standard sequence alignment techniques. Structural alignment can therefore be used to imply evolutionary relationships between proteins that share very little common sequence. However, caution should be used in using the results as evidence for shared evolutionary ancestry because of the possible confounding effects of convergent evolution by which multiple unrelated amino acid sequences converge on a common tertiary structure.

Haplotype Group of genes from one parent

A haplotype is a group of alleles in an organism that are inherited together from a single parent.

In electrical engineering, statistical computing and bioinformatics, the Baum–Welch algorithm is a special case of the EM algorithm used to find the unknown parameters of a hidden Markov model (HMM). It makes use of the forward-backward algorithm to compute the statistics for the expectation step.

In statistics, a mixture model is a probabilistic model for representing the presence of subpopulations within an overall population, without requiring that an observed data set should identify the sub-population to which an individual observation belongs. Formally a mixture model corresponds to the mixture distribution that represents the probability distribution of observations in the overall population. However, while problems associated with "mixture distributions" relate to deriving the properties of the overall population from those of the sub-populations, "mixture models" are used to make statistical inferences about the properties of the sub-populations given only observations on the pooled population, without sub-population identity information.

Generative topographic map (GTM) is a machine learning method that is a probabilistic counterpart of the self-organizing map (SOM), is probably convergent and does not require a shrinking neighborhood or a decreasing step size. It is a generative model: the data is assumed to arise by first probabilistically picking a point in a low-dimensional space, mapping the point to the observed high-dimensional input space, then adding noise in that space. The parameters of the low-dimensional probability distribution, the smooth map and the noise are all learned from the training data using the expectation-maximization (EM) algorithm. GTM was introduced in 1996 in a paper by Christopher Bishop, Markus Svensen, and Christopher K. I. Williams.

A position weight matrix (PWM), also known as a position-specific weight matrix (PSWM) or position-specific scoring matrix (PSSM), is a commonly used representation of motifs (patterns) in biological sequences.

Multiple sequence alignment Alignment of more than two molecular sequence

Multiple sequence alignment (MSA) may refer to the process or the result of sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. In many cases, the input set of query sequences are assumed to have an evolutionary relationship by which they share a linkage and are descended from a common ancestor. From the resulting MSA, sequence homology can be inferred and phylogenetic analysis can be conducted to assess the sequences' shared evolutionary origins. Visual depictions of the alignment as in the image at right illustrate mutation events such as point mutations that appear as differing characters in a single alignment column, and insertion or deletion mutations that appear as hyphens in one or more of the sequences in the alignment. Multiple sequence alignment is often used to assess sequence conservation of protein domains, tertiary and secondary structures, and even individual amino acids or nucleotides.

Network motif Type of sub-graph

Network motifs are recurrent and statistically significant subgraphs or patterns of a larger graph. All networks, including biological networks, social networks, technological networks and more, can be represented as graphs, which include a wide variety of subgraphs.

GeneMark is a generic name for a family of ab initio gene prediction programs developed at the Georgia Institute of Technology in Atlanta. Developed in 1993, original GeneMark was used in 1995 as a primary gene prediction tool for annotation of the first completely sequenced bacterial genome of Haemophilus influenzae, and in 1996 for the first archaeal genome of Methanococcus jannaschii. The algorithm introduced inhomogeneous three-periodic Markov chain models of protein-coding DNA sequence that became standard in gene prediction as well as Bayesian approach to gene prediction in two DNA strands simultaneously. Species specific parameters of the models were estimated from training sets of sequences of known type. The major step of the algorithm computes for a given DNA fragment posterior probabilities of either being "protein-coding" in each of six possible reading frames or being "non-coding". Original GeneMark is an HMM-like algorithm; it can be viewed as approximation to known in the HMM theory posterior decoding algorithm for appropriately defined HMM.

HMMER

HMMER is a free and commonly used software package for sequence analysis written by Sean Eddy. Its general usage is to identify homologous protein or nucleotide sequences, and to perform sequence alignments. It detects homology by comparing a profile-HMM to either a single sequence or a database of sequences. Sequences that score significantly better to the profile-HMM compared to a null model are considered to be homologous to the sequences that were used to construct the profile-HMM. Profile-HMMs are constructed from a multiple sequence alignment in the HMMER package using the hmmbuild program. The profile-HMM implementation used in the HMMER software was based on the work of Krogh and colleagues. HMMER is a console utility ported to every major operating system, including different versions of Linux, Windows, and Mac OS.

The MEME suite is a collection of tools for the discovery and analysis of sequence motifs.

In the field of computational biology, a planted motif search (PMS) also known as a (l, d)-motif search (LDMS) is a method for identifying conserved motifs within a set of nucleic acid or peptide sequences.

SNV calling from NGS data is any of a range of methods for identifying the existence of single nucleotide variants (SNVs) from the results of next generation sequencing (NGS) experiments. These are computational techniques, and are in contrast to special experimental methods based on known population-wide single nucleotide polymorphisms. Due to the increasing abundance of NGS data, these techniques are becoming increasingly popular for performing SNP genotyping, with a wide variety of algorithms designed for specific experimental designs and applications. In addition to the usual application domain of SNP genotyping, these techniques have been successfully adapted to identify rare SNPs within a population, as well as detecting somatic SNVs within an individual using multiple tissue samples.

References

  1. Bailey T.L., Elkan C. Unsupervised Learning of Multiple Motifs In Biopolymers Using EM. Mach. Learn. 1995;21:51–80.