Maximal unique match

Last updated

A maximal unique match or MUM, for short, is part of a key step [1] in the multiple sequence alignment of genomes in computational biology. Identification of MUMs and other potential anchors is the first step in larger alignment systems such as MUMmer. Anchors are the areas between two genomes where they are highly similar.  To understand what a MUM is we each word in the acronym can be broken down individually.  Match implies that the substring occurs in both sequences to be aligned.  Unique means that the substring occurs only once in each sequence.  Finally, maximal states that the substring is not part of another larger string that fulfills both prior requirements. The idea behind this is that long sequences that match exactly and occur only once in each genome are almost certainly part of the global alignment.

Contents

Formal definition

"Given two genomes A and B, Maximal Unique Match (MUM) substring is a common substring of A and B of length longer than a specified minimum length d (by default d = 20) such that

Algorithm

Identifying the set of MUMs in two very long genome sequences is not computationally trivial. [1] There are several algorithmic ways to approach identifying MUMs in multiple sequence alignment. The simplest and slowest method is using brute force where for every index i in genome A and every index j in genome B , you calculate the longest common prefix (P) of A[i...n] and B[j...m]. Next you must guarantee that the length of P is at least d where d is the minimum MUM size specified. Finally you must ensure that P is unique in both genomes. The complexity of the brute force method is thus O(mn). [2]

In actuality though MUMs are identified by building a generalized suffix tree for A and B . A list is then created for all internal nodes with exactly one child from each genome sequence. For each of these nodes(we will identify children from genome A as i and children from genome B as j ) check that A[i-1] ≠ B[j-1] and for those where this conditions holds, we know this is a MUM. In this case the complexity is reduced to O(m+n). [2]

In the illustration below given the initial strings S and T and a d of 1, the MUMs should be G and TA. A red leaf denotes that the leaf came from string S and a blue leaf denotes string T. The internal node at A was discarded because A[i-1] = B[j-1] meaning the character that comes before both A's is identical (T), this is condition where the sequences belongs to a larger unique sequence. The internal node at C is discarded because it has two children from S. This leaves us with the MUMs of G and TA.

MUM identification using a suffix tree MUM suffix tree.png
MUM identification using a suffix tree

Example

To illustrate MUMs further, we take the following example. Let's say that d=3 and our two genomes are as follows:

S = CTACTGTCATGCTGAAGTCATGCGGCTGGCTTAT#

T = CTGGTGTGATGCTAAGTCATGCGGTCTGGCTTAT$

Working through the sequence our first MUM that satisfies the conditions occurs here:

S = CTACTGTCATGCTGAAGTCATGCGGCTGGCTTAT#

T = CTGGTGTGATGCTAAGTCATGCGGTCTGGCTTAT$

TGT is at least d in length, occurs only once in both sequences, and the characters to the left and right differ between genomes. To illustrate an example where expansion is needed to ensure that our MUM is not part of a larger sequence and unique, take the following:

S = CTACTGTCATGCTGAAGTCATGCGGCTGGCTTAT#

T = CTGGTGTGATGCTAAGTCATGCGGTCTGGCTTAT$

There are two problems in the case of testing ATG as a MUM because ATG is not unique and it is also part of a larger subsequence as illustrated here:

S = CTACTGTCATGCTGAAGTCATGCGGCTGGCTTAT#

T = CTGGTGTGATGCTAAGTCATGCGGTCTGGCTTAT$

Therefore expansion of this sequence is required in an attempt to satisfy the conditions for being a MUM as shown here.

S = CTACTGTCATGCTGAAGTCATGCGGCTGGCTTAT#

T = CTGGTGTGATGCTAAGTCATGCGGTCTGGCTTAT$

Using this method iteratively produces a final product where each MUM is identified for clarity each MUM is colored-coded:

S = CTACTGTCATGCTGAAGTCATGCGGCTGGCTTAT#

T = ACTTTGTGATGCTAAGTCATGCGGTCTGGCTTAT$

Maximal Exact Match (MEM)

MUMs are a subset of a larger set referred to as Maximal Exact Matches or MEMS. In a MEM the uniqueness condition of MUMs is relaxed. MEMs are like local alignment but in this case only identify sequences where there are no gaps.

See also

Related Research Articles

In computer science, string-searching algorithms, sometimes called string-matching algorithms, are an important class of string algorithms that try to find a place where one or several strings are found within a larger string or text.

In computer science, the Cocke–Younger–Kasami algorithm is a parsing algorithm for context-free grammars published by Itiroo Sakai in 1961. The algorithm is named after some of its rediscoverers: John Cocke, Daniel Younger, Tadao Kasami, and Jacob T. Schwartz. It employs bottom-up parsing and dynamic programming.

<span class="mw-page-title-main">Sequence alignment</span> Process in bioinformatics that identifies equivalent sites within molecular sequences

In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix. Gaps are inserted between the residues so that identical or similar characters are aligned in successive columns. Sequence alignments are also used for non-biological sequences such as calculating the distance cost between strings in a natural language, or to display financial data.

A phylogenetic tree, phylogeny or evolutionary tree is a graphical representation which shows the evolutionary history between a set of species or taxa during a specific time. In other words, it is a branching diagram or a tree showing the evolutionary relationships among various biological species or other entities based upon similarities and differences in their physical or genetic characteristics. In evolutionary biology, all life on Earth is theoretically part of a single phylogenetic tree, indicating common ancestry. Phylogenetics is the study of phylogenetic trees. The main challenge is to find a phylogenetic tree representing optimal evolutionary ancestry between a set of species or taxa. Computational phylogenetics focuses on the algorithms involved in finding optimal phylogenetic tree in the phylogenetic landscape.

In bioinformatics, BLAST is an algorithm and program for comparing primary biological sequence information, such as the amino-acid sequences of proteins or the nucleotides of DNA and/or RNA sequences. A BLAST search enables a researcher to compare a subject protein or nucleotide sequence with a library or database of sequences, and identify database sequences that resemble the query sequence above a certain threshold. For example, following the discovery of a previously unknown gene in the mouse, a scientist will typically perform a BLAST search of the human genome to see if humans carry a similar gene; BLAST will identify sequences in the human genome that resemble the mouse gene based on similarity of sequence.

In computer science, the Boyer–Moore string-search algorithm is an efficient string-searching algorithm that is the standard benchmark for practical string-search literature. It was developed by Robert S. Boyer and J Strother Moore in 1977. The original paper contained static tables for computing the pattern shifts without an explanation of how to produce them. The algorithm for producing the tables was published in a follow-on paper; this paper contained errors which were later corrected by Wojciech Rytter in 1980.

<span class="mw-page-title-main">Suffix tree</span> Tree containing all suffixes of a given text

In computer science, a suffix tree is a compressed trie containing all the suffixes of the given text as their keys and positions in the text as their values. Suffix trees allow particularly fast implementations of many important string operations.

In mathematics, computer science and especially graph theory, a distance matrix is a square matrix containing the distances, taken pairwise, between the elements of a set. Depending upon the application involved, the distance being used to define this matrix may or may not be a metric. If there are N elements, this matrix will have size N×N. In graph-theoretic applications, the elements are more often referred to as points, nodes or vertices.

<span class="mw-page-title-main">Comparative genomics</span>

Comparative genomics is a field of biological research in which the genomic features of different organisms are compared. The genomic features may include the DNA sequence, genes, gene order, regulatory sequences, and other genomic structural landmarks. In this branch of genomics, whole or large parts of genomes resulting from genome projects are compared to study basic biological similarities and differences as well as evolutionary relationships between organisms. The major principle of comparative genomics is that common features of two organisms will often be encoded within the DNA that is evolutionarily conserved between them. Therefore, comparative genomic approaches start with making some form of alignment of genome sequences and looking for orthologous sequences in the aligned genomes and checking to what extent those sequences are conserved. Based on these, genome and molecular evolution are inferred and this may in turn be put in the context of, for example, phenotypic evolution or population genetics.

In computer science, a longest common substring of two or more strings is a longest string that is a substring of all of them. There may be more than one longest common substring. Applications include data deduplication and plagiarism detection.

<span class="mw-page-title-main">Generalized suffix tree</span>

In computer science, a generalized suffix tree is a suffix tree for a set of strings. Given the set of strings of total length , it is a Patricia tree containing all suffixes of the strings. It is mostly used in bioinformatics.

In computational phylogenetics, tree alignment is a computational problem concerned with producing multiple sequence alignments, or alignments of three or more sequences of DNA, RNA, or protein. Sequences are arranged into a phylogenetic tree, modeling the evolutionary relationships between species or taxa. The edit distances between sequences are calculated for each of the tree's internal vertices, such that the sum of all edit distances within the tree is minimized. Tree alignment can be accomplished using one of several algorithms with various trade-offs between manageable tree size and computational effort.

MaMF, or Mammalian Motif Finder, is an algorithm for identifying motifs to which transcription factors bind.

BLAT is a pairwise sequence alignment algorithm that was developed by Jim Kent at the University of California Santa Cruz (UCSC) in the early 2000s to assist in the assembly and annotation of the human genome. It was designed primarily to decrease the time needed to align millions of mouse genomic reads and expressed sequence tags against the human genome sequence. The alignment tools of the time were not capable of performing these operations in a manner that would allow a regular update of the human genome assembly. Compared to pre-existing tools, BLAT was ~500 times faster with performing mRNA/DNA alignments and ~50 times faster with protein/protein alignments.

MUMmer is a bioinformatics software system for sequence alignment. It is based on the suffix tree data structure. It has been used for comparing different genomes assemblies to one another, which allows scientists to determine how a genome has changed. The acronym "MUMmer" comes from "Maximal Unique Matches", or MUMs.

Velvet is an algorithm package that has been designed to deal with de novo genome assembly and short read sequencing alignments. This is achieved through the manipulation of de Bruijn graphs for genomic sequence assembly via the removal of errors and the simplification of repeated regions. Velvet has also been implemented in commercial packages, such as Sequencher, Geneious, MacVector and BioNumerics.

In the field of computational biology, a planted motif search (PMS) also known as a (l, d)-motif search (LDMS) is a method for identifying conserved motifs within a set of nucleic acid or peptide sequences.

In bioinformatics, alignment-free sequence analysis approaches to molecular sequence and structure data provide alternatives over alignment-based approaches.

In bioinformatics, a DNA read error occurs when a sequence assembler changes one DNA base for a different base. The reads from the sequence assembler can then be used to create a de Bruijn graph, which can be used in various ways to find errors.

In bioinformatics, a spaced seed is a pattern of relevant and irrelevant positions in a biosequence and a method of approximate string matching that allows for substitutions. They are a straightforward modification to the earliest heuristic-based alignment efforts that allow for minor differences between the sequences of interest. Spaced seeds have been used in homology search., alignment, assembly, and metagenomics. They are usually represented as a sequence of zeroes and ones, where a one indicates relevance and a zero indicates irrelevance at the given position. Some visual representations use pound signs for relevant and dashes or asterisks for irrelevant positions.

References

  1. 1 2 Delcher, A. L.; Kasif, S.; Fleishmann, R.D.; Peterson, J.; White, O.; Salzberg, S.L. (1999). "Alignment of whole genomes". Nucleic Acids Research. 27 (11): 2369–2376. doi: 10.1093/nar/30.11.2478 . PMC   117189 . PMID   10325427.
  2. 1 2 3 Wing-Kin, Sung (2010). Algorithms in Bioinformatics: A Practical Introduction (First ed.). Boca Raton: Chapman & Hall/CRC Press. ISBN   978-1420070330.