MUMmer

Last updated

MUMmer is a bioinformatics software system for sequence alignment. It is based on the suffix tree data structure. It has been used for comparing different genomes assemblies to one another, which allows scientists to determine how a genome has changed. The acronym "MUMmer" comes from "Maximal Unique Matches", or MUMs.

Contents

The original algorithms in the MUMMER software package were designed by Art Delcher, Simon Kasif and Steven Salzberg. Mummer was the first whole genome comparison system developed in Bioinformatics. It was originally applied to the comparison of two related strains of bacteria.

The MUMmer software is open source. The system is maintained primarily by Steven Salzberg and Arthur Delcher at Center for Computational Biology at Johns Hopkins University.

MUMmer is a highly cited bioinformatics system in the scientific literature. According to Google Scholar, as of early 2013 the original MUMmer paper (Delcher et al., 1999) [1] has been cited 691 times; the MUMmer 2 paper (Delcher et al., 2002) [2] has been cited 455 times; and the MUMmer 3.0 article (Kurtz et al., 2004) [3] has been cited 903 times.

Overview

Mummer is a fast algorithm used for the rapid alignment of entire genomes. The MUMmer algorithm is relatively new and has 4 versions.

Versions of MUMmers

MUMmer1

MUMmer1 or just MUMmer consists of three parts, the first part consists of the creation of suffix trees (to get MUMs), the second part in the longest increasing subsequence or longest common subsequences (to order MUMs), lastly any alignment to close gaps.

Interruptions between MUMs-alignment, are known as gaps. Otherther alignment algorithms fill these gaps. The gaps fall in the following four classes: [4]

MUMmer 2

This algorithm was redesigned to require less memory and increase speed and accuracy. It also allows for bigger genomes alignment.

The improvement was the amount stored in the suffix trees by employing the one created by Kurtz.

MUMmer 3

According to Stefan Kurtz and his teammates, “the most significant technical improvement in MUMmer 3.0, is a complete rewrite of the suffix-tree code, based on the compact suffix- tree representation of” [5] the tree described in the article “Reducing the space requirement of suffix trees”. [6]

MUMmer 4

According to Guillaume and his team, there are some extra improvements in the implementation and also innovation with Query parallelism. “MUMmer4 now includes options to save and load the suffix array for a given reference." [7] This allows the suffix tree can be built once and constructed again after running it from the saved suffix tree.

Software - Open Source

MUMmer has open-source software and can be accessed online.

There are other types of sequence alignments:

Related Research Articles

<span class="mw-page-title-main">Sequence alignment</span> Process in bioinformatics that identifies equivalent sites within molecular sequences

In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix. Gaps are inserted between the residues so that identical or similar characters are aligned in successive columns. Sequence alignments are also used for non-biological sequences such as calculating the distance cost between strings in a natural language, or to display financial data.

In bioinformatics and biochemistry, the FASTA format is a text-based format for representing either nucleotide sequences or amino acid (protein) sequences, in which nucleotides or amino acids are represented using single-letter codes.

BioJava is an open-source software project dedicated to provide Java tools to process biological data. BioJava is a set of library functions written in the programming language Java for manipulating sequences, protein structures, file parsers, Common Object Request Broker Architecture (CORBA) interoperability, Distributed Annotation System (DAS), access to AceDB, dynamic programming, and simple statistical routines. BioJava supports a range of data, starting from DNA and protein sequences to the level of 3D protein structures. The BioJava libraries are useful for automating many daily and mundane bioinformatics tasks such as to parsing a Protein Data Bank (PDB) file, interacting with Jmol and many more. This application programming interface (API) provides various file parsers, data models and algorithms to facilitate working with the standard data formats and enables rapid application development and analysis.

In bioinformatics, GLIMMER (Gene Locator and Interpolated Markov ModelER) is used to find genes in prokaryotic DNA. "It is effective at finding genes in bacteria, archea, viruses, typically finding 98-99% of all relatively long protein coding genes". GLIMMER was the first system that used the interpolated Markov model to identify coding regions. The GLIMMER software is open source and is maintained by Steven Salzberg, Art Delcher, and their colleagues at the Center for Computational Biology at Johns Hopkins University. The original GLIMMER algorithms and software were designed by Art Delcher, Simon Kasif and Steven Salzberg and applied to bacterial genome annotation in collaboration with Owen White.

<span class="mw-page-title-main">Multiple sequence alignment</span> Alignment of more than two molecular sequences

Multiple sequence alignment (MSA) may refer to the process or the result of sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. In many cases, the input set of query sequences are assumed to have an evolutionary relationship by which they share a linkage and are descended from a common ancestor. From the resulting MSA, sequence homology can be inferred and phylogenetic analysis can be conducted to assess the sequences' shared evolutionary origins. Visual depictions of the alignment as in the image at right illustrate mutation events such as point mutations that appear as differing characters in a single alignment column, and insertion or deletion mutations that appear as hyphens in one or more of the sequences in the alignment. Multiple sequence alignment is often used to assess sequence conservation of protein domains, tertiary and secondary structures, and even individual amino acids or nucleotides.

A maximal unique match or MUM, for short, is part of a key step in the multiple sequence alignment of genomes in computational biology. Identification of MUMs and other potential anchors, is the first step in larger alignment systems such as MUMmer. Anchors are the areas between two genomes where they are highly similar. To understand what a MUM is we each word in the acronym can be broken down individually. Match implies that the substring occurs in both sequences to be aligned. Unique means that the substring occurs only once in each sequence. Finally, maximal states that the substring is not part of another larger string that fulfills both prior requirements. The idea behind this, is that long sequences that match exactly and occur only once in each genome are almost certainly part of the global alignment.

<span class="mw-page-title-main">Steven Salzberg</span> American biologist and computer scientist

Steven Lloyd Salzberg is an American computational biologist and computer scientist who is a Bloomberg Distinguished Professor of Biomedical Engineering, Computer Science, and Biostatistics at Johns Hopkins University, where he is also Director of the Center for Computational Biology.

In bioinformatics, MAFFT is a program used to create multiple sequence alignments of amino acid or nucleotide sequences. Published in 2002, the first version of MAFFT used an algorithm based on progressive alignment, in which the sequences were clustered with the help of the fast Fourier transform. Subsequent versions of MAFFT have added other algorithms and modes of operation, including options for faster alignment of large numbers of sequences, higher accuracy alignments, alignment of non-coding RNA sequences, and the addition of new sequences to existing alignments.

MUltiple Sequence Comparison by Log-Expectation (MUSCLE) is computer software for multiple sequence alignment of protein and nucleotide sequences. It is licensed as public domain. The method was published by Robert C. Edgar in two papers in 2004. The first paper, published in Nucleic Acids Research, introduced the sequence alignment algorithm. The second paper, published in BMC Bioinformatics, presented more technical details.

Rfam is a database containing information about non-coding RNA (ncRNA) families and other structured RNA elements. It is an annotated, open access database originally developed at the Wellcome Trust Sanger Institute in collaboration with Janelia Farm, and currently hosted at the European Bioinformatics Institute. Rfam is designed to be similar to the Pfam database for annotating protein families.

<span class="mw-page-title-main">DNA annotation</span> The process of describing the structure and function of a genome

In molecular biology and genetics, DNA annotation or genome annotation is the process of describing the structure and function of the components of a genome, by analyzing and interpreting them in order to extract their biological significance and understand the biological processes in which they participate. Among other things, it identifies the locations of genes and all the coding regions in a genome and determines what those genes do.

PhylomeDB is a public biological database for complete catalogs of gene phylogenies (phylomes). It allows users to interactively explore the evolutionary history of genes through the visualization of phylogenetic trees and multiple sequence alignments. Moreover, phylomeDB provides genome-wide orthology and paralogy predictions which are based on the analysis of the phylogenetic trees. The automated pipeline used to reconstruct trees aims at providing a high-quality phylogenetic analysis of different genomes, including Maximum Likelihood tree inference, alignment trimming and evolutionary model testing.

In bioinformatics, alignment-free sequence analysis approaches to molecular sequence and structure data provide alternatives over alignment-based approaches.

Bowtie is a software package commonly used for sequence alignment and sequence analysis in bioinformatics. The source code for the package is distributed freely and compiled binaries are available for Linux, macOS and Windows platforms. As of 2017, the Genome Biology paper describing the original Bowtie method has been cited more than 11,000 times. Bowtie is open-source software and is currently maintained by Johns Hopkins University.

Owen R. White is a bioinformatician and director of the Institute For Genome Sciences at the University of Maryland School of Medicine. He is known for his work on the bioinformatics tools GLIMMER and MUMmer.

Non-coding RNAs have been discovered using both experimental and bioinformatic approaches. Bioinformatic approaches can be divided into three main categories. The first involves homology search, although these techniques are by definition unable to find new classes of ncRNAs. The second category includes algorithms designed to discover specific types of ncRNAs that have similar properties. Finally, some discovery methods are based on very general properties of RNA, and are thus able to discover entirely new kinds of ncRNAs.

In bioinformatics, a spaced seed is a pattern of relevant and irrelevant positions in a biosequence and a method of approximate string matching that allows for substitutions. They are a straightforward modification to the earliest heuristic-based alignment efforts that allow for minor differences between the sequences of interest. Spaced seeds have been used in homology search., alignment, assembly, and metagenomics. They are usually represented as a sequence of zeroes and ones, where a one indicates relevance and a zero indicates irrelevance at the given position. Some visual representations use pound signs for relevant and dashes or asterisks for irrelevant positions.

References

  1. Delcher, A. L.; Kasif, S.; Fleischmann, R. D.; Peterson, J.; White, O.; Salzberg, S. L. (1999). "Alignment of whole genomes". Nucleic Acids Research. 27 (11): 2369–2376. doi:10.1093/nar/27.11.2369. PMC   148804 . PMID   10325427.
  2. Delcher, A. L.; Phillippy, A.; Carlton, J.; Salzberg, S. L. (2002). "Fast algorithms for large-scale genome alignment and comparison". Nucleic Acids Research. 30 (11): 2478–2483. doi:10.1093/nar/30.11.2478. PMC   117189 . PMID   12034836.
  3. Delcher, A.; Harmon, D.; Kasif, S.; White, O.; Salzberg, S. (1999). "Improved microbial gene identification with GLIMMER". Nucleic Acids Research. 27 (23): 4636–4641. doi:10.1093/nar/27.23.4636. PMC   148753 . PMID   10556321.
  4. Delcher, A.; Kasif, S.; Fleischmann, R.; Peterson, J.; White, O.; Salzberg, S. (1999). "Alignment of Whole Genomes". Nucleic Acids Research. 27 (11): 2369–2376. doi: 10.1093/nar/27.23.4636 . PMC   148804 . PMID   10325427.
  5. Kurtz, S.; Phillippy, A.; Delcher, A.; Smoot, M.; Shumway, M.; Antonescu, C.; Salzberg, S. (2004). "Versatile and open software for comparing large genomes" (PDF). Genome Biology. 5 (2): R12. doi: 10.1186/gb-2004-5-2-r12 . PMC   395750 . PMID   14759262. Archived (PDF) from the original on 2019-07-11. Retrieved 2021-05-06.
  6. Kurtz, S. (1999). "Reducing the Space Requirement of Suffix Trees". Software: Practice and Experience. 29 (13): 1149–1171. doi:10.1002/(SICI)1097-024X(199911)29:13<1149::AID-SPE274>3.0.CO;2-O. Archived from the original on 2021-05-06. Retrieved 2021-05-06.
  7. Marçais, Guillaume.; Pillippy, A.; Delcher, A.; Coston, R.; Salzberg, S.; Zimin, A. (2018). "MUMmer4: A fast and versatile genome alignment system". PLOS Computational Biology. 14 (1): e1005944. Bibcode:2018PLSCB..14E5944M. doi: 10.1371/journal.pcbi.1005944 . PMC   5802927 . PMID   29373581.