MAFFT

Last updated
MAFFT
Developer(s) Kazutaka Katoh
Initial release2002;22 years ago (2002)
Stable release
7.526 / April 2024;5 months ago (2024-04)
Written in C
Operating system Unix, Linux, Mac, Windows
Type Bioinformatics tool
Licence BSD, GPL, others [1]
Website mafft.cbrc.jp/alignment/software

In bioinformatics, MAFFT (multiple alignment using fast Fourier transform) is a program used to create multiple sequence alignments of amino acid or nucleotide sequences. Published in 2002, the first version used an algorithm based on progressive alignment, in which the sequences were clustered with the help of the fast Fourier transform. [2] Subsequent versions of MAFFT have added other algorithms and modes of operation, [3] including options for faster alignment of large numbers of sequences, [4] higher accuracy alignments, [5] alignment of non-coding RNA sequences, [6] and the addition of new sequences to existing alignments. [7]

Contents

History

There have been many variations of the MAFFT software, some of which are listed below:

A timeline outlining the different versions of MAFFT since 2002. Provides brief descriptions for each notable generation of the software. MAFFT timeline.png
A timeline outlining the different versions of MAFFT since 2002. Provides brief descriptions for each notable generation of the software.

Algorithm

The MAFFT algorithm works following these 5 steps Pairwise Alignment, Distance Calculation, Guide Tree Construction, Progressive Alignment, Iterative Refinement. [8]

Input/output

Web form

Input

Steps of how to use MAFFT with other programs to view a MSA Example Use of MAFFT.png
Steps of how to use MAFFT with other programs to view a MSA

This program can take in multiple sequences as input, which can be entered in two ways:

Sequence input window
Here is an example of a FASTA format, to see more available formats click on the following link: https://www.ebi.ac.uk/seqdb/confluence/display/JDSAT/Multiple+Sequence+Alignment+Tool+Input+Examples FAM149A Promoter region (FASTA format).png
Here is an example of a FASTA format, to see more available formats click on the following link: https://www.ebi.ac.uk/seqdb/confluence/display/JDSAT/Multiple+Sequence+Alignment+Tool+Input+Examples

The user can directly enter three or more sequences in the input window in any of the following formats: GCG, FASTA, EMBL (nucleotide only), GenBank, PIR, NBRF, PHYLIP, or UniProtKB/Swiss-Prot (protein only). It is important to note that partly formatted sequences are not accepted, and adding a return to the end of the sequence may help certain applications understand the input. It is also advised to avoid using data from word processors as hidden/control characters may be present. [11]

Sequence file upload

The user can upload a file containing three or more valid sequences in any format mentioned above. Word processor files may yield unpredictable results due to the presence of hidden/control characters, so it is best to save files with the Unix format option to avoid hidden Windows characters. Once the file is uploaded, it can be used as input for multiple sequence alignment. [11]

Text files saved on DOS and Windows format have different line endings than those saved on Unix and Linux. DOS–Windows uses a combination of carriage return and line feed characters ("\r\n") to indicate the end of a line, while Unix–Linux systems use only a line feed character ("\n"). [12]

When transferring files between Windows and Unix-based systems, it's important to be aware of these differences to ensure that the line endings are correctly translated. Otherwise, the hidden carriage return characters in the Windows-formatted files may cause issues when viewed or edited on Unix-based systems, and vice versa. [12]

Output

The user will have the option to request the Multiple Sequence Alignment (MSA) to be generated in one of the two available formats:

Example of ClustalW output Multiple Sequence Alignment Using ClustalW.jpg
Example of ClustalW output
Output formatDescriptionAbbreviation
Pearson/FASTAPearson or FASTA sequence formatfasta
ClustalWClustalW alignment format without base/residue numberingclustalw

Default value is: Pearson/FASTA [fasta]

Understanding ClustalW output:
SymbolDefinitionMeaning
*asteriskConserved sequence (identical)
 :colonConservative mutation
.periodSemi-conservative mutation
( )blankNon-conservative mutation
-dashGap

Settings

There are many settings that affect how the MAFFT algorithm works. Adjusting the settings to needs is the best way to get accurate and meaningful results. The most important settings to understand are: the Scoring Matrix, Gap Open Penalty, and Gap Extension Penalty.

Accuracy and results

MAFFT is widely considered to be one of the most accurate and versatile tools for multiple sequence alignment in bioinformatics. In fact, studies have shown that MAFFT performs exceptionally well when compared to other popular algorithms such as ClustalW and T-Coffee, particularly for larger datasets and sequences with high degrees of divergence. [16] For example, in a study comparing performance of various alignment algorithms on increasing sequence lengths, MAFFT's FFT-NS-2 algorithm was found to be the fastest program for all tested sequence sizes. This is due to its use of fast Fourier transform (FFT) algorithms, which enable rapid and accurate alignment of even highly divergent sequences. Because of the use of fast Fourier transform (FFT) the algorithm runs in either O(n^2) or O(n) depending on the given data set. MAFFT takes less CPU runtime than other algorithms that have the same or similar accuracies especially T-Coffee, ClustalW, and Needleman-Wunsch. [2]

Later versions of MAFFT have added other algorithms and modes of operation, including options for faster alignment of large numbers of sequences, [9] higher accuracy alignments, [17] alignment of non-coding RNA sequences, [18] and the addition of new sequences to existing alignments. [19]

MAFFT stands out among other popular algorithms such as ClustalW and T-Coffee due to its high accuracy, versatility, and range of features. It offers various alignment methods and strategies, including iterative refinement and consistency-based approaches, that further enhance accuracy and robustness of alignments. As a result, MAFFT is widely recognized as a powerful tool for multiple sequence alignment and is highly appreciated by the scientific community. [20]

See also

Related Research Articles

<span class="mw-page-title-main">Sequence alignment</span> Process in bioinformatics that identifies equivalent sites within molecular sequences

In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix. Gaps are inserted between the residues so that identical or similar characters are aligned in successive columns. Sequence alignments are also used for non-biological sequences such as calculating the distance cost between strings in a natural language, or to display financial data.

In bioinformatics, BLAST is an algorithm and program for comparing primary biological sequence information, such as the amino-acid sequences of proteins or the nucleotides of DNA and/or RNA sequences. A BLAST search enables a researcher to compare a subject protein or nucleotide sequence with a library or database of sequences, and identify database sequences that resemble the query sequence above a certain threshold. For example, following the discovery of a previously unknown gene in the mouse, a scientist will typically perform a BLAST search of the human genome to see if humans carry a similar gene; BLAST will identify sequences in the human genome that resemble the mouse gene based on similarity of sequence.

In bioinformatics and biochemistry, the FASTA format is a text-based format for representing either nucleotide sequences or amino acid (protein) sequences, in which nucleotides or amino acids are represented using single-letter codes.

<span class="mw-page-title-main">Structural alignment</span> Aligning molecular sequences using sequence and structural information

Structural alignment attempts to establish homology between two or more polymer structures based on their shape and three-dimensional conformation. This process is usually applied to protein tertiary structures but can also be used for large RNA molecules. In contrast to simple structural superposition, where at least some equivalent residues of the two structures are known, structural alignment requires no a priori knowledge of equivalent positions. Structural alignment is a valuable tool for the comparison of proteins with low sequence similarity, where evolutionary relationships between proteins cannot be easily detected by standard sequence alignment techniques. Structural alignment can therefore be used to imply evolutionary relationships between proteins that share very little common sequence. However, caution should be used in using the results as evidence for shared evolutionary ancestry because of the possible confounding effects of convergent evolution by which multiple unrelated amino acid sequences converge on a common tertiary structure.

In mathematics, computer science and especially graph theory, a distance matrix is a square matrix containing the distances, taken pairwise, between the elements of a set. Depending upon the application involved, the distance being used to define this matrix may or may not be a metric. If there are N elements, this matrix will have size N×N. In graph-theoretic applications, the elements are more often referred to as points, nodes or vertices.

A Gap penalty is a method of scoring alignments of two or more sequences. When aligning sequences, introducing gaps in the sequences can allow an alignment algorithm to match more terms than a gap-less alignment can. However, minimizing gaps in an alignment is important to create a useful alignment. Too many gaps can cause an alignment to become meaningless. Gap penalties are used to adjust alignment scores based on the number and length of gaps. The five main types of gap penalties are constant, linear, affine, convex, and profile-based.

FASTA is a DNA and protein sequence alignment software package first described by David J. Lipman and William R. Pearson in 1985. Its legacy is the FASTA format which is now ubiquitous in bioinformatics.

<span class="mw-page-title-main">Smith–Waterman algorithm</span> Algorithm for determining similar regions between two molecular sequences

The Smith–Waterman algorithm performs local sequence alignment; that is, for determining similar regions between two strings of nucleic acid sequences or protein sequences. Instead of looking at the entire sequence, the Smith–Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure.

<span class="mw-page-title-main">Clustal</span> Bioinformatics computer program

Clustal is a computer program used for multiple sequence alignment in bioinformatics. The software and its algorithms have gone through several iterations, with ClustalΩ (Omega) being the latest version as of 2011. It is available as standalone software, via a web interface, and through a server hosted by the European Bioinformatics Institute.

The European Bioinformatics Institute (EMBL-EBI) is an intergovernmental organization (IGO) which, as part of the European Molecular Biology Laboratory (EMBL) family, focuses on research and services in bioinformatics. It is located on the Wellcome Genome Campus in Hinxton near Cambridge, and employs over 600 full-time equivalent (FTE) staff.

<span class="mw-page-title-main">Multiple sequence alignment</span> Alignment of more than two molecular sequences

Multiple sequence alignment (MSA) is the process or the result of sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. These alignments are used to infer evolutionary relationships via phylogenetic analysis and can highlight homologous features between sequences. Alignments highlight mutation events such as point mutations, insertion mutations and deletion mutations, and alignments are used to assess sequence conservation and infer the presence and activity of protein domains, tertiary structures, secondary structures, and individual amino acids or nucleotides.

T-Coffee is a multiple sequence alignment software using a progressive approach. It generates a library of pairwise alignments to guide the multiple sequence alignment. It can also combine multiple sequences alignments obtained previously and in the latest versions can use structural information from PDB files (3D-Coffee). It has advanced features to evaluate the quality of the alignments and some capacity for identifying occurrence of motifs (Mocca). It produces alignment in the aln format (Clustal) by default, but can also produce PIR, MSF, and FASTA format. The most common input formats are supported.

MUltiple Sequence Comparison by Log-Expectation (MUSCLE) is a computer software for multiple sequence alignment of protein and nucleotide sequences. It is licensed as public domain. The method was published by Robert C. Edgar in two papers in 2004. The first paper, published in Nucleic Acids Research, introduced the sequence alignment algorithm. The second paper, published in BMC Bioinformatics, presented more technical details.

Warren Richard Gish is the owner of Advanced Biocomputing LLC. He joined Washington University in St. Louis as a junior faculty member in 1994, and was a Research Associate Professor of Genetics from 2002 to 2007.

<span class="mw-page-title-main">UGENE</span> Computer software for bioinformatics

UGENE is computer software for bioinformatics. It works on personal computer operating systems such as Windows, macOS, or Linux. It is released as free and open-source software, under a GNU General Public License (GPL) version 2.

Biological data visualization is a branch of bioinformatics concerned with the application of computer graphics, scientific visualization, and information visualization to different areas of the life sciences. This includes visualization of sequences, genomes, alignments, phylogenies, macromolecular structures, systems biology, microscopy, and magnetic resonance imaging data. Software tools used for visualizing biological data range from simple, standalone programs to complex, integrated systems.

Fast statistical alignment or FSA is a multiple sequence alignment program for aligning many proteins, RNAs, or long genomic DNA sequences. Along with MUSCLE and MAFFT, FSA is one of the few sequence alignment programs which can align datasets of hundreds or thousands of sequences. FSA uses a different optimization criterion which allows it to more reliably identify non-homologous sequences than these other programs, although this increased accuracy comes at the cost of decreased speed.

T-REX is a freely available web server, developed at the department of Computer Science of the Université du Québec à Montréal, dedicated to the inference, validation and visualization of phylogenetic trees and phylogenetic networks. The T-REX web server allows the users to perform several popular methods of phylogenetic analysis as well as some new phylogenetic applications for inferring, drawing and validating phylogenetic trees and networks.

<span class="mw-page-title-main">Desmond G. Higgins</span>

Desmond Gerard Higgins is a Professor of Bioinformatics at University College Dublin, widely known for CLUSTAL, a series of computer programs for performing multiple sequence alignment. According to Nature, Higgins' papers describing CLUSTAL are among the top ten most highly cited scientific papers of all time.

Bacterial phylodynamics is the study of immunology, epidemiology, and phylogenetics of bacterial pathogens to better understand the evolutionary role of these pathogens. Phylodynamic analysis includes analyzing genetic diversity, natural selection, and population dynamics of infectious disease pathogen phylogenies during pandemics and studying intra-host evolution of viruses. Phylodynamics combines the study of phylogenetic analysis, ecological, and evolutionary processes to better understand of the mechanisms that drive spatiotemporal incidence and phylogenetic patterns of bacterial pathogens. Bacterial phylodynamics uses genome-wide single-nucleotide polymorphisms (SNP) in order to better understand the evolutionary mechanism of bacterial pathogens. Many phylodynamic studies have been performed on viruses, specifically RNA viruses which have high mutation rates. The field of bacterial phylodynamics has increased substantially due to the advancement of next-generation sequencing and the amount of data available.

References

  1. The base MAFFT software is distributed under one of the BSD licenses, while versions for Microsoft Windows are licensed under a GNU General Public License. Some distributions of MAFFT contain software licensed under other licenses https://mafft.cbrc.jp/alignment/software/
  2. 1 2 3 4 Katoh, Kazutaka; Misawa, Kazuharu; Kuma, Kei-ichi; Miyata, Takashi (2002). "MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform". Nucleic Acids Research. 30 (14): 3059–66. doi:10.1093/nar/gkf436. PMC   135756 . PMID   12136088.
  3. 1 2 3 4 "MAFFT ver.7 - a multiple sequence alignment program". mafft.cbrc.jp. Retrieved 28 April 2021.
  4. Katoh, K.; Toh, H. (2006). "PartTree: An algorithm to build an approximate tree from a large number of unaligned sequences". Bioinformatics. 23 (3): 372–4. doi: 10.1093/bioinformatics/btl592 . PMID   17118958.
  5. Katoh, K.; Kuma, K.; Miyata, T.; Toh, H. (2005). "Improvement in the accuracy of multiple sequence alignment program MAFFT". Genome Informatics. International Conference on Genome Informatics. 16 (1): 22–33. PMID   16362903.
  6. Katoh, Kazutaka; Toh, Hiroyuki (2008). "Improved accuracy of multiple ncRNA alignment by incorporating structural information into a MAFFT-based framework". BMC Bioinformatics. 9: 212. doi: 10.1186/1471-2105-9-212 . PMC   2387179 . PMID   18439255.
  7. Katoh, Kazutaka; Frith, Martin C (2012). "Adding unaligned sequences into an existing alignment using MAFFT and LAST". Bioinformatics. 28 (23): 3144–6. doi:10.1093/bioinformatics/bts578. PMC   3516148 . PMID   23023983.
  8. The base MAFFT software is released under one of the BSD licenses, while versions for Microsoft Windows are released under a GNU General Public License. Some distributions of MAFFT contain software licensed under other licenses https://mafft.cbrc.jp/alignment/software/
  9. 1 2 3 4 5 6 Katoh, K.; Standley, D. M. (April 2013). "MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability". Molecular Biology and Evolution. 30 (4): 772–780. doi:10.1093/molbev/mst010. PMC   3603318 . PMID   23329690.
  10. 1 2 3 Katoh, Kazutaka; Toh, Hiroyuki (July 2008). "Recent developments in the MAFFT multiple sequence alignment program". Briefings in Bioinformatics. 9 (4): 286–298. doi: 10.1093/bib/bbn013 . PMID   18372315.
  11. 1 2 "MAFFT Help and Documentation - Job Dispatcher Sequence Analysis Tools - EMBL-EBI". www.ebi.ac.uk. Retrieved 2023-04-24.
  12. 1 2 "Windows vs. Unix Line Endings". www.cs.toronto.edu. Retrieved 2023-04-27.
  13. Pearson, William R. (October 2013). "Selecting the Right Similarity‐Scoring Matrix". Current Protocols in Bioinformatics. 43 (1): 3.5.1–3.5.9. doi:10.1002/0471250953.bi0305s43. PMC   3848038 . PMID   24509512.
  14. "ROSALIND: Glossary: Gap penalty".
  15. Carroll, Hyrum; Clement, Mark; Ridge, Perry; Snell, Quinn (October 2006). "Effects of Gap Open and Gap Extension Penalties". Faculty Publications.
  16. Edgar, Robert; Batzoglou, Serafim (June 2006). "Multiple sequence alignment". Current Opinion in Structural Biology. 16 (3): 368–373. doi:10.1016/j.sbi.2006.04.004. PMID   16679011.
  17. Katoh, Kazutaka (2010-04-28). "Parallelization of the MAFFT multiple sequence alignment program". Bioinformatics. 26 (15): 1899–1900. doi:10.1093/bioinformatics/btq224. PMC   2905546 . PMID   20427515.
  18. Kazunori, Yamada (4 July 2016). "Application of the MAFFT sequence alignment program to large data—reexamination of the usefulness of chained guide trees". Bioinformatics. 32 (21): 3246–3251. doi:10.1093/bioinformatics/btw412. PMC   5079479 . PMID   27378296.
  19. Kazutaka, Katoh (27 September 2012). "Adding unaligned sequences into an existing alignment using MAFFT and LAST". Bioinformatics. 28 (23): 3144–3146. doi:10.1093/bioinformatics/bts578. PMC   3516148 . PMID   23023983.
  20. Edgar, R. C. (8 March 2004). "MUSCLE: multiple sequence alignment with high accuracy and high throughput". Nucleic Acids Research. 32 (5): 1792–1797. doi:10.1093/nar/gkh340. PMC   390337 . PMID   15034147.