Original author(s) | Masatoshi Nei, Sudhir Kumar, Koichiro Tamura, Glen Stecher, Daniel Peterson, Nicholas Peterson |
---|---|
Developer(s) | Pennsylvania State University |
Initial release | 1993 |
Stable release | 11.0.13 / June 2022 [1] |
Operating system | Windows, OS X, Linux |
Platform | x86, x86-64 |
Available in | English |
Type | Bioinformatics |
License | Proprietary freeware |
Website | www |
Molecular Evolutionary Genetics Analysis (MEGA) is computer software for conducting statistical analysis of molecular evolution and for constructing phylogenetic trees. It includes many sophisticated methods and tools for phylogenomics and phylomedicine. It is licensed as proprietary freeware. The project for developing this software was initiated by the leadership of Masatoshi Nei in his laboratory at the Pennsylvania State University in collaboration with his graduate student Sudhir Kumar and postdoctoral fellow Koichiro Tamura. [2] Nei wrote a monograph (pp. 130) outlining the scope of the software and presenting new statistical methods that were included in MEGA. The entire set of computer programs was written by Kumar and Tamura. The personal computers then lacked the ability to send the monograph and software electronically, so they were delivered by postal mail. From the start, MEGA was intended to be easy-to-use and include solid statistical methods only.
MEGA version 2 (MEGA2), which was coauthored by an additional investigator Ingrid Jakobson, was released in 2001. [3] All the computer programs and the readme files of this version could be sent electronically due to advances in computer technology. Around this time, the leadership of the MEGA project was taken over by Kumar (now at Temple University) and Tamura (now at Tokyo Metropolitan University). The monograph Molecular Evolutionary Genetics Analysis was often used as a textbook for new ways to study molecular evolution.
MEGA has been updated and expanded several times and currently all these versions are available from the MEGA website. The latest release, MEGA7, has been optimized for use on 64-bit computing systems. MEGA is in two version. A graphical user interface is available as a native Microsoft Windows program. A command line version, MEGA-Computing Core (MEGA-CC), is available for native cross-platform operation. The method is widely used and cited. With millions of downloads across the releases, MEGA is cited in more than 85,000 papers. The 5th version has been cited over 25,000 times in 4 years. [4]
Alignment Editor ― Within MEGA, the Alignment Editor is a tool that may be used for editing and building multiple sequence alignments. The Alignment Editor in MEGA includes an integrated tool for both ClustalW and MUSCLE programs. All actions take place in the Analysis Explorer, which can be found in the main menu of MEGA. When a new alignment is being created, the user is presented with three options: create a new alignment, open a saved alignment session, or retrieve sequences from a file (importing sequences from NCBI). Once an option is selected, the user can choose either ClustalW or MUSCLE from the Alignment tab located at the top of the page. Parameters for the selected alignment program can then be specified and a progress bar will appear while the tool is being computer. Aligned sequences will replace unaligned ones in the main section of the Alignment Editor. To perform further analysis in MEGA, it is advisable to save the alignment session in either MEGA or FASTA format. [5]
Trace Data File Viewer/Editor ― The Trace Data File Viewer/Editor has many functionalities in the following three menus. All the commands are used to help specialize searches and alignments in MEGA.
Integrated web browser, sequence fetching ― MEGA comes with a built-in web browser that allows users to access GenBank sequence data from the NCBI website. The integrated web browser can be accessed when creating a new alignment in the Alignment Editor. To successfully use sequences from NCBI, it is advised to change the searches to FASTA format and use the “Add to Alignment” button. Once completed, all the sequences will be imported into the MEGA application. [7]
One of the challenges associated with evolutionary genetic analysis is the presence of ambiguous states such as R, Y, and T. These states often arise from sequence errors or incomplete datasets. However, MEGA offers several resources to handle ambiguous states, including the deletion of sites that have an ambiguity score higher than the Site Coverage Cutoff parameter. [8]
MEGA's extended format allows users to save all data attributes, such as sequence length, nucleotide positions, gaps, and ambiguous states. [9] Additionally, MEGA supports data import from other formats, such as Clustal, which ensures a seamless transition between popular file types. [10]
After importing a dataset, MEGA provides multiple different data viewer options. For example, users can view statistical attributes and select subsets in the Sequence Data Explorer or use the Distance Data Explorer to inspect pairwise distance data. [11] Another feature of MEGA is the visual specification of domain groups. This allows users to group sequences by a specific characteristic and view subsequent phylogenetic trees.
MEGA offers support for modifying the genetic code used for translating DNA sequences. By default, MEGA has 23 built-in genetic code variations including the standard code, vertebrate mitochondrial code, Drosophila mitochondrial code, and yeast mitochondrial code. [12] Users may add, remove, or edit any genetic code table.
Genetic code name |
---|
Standard |
Vertebrate Mitochondrial |
Invertebrate Mitochondrial |
Yeast Mitochondrial |
Mold Mitochondrial |
Protozoan Mitochondrial |
Coelenterate Mitochondrial |
Mycoplasma |
Spiroplasma |
Ciliate Nuclear |
Dasycladacean Nuclear |
Hexamita Nuclear |
Echinoderm Mitochondrial |
Euplotid Nuclear |
Bacterial Plastid |
Plant Plastid |
Alternative Yeast Nuclear |
Ascidian Mitochondrial |
Flatworm Mitochondrial |
Blepharisma Mitochondrial |
Chlorophycean Mitochondrial |
Trematode Mitochondrial |
Scenedesmus obliquus Mitochondrial |
Thraustochytrium Mitochondrial |
In addition, MEGA can also computes the degeneracy of each codon position in a genetic code table as well as the number of synonymous sites and non-synonymous sites using the Nei-Gojobori method. [13]
The Caption Expert is a part of MEGA which provides publication-like detailed captions based on the properties of analysis results. It is a tool that may be used for distance matrix, phylogeny, tests, etc. within MEGA (megasoftware). [15]
MEGA's integrated text file editor enables users to edit text files without the need for another program. Features like columnar block selection-editing aid in the performance of bulk operations, like changing letter case or font size. Additionally, the editor includes line numbers to assist with the navigation of large files and identifying areas of interest. [16]
MEGA also provides several tools to format sequences. For example, the built-in reverse complement utility reverses the order of characters and replaces each with its complement. [17]
The screenshots demonstrate the use of MEGA's reverse complement tool. The original sequence was reversed and each nucleotide was replaced with its complement to produce the reverse complement.
MEGA provides a graphical interface for displaying and manipulating aligned nucleotide and protein sequences. [18] The Sequence Data Explorer has multiple menu functionalities to help with exporting data, searching alignments, changing display features, highlighting sites, and computing statistics:
Substitution Models in MEGA allow various options with different attributes of substitution models for both DNA and protein sequences. You may choose different substitution types, model, etc. to fit best with chosen data. The three main substitution models are 4x4 Rate Matrix, Transition-Transversion Rate Ratio (k1,k2), and Transition-Transversion Rate Bias of R.
Transition-Transversion Rate Ratios (k1, k2) – Transition-Transversion Rate Ratio calculates the ratio rate of Transition(a) to Transversion(b) using the formula k = a/b. [24]
Transition-Transversion Rate Bias (R) — Transition-Transversion Rate Bias of R in MEGA calculates the ratio of the number of transitions to the number of transversions between a pair of sequences. MEGA allows a user to conduct an analysis of the data with a specified value of R. A key takeaway is when R equals 0.5, it means there is no bias towards either a transition or transversion substitution. [24]
MEGA offers several approaches for testing substitution pattern homogeneity, such as composition distance, disparity index, and Monte Carlo tests. These methods are used to determine if different genetic regions evolved under the same selective pressure.
Computation distance measures the variation in nucleotide composition between two sequences. MEGA computes this figure per site and excludes any gaps or missing data. A larger distance suggests that the regions evolved under different selective pressures. [25]
The disparity index evaluates the difference in substitution patterns for a given pair of sequences. This value is calculated per site and is thought to be more dynamic than the chi-square test. A large difference implies that the pattern of substitution was not the same for the given pair of sequences. [25]
The Monte Carlo test is another approach to test substitution pattern homogeneity that involves running a null distribution simulation. MEGA requires the user to specify the number of replicates and a starting seed. For a significant result, many simulations must be performed. [25] Therefore, it is essential to consider the computational cost of the algorithm.
Monte Carlo method | Computational complexity |
---|---|
MC + exact simulation | |
MC + tau-leaping | |
MC + midpt. or trap. tau-leaping | |
MC + Euler for diff. approx. |
The table above shows the computational complexity of different Monte Carlo methods as approaches infinity in relation to the parameter . While it's not clear which method MEGA employs, it is likely to be computationally intensive because all the methods listed in the table have a computational order greater than .
MEGA offers a wide variety of options for calculating evolutionary distance between a pair of nucleotide or amino acid sequences with or without standard errors. [27] Distance methods are divide into three categories, nucleotide, syn-nonsynonymous, and amino acids:
After selecting a distance method, a subset of attributes will become visible when applicable. The attributes are Substitutions to Include, Transition/Transversion Ratio, Pattern among Lineages, and Rates among Sites. For example, if a model has a rate variation, the gamma parameter will become visible. In addition, every distance method provides options for handling gap and missing data, and codon position if applicable. [30]
Every substitution matrix has it own use case. One of the simplest model is the Juke-Cantor, which assumes an equal mutation rates. The Kimura 2-Parameter model extends that model but with distinctions between transition rates ( and ) and transversion rates (). Then the Kimura 3-Parameter model extends that model but with distinctions between transversions that conserve the nucleotide's weak/strong property ( and ) and transversions that conserve the nucleotide's amino/keto property ( and ). However, each extension adds more parameter and risk the issue of overfitting. The best substitution matrix depends on the data used. To help with selection, MEGA provides a Find Best-Fit Substitution Model in the Model tab that run each model and assigns a Bayesian information criterion evaluation.
Large sample Z-test The Z-test is used to compare relative synonymous and nonsynonymous substitutions within a gene sequence, with the main objective of determining positive selection. To perform the Z-test formula, an estimation of the number of synonymous substitutions per synonymous site (dS) and nonsynonymous substitutions per nonsynonymous site (dN) must be account for, along with the variances of the synonymous and nonsynonymous substitutions Var(dS) and Var(dN). The formula used for the Z-test is:
Z = (dN – dS_ / SQRT(Var(dS) + Var(dN))
If dN is greater than dS, it indicates positive selection, while if dN is less than dS, it indicates purifying selection. The output of Z from the formula above will determine if it is a positive or purifying selection. Key factors to determine which selection the output will be is the variances of the synonymous and nonsynonymous sites. These tests are commonly used for analytical formulas or bootstrapping resampling in MEGA. [31]
Fisher's exact test — Fisher's Exact Test examines synonymous and nonsynonymous substitutions in sequences and is referred to as a one-tailed test when analyzing small samples for positive selection. Rejecting the null hypothesis of neutrality occurs when the P-Value is less than 0.05. If the differences per synonymous site exceed those per nonsynonymous site, MEGA assigns a P-Value of 1, indicating purifying selection rather than positive selection. [32] Further research on Fisher's Exact Test, the algorithm is based on the probability distribution of n!. As a conclusion, it could be argued that the time complexity of the algorithm is O(n!). The name for the distribution method is Hypergeometric Distribution (Hoffman). [33]
Tajima's Neutrality Test — The purpose of Tajima's Neutrality Test is to assess the relationship between the number of segregating sites per site and nucleotide diversity. When alleles are selectively neutral the product 4Nv can be estimated in two ways. N represents the effective population size and v is the mutation rate per site. By calculating the difference between these estimates, one can determine if there is evidence of non-neutral evolution. [34]
The molecular clock hypothesis suggests that all sequences have evolved at a constant rate over time. Therefore, the molecular clock test evaluates this statement in conjunction with the data provided by the user. In MEGA, this test is performed by applying a maximum likelihood test to a given tree topology and sequence alignment. This produces two log-likelihood values, one with the clock hypothesis and one without. [35]
Another approach offered by MEGA is Tajima's relative rate test. This method compares the number of substitutions per site between different sequences. If the resulting numbers differ by a large factor, the molecular clock hypothesis may not be valid for the given data set. [36]
MEGA offers five methods building a phylogenetic tree:
Each method allows for a bootstrap phylogeny test with any number of replications. Neighbor joining and minimum evolution allows for an interior-branch test instead. Substitution model and parameters are the same as the distance estimation methods.
MEGA provides a graphical interface for displaying a phylogenetic tree based on a variety of options. In the view menu, the tree can be displayed in three different styles: traditional, radiation, or circle. Traditional trees have three different branch styles: rectangular, straight, or curved. The view menu also offers toggling topology scaling, changing font type and size, arranging taxa, showing/hiding various details, and a general option for more control over the tree drawing aspects. [37]
The subtree menu provides options for manipulating the tree, such as swapping branches, flipping lineages order, compressing/expanding subtrees, and moving the tree's root. Subtrees can also be displayed in its own tree explorer with all the same features and options. [38] The compute menu provides options for computing a condensed tree, a consensus tree, or a timetree with or without a molecular clock. [39] The file menu provides options for saving, exporting, printing, and exiting. The tree topology can be exported to a file in MEGA tree format, or for timetrees, exported in a tabular format with relevant information used when constructing the timetree. Other export options include the current timetree calibrations, analysis summary, partition list, and pairwise distances. [40] The tree explorer also provide options to save the current tree display in an image format or to the clipboard under the image menu option. The image format supported are BMP, PNG, PDF, SVG, TIFF, and EMF. [41] If the user chose to build the tree with bootstrap replication, then the tree explorer will have two tabs, one with the original tree and one with the bootstrap consensus tree.
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix. Gaps are inserted between the residues so that identical or similar characters are aligned in successive columns. Sequence alignments are also used for non-biological sequences such as calculating the distance cost between strings in a natural language, or to display financial data.
Molecular phylogenetics is the branch of phylogeny that analyzes genetic, hereditary molecular differences, predominantly in DNA sequences, to gain information on an organism's evolutionary relationships. From these analyses, it is possible to determine the processes by which diversity among species has been achieved. The result of a molecular phylogenetic analysis is expressed in a phylogenetic tree. Molecular phylogenetics is one aspect of molecular systematics, a broader term that also includes the use of molecular data in taxonomy and biogeography.
In mathematics, computer science and especially graph theory, a distance matrix is a square matrix containing the distances, taken pairwise, between the elements of a set. Depending upon the application involved, the distance being used to define this matrix may or may not be a metric. If there are N elements, this matrix will have size N×N. In graph-theoretic applications, the elements are more often referred to as points, nodes or vertices.
Nucleotide diversity is a concept in molecular genetics which is used to measure the degree of polymorphism within a population.
In biology, a substitution model, also called models of sequence evolution, are Markov models that describe changes over evolutionary time. These models describe evolutionary changes in macromolecules, such as DNA sequences or protein sequences, that can be represented as sequence of symbols. Substitution models are used to calculate the likelihood of phylogenetic trees using multiple sequence alignment data. Thus, substitution models are central to maximum likelihood estimation of phylogeny as well as Bayesian inference in phylogeny. Estimates of evolutionary distances are typically calculated using substitution models. Substitution models are also central to phylogenetic invariants because they are necessary to predict site pattern frequencies given a tree topology. Substitution models are also necessary to simulate sequence data for a group of organisms related by a specific tree.
Clustal is a computer program used for multiple sequence alignment in bioinformatics. The software and its algorithms have gone through several iterations, with ClustalΩ (Omega) being the latest version as of 2011. It is available as standalone software, via a web interface, and through a server hosted by the European Bioinformatics Institute.
Genetic distance is a measure of the genetic divergence between species or between populations within a species, whether the distance measures time from common ancestor or degree of differentiation. Populations with many similar alleles have small genetic distances. This indicates that they are closely related and have a recent common ancestor.
Computational phylogenetics, phylogeny inference, or phylogenetic inference focuses on computational and optimization algorithms, heuristics, and approaches involved in phylogenetic analyses. The goal is to find a phylogenetic tree representing optimal evolutionary ancestry between a set of genes, species, or taxa. Maximum likelihood, parsimony, Bayesian, and minimum evolution are typical optimality criteria used to assess how well a phylogenetic tree topology describes the sequence data. Nearest Neighbour Interchange (NNI), Subtree Prune and Regraft (SPR), and Tree Bisection and Reconnection (TBR), known as tree rearrangements, are deterministic algorithms to search for optimal or the best phylogenetic tree. The space and the landscape of searching for the optimal phylogenetic tree is known as phylogeny search space.
Multiple sequence alignment (MSA) is the process or the result of sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. These alignments are used to infer evolutionary relationships via phylogenetic analysis and can highlight homologous features between sequences. Alignments highlight mutation events such as point mutations, insertion mutations and deletion mutations, and alignments are used to assess sequence conservation and infer the presence and activity of protein domains, tertiary structures, secondary structures, and individual amino acids or nucleotides.
PHYLogeny Inference Package (PHYLIP) is a free computational phylogenetics package of programs for inferring evolutionary trees (phylogenies). It consists of 65 portable programs, i.e., the source code is written in the programming language C. As of version 3.696, it is licensed as open-source software; versions 3.695 and older were proprietary software freeware. Releases occur as source code, and as precompiled executables for many operating systems including Windows, Mac OS 8, Mac OS 9, OS X, Linux ; and FreeBSD from FreeBSD.org. Full documentation is written for all the programs in the package and is included therein. The programs in the phylip package were written by Professor Joseph Felsenstein, of the Department of Genome Sciences and the Department of Biology, University of Washington, Seattle.
In genetics, the Ka/Ks ratio, also known as ω or dN/dS ratio, is used to estimate the balance between neutral mutations, purifying selection and beneficial mutations acting on a set of homologous protein-coding genes. It is calculated as the ratio of the number of nonsynonymous substitutions per non-synonymous site (Ka), in a given period of time, to the number of synonymous substitutions per synonymous site (Ks), in the same period. The latter are assumed to be neutral, so that the ratio indicates the net balance between deleterious and beneficial mutations. Values of Ka/Ks significantly above 1 are unlikely to occur without at least some of the mutations being advantageous. If beneficial mutations are assumed to make little contribution, then Ka/Ks estimates the degree of evolutionary constraint.
Masatoshi Nei was a Japanese-born American evolutionary biologist.
In bioinformatics, MAFFT is a program used to create multiple sequence alignments of amino acid or nucleotide sequences. Published in 2002, the first version of MAFFT used an algorithm based on progressive alignment, in which the sequences were clustered with the help of the fast Fourier transform. Subsequent versions of MAFFT have added other algorithms and modes of operation, including options for faster alignment of large numbers of sequences, higher accuracy alignments, alignment of non-coding RNA sequences, and the addition of new sequences to existing alignments.
A number of different Markov models of DNA sequence evolution have been proposed. These substitution models differ in terms of the parameters used to describe the rates at which one nucleotide replaces another during evolution. These models are frequently used in molecular phylogenetic analyses. In particular, they are used during the calculation of likelihood of a tree and they are used to estimate the evolutionary distance between sequences from the observed differences between the sequences.
Treefinder is a computer program for the likelihood-based reconstruction of phylogenetic trees from molecular sequences. It was written by Gangolf Jobb, a former researcher at the University of Munich, Germany, and was originally released in 2004. Treefinder is free of charge, though the most recent license prohibits its use in the USA and eight European countries.
A nonsynonymous substitution is a nucleotide mutation that alters the amino acid sequence of a protein. Nonsynonymous substitutions differ from synonymous substitutions, which do not alter amino acid sequences and are (sometimes) silent mutations. As nonsynonymous substitutions result in a biological change in the organism, they are subject to natural selection.
The McDonald–Kreitman test is a statistical test often used by evolutionary and population biologists to detect and measure the amount of adaptive evolution within a species by determining whether adaptive evolution has occurred, and the proportion of substitutions that resulted from positive selection. To do this, the McDonald–Kreitman test compares the amount of variation within a species (polymorphism) to the divergence between species (substitutions) at two types of sites, neutral and nonneutral. A substitution refers to a nucleotide that is fixed within one species, but a different nucleotide is fixed within a second species at the same base pair of homologous DNA sequences. A site is nonneutral if it is either advantageous or deleterious. The two types of sites can be either synonymous or nonsynonymous within a protein-coding region. In a protein-coding sequence of DNA, a site is synonymous if a point mutation at that site would not change the amino acid, also known as a silent mutation. Because the mutation did not result in a change in the amino acid that was originally coded for by the protein-coding sequence, the phenotype, or the observable trait, of the organism is generally unchanged by the silent mutation. A site in a protein-coding sequence of DNA is nonsynonymous if a point mutation at that site results in a change in the amino acid, resulting in a change in the organism's phenotype. Typically, silent mutations in protein-coding regions are used as the "control" in the McDonald–Kreitman test.
T-REX is a freely available web server, developed at the department of Computer Science of the Université du Québec à Montréal, dedicated to the inference, validation and visualization of phylogenetic trees and phylogenetic networks. The T-REX web server allows the users to perform several popular methods of phylogenetic analysis as well as some new phylogenetic applications for inferring, drawing and validating phylogenetic trees and networks.
Bacterial phylodynamics is the study of immunology, epidemiology, and phylogenetics of bacterial pathogens to better understand the evolutionary role of these pathogens. Phylodynamic analysis includes analyzing genetic diversity, natural selection, and population dynamics of infectious disease pathogen phylogenies during pandemics and studying intra-host evolution of viruses. Phylodynamics combines the study of phylogenetic analysis, ecological, and evolutionary processes to better understand of the mechanisms that drive spatiotemporal incidence and phylogenetic patterns of bacterial pathogens. Bacterial phylodynamics uses genome-wide single-nucleotide polymorphisms (SNP) in order to better understand the evolutionary mechanism of bacterial pathogens. Many phylodynamic studies have been performed on viruses, specifically RNA viruses which have high mutation rates. The field of bacterial phylodynamics has increased substantially due to the advancement of next-generation sequencing and the amount of data available.
Genetic saturation is the result of multiple substitutions at the same site in a sequence, or identical substitutions in different sequences, such that the apparent sequence divergence rate is lower than the actual divergence that has occurred. When comparing two or more genetic sequences consisting of single nucleotides, differences in sequence observed are only differences in the final state of the nucleotide sequence. Single nucleotides that undergoing genetic saturation change multiple times, sometimes back to their original nucleotide or to a nucleotide common to the compared genetic sequence. Without genetic information from intermediate taxa, it is difficult to know how much, or if any saturation has occurred on an observed sequence. Genetic saturation occurs most rapidly on fast-evolving sequences, such as the hypervariable region of mitochondrial DNA, or in short tandem repeats such as on the Y-chromosome.