Original author(s) | Previous Programmers: Charles P. Kollar, Nandita Mukhopadhyay, Lee Almasy, Mark Schroeder, William P. Mulvihill. |
---|---|
Developer(s) | Daniel E. Weeks, Robert V. Baron, Justin R. Stickel. |
Initial release | 16 January 2000 |
Stable release | 5.0.1 / 13 December 2018 |
Repository | |
Written in | C++ |
Operating system | Linux, Mac OS X, Microsoft Windows |
Type | Applied statistical genetics, Bioinformatics |
License | GNU General Public License version 3 |
Website | watson |
Mega2 is a data manipulation software for applied statistical genetics. Mega is an acronym for Manipulation Environment for Genetic Analysis.
The software allows the applied statistical geneticist to convert one's data from several input formats to a large number output formats suitable for analysis by commonly used software packages. [1] [2] [3] [4] In a typical human genetics study, the analyst often needs to use a variety of different software programs to analyze the data, and these programs usually require that the data be formatted to their precise input specifications. Conversion of one's data into these multiple different formats can be tedious, time-consuming, and error-prone. Mega2, by providing validated conversion pipelines, can accelerate the analyses while reducing errors.
Mega2 produces a common intermediate data representation using SQLite3, which enables the data to be accessed by other programs and languages. In particular, the Mega2R R package converts the SQLite3 data into R data frames. Several R functions are provided that illustrate how data can be extracted from the data frames for common R analysis, such as SKAT and pedgene. The key is being able to efficiently extract genotypes corresponding to chosen subsets of markers so as to facilitate gene-based association testing by automating looping over genes in the genome. Another function converts to VCF format and another converts the data to GenABEL format. For more information about the Mega2R package, see here.
Mega2 has been used to facilitate genetic analyses of a wide variety of human traits, including hereditary dystonia, [5] Ehlers-Danlos syndrome, [6] multiple sclerosis, [7] and gliomas. [8] A list of PubMed Central articles citing Mega2 can be seen here.
Mega2, which focusses on data reformatting, should not be confused with the MEGA, Molecular Evolutionary Genetics Analysis program, which focuses on molecular evolution and phylogenetics.
Mega2 accepts input data in a variety of widely used file formats. These contain, at a minimum, data about the phenotypes, the marker genotypes, any family structures, and map positions of the markers.
Input format | Description | Links |
---|---|---|
LINKAGE [9] [10] [11] [12] | pre-Makeped or post-Makeped formats | Linkage User Guide (PDF), LINKAGE format |
Mega2 [1] [2] [3] [4] | simplified/augmented LINKAGE-format | Mega2 format |
PLINK [13] | ped format or binary bed format | PLINK documentation |
VCF or BCF [14] | Variant Call Format or Binary Variant Call Format | Variant Call Format (Wikipedia entry), BCF documentation |
IMPUTE2 [15] [16] | IMPUTE2 GEN and BGEN Formats | IMPUTE2 documentation, GEN format, BGEN format |
Mega2 supports conversion to the following output formats.
Output format | Links |
---|---|
ASPEX format | ASPEX |
Allegro format [17] | |
Beagle format [18] [19] | BEAGLE |
CRANEFOOT format [20] | CRANEFOOT |
Eigenstrat format [21] [22] | EIGENSOFT |
FBAT format [23] | FBAT |
GeneHunter format [24] | GeneHunter |
GeneHunter-Plus format [25] | GeneHunter-Plus |
IQLS/Idcoefs format [26] [27] | IQLS,Idcoefs |
Linkage format [9] [10] [11] [12] | Linkage User Guide (PDF), LINKAGE format |
Loki format [28] | Loki |
MaCH/minimac3 format [29] [30] | MaCH, minimac3 |
MLBQTL format [31] | MLB-QTL |
Mega2 annotated format [1] [2] [3] [4] | Mega2 format |
Mendel format [32] | Mendel |
Merlin format [33] | Merlin |
Merlin/SimWalk2-NPL format [33] [34] | Merlin SimWalk2 |
PANGAEA MORGAN format [35] [36] | MORGAN |
PAP format [37] | PAP |
PLINK format [13] (bed, lgen, or ped formats) | PLINK |
PREST format [38] [39] | PREST |
PSEQ format | PSEQ |
Pre-makeped LINKAGE format [9] [10] [11] [12] | Linkage User Guide (PDF), LINKAGE format |
ROADTRIPS format [40] | ROADTRIPS |
SAGE format | SAGE, openSAGE |
SHAPEIT format [41] [42] [43] [44] [45] | SHAPEIT |
SIMULATE format [46] | SIMULATE |
SLINK format [47] [48] | FASTSLINK |
SOLAR format [49] [50] | SOLAR |
SPLINK format [51] | SPLINK |
SUP format [48] [52] | SUP |
SimWalk2 format [34] | SimWalk2 |
Structure format [53] [54] [55] | Structure |
VCF format [14] | Variant Call Format (Wikipedia entry) |
Vintage Mendel format [32] [56] | Vintage Mendel |
Vitesse format [57] | Vitesse |
The Mega2 documentation is available here in HTML format, and here in PDF format.
A DNA segment is identical by state (IBS) in two or more individuals if they have identical nucleotide sequences in this segment. An IBS segment is identical by descent (IBD) in two or more individuals if they have inherited it from a common ancestor without recombination, that is, the segment has the same ancestral origin in these individuals. DNA segments that are IBD are IBS per definition, but segments that are not IBD can still be IBS due to the same mutations in different individuals or recombinations that do not alter the segment.
Medical genetics is the branch of medicine that involves the diagnosis and management of hereditary disorders. Medical genetics differs from human genetics in that human genetics is a field of scientific research that may or may not apply to medicine, while medical genetics refers to the application of genetics to medical care. For example, research on the causes and inheritance of genetic disorders would be considered within both human genetics and medical genetics, while the diagnosis, management, and counselling people with genetic disorders would be considered part of medical genetics.
A tag SNP is a representative single nucleotide polymorphism (SNP) in a region of the genome with high linkage disequilibrium that represents a group of SNPs called a haplotype. It is possible to identify genetic variation and association to phenotypes without genotyping every SNP in a chromosomal region. This reduces the expense and time of mapping genome areas associated with disease, since it eliminates the need to study every individual SNP. Tag SNPs are useful in whole-genome SNP association studies in which hundreds of thousands of SNPs across the entire genome are genotyped.
Haploview is a commonly used bioinformatics software which is designed to analyze and visualize patterns of linkage disequilibrium (LD) in genetic data. Haploview can also perform association studies, choosing tagSNPs and estimating haplotype frequencies. Haploview is developed and maintained by Dr. Mark Daly's lab at the MIT/Harvard Broad Institute.
In genomics, a genome-wide association study, is an observational study of a genome-wide set of genetic variants in different individuals to see if any variant is associated with a trait. GWA studies typically focus on associations between single-nucleotide polymorphisms (SNPs) and traits like major human diseases, but can equally be applied to any other genetic variants and any other organisms.
Zinc phosphodiesterase ELAC protein 2 is an enzyme that in humans is encoded by the ELAC2 gene. on chromosome 17. It is an endonuclease thought to be involved in mitochondrial tRNA maturation,
The CCM2 gene contains 10 coding exons and an alternatively spliced exon 1B. This gene is located on chromosome 7p13 and loss of function mutations on CCM2 lead to the onset of Cerebral Cavernous Malformations (CCM) illness. Cerebral cavernous malformations (CCMs) are vascular malformations in the brain and spinal cord made of dilated capillary vessels.
Inversin is a protein that in humans is encoded by the INVS gene.
Bardet–Biedl syndrome 1 protein is a protein that in humans is encoded by the BBS1 gene. BBS1 is part of the BBSome complex, which required for ciliogenesis. Mutations in this gene have been observed in patients with the major form of Bardet–Biedl syndrome.
Cytochrome P450 4V2 is a protein that in humans is encoded by the CYP4V2 gene.
Mark Lathrop is a Canadian Biostatistician. He headed the Center for the Study of Human Polymorphisms, but returned to Canada as Scientific Director at McGill University and Genome Quebec's Innovation Centre in 2011.
Glucosamine-6-phosphate deaminase 2 also known as GNPDA2 is an enzyme that in humans is encoded by the GNPDA2 gene.
Neuronal growth regulator 1 also known as NEGR1 is a protein which in humans is encoded by the NEGR1 gene.
Like many other medical conditions, obesity is the result of an interplay between environmental and genetic factors. Studies have identified variants in several genes that may contribute to weight gain and body fat distribution; although, only in a few cases are genes the primary cause of obesity.
Collagen and calcium-binding EGF domain-containing protein 1 is a protein that in humans is encoded by the CCBE1 gene.
Disease gene identification is a process by which scientists identify the mutant genotypes responsible for an inherited genetic disorder. Mutations in these genes can include single nucleotide substitutions, single nucleotide additions/deletions, deletion of the entire gene, and other genetic abnormalities.
Quantitative trait loci mapping or QTL mapping is the process of identifying genomic regions that potentially contain genes responsible for important economic, health or environmental characters. Mapping QTLs is an important activity that plant breeders and geneticists routinely use to associate potential causal genes with phenotypes of interest. Family-based QTL mapping is a variant of QTL mapping where multiple-families are used.
In genetics, haplotype estimation refers to the process of statistical estimation of haplotypes from genotype data. The most common situation arises when genotypes are collected at a set of polymorphic sites from a group of individuals. For example in human genetics, genome-wide association studies collect genotypes in thousands of individuals at between 200,000-5,000,000 SNPs using microarrays. Haplotype estimation methods are used in the analysis of these datasets and allow genotype imputation of alleles from reference databases such as the HapMap Project and the 1000 Genomes Project.
In genetics, imputation is the statistical inference of unobserved genotypes. It is achieved by using known haplotypes in a population, for instance from the HapMap or the 1000 Genomes Project in humans, thereby allowing to test for association between a trait of interest and experimentally untyped genetic variants, but whose genotypes have been statistically inferred ("imputed"). Genotype imputation is usually performed on SNPs, the most common kind of genetic variation.
Jonathan Laurence Marchini is a Bayesian statistician and professor of statistical genomics in the Department of Statistics at the University of Oxford, a tutorial fellow in statistics at Somerville College, Oxford and a co-founder and director of Gensci Ltd. He co-leads the Haplotype Reference Consortium.