Mega2, the Manipulation Environment for Genetic Analysis

Last updated
Mega2
Original author(s) Previous Programmers: Charles P. Kollar, Nandita Mukhopadhyay, Lee Almasy, Mark Schroeder, William P. Mulvihill.
Developer(s) Daniel E. Weeks, Robert V. Baron, Justin R. Stickel.
Initial release16 January 2000;25 years ago (2000-01-16)
Stable release
5.0.1 / 13 December 2018;6 years ago (2018-12-13)
Repository
Written in C++
Operating system Linux, Mac OS X, Microsoft Windows
Type Applied statistical genetics, Bioinformatics
License GNU General Public License version 3
Website watson.hgen.pitt.edu/register/

Mega2 is a data manipulation software for applied statistical genetics. Mega is an acronym for Manipulation Environment for Genetic Analysis.

Contents

The software allows the applied statistical geneticist to convert one's data from several input formats to a large number output formats suitable for analysis by commonly used software packages. [1] [2] [3] [4] In a typical human genetics study, the analyst often needs to use a variety of different software programs to analyze the data, and these programs usually require that the data be formatted to their precise input specifications. Conversion of one's data into these multiple different formats can be tedious, time-consuming, and error-prone. Mega2, by providing validated conversion pipelines, can accelerate the analyses while reducing errors.

Mega2 produces a common intermediate data representation using SQLite3, which enables the data to be accessed by other programs and languages. In particular, the Mega2R R package converts the SQLite3 data into R data frames. Several R functions are provided that illustrate how data can be extracted from the data frames for common R analysis, such as SKAT and pedgene. The key is being able to efficiently extract genotypes corresponding to chosen subsets of markers so as to facilitate gene-based association testing by automating looping over genes in the genome. Another function converts to VCF format and another converts the data to GenABEL format. For more information about the Mega2R package, see here.

Mega2 has been used to facilitate genetic analyses of a wide variety of human traits, including hereditary dystonia, [5] Ehlers-Danlos syndrome, [6] multiple sclerosis, [7] and gliomas. [8] A list of PubMed Central articles citing Mega2 can be seen here.

Mega2, which focusses on data reformatting, should not be confused with the MEGA, Molecular Evolutionary Genetics Analysis program, which focuses on molecular evolution and phylogenetics.

Input file formats

Mega2 accepts input data in a variety of widely used file formats. These contain, at a minimum, data about the phenotypes, the marker genotypes, any family structures, and map positions of the markers.

Input formatDescriptionLinks
LINKAGE [9] [10] [11] [12] pre-Makeped or post-Makeped formats Linkage User Guide (PDF), LINKAGE format
Mega2 [1] [2] [3] [4] simplified/augmented LINKAGE-format Mega2 format
PLINK [13] ped format or binary bed format PLINK documentation
VCF or BCF [14] Variant Call Format or Binary Variant Call Format Variant Call Format (Wikipedia entry), BCF documentation
IMPUTE2 [15] [16] IMPUTE2 GEN and BGEN Formats IMPUTE2 documentation, GEN format, BGEN format

Output file formats

Mega2 supports conversion to the following output formats.

Output formatLinks
ASPEX format ASPEX
Allegro format [17]
Beagle format [18] [19] BEAGLE
CRANEFOOT format [20] CRANEFOOT
Eigenstrat format [21] [22] EIGENSOFT
FBAT format [23] FBAT
GeneHunter format [24] GeneHunter
GeneHunter-Plus format [25] GeneHunter-Plus
IQLS/Idcoefs format [26] [27] IQLS,Idcoefs
Linkage format [9] [10] [11] [12] Linkage User Guide (PDF), LINKAGE format
Loki format [28] Loki
MaCH/minimac3 format [29] [30] MaCH, minimac3
MLBQTL format [31] MLB-QTL
Mega2 annotated format [1] [2] [3] [4] Mega2 format
Mendel format [32] Mendel
Merlin format [33] Merlin
Merlin/SimWalk2-NPL format [33] [34] Merlin SimWalk2
PANGAEA MORGAN format [35] [36] MORGAN
PAP format [37] PAP
PLINK format [13] (bed, lgen, or ped formats) PLINK
PREST format [38] [39] PREST
PSEQ format PSEQ
Pre-makeped LINKAGE format [9] [10] [11] [12] Linkage User Guide (PDF), LINKAGE format
ROADTRIPS format [40] ROADTRIPS
SAGE format SAGE, openSAGE
SHAPEIT format [41] [42] [43] [44] [45] SHAPEIT
SIMULATE format [46] SIMULATE
SLINK format [47] [48] FASTSLINK
SOLAR format [49] [50] SOLAR
SPLINK format [51] SPLINK
SUP format [48] [52] SUP
SimWalk2 format [34] SimWalk2
Structure format [53] [54] [55] Structure
VCF format [14] Variant Call Format (Wikipedia entry)
Vintage Mendel format [32] [56] Vintage Mendel
Vitesse format [57] Vitesse

Documentation

The Mega2 documentation is available here in HTML format, and here in PDF format.

References

  1. 1 2 3 Mukhopadhyay, N; Almasy L; Schroeder M; Mulvihill WP; Weeks DE (1999). "Mega2, a data-handling program for facilitating genetic linkage and association analyses". Am J Hum Genet. 65: A436.
  2. 1 2 3 Mukhopadhyay, N; Almasy L; Schroeder M; Mulvihill WP; Weeks DE (2005). "Mega2: data-handling for facilitating genetic linkage and association analyses". Bioinformatics. 21 (10): 2556–2557. doi: 10.1093/bioinformatics/bti364 . PMID   15746282.
  3. 1 2 3 Kollar, CP; Baron RV; Mukhopadhyay N; Weeks DE (October 2013). "Mega2: enhanced data-handling for facilitating genetic linkage and association analyses". Presented at the 63rd Annual Meeting of the American Society of Human Genetics, Boston: Abstract 1831.
  4. 1 2 3 Baron RV, Kollar C, Mukhopadhyay N, Weeks DE (2014). "Mega2: validated data-reformatting for linkage and association analyses". Source Code Biol Med. 9 (1) 26. doi: 10.1186/s13029-014-0026-y . PMC   4269913 . PMID   25687422.
  5. Hersheson J, Mencacci NE, Davis M, Macdonald N, Trabzuni D, Ryten M, Pittman A, Paudel R, Kara E, Fawcett K, Plagnol V, Bhatia KP, Medlar AJ, Stanescu HC, Hardy J, Kleta R, Wood NW, Houlden H (2013). "Mutations in the autoregulatory domain of beta-tubulin 4a cause hereditary dystonia". Ann Neurol. 73 (4): 546–553. doi:10.1002/ana.23832. PMC   3698699 . PMID   23424103.
  6. Baumann M, Giunta C, Krabichler B, Ruschendorf F, Zoppi N, Colombi M, Bittner RE, Quijano-Roy S, Muntoni F, Cirak S, Schreiber G, Zou Y, Hu Y, Romero NB, Carlier RY, Amberger A, Deutschmann A, Straub V, Rohrbach M, Steinmann B, Rostasy K, Karall D, Bonnemann CG, Zschocke J, Fauth C (2012). "Mutations in FKBP14 cause a variant of Ehlers-Danlos syndrome with progressive kyphoscoliosis, myopathy, and hearing loss". Am J Hum Genet. 90 (2): 201–216. doi:10.1016/j.ajhg.2011.12.004. PMC   3276673 . PMID   22265013.
  7. Dyment DA, Cader MZ, Chao MJ, Lincoln MR, Morrison KM, Disanto G, Morahan JM, De Luca GC, Sadovnick AD, Lepage P, Montpetit A, Ebers GC, Ramagopalan SV (2012). "Exome sequencing identifies a novel multiple sclerosis susceptibility variant in the TYK2 gene". Neurology. 79 (5): 406–411. doi:10.1212/wnl.0b013e3182616fc4. PMC   3405256 . PMID   22744673.
  8. Shete S, Lau CC, Houlston RS, Claus EB, Barnholtz-Sloan J, Lai R, Il'yasova D, Schildkraut J, Sadetzki S, Johansen C, Bernstein JL, Olson SH, Jenkins RB, Yang P, Vick NA, Wrensch M, Davis FG, McCarthy BJ, Leung EH, Davis C, Cheng R, Hosking FJ, Armstrong GN, Liu Y, Yu RK, Henriksson R, Gliogene C, Melin BS, Bondy ML (2011). "Genome-wide high-density SNP linkage search for glioma susceptibility loci: results from the Gliogene Consortium". Cancer Res. 71 (24): 7568–7575. doi:10.1158/0008-5472.can-11-0013. PMC   3242820 . PMID   22037877.
  9. 1 2 3 Lathrop GM, Lalouel JM (1984). "Easy calculations of lod scores and genetic risks on small computers". Am J Hum Genet. 36 (2): 460–465. PMC   1684427 . PMID   6585139.
  10. 1 2 3 Lathrop GM, Lalouel JM, Julier C, Ott J (1985). "Multilocus linkage analysis in humans: detection of linkage and estimation of recombination". Am J Hum Genet. 37 (3): 482–498. PMC   1684598 . PMID   3859205.
  11. 1 2 3 Lathrop GM, Lalouel JM, White RL (1986). "Construction of human linkage maps: likelihood calculations for multilocus analysis". Genet Epidemiol. 3 (1): 39–52. doi:10.1002/gepi.1370030105. PMID   3957003. S2CID   29289413.
  12. 1 2 3 Lathrop GM, Lalouel JM (1988). "Efficient computations in multilocus linkage analysis". Am J Hum Genet. 42 (3): 498–505. PMC   1715153 . PMID   3162348.
  13. 1 2 Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, et al. (2011). "The variant call format and VCFtools". Bioinformatics. 27 (15): 2156–8. doi:10.1093/bioinformatics/btr330. PMC   3137218 . PMID   21653522.
  14. Howie BN, Donnelly P, Marchini J (2009). "A flexible and accurate genotype imputation method for the next generation of genome-wide association studies". PLOS Genet. 5 (6): e1000529. doi: 10.1371/journal.pgen.1000529 . PMC   2689936 . PMID   19543373.
  15. Marchini J, Howie B (2010). "Genotype imputation for genome-wide association studies". Nat Rev Genet. 11 (7): 499–511. doi:10.1038/nrg2796. PMID   20517342. S2CID   1465707.
  16. Gudbjartsson DF, Jonasson K, Frigge ML, Kong A (2000). "Allegro, a new computer program for multipoint linkage analysis". Nat Genet. 25 (1): 12–13. doi:10.1038/75514. PMID   10802644. S2CID   27362146.
  17. Browning SR, Browning BL (2007). "Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering". Am J Hum Genet. 81 (5): 1084–1097. doi:10.1086/521987. PMC   2265661 . PMID   17924348.
  18. Browning BL, Browning SR (2009). "A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals". Am J Hum Genet. 84 (2): 210–223. doi:10.1016/j.ajhg.2009.01.005. PMC   2668004 . PMID   19200528.
  19. Makinen VP, Parkkonen M, Wessman M, Groop PH, Kanninen T, Kaski K (2005). "High-throughput pedigree drawing". Eur J Hum Genet. 13 (8): 987–989. doi: 10.1038/sj.ejhg.5201430 . PMID   15870825.
  20. Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D (2006). "Principal components analysis corrects for stratification in genome-wide association studies". Nat Genet. 38 (8): 904–909. doi:10.1038/ng1847. PMID   16862161. S2CID   8127858.
  21. Patterson N, Price AL, Reich D (2006). "Population structure and eigenanalysis". PLOS Genet. 2 (12): e190. doi: 10.1371/journal.pgen.0020190 . PMC   1713260 . PMID   17194218.
  22. Laird NM, Horvath S, Xu X (2000). "Implementing a unified approach to family-based tests of association". Genet Epidemiol. 19 (Suppl 1): S36–42. doi:10.1002/1098-2272(2000)19:1+<::aid-gepi6>3.3.co;2-d. PMID   11055368.
  23. Kruglyak L, Daly MJ, Reeve-Daly MP, Lander ES (1996). "Parametric and nonparametric linkage analysis: a unified multipoint approach". Am J Hum Genet. 58 (6): 1347–1363. PMC   1915045 . PMID   8651312.
  24. Kong A, Cox NJ (1997). "Allele-sharing models: LOD scores and accurate linkage tests". Am J Hum Genet. 61 (5): 1179–1188. doi:10.1086/301592. PMC   1716027 . PMID   9345087.
  25. Wang Z, McPeek MS (2009). "An Incomplete-Data Quasi-likelihood Approach to Haplotype-Based Genetic Association Studies on Related Individuals". J Am Stat Assoc. 104 (487): 1251–1260. doi:10.1198/jasa.2009.tm08507. PMC   2860453 . PMID   20428335.
  26. Abney M (2009). "A graphical algorithm for fast computation of identity coefficients and generalized kinship coefficients". Bioinformatics. 25 (12): 1561–1563. doi:10.1093/bioinformatics/btp185. PMC   2687941 . PMID   19359355.
  27. Heath SC (1997). "Markov chain Monte Carlo segregation and linkage analysis for oligogenic models". Am J Hum Genet. 61 (3): 748–760. doi:10.1086/515506. PMC   1715966 . PMID   9326339.
  28. Howie B, Fuchsberger C, Stephens M, Marchini J, Abecasis GR (2012). "Fast and accurate genotype imputation in genome-wide association studies through pre-phasing". Nat Genet. 44 (8): 955–959. doi:10.1038/ng.2354. PMC   3696580 . PMID   22820512.
  29. Fuchsberger C, Abecasis GR, Hinds DA (2015). "minimac2: faster genotype imputation". Bioinformatics. 31 (5): 782–784. doi:10.1093/bioinformatics/btu704. PMC   4341061 . PMID   25338720.
  30. Alcais A, Philippi A, Abel L (1999). "Genetic model-free linkage analysis using the maximum-likelihood- binomial method for categorical traits". Genet Epidemiol. 17 (Suppl 1): S467–472. doi: 10.1002/gepi.1370170775 . PMID   10597477.
  31. 1 2 Lange K, Papp JC, Sinsheimer JS, Sripracha R, Zhou H, Sobel EM (2013). "Mendel: the Swiss army knife of genetic analysis programs". Bioinformatics. 29 (12): 1568–1570. doi:10.1093/bioinformatics/btt187. PMC   3673222 . PMID   23610370.
  32. 1 2 Abecasis GR, Cherny SS, Cookson WO, Cardon LR (2002). "Merlin--rapid analysis of dense genetic maps using sparse gene flow trees". Nat Genet. 30 (1): 97–101. doi:10.1038/ng786. PMID   11731797. S2CID   12226524.
  33. 1 2 Sobel E, Lange K (1996). "Descent graphs in pedigree analysis: Applications to haplotyping, location scores, and marker-sharing statistics". Am J Hum Genet. 58 (6): 1323–1337. PMC   1915074 . PMID   8651310.
  34. Thompson EA (1994). "Monte Carlo likelihood in the genetic mapping of complex traits". Philos Trans R Soc Lond B Biol Sci. 344 (1310): 345–350, discussion 350–341. doi:10.1098/rstb.1994.0073. PMID   7800704.
  35. Thompson EA (1994). "Monte Carlo likelihood in genetic mapping". Statistical Science. 9 (3): 355–366. doi: 10.1214/ss/1177010381 .
  36. Hasstedt SJ (2005). "jPAP: Document-driven software for genetic analysis". Genet Epidemiol. 29: 255.
  37. McPeek MS, Sun L (2000). "Statistical tests for detection of misspecified relationships by use of genome-screen data". Am J Hum Genet. 66 (3): 1076–1094. doi:10.1086/302800. PMC   1288143 . PMID   10712219.
  38. Sun L, Wilder K, McPeek MS (2002). "Enhanced pedigree error detection". Hum Hered. 54 (2): 99–110. doi:10.1159/000067666. PMID   12566741. S2CID   26992288.
  39. Thornton T, McPeek MS (2010). "ROADTRIPS: case-control association testing with partially or completely unknown population and pedigree structure". Am J Hum Genet. 86 (2): 172–184. doi:10.1016/j.ajhg.2010.01.001. PMC   2820184 . PMID   20137780.
  40. Delaneau O, Marchini J, Zagury JF (2012). "A linear complexity phasing method for thousands of genomes". Nat Methods. 9 (2): 179–81. doi:10.1038/nmeth.1785. PMID   22138821. S2CID   13765612.
  41. Delaneau O, Zagury JF, Marchini J (2013). "Improved whole-chromosome phasing for disease and population genetic studies". Nat Methods. 10 (1): 5–6. doi:10.1038/nmeth.2307. PMID   23269371. S2CID   205421216.
  42. Delaneau O, Howie B, Cox AJ, Zagury JF, Marchini J (2013). "Haplotype estimation using sequencing reads". Am J Hum Genet. 93 (4): 687–96. doi:10.1016/j.ajhg.2013.09.002. PMC   3791270 . PMID   24094745.
  43. O'Connell J, Gurdasani D, Delaneau O, Pirastu N, Ulivi S, Cocca M, et al. (2014). "A general approach for haplotype phasing across the full spectrum of relatedness". PLOS Genet. 10 (4): e1004234. doi: 10.1371/journal.pgen.1004234 . PMC   3990520 . PMID   24743097.
  44. Delaneau O, Marchini J, The 1000 Genomes Project Consortium (2014). "Integrating sequence and array data to create an improved 1000 Genomes Project haplotype reference panel". Nat Commun. 5 3934. Bibcode:2014NatCo...5.3934.. doi:10.1038/ncomms4934. PMC   4338501 . PMID   25653097.
  45. Speer M, Terwilliger JD, Ott J (1992). "A chromosome-based method for rapid computer simulation". Am J Hum Genet. 51: A202.
  46. Weeks DE, Ott J, Lathrop GM (1990). "SLINK: a general simulation program for linkage analysis". Am J Hum Genet. 47 (3): A204.
  47. Blangero J, Almasy L (1997). "Multipoint oligogenic linkage analysis of quantitative traits". Genet Epidemiol. 14 (6): 959–964. doi:10.1002/(sici)1098-2272(1997)14:6<959::aid-gepi66>3.0.co;2-k. PMID   9433607. S2CID   11630296.
  48. Almasy L, Blangero J (1998). "Multipoint quantitative-trait linkage analysis in general pedigrees". Am J Hum Genet. 62 (5): 1198–1211. doi:10.1086/301844. PMC   1377101 . PMID   9545414.
  49. Holmans P (1993). "Asymptotic properties of affected-sib-pair linkage analysis". Am J Hum Genet. 52 (2): 362–374. PMC   1682211 . PMID   8430697.
  50. Lemire M (2006). "SUP: an extension to SLINK to allow a larger number of marker loci to be simulated in pedigrees conditional on trait values". BMC Genet. 7 40. doi: 10.1186/1471-2156-7-40 . PMC   1524809 . PMID   16803631.
  51. Pritchard JK, Stephens M, Donnelly P (2000). "Inference of population structure using multilocus genotype data". Genetics. 155 (2): 945–959. doi:10.1093/genetics/155.2.945. PMC   1461096 . PMID   10835412.
  52. Falush D, Stephens M, Pritchard JK (2003). "Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies". Genetics. 164 (4): 1567–1587. doi:10.1093/genetics/164.4.1567. PMC   1462648 . PMID   12930761.
  53. Falush D, Stephens M, Pritchard JK (2007). "Inference of population structure using multilocus genotype data: dominant markers and null alleles". Mol Ecol Notes. 7 (4): 574–578. doi:10.1111/j.1471-8286.2007.01758.x. PMC   1974779 . PMID   18784791.
  54. Lange K, Weeks D, Boehnke M (1988). "Programs for pedigree analysis: MENDEL, FISHER, and dGENE" (PDF). Genet Epidemiol. 5 (6): 471–472. doi:10.1002/gepi.1370050611. hdl: 2027.42/101847 . PMID   3061869. S2CID   44260724.
  55. O'Connell JR, Weeks DE (1995). "The VITESSE algorithm for rapid exact multilocus linkage analysis via genotype set-recoding and fuzzy inheritance". Nat Genet. 11 (4): 402–408. doi:10.1038/ng1295-402. PMID   7493020. S2CID   12496754.