Mega2, the Manipulation Environment for Genetic Analysis

Mega2
Original author(s)Previous Programmers: Charles P. Kollar, Nandita Mukhopadhyay, Lee Almasy, Mark Schroeder, William P. Mulvihill.
Developer(s)Daniel E. Weeks, Robert V. Baron, Justin R. Stickel.
Initial release16 January 2000; 24 years ago (2000-01-16)
Stable release
5.0.1 / 13 December 2018; 5 years ago (2018-12-13)
Repository
Written inC++
Operating systemLinux, Mac OS X, Microsoft Windows
TypeApplied statistical genetics, Bioinformatics
LicenseGNU General Public License version 3
Websitewatson.hgen.pitt.edu/register/

Mega2 (short for manipulation environment for genetic analysis) allows the applied statistical geneticist to convert one's data from several input formats to a large number output formats suitable for analysis by commonly used software packages.[1][2][3][4] In a typical human genetics study, the analyst often needs to use a variety of different software programs to analyze the data, and these programs usually require that the data be formatted to their precise input specifications. Conversion of one's data into these multiple different formats can be tedious, time-consuming, and error-prone. Mega2, by providing validated conversion pipelines, can accelerate the analyses while reducing errors.

Mega2 produces a common intermediate data representation using SQLite3, which enables the data to be accessed by other programs and languages. In particular, the Mega2R R package converts the SQLite3 data into R data frames. Several R functions are provided that illustrate how data can be extracted from the data frames for common R analysis, such as SKAT and pedgene. The key is being able to efficiently extract genotypes corresponding to chosen subsets of markers so as to facilitate gene-based association testing by automating looping over genes in the genome. Another function converts to VCF format and another converts the data to GenABEL format. For more information about the Mega2R package, see here.

Mega2 has been used to facilitate genetic analyses of a wide variety of human traits, including hereditary dystonia,[5] Ehlers-Danlos syndrome,[6] multiple sclerosis,[7] and gliomas.[8] A list of PubMed Central articles citing Mega2 can be seen here.

Mega2, which focusses on data reformatting, should not be confused with the MEGA, Molecular Evolutionary Genetics Analysis program, which focuses on molecular evolution and phylogenetics.

Input file formats[edit]

Mega2 accepts input data in a variety of widely used file formats. These contain, at a minimum, data about the phenotypes, the marker genotypes, any family structures, and map positions of the markers.

Input format Description Links
LINKAGE[9][10][11][12] pre-Makeped or post-Makeped formats Linkage User Guide (PDF), LINKAGE format
Mega2[1][2][3][4] simplified/augmented LINKAGE-format Mega2 format
PLINK[13] ped format or binary bed format PLINK documentation
VCF or BCF[14] Variant Call Format or Binary Variant Call Format Variant Call Format (Wikipedia entry), BCF documentation
IMPUTE2[15][16] IMPUTE2 GEN and BGEN Formats IMPUTE2 documentation, GEN format, BGEN format

Output file formats[edit]

Mega2 supports conversion to the following output formats.

Output format Links
ASPEX format ASPEX
Allegro format[17]
Beagle format[18][19] BEAGLE
CRANEFOOT format[20] CRANEFOOT
Eigenstrat format[21][22] EIGENSOFT
FBAT format[23] FBAT
GeneHunter format[24] GeneHunter
GeneHunter-Plus format[25] GeneHunter-Plus
IQLS/Idcoefs format[26][27] IQLS,Idcoefs
Linkage format[9][10][11][12] Linkage User Guide (PDF), LINKAGE format
Loki format[28] Loki
MaCH/minimac3 format[29] [30] MaCH, minimac3
MLBQTL format[31] MLB-QTL
Mega2 annotated format[1][2][3][4] Mega2 format
Mendel format[32] Mendel
Merlin format[33] Merlin
Merlin/SimWalk2-NPL format[33][34] Merlin SimWalk2
PANGAEA MORGAN format[35][36] MORGAN
PAP format[37] PAP
PLINK format[13] (bed, lgen, or ped formats) PLINK
PREST format[38][39] PREST
PSEQ format PSEQ
Pre-makeped LINKAGE format[9][10][11][12] Linkage User Guide (PDF), LINKAGE format
ROADTRIPS format[40] ROADTRIPS
SAGE format SAGE, openSAGE
SHAPEIT format[41][42][43][44][45] SHAPEIT
SIMULATE format[46] SIMULATE
SLINK format[47][48] FASTSLINK
SOLAR format[49][50] SOLAR
SPLINK format[51] SPLINK
SUP format[48][52] SUP
SimWalk2 format[34] SimWalk2
Structure format[53][54][55] Structure
VCF format[14] Variant Call Format (Wikipedia entry)
Vintage Mendel format[32][56] Vintage Mendel
Vitesse format[57] Vitesse

Documentation[edit]

The Mega2 documentation is available here in HTML format, and here in PDF format.

References[edit]

  1. ^ a b c Mukhopadhyay, N; Almasy L; Schroeder M; Mulvihill WP; Weeks DE (1999). "Mega2, a data-handling program for facilitating genetic linkage and association analyses". Am J Hum Genet. 65: A436.
  2. ^ a b c Mukhopadhyay, N; Almasy L; Schroeder M; Mulvihill WP; Weeks DE (2005). "Mega2: data-handling for facilitating genetic linkage and association analyses". Bioinformatics. 21 (10): 2556–2557. doi:10.1093/bioinformatics/bti364. PMID 15746282.
  3. ^ a b c Kollar, CP; Baron RV; Mukhopadhyay N; Weeks DE (October 2013). "Mega2: enhanced data-handling for facilitating genetic linkage and association analyses". Presented at the 63rd Annual Meeting of the American Society of Human Genetics, Boston: Abstract 1831.
  4. ^ a b c Baron RV, Kollar C, Mukhopadhyay N, Weeks DE (2014). "Mega2: validated data-reformatting for linkage and association analyses". Source Code Biol Med. 9 (1): 26. doi:10.1186/s13029-014-0026-y. PMC 4269913. PMID 25687422.
  5. ^ Hersheson J, Mencacci NE, Davis M, Macdonald N, Trabzuni D, Ryten M, Pittman A, Paudel R, Kara E, Fawcett K, Plagnol V, Bhatia KP, Medlar AJ, Stanescu HC, Hardy J, Kleta R, Wood NW, Houlden H (2013). "Mutations in the autoregulatory domain of beta-tubulin 4a cause hereditary dystonia". Ann Neurol. 73 (4): 546–553. doi:10.1002/ana.23832. PMC 3698699. PMID 23424103.
  6. ^ Baumann M, Giunta C, Krabichler B, Ruschendorf F, Zoppi N, Colombi M, Bittner RE, Quijano-Roy S, Muntoni F, Cirak S, Schreiber G, Zou Y, Hu Y, Romero NB, Carlier RY, Amberger A, Deutschmann A, Straub V, Rohrbach M, Steinmann B, Rostasy K, Karall D, Bonnemann CG, Zschocke J, Fauth C (2012). "Mutations in FKBP14 cause a variant of Ehlers-Danlos syndrome with progressive kyphoscoliosis, myopathy, and hearing loss". Am J Hum Genet. 90 (2): 201–216. doi:10.1016/j.ajhg.2011.12.004. PMC 3276673. PMID 22265013.
  7. ^ Dyment DA, Cader MZ, Chao MJ, Lincoln MR, Morrison KM, Disanto G, Morahan JM, De Luca GC, Sadovnick AD, Lepage P, Montpetit A, Ebers GC, Ramagopalan SV (2012). "Exome sequencing identifies a novel multiple sclerosis susceptibility variant in the TYK2 gene". Neurology. 79 (5): 406–411. doi:10.1212/wnl.0b013e3182616fc4. PMC 3405256. PMID 22744673.
  8. ^ Shete S, Lau CC, Houlston RS, Claus EB, Barnholtz-Sloan J, Lai R, Il'yasova D, Schildkraut J, Sadetzki S, Johansen C, Bernstein JL, Olson SH, Jenkins RB, Yang P, Vick NA, Wrensch M, Davis FG, McCarthy BJ, Leung EH, Davis C, Cheng R, Hosking FJ, Armstrong GN, Liu Y, Yu RK, Henriksson R, Gliogene C, Melin BS, Bondy ML (2011). "Genome-wide high-density SNP linkage search for glioma susceptibility loci: results from the Gliogene Consortium". Cancer Res. 71 (24): 7568–7575. doi:10.1158/0008-5472.can-11-0013. PMC 3242820. PMID 22037877.
  9. ^ a b c Lathrop GM, Lalouel JM (1984). "Easy calculations of lod scores and genetic risks on small computers". Am J Hum Genet. 36 (2): 460–465. PMC 1684427. PMID 6585139.
  10. ^ a b c Lathrop GM, Lalouel JM, Julier C, Ott J (1985). "Multilocus linkage analysis in humans: detection of linkage and estimation of recombination". Am J Hum Genet. 37 (3): 482–498. PMC 1684598. PMID 3859205.
  11. ^ a b c Lathrop GM, Lalouel JM, White RL (1986). "Construction of human linkage maps: likelihood calculations for multilocus analysis". Genet Epidemiol. 3 (1): 39–52. doi:10.1002/gepi.1370030105. PMID 3957003. S2CID 29289413.
  12. ^ a b c Lathrop GM, Lalouel JM (1988). "Efficient computations in multilocus linkage analysis". Am J Hum Genet. 42 (3): 498–505. PMC 1715153. PMID 3162348.
  13. ^ a b Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, et al. (2011). "The variant call format and VCFtools". Bioinformatics. 27 (15): 2156–8. doi:10.1093/bioinformatics/btr330. PMC 3137218. PMID 21653522.
  14. ^ Howie BN, Donnelly P, Marchini J (2009). "A flexible and accurate genotype imputation method for the next generation of genome-wide association studies". PLOS Genet. 5 (6): e1000529. doi:10.1371/journal.pgen.1000529. PMC 2689936. PMID 19543373.
  15. ^ Marchini J, Howie B (2010). "Genotype imputation for genome-wide association studies". Nat Rev Genet. 11 (7): 499–511. doi:10.1038/nrg2796. PMID 20517342. S2CID 1465707.
  16. ^ Gudbjartsson DF, Jonasson K, Frigge ML, Kong A (2000). "Allegro, a new computer program for multipoint linkage analysis". Nat Genet. 25 (1): 12–13. doi:10.1038/75514. PMID 10802644. S2CID 27362146.
  17. ^ Browning SR, Browning BL (2007). "Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering". Am J Hum Genet. 81 (5): 1084–1097. doi:10.1086/521987. PMC 2265661. PMID 17924348.
  18. ^ Browning BL, Browning SR (2009). "A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals". Am J Hum Genet. 84 (2): 210–223. doi:10.1016/j.ajhg.2009.01.005. PMC 2668004. PMID 19200528.
  19. ^ Makinen VP, Parkkonen M, Wessman M, Groop PH, Kanninen T, Kaski K (2005). "High-throughput pedigree drawing". Eur J Hum Genet. 13 (8): 987–989. doi:10.1038/sj.ejhg.5201430. PMID 15870825.
  20. ^ Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D (2006). "Principal components analysis corrects for stratification in genome-wide association studies". Nat Genet. 38 (8): 904–909. doi:10.1038/ng1847. PMID 16862161. S2CID 8127858.
  21. ^ Patterson N, Price AL, Reich D (2006). "Population structure and eigenanalysis". PLOS Genet. 2 (12): e190. doi:10.1371/journal.pgen.0020190. PMC 1713260. PMID 17194218.
  22. ^ Laird NM, Horvath S, Xu X (2000). "Implementing a unified approach to family-based tests of association". Genet Epidemiol. 19 (Suppl 1): S36–42. doi:10.1002/1098-2272(2000)19:1+<::aid-gepi6>3.3.co;2-d. PMID 11055368.
  23. ^ Kruglyak L, Daly MJ, Reeve-Daly MP, Lander ES (1996). "Parametric and nonparametric linkage analysis: a unified multipoint approach". Am J Hum Genet. 58 (6): 1347–1363. PMC 1915045. PMID 8651312.
  24. ^ Kong A, Cox NJ (1997). "Allele-sharing models: LOD scores and accurate linkage tests". Am J Hum Genet. 61 (5): 1179–1188. doi:10.1086/301592. PMC 1716027. PMID 9345087.
  25. ^ Wang Z, McPeek MS (2009). "An Incomplete-Data Quasi-likelihood Approach to Haplotype-Based Genetic Association Studies on Related Individuals". J Am Stat Assoc. 104 (487): 1251–1260. doi:10.1198/jasa.2009.tm08507. PMC 2860453. PMID 20428335.
  26. ^ Abney M (2009). "A graphical algorithm for fast computation of identity coefficients and generalized kinship coefficients". Bioinformatics. 25 (12): 1561–1563. doi:10.1093/bioinformatics/btp185. PMC 2687941. PMID 19359355.
  27. ^ Heath SC (1997). "Markov chain Monte Carlo segregation and linkage analysis for oligogenic models". Am J Hum Genet. 61 (3): 748–760. doi:10.1086/515506. PMC 1715966. PMID 9326339.
  28. ^ Howie B, Fuchsberger C, Stephens M, Marchini J, Abecasis GR (2012). "Fast and accurate genotype imputation in genome-wide association studies through pre-phasing". Nat Genet. 44 (8): 955–959. doi:10.1038/ng.2354. PMC 3696580. PMID 22820512.
  29. ^ Fuchsberger C, Abecasis GR, Hinds DA (2015). "minimac2: faster genotype imputation". Bioinformatics. 31 (5): 782–784. doi:10.1093/bioinformatics/btu704. PMC 4341061. PMID 25338720.
  30. ^ Alcais A, Philippi A, Abel L (1999). "Genetic model-free linkage analysis using the maximum-likelihood- binomial method for categorical traits". Genet Epidemiol. 17 (Suppl 1): S467–472. doi:10.1002/gepi.1370170775. PMID 10597477.
  31. ^ a b Lange K, Papp JC, Sinsheimer JS, Sripracha R, Zhou H, Sobel EM (2013). "Mendel: the Swiss army knife of genetic analysis programs". Bioinformatics. 29 (12): 1568–1570. doi:10.1093/bioinformatics/btt187. PMC 3673222. PMID 23610370.
  32. ^ a b Abecasis GR, Cherny SS, Cookson WO, Cardon LR (2002). "Merlin--rapid analysis of dense genetic maps using sparse gene flow trees". Nat Genet. 30 (1): 97–101. doi:10.1038/ng786. PMID 11731797. S2CID 12226524.
  33. ^ a b Sobel E, Lange K (1996). "Descent graphs in pedigree analysis: Applications to haplotyping, location scores, and marker-sharing statistics". Am J Hum Genet. 58 (6): 1323–1337. PMC 1915074. PMID 8651310.
  34. ^ Thompson EA (1994). "Monte Carlo likelihood in the genetic mapping of complex traits". Philos Trans R Soc Lond B Biol Sci. 344 (1310): 345–350, discussion 350–341. doi:10.1098/rstb.1994.0073. PMID 7800704.
  35. ^ Thompson EA (1994). "Monte Carlo likelihood in genetic mapping". Statistical Science. 9 (3): 355–366. doi:10.1214/ss/1177010381.
  36. ^ Hasstedt SJ (2005). "jPAP: Document-driven software for genetic analysis". Genet Epidemiol. 29: 255.
  37. ^ McPeek MS, Sun L (2000). "Statistical tests for detection of misspecified relationships by use of genome-screen data". Am J Hum Genet. 66 (3): 1076–1094. doi:10.1086/302800. PMC 1288143. PMID 10712219.
  38. ^ Sun L, Wilder K, McPeek MS (2002). "Enhanced pedigree error detection". Hum Hered. 54 (2): 99–110. doi:10.1159/000067666. PMID 12566741. S2CID 26992288.
  39. ^ Thornton T, McPeek MS (2010). "ROADTRIPS: case-control association testing with partially or completely unknown population and pedigree structure". Am J Hum Genet. 86 (2): 172–184. doi:10.1016/j.ajhg.2010.01.001. PMC 2820184. PMID 20137780.
  40. ^ Delaneau O, Marchini J, Zagury JF (2012). "A linear complexity phasing method for thousands of genomes". Nat Methods. 9 (2): 179–81. doi:10.1038/nmeth.1785. PMID 22138821. S2CID 13765612.
  41. ^ Delaneau O, Zagury JF, Marchini J (2013). "Improved whole-chromosome phasing for disease and population genetic studies". Nat Methods. 10 (1): 5–6. doi:10.1038/nmeth.2307. PMID 23269371. S2CID 205421216.
  42. ^ Delaneau O, Howie B, Cox AJ, Zagury JF, Marchini J (2013). "Haplotype estimation using sequencing reads". Am J Hum Genet. 93 (4): 687–96. doi:10.1016/j.ajhg.2013.09.002. PMC 3791270. PMID 24094745.
  43. ^ O'Connell J, Gurdasani D, Delaneau O, Pirastu N, Ulivi S, Cocca M, et al. (2014). "A general approach for haplotype phasing across the full spectrum of relatedness". PLOS Genet. 10 (4): e1004234. doi:10.1371/journal.pgen.1004234. PMC 3990520. PMID 24743097.
  44. ^ Delaneau O, Marchini J, The 1000 Genomes Project Consortium (2014). "Integrating sequence and array data to create an improved 1000 Genomes Project haplotype reference panel". Nat Commun. 5: 3934. Bibcode:2014NatCo...5.3934.. doi:10.1038/ncomms4934. PMC 4338501. PMID 25653097.
  45. ^ Speer M, Terwilliger JD, Ott J (1992). "A chromosome-based method for rapid computer simulation". Am J Hum Genet. 51: A202.
  46. ^ Weeks DE, Ott J, Lathrop GM (1990). "SLINK: a general simulation program for linkage analysis". Am J Hum Genet. 47 (3): A204.
  47. ^ Blangero J, Almasy L (1997). "Multipoint oligogenic linkage analysis of quantitative traits". Genet Epidemiol. 14 (6): 959–964. doi:10.1002/(sici)1098-2272(1997)14:6<959::aid-gepi66>3.0.co;2-k. PMID 9433607. S2CID 11630296.
  48. ^ Almasy L, Blangero J (1998). "Multipoint quantitative-trait linkage analysis in general pedigrees". Am J Hum Genet. 62 (5): 1198–1211. doi:10.1086/301844. PMC 1377101. PMID 9545414.
  49. ^ Holmans P (1993). "Asymptotic properties of affected-sib-pair linkage analysis". Am J Hum Genet. 52 (2): 362–374. PMC 1682211. PMID 8430697.
  50. ^ Lemire M (2006). "SUP: an extension to SLINK to allow a larger number of marker loci to be simulated in pedigrees conditional on trait values". BMC Genet. 7: 40. doi:10.1186/1471-2156-7-40. PMC 1524809. PMID 16803631.
  51. ^ Pritchard JK, Stephens M, Donnelly P (2000). "Inference of population structure using multilocus genotype data". Genetics. 155 (2): 945–959. doi:10.1093/genetics/155.2.945. PMC 1461096. PMID 10835412.
  52. ^ Falush D, Stephens M, Pritchard JK (2003). "Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies". Genetics. 164 (4): 1567–1587. doi:10.1093/genetics/164.4.1567. PMC 1462648. PMID 12930761.
  53. ^ Falush D, Stephens M, Pritchard JK (2007). "Inference of population structure using multilocus genotype data: dominant markers and null alleles". Mol Ecol Notes. 7 (4): 574–578. doi:10.1111/j.1471-8286.2007.01758.x. PMC 1974779. PMID 18784791.
  54. ^ Lange K, Weeks D, Boehnke M (1988). "Programs for pedigree analysis: MENDEL, FISHER, and dGENE" (PDF). Genet Epidemiol. 5 (6): 471–472. doi:10.1002/gepi.1370050611. hdl:2027.42/101847. PMID 3061869. S2CID 44260724.
  55. ^ O'Connell JR, Weeks DE (1995). "The VITESSE algorithm for rapid exact multilocus linkage analysis via genotype set-recoding and fuzzy inheritance". Nat Genet. 11 (4): 402–408. doi:10.1038/ng1295-402. PMID 7493020. S2CID 12496754.

External links[edit]