PLINK (genetic tool-set)

Last updated

PLINK [1] is a free, commonly used, open-source whole-genome association analysis toolset designed by Shaun Purcell. The software is designed flexibly to perform a wide range of basic, large-scale genetic analyses.

Contents

PLINK currently supports following functionalities:

Input and output files

PLINK has its own format of text files (.ped) and binary text files (.bed) that serve as input files for most analyses. [2] A .map accompanies a .ped file and provides information about variants, while .bim and .fam files accompany .bed files as part of the binary dataset. Additionally, PLINK accepts inputs of VCF, BCF, Oxford, and 23andMe files, which are typically extracted into the binary .bed format prior to performing desired analyses. With certain formats such as VCF, some information such as phase and dosage will be discarded.

PLINK has a variety of output files depending on the analysis. PLINK has the ability to output files for BEAGLE and can recode a .bed file into a VCF for analyses in other programs. Additionally, PLINK is designed to work in conjunction with R, and can output files to be processed by certain R packages.

Extensions and current developments

Related Research Articles

<span class="mw-page-title-main">Linker (computing)</span> Computer program which combines multiple object files into a single file

In computing, a linker or link editor is a computer system program that takes one or more object files and combines them into a single executable file, library file, or another "object" file.

<span class="mw-page-title-main">SPSS</span> Statistical analysis software

SPSS Statistics is a statistical software suite developed by IBM for data management, advanced analytics, multivariate analysis, business intelligence, and criminal investigation. Long produced by SPSS Inc., it was acquired by IBM in 2009. Versions of the software released since 2015 have the brand name IBM SPSS Statistics.

In population genetics, linkage disequilibrium (LD) is a measure of non-random association between segments of DNA (alleles) at different positions on the chromosome (loci) in a given population based on a comparison between the frequency at which two alleles are detected together at the same loci versus the frequencies at which each allele is simply detected at that same loci. Loci are said to be in linkage disequilibrium when the frequency of being detected together is higher or lower than expected if the loci were independent and associated randomly.

<span class="mw-page-title-main">Identity by descent</span> Identical nucleotide sequence due to inheritance without recombination from a common ancestor

A DNA segment is identical by state (IBS) in two or more individuals if they have identical nucleotide sequences in this segment. An IBS segment is identical by descent (IBD) in two or more individuals if they have inherited it from a common ancestor without recombination, that is, the segment has the same ancestral origin in these individuals. DNA segments that are IBD are IBS per definition, but segments that are not IBD can still be IBS due to the same mutations in different individuals or recombinations that do not alter the segment.

Genetic association is when one or more genotypes within a population co-occur with a phenotypic trait more often than would be expected by chance occurrence.

Haploview is a commonly used bioinformatics software which is designed to analyze and visualize patterns of linkage disequilibrium (LD) in genetic data. Haploview can also perform association studies, choosing tagSNPs and estimating haplotype frequencies. Haploview is developed and maintained by Dr. Mark Daly's lab at the MIT/Harvard Broad Institute.

<span class="mw-page-title-main">Genome-wide association study</span> Study of genetic variants in different individuals

In genomics, a genome-wide association study, is an observational study of a genome-wide set of genetic variants in different individuals to see if any variant is associated with a trait. GWA studies typically focus on associations between single-nucleotide polymorphisms (SNPs) and traits like major human diseases, but can equally be applied to any other genetic variants and any other organisms.

In IBM mainframe operating systems, basic partitioned access method (BPAM) is an access method for libraries, called partitioned datasets (PDSes) in IBM terminology. BPAM is used in OS/360, OS/VS2, MVS, z/OS, and others.

<span class="mw-page-title-main">Galaxy (computational biology)</span>

Galaxy is a scientific workflow, data integration, and data and analysis persistence and publishing platform that aims to make computational biology accessible to research scientists that do not have computer programming or systems administration experience. Although it was initially developed for genomics research, it is largely domain agnostic and is now used as a general bioinformatics workflow management system.

Quantitative trait loci mapping or QTL mapping is the process of identifying genomic regions that potentially contain genes responsible for important economic, health or environmental characters. Mapping QTLs is an important activity that plant breeders and geneticists routinely use to associate potential causal genes with phenotypes of interest. Family-based QTL mapping is a variant of QTL mapping where multiple-families are used.

SAMtools is a set of utilities for interacting with and post-processing short DNA sequence read alignments in the SAM, BAM and CRAM formats, written by Heng Li. These files are generated as output by short read aligners like BWA. Both simple and advanced tools are provided, supporting complex tasks like variant calling and alignment viewing as well as sorting, indexing, data extraction and format conversion. SAM files can be very large, so compression is used to save space. SAM files are human-readable text files, and BAM files are simply their binary equivalent, whilst CRAM files are a restructured column-oriented binary container format. BAM files are typically compressed and more efficient for software to work with than SAM. SAMtools makes it possible to work directly with a compressed BAM file, without having to uncompress the whole file. Additionally, since the format for a SAM/BAM file is somewhat complex - containing reads, references, alignments, quality information, and user-specified annotations - SAMtools reduces the effort needed to use SAM/BAM files by hiding low-level details.

Ensembl Genomes is a scientific project to provide genome-scale data from non-vertebrate species.

Mega2 is a data manipulation software for applied statistical genetics. Mega is an acronym for Manipulation Environment for Genetic Analysis.

<span class="mw-page-title-main">Gene set enrichment analysis</span> Bioinformatics method

Gene set enrichment analysis (GSEA) (also called functional enrichment analysis or pathway enrichment analysis) is a method to identify classes of genes or proteins that are over-represented in a large set of genes or proteins, and may have an association with different phenotypes (e.g. different organism growth patterns or diseases). The method uses statistical approaches to identify significantly enriched or depleted groups of genes. Transcriptomics technologies and proteomics results often identify thousands of genes, which are used for the analysis.

Genome-wide complex trait analysis (GCTA) Genome-based restricted maximum likelihood (GREML) is a statistical method for heritability estimation in genetics, which quantifies the total additive contribution of a set of genetic variants to a trait. GCTA is typically applied to common single nucleotide polymorphisms (SNPs) on a genotyping array and thus termed "chip" or "SNP" heritability.

Benjamin Michael Neale is a statistical geneticist with a specialty in psychiatric genetics. He is an institute member at the Broad Institute as well as an associate professor at both Harvard Medical School and the Analytic and Translational Genetics Unit at Massachusetts General Hospital. Neale specializes in genome-wide association studies (GWAS). He was responsible for the data analysis of the first GWAS on attention-deficit/hyperactivity-disorder, and he developed new analysis software such as PLINK, which allows for whole-genome data to be analyzed for specific gene markers. Related to his work on GWAS, Neale is the lead of the ADHD psychiatric genetics and also a member of the Psychiatric GWAS Consortium analysis committee.

ANNOVAR is a bioinformatics software tool for the interpretation and prioritization of single nucleotide variants (SNVs), insertions, deletions, and copy number variants (CNVs) of a given genome.

In statistical genetics, Haseman–Elston (HE) regression is a form of statistical regression originally proposed for linkage analysis of quantitative traits for sibling pairs. It was first developed by Joseph K. Haseman and Robert C. Elston in 1972. A much earlier source of sib-pair linkage implementation was, in 1935 and 1938, proposed by Lionel S. Penrose, who is father of Nobel laureate theoretical physicist Roger Penrose. In 2000, Elston et al. proposed a "revisited", extended form of Haseman–Elston regression. Since then, further extensions to the "revisited" form of HE regression have been proposed. Although HE regression "...seems a rusty weapon in the genomics analysis armory of the GWAS era. This is because the HE regression relies on relatedness measured on IBD but not identity by state (IBS)...", HE has been adapted for association analysis in unrelated samples, whose relatedness is measured in IBS.

Nvidia Parabricks is a suite of free software for genome analysis developed by Nvidia, designed to deliver high throughput by resorting to graphics processing unit (GPU) acceleration.

References

  1. Purcell S; Neale B; Todd-Brown K; Thomas L; Ferreira MAR; Bender D; Maller J; Sklar P; de Bakker PIW; Daly MJ; Sham PC (2007). "PLINK: a toolset for whole-genome association and population-based linkage analysis". American Journal of Human Genetics. 81 (3): 559–75. doi:10.1086/519795. PMC   1950838 . PMID   17701901.
  2. Christopher Chang (2017). "PLINK 1.9 File format reference" (PDF). Biobank UK at University of Oxford. Retrieved 2022-08-05. PLINK input and output file formats which are identifiable by file extension
  3. Lee, James J.; Purcell, Shaun M.; Vattikuti, Shashaank; Tellier, Laurent CAM; Chow, Carson C.; Chang, Christopher C. (2015-12-01). "Second-generation PLINK: rising to the challenge of larger and richer datasets". GigaScience. 4 (1): 7. doi: 10.1186/s13742-015-0047-8 . PMC   4342193 . PMID   25722852.