PLINK (genetic tool-set)

Last updated

PLINK [1] is a free, commonly used, open-source whole-genome association analysis toolset designed by Shaun Purcell. The software is designed flexibly to perform a wide range of basic, large-scale genetic analyses.

Contents

PLINK currently supports following functionalities:

Input and output files

PLINK has its own format of text files (.ped) and binary text files (.bed) that serve as input files for most analyses. [2] A .map accompanies a .ped file and provides information about variants, while .bim and .fam files accompany .bed files as part of the binary dataset. Additionally, PLINK accepts inputs of VCF, BCF, Oxford, and 23andMe files, which are typically extracted into the binary .bed format prior to performing desired analyses. With certain formats such as VCF, some information such as phase and dosage will be discarded.

PLINK has a variety of output files depending on the analysis. PLINK has the ability to output files for BEAGLE and can recode a .bed file into a VCF for analyses in other programs. Additionally, PLINK is designed to work in conjunction with R, and can output files to be processed by certain R packages.

Extensions and current developments

Related Research Articles

<span class="mw-page-title-main">Linker (computing)</span> Computer program which combines multiple object files into a single file

In computing, a linker or link editor is a computer system program that takes one or more object files and combines them into a single executable file, library file, or another "object" file.

<span class="mw-page-title-main">SPSS</span> Statistical analysis software

SPSS Statistics is a statistical software suite developed by IBM for data management, advanced analytics, multivariate analysis, business intelligence, and criminal investigation. Long produced by SPSS Inc., it was acquired by IBM in 2009. Current versions have the brand name: IBM SPSS Statistics.

<span class="mw-page-title-main">Biopython</span> Collection of open-source Python software tools for computational biology

The Biopython project is an open-source collection of non-commercial Python tools for computational biology and bioinformatics, created by an international association of developers. It contains classes to represent biological sequences and sequence annotations, and it is able to read and write to a variety of file formats. It also allows for a programmatic means of accessing online databases of biological information, such as those at NCBI. Separate modules extend Biopython's capabilities to sequence alignment, protein structure, population genetics, phylogenetics, sequence motifs, and machine learning. Biopython is one of a number of Bio* projects designed to reduce code duplication in computational biology.

In population genetics, linkage disequilibrium (LD) is the non-random association of alleles at different loci in a given population. Loci are said to be in linkage disequilibrium when the frequency of association of their different alleles is higher or lower than what would be expected if the loci were independent and associated randomly.

<span class="mw-page-title-main">Identity by descent</span> Identical nucleotide sequence due to inheritance without recombination from a common ancestor

A DNA segment is identical by state (IBS) in two or more individuals if they have identical nucleotide sequences in this segment. An IBS segment is identical by descent (IBD) in two or more individuals if they have inherited it from a common ancestor without recombination, that is, the segment has the same ancestral origin in these individuals. DNA segments that are IBD are IBS per definition, but segments that are not IBD can still be IBS due to the same mutations in different individuals or recombinations that do not alter the segment.

Genetic association is when one or more genotypes within a population co-occur with a phenotypic trait more often than would be expected by chance occurrence.

Haploview is a commonly used bioinformatics software which is designed to analyze and visualize patterns of linkage disequilibrium (LD) in genetic data. Haploview can also perform association studies, choosing tagSNPs and estimating haplotype frequencies. Haploview is developed and maintained by Dr. Mark Daly's lab at the MIT/Harvard Broad Institute.

<span class="mw-page-title-main">Genome-wide association study</span> Study of genetic variants in different individuals

In genomics, a genome-wide association study, also known as whole genome association study, is an observational study of a genome-wide set of genetic variants in different individuals to see if any variant is associated with a trait. GWA studies typically focus on associations between single-nucleotide polymorphisms (SNPs) and traits like major human diseases, but can equally be applied to any other genetic variants and any other organisms.

In IBM mainframe operating systems, basic partitioned access method (BPAM) is an access method for libraries, called partitioned datasets (PDSes) in IBM terminology. BPAM is used in OS/360, OS/VS2, MVS, z/OS, and others.

<span class="mw-page-title-main">Galaxy (computational biology)</span>

Galaxy is a scientific workflow, data integration, and data and analysis persistence and publishing platform that aims to make computational biology accessible to research scientists that do not have computer programming or systems administration experience. Although it was initially developed for genomics research, it is largely domain agnostic and is now used as a general bioinformatics workflow management system.

Quantitative trait loci mapping or QTL mapping is the process of identifying genomic regions that potentially contain genes responsible for important economic, health or environmental characters. Mapping QTLs is an important activity that plant breeders and geneticists routinely use to associate potential causal genes with phenotypes of interest. Family-based QTL mapping is a variant of QTL mapping where multiple-families are used.

SAMtools is a set of utilities for interacting with and post-processing short DNA sequence read alignments in the SAM, BAM and CRAM formats, written by Heng Li. These files are generated as output by short read aligners like BWA. Both simple and advanced tools are provided, supporting complex tasks like variant calling and alignment viewing as well as sorting, indexing, data extraction and format conversion. SAM files can be very large, so compression is used to save space. SAM files are human-readable text files, and BAM files are simply their binary equivalent, whilst CRAM files are a restructured column-oriented binary container format. BAM files are typically compressed and more efficient for software to work with than SAM. SAMtools makes it possible to work directly with a compressed BAM file, without having to uncompress the whole file. Additionally, since the format for a SAM/BAM file is somewhat complex - containing reads, references, alignments, quality information, and user-specified annotations - SAMtools reduces the effort needed to use SAM/BAM files by hiding low-level details.

Ensembl Genomes is a scientific project to provide genome-scale data from non-vertebrate species.

Mega2 allows the applied statistical geneticist to convert one's data from several input formats to a large number output formats suitable for analysis by commonly used software packages. In a typical human genetics study, the analyst often needs to use a variety of different software programs to analyze the data, and these programs usually require that the data be formatted to their precise input specifications. Conversion of one's data into these multiple different formats can be tedious, time-consuming, and error-prone. Mega2, by providing validated conversion pipelines, can accelerate the analyses while reducing errors.

<span class="mw-page-title-main">Gene set enrichment analysis</span>

Gene set enrichment analysis (GSEA) is a method to identify classes of genes or proteins that are over-represented in a large set of genes or proteins, and may have an association with disease phenotypes. The method uses statistical approaches to identify significantly enriched or depleted groups of genes. Transcriptomics technologies and proteomics results often identify thousands of genes which are used for the analysis.

Genome-wide complex trait analysis (GCTA) Genome-based restricted maximum likelihood (GREML) is a statistical method for variance component estimation in genetics which quantifies the total narrow-sense (additive) contribution to a trait's heritability of a particular subset of genetic variants. This is done by directly quantifying the chance genetic similarity of unrelated individuals and comparing it to their measured similarity on a trait; if two unrelated individuals are relatively similar genetically and also have similar trait measurements, then the measured genetics are likely to causally influence that trait, and the correlation can to some degree tell how much. This can be illustrated by plotting the squared pairwise trait differences between individuals against their estimated degree of relatedness. The GCTA framework can be applied in a variety of settings. For example, it can be used to examine changes in heritability over aging and development. It can also be extended to analyse bivariate genetic correlations between traits. There is an ongoing debate about whether GCTA generates reliable or stable estimates of heritability when used on current SNP data. The method is based on the outdated and false dichotomy of genes versus the environment. It also suffers from serious methodological weaknesses, such as susceptibility to population stratification.

Benjamin Michael Neale is a statistical geneticist with a specialty in psychiatric genetics. He is an institute member at the Broad Institute as well as an associate professor at both Harvard Medical School and the Analytic and Translational Genetics Unit at Massachusetts General Hospital. Neale specializes in genome-wide association studies (GWAS). He was responsible for the data analysis of the first GWAS on attention-deficit/hyperactivity-disorder, and he developed new analysis software such as PLINK, which allows for whole-genome data to be analyzed for specific gene markers. Related to his work on GWAS, Neale is the lead of the ADHD psychiatric genetics and also a member of the Psychiatric GWAS Consortium analysis committee.

<span class="mw-page-title-main">ANNOVAR</span> Bioinformatics software

ANNOVAR is a bioinformatics software tool for the interpretation and prioritization of single nucleotide variants (SNVs), insertions, deletions, and copy number variants (CNVs) of a given genome. It has the ability to annotate human genomes hg18, hg19, hg38, and model organisms genomes such as: mouse, zebrafish, fruit fly, roundworm, yeast and many others. The annotations could be used to determine the functional consequences of the mutations on the genes and organisms, infer cytogenetic bands, report functional importance scores, and/or find variants in conserved regions. ANNOVAR along with SNP effect (SnpEFF) and Variant Effect Predictor (VEP) are three of the most commonly used variant annotation tools.

In statistical genetics, Haseman–Elston (HE) regression is a form of statistical regression originally proposed for linkage analysis of quantitative traits for sibling pairs. It was first developed by Joseph K. Haseman and Robert C. Elston in 1972. A much earlier source of sib-pair linkage implementation was, in 1935 and 1938, proposed by Lionel S. Penrose, who is father of Nobel laureate theoretical physicist Roger Penrose. In 2000, Elston et al. proposed a "revisited", extended form of Haseman–Elston regression. Since then, further extensions to the "revisited" form of HE regression have been proposed. Although HE regression "...seems a rusty weapon in the genomics analysis armory of the GWAS era. This is because the HE regression relies on relatedness measured on IBD but not identity by state (IBS)...", HE has been adapted for association analysis in unrelated samples, whose relatedness is measured in IBS.

References

  1. Purcell S; Neale B; Todd-Brown K; Thomas L; Ferreira MAR; Bender D; Maller J; Sklar P; de Bakker PIW; Daly MJ; Sham PC (2007). "PLINK: a toolset for whole-genome association and population-based linkage analysis". American Journal of Human Genetics. 81 (3): 559–75. doi:10.1086/519795. PMC   1950838 . PMID   17701901.
  2. Christopher Chang (2017). "PLINK 1.9 File format reference" (PDF). Biobank UK at University of Oxford. Retrieved 2022-08-05. PLINK input and output file formats which are identifiable by file extension
  3. Lee, James J.; Purcell, Shaun M.; Vattikuti, Shashaank; Tellier, Laurent CAM; Chow, Carson C.; Chang, Christopher C. (2015-12-01). "Second-generation PLINK: rising to the challenge of larger and richer datasets". GigaScience. 4 (1): 7. doi:10.1186/s13742-015-0047-8. PMC   4342193 . PMID   25722852.