ANNOVAR (ANNOtate VARiation) is a bioinformatics software tool for the interpretation and prioritization of single nucleotide variants (SNVs), insertions, deletions, and copy number variants (CNVs) of a given genome. [1] It has the ability to annotate human genomes hg18, hg19, hg38, and model organisms genomes such as: mouse ( Mus musculus ), zebrafish ( Danio rerio ), fruit fly ( Drosophila melanogaster), roundworm ( Caenorhabditis elegans ), yeast ( Saccharomyces cerevisiae ) and many others. [2] The annotations could be used to determine the functional consequences of the mutations on the genes and organisms, infer cytogenetic bands, report functional importance scores, and/or find variants in conserved regions. [2] ANNOVAR along with SNP effect (SnpEFF) and Variant Effect Predictor (VEP) are three of the most commonly used variant annotation tools.
The cost of high throughput DNA sequencing has reduced drastically from around $100 million/human genome in 2001 to around $1000/human genome in 2017. [3] Due to this increase in accessibility, high throughput DNA sequencing has become more widely used in research and clinical settings. [4] [5] Some common areas that utilize high throughput DNA sequencing extensively are: Whole Exome Sequencing, Whole Genome Sequencing (WGS), and genome wide association studies (GWAS). [6] [7]
There are a growing number of tools available seeking to comprehensively manage, analyze and interpret the enormous amount of data generated from high-throughput DNA sequencing. The tools are required to be efficient and robust enough to analyze a large number of variants (more than 3 million in human genome) though sensitive enough to identify rare and clinically relevant variants that are likely harmful/deleterious. [8] ANNOVAR was developed by Dr. Kai Wang in 2010 at the Center for Applied Genomics in the University of Pennsylvania. [1] It is a type of variant annotation tool that compiles deleterious genetic variant prediction scores from programs such as PolyPhen, ClinVar, and CADD and annotates the SNVs, insertions, deletions, and CNVs of the provided genome. ANNOVAR is one of the first efficient, configurable, extensible and cross-platform compatible variant annotation tools created.
In terms of the larger bioinformatics workflow, ANNOVAR fits in near the end, after DNA sequencing reads having between mapped, aligned, and variants have been predicted from an alignment file (BAM), also known as variant calling. This process will produce a resultant VCF file, a tab-separated text file in a tabular like structure, containing genetic variants as rows. This file can then be used as input into the ANNOVAR software program for the variant annotation process, outputting interpretations of the variants identified from the upstream bioinformatics pipeline.
This approach identifies whether the input variants cause protein coding changes and the amino acids that are affected by the mutations. [9] The input file can be composed of exons, introns, intergenic regions, splice acceptor/donor sites, and 5′/3′ untranslated regions. The focus is to explore the relationship between non-synonymous mutations (SNPs, indels, or CNVs) and their functional impact on known genes. [10] Especially, gene-based annotation will highlight the exact amino acid change if the mutation is in the exonic region and the predicted effect on the function of the known gene. This approach is useful for identifying variants in known genes from Whole Exome Sequencing data.
This approach identifies deleterious variants in specific genomic regions based on the genomic elements around the gene. [11] Some categories region-based annotation will take into account are:
1) Is the variant in a known conserved genomic region?
Mutations occur during mitosis and meiosis. If there is no selective pressure for specific nucleotide sequences, then all areas of a genome would be mutated are equal rates. The genomic regions that are highly conserved indicate genomic sequences that are essential to the organism's survival and/or reproductive success. Thus, if the variant disrupts a highly conserved region, the variant is likely highly deleterious. [12]
2) Is the variant in a predicted transcription factor binding site?
DNA is transcribed into messenger RNA (mRNA) by RNA polymerase II. This process can be modulated transcription factors which can enhance or inhibit binding of RNApol II. If the variant disrupts a transcription factor binding site then transcription of the gene could be altered causing changes in gene expression level and/or protein production amount. This changes could cause phenotypic variations.
3) Is the variant in a predicted miRNA target site?
MicroRNA (miRNA) is a type of RNA that complementary binds to targeted mRNA sequence to suppress or silence the translation of the mRNA. If the variant disrupts the miRNA target location, the miRNA could have altered binding affinity to the corresponding gene transcript thus changing the mRNA expression level of the transcript. This could further impact protein production levels which could cause phenotypic variations.
4) Is the variant predicted to interrupt a stable RNA secondary structure?
RNA can function at the RNA level as non-coding RNA or be translated into proteins for downstream processes. RNA secondary structures are extremely important in determining the correct half-life and function of those RNA. Two RNA species with tightly regulated secondary structures are ribosomal RNA (rRNA) and transfer RNA (tRNA) which are essential in translation of mRNA to protein. If the variant disrupts the stability of the RNA secondary structure, the half-life of the RNA could be shortened thus lowering the concentration of RNA in the cell.
Non-coding regions encompasses 99% of the human genome [13] and region-based annotation is extremely useful in identifying variants in those regions. This approach can be used on WGS data.
This approach identifies variants that are documented in specific databases. [14] The variants could be obtained from dbSNP, 1000 Genomes Project, or user-supplied list. Additional information could be obtained from the frequency of the variants from the above databases or the predicted deleterious scores created by PolyPhen, CADD, ClinVar or many others. [1] The more infrequent a variant appears in the public database, the more deleterious it is likely to be. Results from different deleterious score prediction tools can combined together by the researcher to make a more accurate call on the variant.
Taken together, these approaches complement one another to filter through over 4 million variants in a human genome. Common, low-deleterious score variants are eliminated to reveal the rare, high-deleterious score variants which could be causal for congenital diseases.
ANNOVAR is a command-line tool written in the Perl programming language and can be run on any operating system that has a Perl interpreter installed. [1] If used for non-commercial purposes, it is available free as an open-source package that is downloadable through the ANNOVAR website. ANNOVAR can process most next-generation sequencing data which has been run through a variant calling software.
Script | Purpose | Description | Input | Output | Requirements |
---|---|---|---|---|---|
annotate_variation.pl | variant annotator | The core script, which functionally annotates the genetic variants via (1) gene-based, (2) region-based, and/or (3) filter-based annotation. | .avinput | .avinput | Data sources are downloaded for annotation, e.g. hg38, UCSC, 1000 Genomes Project. |
convert2annovar.pl | file converter | Converts various file formats to the custom ANNOVAR input file format. | See "Conversion to the ANNOVAR input file format" section. | .avinput | |
table_annovar.pl | automated variant annotator | A wrapper around annotate_variation.pl that can take VCF format along with the ANNOVAR format, performs annotation and outputs an Excel-compatible file. Ideal for beginners. | .avinput, CSV, TSV, VCF | CSV, TSV, VCF, TXT | Data sources are downloaded for annotation, e.g. hg38, UCSC, 1000 Genomes Project. |
variants_reduction.pl | variant reducer | Performs stepwise variant reduction on a large set of input variants to narrow down to a subset of functionally important variants. Filtering procedures include: Applies a stepwise procedure of filtering to identify subsets of variants that are likely to be related to a disease. [2] Such filtering procedures include: [2]
| .avinput | .avinput | Gene-based annotation data sources and various filter-based annotation data sources are downloaded. |
The ANNOVAR software accepts text-based input files, including VCF (Variant Call Format), the gold standard for describing genetic loci.
The program's main annotation script, annotate_variation.pl
requires a custom input file format, the ANNOVAR input format (.avinput). Common file types can be converted to ANNOVAR input format for annotation using a provided script (see below). It is a simple text file where each line in the file corresponds to a variant and within each line are tab-delimited columns representing the basic genomic coordinate fields (chromosome, start position, end position, reference nucleotides, and observed nucleotides), followed by optional columns [2]
The ANNOVAR file input contains the following basic fields:
For basic "out-of-the-box" usage:
A popular function of the ANNOVAR tool is the use of the table_annovar.pl
script which simplifies the workflow into one single command-line call, given that the data sources for annotation have already been downloaded. File conversion from VCF file is handled within the function call, followed by annotation and output to an Excel-compatible file. The script takes a number of parameters for annotation and outputs a VCF file with the annotations as key-value pairs inside of the INFO
column of the VCF file for each genetic variant, e.g. "genomic_function=exonic".
File conversion to the ANNOVAR input format is possible using the provided file format conversion script convert2annovar.pl
. The program accepts common file formats outputted by upstream variant calling tools. Subsequent functional annotation scripts annotate_variation.pl
use the ANNOVAR input file. File formats that are accepted by the convert2annovar.pl
include the following: [2]
Generating input files based on specific variants, transcripts, or genomic regions:
When investigating candidate loci that are linked to diseases, using the above variant calling file formats as input to ANNOVAR is a standard workflow for functional annotation of genetic variants outputted from an upstream bioinformatics pipeline. ANNOVAR can also be used to in other scenarios, such as interrogating a set of genetic variants of interest based on a list of dbSNP identifiers as well as variants within specific genomic or exomic regions. [2]
In the case of dbSNP identifiers, providing to the convert2annovar.pl
script a list of identifiers (e.g. rs41534544, rs4308095, rs12345678) in a text file along with the reference genome of interest as a parameter, ANNOVAR will output an ANNOVAR input file with the genomic coordinate fields for those variants which can then be used for functional annotation. [2]
In the case of genomic regions, one can provide a genomic range of interest (e.g. chr1:2000001-2000003) along with the reference genome of interest and ANNOVAR will generate an ANNOVAR input file of all the genetic loci spanning that range. In addition, insertion and deletion size could also be specified in which the script will select all the genetic loci where a specific size of interest insertion or deletion is found. [2]
Last, if looking at variants within specific exonic regions, users can generate ANNOVAR input files for all possible variants in exons (including splicing variants) when theconvert2annovar.pl
script is provided an RNA transcript identifier (e.g. NM_022162) based on the standard HGVS (Human Genome Variation Society) nomenclature. [2]
The possible output files are an annotated .avinput file, CSV, TSV, or VCF. Depending on the annotation strategy taken (see Figure below), the input and output files will differ. It is possible to configure the output file types given a specific input file, by providing the program the appropriate parameter.
For example, for the table_annovar.pl
program, if the input file is VCF, then the output will also be a VCF file. If the input file is of the ANNOVAR input format type, then the output will be a TSV by default, with the option to output to CSV if the -csvout
parameter is specified. By choosing CSV or TSV as the output file type, a user could open the files to view the annotations in Excel or a different spreadsheet software application. This is a popular feature among users.
The output file will contain all the data from the original input file with additional columns for the desired annotations. For example, when annotating variants with characteristics such as (1) genomic function and (2) the functional role of the coding variant, the output file will contain all the columns from the input file, followed additional columns "genomic_function" (e.g. with values "exonic" or "intronic") and "coding_variant_function" (e.g. with values "synonymous SNV" or "non-synonymous SNV").
Benchmarked on a modern desktop computer (3 GHz Intel Xeon CPU, 8GB memory), for 4.7 million variants, ANNOVAR requires ~4 minutes to perform gene-based functional annotation, or ~15 minutes to perform stepwise "variants reduction". It is said to be practical for performing variant annotation and variant prioritization on hundreds of human genomes in a day. [2]
ANNOVAR could be sped up by using the -thread
argument which enables multi-threading so that input files could be processed in parallel.
To use ANNOVAR for functional annotation of variants, annotation datasets can be downloaded using the annotate_variation.pl
script, which saves them to local disk. [1] Different annotation data sources are used for the three major types of annotation (gene-based, region-based, and filter-based).
These are some of the data sources for each annotation type:
1000 Genomes Project | LRT | ClinVar |
dbSNP | MutationTaster | CADD |
avSNP | GERP++ | DANN |
dbNSFP | ExAC | COSMIC |
SIFT | ESP (Exome Sequencing Project) | ICGC |
PolyPhen 2 | gnomAD allele frequency | NCI60 |
PhyloP | Complete Genomics allele frequency |
Given the large number of data sources for filter-based annotation, here are examples of which subsets of the datasets to use for a few of the most common use cases. [14]
ANNOVAR is one of the common annotation tools for identifying candidate and causal mutations and genes for rare genetic diseases.
Using a combination of gene-based and filter-based annotation followed by variant reduction based on the annotation values of the variants, the causal gene in a rare recessive Mendelian disease called Miller syndrome can be identified. [1]
This will involve synthesizing a genome-wide data set of ~4.2 million single nucleotide variants (SNVs) and ~0.5 million insertions and deletions (indels). [1] Two known causal mutations for Miller syndrome (G152R and G202A in the DHODH gene) are also included [1]
Steps in identifying the causal variants for the disease using ANNOVAR: [1]
Two limitations of ANNOVAR relate to detection of common diseases and larger structural variant annotations. These problems are present in all current variant annotation tools.
Most common diseases such as diabetes and Alzheimer have multiple variants throughout the genome which are common in the population. [15] [16] These variants are expected to have low individual deleterious scores and cause disease though the accumulation of multiple variants. However ANNOVAR has default "variant-reduction" schemes that provides a small list of rare and highly predicted deleterious variants. [10] These default settings could be optimized so the output data would display additional variants with decreasing predicted deleterious scores. [2] ANNOVAR is primarily used for identifying variants involved rare diseases where the causal mutation is expected to be rare and highly deleterious.
Larger structural variants (SVs) such as chromosomal inversions, translocations, and complex SVs have been shown to cause diseases such as haemophilia A and Alzheimer's. [17] [18] However, SVs are often difficult to annotate because it is difficult to assign specific deleterious scores to large mutated genomic regions. Currently, ANNOVAR can only annotate genes contained within deletions or duplications, or small indels of <50bp. ANNOVAR cannot infer complex SVs and translocations [10]
There are also two other types of SNP annotation tools that are similar to ANNOVAR: SNP effect (SnpEFF) and Variant Effect Predictor (VEP). Many of the features between ANNOVAR, SnpEFF, and VEP are the same including the input and output file format, regulatory region annotations, and know variant annotations. However, the main differences are that ANNOVAR cannot annotate for loss of function predictions whereas both SnpEFF and VEP can. Also, ANNOVAR cannot annotate microRNA structural binding locations whereas VEP can. [19] MicroRNA structural binding location predictions can be informative in revealing post-transcriptional mutations’ role in disease pathogenesis. [20] Loss of function mutations are changes in the genome that results in the total dysfunction of the gene product. Thus, these predictions could be extremely informative in regards to disease diagnosis, especially in rare monogenic diseases.[ citation needed ]
Class | Feature | VEP | Annovar | SnpEff |
General | Availability | Free | Free (academic use only) | Free |
Input | VCF | Yes | Yes | Yes |
Sequence variants | Yes | Yes | Yes | |
Structural variants | Yes | Yes | Yes | |
Output | VCF | Yes | Yes | Yes |
Transcript sets | Ensembl | Yes | Yes | Yes |
RefSeq | Yes | Yes | Yes | |
User-created databases | Yes | Yes | Yes | |
Interfaces | Local package | Yes | Yes | Yes |
Instant prediction web interface | Yes | Yes | No | |
Consequence types | Splicing predictions | Yes (via plugins) | Yes (via external data) | Yes (experimental) |
Loss of function prediction | Yes (via plugins) | No | Yes | |
Non-coding | Regulatory features | Yes | Yes | Yes |
Support multiple cell lines | Yes | No | Yes | |
miRNA structure location | Yes (via plugins) | No | No | |
Known variants | Report known variants | Yes | Yes | Yes |
Filter by frequency | Yes | Yes | Yes | |
Clinical significance | Yes | Yes | Yes | |
Other filters | Pre-set filters | Yes | Yes | Yes |
*Table adapted from McLaren et al. (2016).
Bioinformatics is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combines biology, chemistry, physics, computer science, information engineering, mathematics and statistics to analyze and interpret the biological data. Bioinformatics has been used for in silico analyses of biological queries using computational and statistical techniques.
The human genome is a complete set of nucleic acid sequences for humans, encoded as DNA within the 23 chromosome pairs in cell nuclei and in a small DNA molecule found within individual mitochondria. These are usually treated separately as the nuclear genome and the mitochondrial genome. Human genomes include both protein-coding DNA sequences and various types of DNA that does not encode proteins. The latter is a diverse category that includes DNA coding for non-translated RNA, such as that for ribosomal RNA, transfer RNA, ribozymes, small nuclear RNAs, and several types of regulatory RNAs. It also includes promoters and their associated gene-regulatory elements, DNA playing structural and replicatory roles, such as scaffolding regions, telomeres, centromeres, and origins of replication, plus large numbers of transposable elements, inserted viral DNA, non-functional pseudogenes and simple, highly-repetitive sequences. Introns make up a large percentage of non-coding DNA. Some of this non-coding DNA is non-functional junk DNA, such as pseudogenes, but there is no firm consensus on the total amount of junk DNA.
In genetics, a single-nucleotide polymorphism is a germline substitution of a single nucleotide at a specific position in the genome and is present in a sufficiently large fraction of the population. Single nucleotide substitutions with an allele frequency of less than 1% are called "single-nucleotide variants", not SNPs.
Functional genomics is a field of molecular biology that attempts to describe gene functions and interactions. Functional genomics make use of the vast data generated by genomic and transcriptomic projects. Functional genomics focuses on the dynamic aspects such as gene transcription, translation, regulation of gene expression and protein–protein interactions, as opposed to the static aspects of the genomic information such as DNA sequence or structures. A key characteristic of functional genomics studies is their genome-wide approach to these questions, generally involving high-throughput methods rather than a more traditional "candidate-gene" approach.
The Rat Genome Database (RGD) is a database of rat genomics, genetics, physiology and functional data, as well as data for comparative genomics between rat, human and mouse. RGD is responsible for attaching biological information to the rat genome via structured vocabulary, or ontology, annotations assigned to genes and quantitative trait loci (QTL), and for consolidating rat strain data and making it available to the research community. They are also developing a suite of tools for mining and analyzing genomic, physiologic and functional data for the rat, and comparative data for rat, mouse, human, and five other species.
The 1000 Genomes Project, launched in January 2008, was an international research effort to establish by far the most detailed catalogue of human genetic variation. Scientists planned to sequence the genomes of at least one thousand anonymous participants from a number of different ethnic groups within the following three years, using newly developed technologies which were faster and less expensive. In 2010, the project finished its pilot phase, which was described in detail in a publication in the journal Nature. In 2012, the sequencing of 1092 genomes was announced in a Nature publication. In 2015, two papers in Nature reported results and the completion of the project and opportunities for future research.
Genomatix GmbH is a computational biology company headquartered in Munich, Germany, with a seat of business in Ann Arbor, Michigan, U.S.A.
GeneCards is a database of human genes that provides genomic, proteomic, transcriptomic, genetic and functional information on all known and predicted human genes. It is being developed and maintained by the Crown Human Genome Center at the Weizmann Institute of Science, in collaboration with LifeMap Sciences.
Exome sequencing, also known as whole exome sequencing (WES), is a genomic technique for sequencing all of the protein-coding regions of genes in a genome. It consists of two steps: the first step is to select only the subset of DNA that encodes proteins. These regions are known as exons—humans have about 180,000 exons, constituting about 1% of the human genome, or approximately 30 million base pairs. The second step is to sequence the exonic DNA using any high-throughput DNA sequencing technology.
Proteogenomics is a field of biological research that utilizes a combination of proteomics, genomics, and transcriptomics to aid in the discovery and identification of peptides. Proteogenomics is used to identify new peptides by comparing MS/MS spectra against a protein database that has been derived from genomic and transcriptomic information. Proteogenomics often refers to studies that use proteomic information, often derived from mass spectrometry, to improve gene annotations. The utilization of both proteomics and genomics data alongside advances in the availability and power of spectrographic and chromatographic technology led to the emergence of proteogenomics as its own field in 2004.
Disease gene identification is a process by which scientists identify the mutant genotypes responsible for an inherited genetic disorder. Mutations in these genes can include single nucleotide substitutions, single nucleotide additions/deletions, deletion of the entire gene, and other genetic abnormalities.
Ensembl Genomes is a scientific project to provide genome-scale data from non-vertebrate species.
GeneTalk is a web-based platform, tool, and database for filtering, reduction and prioritization of human sequence variants from next-generation sequencing (NGS) data. GeneTalk allows editing annotation about sequence variants and build up a crowd sourced database with clinically relevant information for diagnostics of genetic disorders. GeneTalk allows searching for information about specific sequence variants and connects to experts on variants that are potentially disease-relevant.
SNV calling from NGS data is any of a range of methods for identifying the existence of single nucleotide variants (SNVs) from the results of next generation sequencing (NGS) experiments. These are computational techniques, and are in contrast to special experimental methods based on known population-wide single nucleotide polymorphisms. Due to the increasing abundance of NGS data, these techniques are becoming increasingly popular for performing SNP genotyping, with a wide variety of algorithms designed for specific experimental designs and applications. In addition to the usual application domain of SNP genotyping, these techniques have been successfully adapted to identify rare SNPs within a population, as well as detecting somatic SNVs within an individual using multiple tissue samples.
The Cancer Genome Anatomy Project (CGAP), created by the National Cancer Institute (NCI) in 1997 and introduced by Al Gore, is an online database on normal, pre-cancerous and cancerous genomes. It also provides tools for viewing and analysis of the data, allowing for identification of genes involved in various aspects of tumor progression. The goal of CGAP is to characterize cancer at a molecular level by providing a platform with readily accessible updated data and a set of tools such that researchers can easily relate their findings to existing knowledge. There is also a focus on development of software tools that improve the usage of large and complex datasets. The project is directed by Daniela S. Gerhard, and includes sub-projects or initiatives, with notable ones including the Cancer Chromosome Aberration Project (CCAP) and the Genetic Annotation Initiative (GAI). CGAP contributes to many databases and organisations such as the NCBI contribute to CGAP's databases.
Single nucleotide polymorphism annotation is the process of predicting the effect or function of an individual SNP using SNP annotation tools. In SNP annotation the biological information is extracted, collected and displayed in a clear form amenable to query. SNP functional annotation is typically performed based on the available information on nucleic acid and protein sequences.
Spliceman is an online genomic identification tool used to predict the likelihood that a mutation within a DNA sequence is linked with genetic disease. It was created in 2011 by a Brown University lab, and has been used in several studies to identify disease-causing mutant alleles.
SnpEff is an open source tool that annotates variants and predicts their effects on genes by using an interval forest approach. This program takes pre-determined variants listed in a data file that contains the nucleotide change and its position and predicts if the variants are deleterious. This program was first created by Pablo Cingolani to predict effects of single nucleotide polymorphisms (SNPs) in Drosophila, and is now widely used at many universities such as Harvard University, UC Berkeley, Stanford University etc. SnpEff has been used for various applications – from personalized medicine at Stanford University, to profiling bacteria. This annotation and prediction software can be compared to ANNOVAR and Variant Effect Predictor, but each use different nomenclatures
Personalized onco-genomics (POG) is the field of oncology and genomics that is focused on using whole genome analysis to make personalized clinical treatment decisions. The program was devised at British Columbia's BC Cancer Agency and is currently being led by Marco Marra and Janessa Laskin. Genome instability has been identified as one of the underlying hallmarks of cancer. The genetic diversity of cancer cells promotes multiple other cancer hallmark functions that help them survive in their microenvironment and eventually metastasise. The pronounced genomic heterogeneity of tumours has led researchers to develop an approach that assesses each individual's cancer to identify targeted therapies that can halt cancer growth. Identification of these "drivers" and corresponding medications used to possibly halt these pathways are important in cancer treatment.