ANNOVAR

Last updated April 10, 2023

ANNOVAR (ANNOtate VARiation) is a bioinformatics software tool for the interpretation and prioritization of single nucleotide variants (SNVs), insertions, deletions, and copy number variants (CNVs) of a given genome.^[1] It has the ability to annotate human genomes hg18, hg19, hg38, and model organisms genomes such as: mouse ( Mus musculus ), zebrafish ( Danio rerio ), fruit fly ( Drosophila melanogaster), roundworm ( Caenorhabditis elegans ), yeast ( Saccharomyces cerevisiae ) and many others.^[2] The annotations could be used to determine the functional consequences of the mutations on the genes and organisms, infer cytogenetic bands, report functional importance scores, and/or find variants in conserved regions.^[2] ANNOVAR along with SNP effect (SnpEFF) and Variant Effect Predictor (VEP) are three of the most commonly used variant annotation tools.

Background
Types of functional annotation of genetic variants
Gene-based annotation
Region-based annotation
Filter-based annotation
Technical information
File formats
System efficiency
Data resources
Gene-based annotation 2
Region-based annotation 2
Filter-based annotation 2
Example application
Using ANNOVAR for prioritization of genetic variants to identify mutations in a rare genetic disease
Limitations of ANNOVAR
Alternate variant annotation tools
References

Background

The cost of high throughput DNA sequencing has reduced drastically from around $100 million/human genome in 2001 to around $1000/human genome in 2017.^[3] Due to this increase in accessibility, high throughput DNA sequencing has become more widely used in research and clinical settings.^[4]^[5] Some common areas that utilize high throughput DNA sequencing extensively are: Whole Exome Sequencing, Whole Genome Sequencing (WGS), and genome wide association studies (GWAS).^[6]^[7]

There are a growing number of tools available seeking to comprehensively manage, analyze and interpret the enormous amount of data generated from high-throughput DNA sequencing. The tools are required to be efficient and robust enough to analyze a large number of variants (more than 3 million in human genome) though sensitive enough to identify rare and clinically relevant variants that are likely harmful/deleterious.^[8] ANNOVAR was developed by Dr. Kai Wang in 2010 at the Center for Applied Genomics in the University of Pennsylvania.^[1] It is a type of variant annotation tool that compiles deleterious genetic variant prediction scores from programs such as PolyPhen, ClinVar, and CADD and annotates the SNVs, insertions, deletions, and CNVs of the provided genome. ANNOVAR is one of the first efficient, configurable, extensible and cross-platform compatible variant annotation tools created.

In terms of the larger bioinformatics workflow, ANNOVAR fits in near the end, after DNA sequencing reads having between mapped, aligned, and variants have been predicted from an alignment file (BAM), also known as variant calling. This process will produce a resultant VCF file, a tab-separated text file in a tabular like structure, containing genetic variants as rows. This file can then be used as input into the ANNOVAR software program for the variant annotation process, outputting interpretations of the variants identified from the upstream bioinformatics pipeline.

Types of functional annotation of genetic variants

Gene-based annotation

This approach identifies whether the input variants cause protein coding changes and the amino acids that are affected by the mutations.^[9] The input file can be composed of exons, introns, intergenic regions, splice acceptor/donor sites, and 5′/3′ untranslated regions. The focus is to explore the relationship between non-synonymous mutations (SNPs, indels, or CNVs) and their functional impact on known genes.^[10] Especially, gene-based annotation will highlight the exact amino acid change if the mutation is in the exonic region and the predicted effect on the function of the known gene. This approach is useful for identifying variants in known genes from Whole Exome Sequencing data.

Region-based annotation

This approach identifies deleterious variants in specific genomic regions based on the genomic elements around the gene.^[11] Some categories region-based annotation will take into account are:

1) Is the variant in a known conserved genomic region?

Mutations occur during mitosis and meiosis. If there is no selective pressure for specific nucleotide sequences, then all areas of a genome would be mutated are equal rates. The genomic regions that are highly conserved indicate genomic sequences that are essential to the organism's survival and/or reproductive success. Thus, if the variant disrupts a highly conserved region, the variant is likely highly deleterious.^[12]

2) Is the variant in a predicted transcription factor binding site?

DNA is transcribed into messenger RNA (mRNA) by RNA polymerase II. This process can be modulated transcription factors which can enhance or inhibit binding of RNApol II. If the variant disrupts a transcription factor binding site then transcription of the gene could be altered causing changes in gene expression level and/or protein production amount. This changes could cause phenotypic variations.

3) Is the variant in a predicted miRNA target site?

MicroRNA (miRNA) is a type of RNA that complementary binds to targeted mRNA sequence to suppress or silence the translation of the mRNA. If the variant disrupts the miRNA target location, the miRNA could have altered binding affinity to the corresponding gene transcript thus changing the mRNA expression level of the transcript. This could further impact protein production levels which could cause phenotypic variations.

4) Is the variant predicted to interrupt a stable RNA secondary structure?

RNA can function at the RNA level as non-coding RNA or be translated into proteins for downstream processes. RNA secondary structures are extremely important in determining the correct half-life and function of those RNA. Two RNA species with tightly regulated secondary structures are ribosomal RNA (rRNA) and transfer RNA (tRNA) which are essential in translation of mRNA to protein. If the variant disrupts the stability of the RNA secondary structure, the half-life of the RNA could be shortened thus lowering the concentration of RNA in the cell.

Non-coding regions encompasses 99% of the human genome^[13] and region-based annotation is extremely useful in identifying variants in those regions. This approach can be used on WGS data.

Filter-based annotation

This approach identifies variants that are documented in specific databases.^[14] The variants could be obtained from dbSNP, 1000 Genomes Project, or user-supplied list. Additional information could be obtained from the frequency of the variants from the above databases or the predicted deleterious scores created by PolyPhen, CADD, ClinVar or many others.^[1] The more infrequent a variant appears in the public database, the more deleterious it is likely to be. Results from different deleterious score prediction tools can combined together by the researcher to make a more accurate call on the variant.

Taken together, these approaches complement one another to filter through over 4 million variants in a human genome. Common, low-deleterious score variants are eliminated to reveal the rare, high-deleterious score variants which could be causal for congenital diseases.

Technical information

ANNOVAR is a command-line tool written in the Perl programming language and can be run on any operating system that has a Perl interpreter installed.^[1] If used for non-commercial purposes, it is available free as an open-source package that is downloadable through the ANNOVAR website. ANNOVAR can process most next-generation sequencing data which has been run through a variant calling software.

**Overview of the main scripts in the program**
Script	Purpose	Description	Input	Output	Requirements
`annotate_variation.pl`	variant annotator	The core script, which functionally annotates the genetic variants via (1) gene-based, (2) region-based, and/or (3) filter-based annotation.	.avinput	.avinput	Data sources are downloaded for annotation, e.g. hg38, UCSC, 1000 Genomes Project.
`convert2annovar.pl`	file converter	Converts various file formats to the custom ANNOVAR input file format.	See "Conversion to the ANNOVAR input file format" section.	.avinput
`table_annovar.pl`	automated variant annotator	A wrapper around `annotate_variation.pl` that can take VCF format along with the ANNOVAR format, performs annotation and outputs an Excel-compatible file. Ideal for beginners.	.avinput, CSV, TSV, VCF	CSV, TSV, VCF, TXT	Data sources are downloaded for annotation, e.g. hg38, UCSC, 1000 Genomes Project.
`variants_reduction.pl`	variant reducer	Performs stepwise variant reduction on a large set of input variants to narrow down to a subset of functionally important variants. Filtering procedures include: Applies a stepwise procedure of filtering to identify subsets of variants that are likely to be related to a disease.^[2] Such filtering procedures include:^[2] identifying non-synonymous and splicing variants removing variants in segmental duplication regions identifying conserved genomic regions removing variants from 1000 Genomes Project, ESP6500 and dbSNP	.avinput	.avinput	Gene-based annotation data sources and various filter-based annotation data sources are downloaded.

File formats

The ANNOVAR software accepts text-based input files, including VCF (Variant Call Format), the gold standard for describing genetic loci.

The program's main annotation script, annotate_variation.pl requires a custom input file format, the ANNOVAR input format (.avinput). Common file types can be converted to ANNOVAR input format for annotation using a provided script (see below). It is a simple text file where each line in the file corresponds to a variant and within each line are tab-delimited columns representing the basic genomic coordinate fields (chromosome, start position, end position, reference nucleotides, and observed nucleotides), followed by optional columns^[2]

The ANNOVAR file input contains the following basic fields:

Chr
Start
End
Ref
Alt

For basic "out-of-the-box" usage:

A popular function of the ANNOVAR tool is the use of the table_annovar.pl script which simplifies the workflow into one single command-line call, given that the data sources for annotation have already been downloaded. File conversion from VCF file is handled within the function call, followed by annotation and output to an Excel-compatible file. The script takes a number of parameters for annotation and outputs a VCF file with the annotations as key-value pairs inside of the INFO column of the VCF file for each genetic variant, e.g. "genomic_function=exonic".

Conversion to the ANNOVAR input file format

File conversion to the ANNOVAR input format is possible using the provided file format conversion script convert2annovar.pl. The program accepts common file formats outputted by upstream variant calling tools. Subsequent functional annotation scripts annotate_variation.pl use the ANNOVAR input file. File formats that are accepted by the convert2annovar.pl include the following:^[2]

Variant Call Format
Samtools genotype-calling pileup format
Illumina export format from GenomeStudio
SOLiD GFF genotype-calling format
Complete Genomics variant format

Generating input files based on specific variants, transcripts, or genomic regions:

When investigating candidate loci that are linked to diseases, using the above variant calling file formats as input to ANNOVAR is a standard workflow for functional annotation of genetic variants outputted from an upstream bioinformatics pipeline. ANNOVAR can also be used to in other scenarios, such as interrogating a set of genetic variants of interest based on a list of dbSNP identifiers as well as variants within specific genomic or exomic regions.^[2]

In the case of dbSNP identifiers, providing to the convert2annovar.pl script a list of identifiers (e.g. rs41534544, rs4308095, rs12345678) in a text file along with the reference genome of interest as a parameter, ANNOVAR will output an ANNOVAR input file with the genomic coordinate fields for those variants which can then be used for functional annotation.^[2]

In the case of genomic regions, one can provide a genomic range of interest (e.g. chr1:2000001-2000003) along with the reference genome of interest and ANNOVAR will generate an ANNOVAR input file of all the genetic loci spanning that range. In addition, insertion and deletion size could also be specified in which the script will select all the genetic loci where a specific size of interest insertion or deletion is found.^[2]

Last, if looking at variants within specific exonic regions, users can generate ANNOVAR input files for all possible variants in exons (including splicing variants) when theconvert2annovar.pl script is provided an RNA transcript identifier (e.g. NM_022162) based on the standard HGVS (Human Genome Variation Society) nomenclature.^[2]

Output file

The possible output files are an annotated .avinput file, CSV, TSV, or VCF. Depending on the annotation strategy taken (see Figure below), the input and output files will differ. It is possible to configure the output file types given a specific input file, by providing the program the appropriate parameter.

For example, for the table_annovar.pl program, if the input file is VCF, then the output will also be a VCF file. If the input file is of the ANNOVAR input format type, then the output will be a TSV by default, with the option to output to CSV if the -csvout parameter is specified. By choosing CSV or TSV as the output file type, a user could open the files to view the annotations in Excel or a different spreadsheet software application. This is a popular feature among users.

The output file will contain all the data from the original input file with additional columns for the desired annotations. For example, when annotating variants with characteristics such as (1) genomic function and (2) the functional role of the coding variant, the output file will contain all the columns from the input file, followed additional columns "genomic_function" (e.g. with values "exonic" or "intronic") and "coding_variant_function" (e.g. with values "synonymous SNV" or "non-synonymous SNV").

System efficiency

Benchmarked on a modern desktop computer (3 GHz Intel Xeon CPU, 8GB memory), for 4.7 million variants, ANNOVAR requires ~4 minutes to perform gene-based functional annotation, or ~15 minutes to perform stepwise "variants reduction". It is said to be practical for performing variant annotation and variant prioritization on hundreds of human genomes in a day.^[2]

ANNOVAR could be sped up by using the -thread argument which enables multi-threading so that input files could be processed in parallel.

Data resources

To use ANNOVAR for functional annotation of variants, annotation datasets can be downloaded using the annotate_variation.pl script, which saves them to local disk.^[1] Different annotation data sources are used for the three major types of annotation (gene-based, region-based, and filter-based).

These are some of the data sources for each annotation type:

Gene-based annotation

UCSC/Ensembl genes
hg38
GENCODE/CCDS

^[9]

Region-based annotation

ENCODE
Custom-made databases conforming to GFF3 (Generic Feature Format version 3)

^[11]

Filter-based annotation

1000 Genomes Project	LRT	ClinVar
dbSNP	MutationTaster	CADD
avSNP	GERP++	DANN
dbNSFP	ExAC	COSMIC
SIFT	ESP (Exome Sequencing Project)	ICGC
PolyPhen 2	gnomAD allele frequency	NCI60
PhyloP	Complete Genomics allele frequency

Given the large number of data sources for filter-based annotation, here are examples of which subsets of the datasets to use for a few of the most common use cases.^[14]

For frequency of variants in whole-exome data:^[14]
1. ExAC: with allele frequencies for all ethnic groups
2. NHLBI-ESP: from 6500 exomes, use three population groupings
3. gnomAD allele frequency: with allele frequencies for multiple populations
For disease-specific variants:^[14]
1. ClinVar: with individual columns for each ClinVar field for each variant
2. COSMIC: somatic mutations from cancer and the frequency of occurrence in each subtype of cancer
3. ICGC: mutations from the International Cancer Genome Consortium
4. NCI-60: human tumor cell panel exome sequencing allele frequency data

^[14]

Example application

Using ANNOVAR for prioritization of genetic variants to identify mutations in a rare genetic disease

ANNOVAR is one of the common annotation tools for identifying candidate and causal mutations and genes for rare genetic diseases.

Using a combination of gene-based and filter-based annotation followed by variant reduction based on the annotation values of the variants, the causal gene in a rare recessive Mendelian disease called Miller syndrome can be identified.^[1]

This will involve synthesizing a genome-wide data set of ~4.2 million single nucleotide variants (SNVs) and ~0.5 million insertions and deletions (indels).^[1] Two known causal mutations for Miller syndrome (G152R and G202A in the DHODH gene) are also included^[1]

Steps in identifying the causal variants for the disease using ANNOVAR:^[1]

Gene-based annotation to identify exonic/splicing variants of the combination of SNVs and indels (~4.7 million variants) where a total of 24,617 exonic variants are identified.^[1]
Since Miller syndrome is a rare Mendelian disease, exonic protein-changing variants are of interest only, which makes up 11,166.^[1] From that, 4860 variants are identified that falls in highly conserved genomic regions^[1]
As public databases such as dbSNP and 1000 Genomes Project archive previously reported variants which are often common, it is less likely that they will contain the Miller syndrome causal variants which are rare.^[1] Hence, variants found in those data sources are filtered out and 413 variants remain.
Then, genes are assessed for whether multiple variants exist in the same gene as compound heterozygotes and 23 genes are left.^[1]
Finally, ‘dispensable’ genes are removed, those which have high-frequency non-sense mutations (in greater than 1% of subjects in the 1000 Genomes Project) which are susceptible to sequencing and alignment errors in short-read sequencing platform.^[1] These genes are considered less likely to be causal of a rare Mendelian disease. Three genes as result are filtered out, and 20 candidate genes are leftover, including the causal gene DHODH ^[1]

Limitations of ANNOVAR

Two limitations of ANNOVAR relate to detection of common diseases and larger structural variant annotations. These problems are present in all current variant annotation tools.

Most common diseases such as diabetes and Alzheimer have multiple variants throughout the genome which are common in the population.^[15]^[16] These variants are expected to have low individual deleterious scores and cause disease though the accumulation of multiple variants. However ANNOVAR has default "variant-reduction" schemes that provides a small list of rare and highly predicted deleterious variants.^[10] These default settings could be optimized so the output data would display additional variants with decreasing predicted deleterious scores.^[2] ANNOVAR is primarily used for identifying variants involved rare diseases where the causal mutation is expected to be rare and highly deleterious.

Larger structural variants (SVs) such as chromosomal inversions, translocations, and complex SVs have been shown to cause diseases such as haemophilia A and Alzheimer's.^[17]^[18] However, SVs are often difficult to annotate because it is difficult to assign specific deleterious scores to large mutated genomic regions. Currently, ANNOVAR can only annotate genes contained within deletions or duplications, or small indels of <50bp. ANNOVAR cannot infer complex SVs and translocations^[10]

Alternate variant annotation tools

There are also two other types of SNP annotation tools that are similar to ANNOVAR: SNP effect (SnpEFF) and Variant Effect Predictor (VEP). Many of the features between ANNOVAR, SnpEFF, and VEP are the same including the input and output file format, regulatory region annotations, and know variant annotations. However, the main differences are that ANNOVAR cannot annotate for loss of function predictions whereas both SnpEFF and VEP can. Also, ANNOVAR cannot annotate microRNA structural binding locations whereas VEP can.^[19] MicroRNA structural binding location predictions can be informative in revealing post-transcriptional mutations’ role in disease pathogenesis.^[20] Loss of function mutations are changes in the genome that results in the total dysfunction of the gene product. Thus, these predictions could be extremely informative in regards to disease diagnosis, especially in rare monogenic diseases.^{[ citation needed ]}

Comparison of Three Variant Annotation Tools
Class	Feature	VEP	Annovar	SnpEff
General	Availability	Free	Free (academic use only)	Free
Input	VCF	Yes	Yes	Yes
	Sequence variants	Yes	Yes	Yes
	Structural variants	Yes	Yes	Yes
Output	VCF	Yes	Yes	Yes
Transcript sets	Ensembl	Yes	Yes	Yes
	RefSeq	Yes	Yes	Yes
	User-created databases	Yes	Yes	Yes
Interfaces	Local package	Yes	Yes	Yes
Interfaces	Instant prediction web interface	Yes	Yes	No
Consequence types	Splicing predictions	Yes (via plugins)	Yes (via external data)	Yes (experimental)
Consequence types	Loss of function prediction	Yes (via plugins)	No	Yes
Non-coding	Regulatory features	Yes	Yes	Yes
	Support multiple cell lines	Yes	No	Yes
	miRNA structure location	Yes (via plugins)	No	No
Known variants	Report known variants	Yes	Yes	Yes
	Filter by frequency	Yes	Yes	Yes
	Clinical significance	Yes	Yes	Yes
Other filters	Pre-set filters	Yes	Yes	Yes

*Table adapted from McLaren et al. (2016).

Related Research Articles

Bioinformatics is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combines biology, chemistry, physics, computer science, information engineering, mathematics and statistics to analyze and interpret the biological data. Bioinformatics has been used for in silico analyses of biological queries using computational and statistical techniques.

The human genome is a complete set of nucleic acid sequences for humans, encoded as DNA within the 23 chromosome pairs in cell nuclei and in a small DNA molecule found within individual mitochondria. These are usually treated separately as the nuclear genome and the mitochondrial genome. Human genomes include both protein-coding DNA sequences and various types of DNA that does not encode proteins. The latter is a diverse category that includes DNA coding for non-translated RNA, such as that for ribosomal RNA, transfer RNA, ribozymes, small nuclear RNAs, and several types of regulatory RNAs. It also includes promoters and their associated gene-regulatory elements, DNA playing structural and replicatory roles, such as scaffolding regions, telomeres, centromeres, and origins of replication, plus large numbers of transposable elements, inserted viral DNA, non-functional pseudogenes and simple, highly-repetitive sequences. Introns make up a large percentage of non-coding DNA. Some of this non-coding DNA is non-functional junk DNA, such as pseudogenes, but there is no firm consensus on the total amount of junk DNA.

<span class="mw-page-title-main">Single-nucleotide polymorphism</span> Single nucleotide in genomic DNA at which different sequence alternatives exist

In genetics, a single-nucleotide polymorphism is a germline substitution of a single nucleotide at a specific position in the genome and is present in a sufficiently large fraction of the population. Single nucleotide substitutions with an allele frequency of less than 1% are called "single-nucleotide variants", not SNPs.

Functional genomics is a field of molecular biology that attempts to describe gene functions and interactions. Functional genomics make use of the vast data generated by genomic and transcriptomic projects. Functional genomics focuses on the dynamic aspects such as gene transcription, translation, regulation of gene expression and protein–protein interactions, as opposed to the static aspects of the genomic information such as DNA sequence or structures. A key characteristic of functional genomics studies is their genome-wide approach to these questions, generally involving high-throughput methods rather than a more traditional "candidate-gene" approach.

The Rat Genome Database (RGD) is a database of rat genomics, genetics, physiology and functional data, as well as data for comparative genomics between rat, human and mouse. RGD is responsible for attaching biological information to the rat genome via structured vocabulary, or ontology, annotations assigned to genes and quantitative trait loci (QTL), and for consolidating rat strain data and making it available to the research community. They are also developing a suite of tools for mining and analyzing genomic, physiologic and functional data for the rat, and comparative data for rat, mouse, human, and five other species.

The 1000 Genomes Project, launched in January 2008, was an international research effort to establish by far the most detailed catalogue of human genetic variation. Scientists planned to sequence the genomes of at least one thousand anonymous participants from a number of different ethnic groups within the following three years, using newly developed technologies which were faster and less expensive. In 2010, the project finished its pilot phase, which was described in detail in a publication in the journal Nature. In 2012, the sequencing of 1092 genomes was announced in a Nature publication. In 2015, two papers in Nature reported results and the completion of the project and opportunities for future research.

Genomatix GmbH is a computational biology company headquartered in Munich, Germany, with a seat of business in Ann Arbor, Michigan, U.S.A.

GeneCards is a database of human genes that provides genomic, proteomic, transcriptomic, genetic and functional information on all known and predicted human genes. It is being developed and maintained by the Crown Human Genome Center at the Weizmann Institute of Science, in collaboration with LifeMap Sciences.

Exome sequencing, also known as whole exome sequencing (WES), is a genomic technique for sequencing all of the protein-coding regions of genes in a genome. It consists of two steps: the first step is to select only the subset of DNA that encodes proteins. These regions are known as exons—humans have about 180,000 exons, constituting about 1% of the human genome, or approximately 30 million base pairs. The second step is to sequence the exonic DNA using any high-throughput DNA sequencing technology.

Proteogenomics is a field of biological research that utilizes a combination of proteomics, genomics, and transcriptomics to aid in the discovery and identification of peptides. Proteogenomics is used to identify new peptides by comparing MS/MS spectra against a protein database that has been derived from genomic and transcriptomic information. Proteogenomics often refers to studies that use proteomic information, often derived from mass spectrometry, to improve gene annotations. The utilization of both proteomics and genomics data alongside advances in the availability and power of spectrographic and chromatographic technology led to the emergence of proteogenomics as its own field in 2004.

Disease gene identification is a process by which scientists identify the mutant genotypes responsible for an inherited genetic disorder. Mutations in these genes can include single nucleotide substitutions, single nucleotide additions/deletions, deletion of the entire gene, and other genetic abnormalities.

Ensembl Genomes is a scientific project to provide genome-scale data from non-vertebrate species.

GeneTalk is a web-based platform, tool, and database for filtering, reduction and prioritization of human sequence variants from next-generation sequencing (NGS) data. GeneTalk allows editing annotation about sequence variants and build up a crowd sourced database with clinically relevant information for diagnostics of genetic disorders. GeneTalk allows searching for information about specific sequence variants and connects to experts on variants that are potentially disease-relevant.

SNV calling from NGS data is any of a range of methods for identifying the existence of single nucleotide variants (SNVs) from the results of next generation sequencing (NGS) experiments. These are computational techniques, and are in contrast to special experimental methods based on known population-wide single nucleotide polymorphisms. Due to the increasing abundance of NGS data, these techniques are becoming increasingly popular for performing SNP genotyping, with a wide variety of algorithms designed for specific experimental designs and applications. In addition to the usual application domain of SNP genotyping, these techniques have been successfully adapted to identify rare SNPs within a population, as well as detecting somatic SNVs within an individual using multiple tissue samples.

The Cancer Genome Anatomy Project (CGAP), created by the National Cancer Institute (NCI) in 1997 and introduced by Al Gore, is an online database on normal, pre-cancerous and cancerous genomes. It also provides tools for viewing and analysis of the data, allowing for identification of genes involved in various aspects of tumor progression. The goal of CGAP is to characterize cancer at a molecular level by providing a platform with readily accessible updated data and a set of tools such that researchers can easily relate their findings to existing knowledge. There is also a focus on development of software tools that improve the usage of large and complex datasets. The project is directed by Daniela S. Gerhard, and includes sub-projects or initiatives, with notable ones including the Cancer Chromosome Aberration Project (CCAP) and the Genetic Annotation Initiative (GAI). CGAP contributes to many databases and organisations such as the NCBI contribute to CGAP's databases.

Single nucleotide polymorphism annotation is the process of predicting the effect or function of an individual SNP using SNP annotation tools. In SNP annotation the biological information is extracted, collected and displayed in a clear form amenable to query. SNP functional annotation is typically performed based on the available information on nucleic acid and protein sequences.

Spliceman is an online genomic identification tool used to predict the likelihood that a mutation within a DNA sequence is linked with genetic disease. It was created in 2011 by a Brown University lab, and has been used in several studies to identify disease-causing mutant alleles.

SnpEff is an open source tool that annotates variants and predicts their effects on genes by using an interval forest approach. This program takes pre-determined variants listed in a data file that contains the nucleotide change and its position and predicts if the variants are deleterious. This program was first created by Pablo Cingolani to predict effects of single nucleotide polymorphisms (SNPs) in Drosophila, and is now widely used at many universities such as Harvard University, UC Berkeley, Stanford University etc. SnpEff has been used for various applications – from personalized medicine at Stanford University, to profiling bacteria. This annotation and prediction software can be compared to ANNOVAR and Variant Effect Predictor, but each use different nomenclatures

Personalized onco-genomics (POG) is the field of oncology and genomics that is focused on using whole genome analysis to make personalized clinical treatment decisions. The program was devised at British Columbia's BC Cancer Agency and is currently being led by Marco Marra and Janessa Laskin. Genome instability has been identified as one of the underlying hallmarks of cancer. The genetic diversity of cancer cells promotes multiple other cancer hallmark functions that help them survive in their microenvironment and eventually metastasise. The pronounced genomic heterogeneity of tumours has led researchers to develop an approach that assesses each individual's cancer to identify targeted therapies that can halt cancer growth. Identification of these "drivers" and corresponding medications used to possibly halt these pathways are important in cancer treatment.

References

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Hakonarson, Hakon; Li, Mingyao; Wang, Kai (2010-09-01). "ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data". Nucleic Acids Research . 38 (16): e164. doi:10.1093/nar/gkq603. ISSN 0305-1048. PMC 2938201 . PMID 20601685.
1 2 3 4 5 6 7 8 9 10 11 12 "ANNOVAR website". www.openbioinformatics.org. Retrieved 2019-02-28.
↑ "DNA Sequencing Costs: Data". National Human Genome Research Institute (NHGRI). Retrieved 2019-04-04.
↑ Emerson, Ryan O.; Sherwood, Anna M.; Rieder, Mark J.; Guenthoer, Jamie; Williamson, David W.; Carlson, Christopher S.; Drescher, Charles W.; Tewari, Muneesh; Bielas, Jason H. (December 2013). "High-throughput sequencing of T cell receptors reveals a homogeneous repertoire of tumor-infiltrating lymphocytes in ovarian cancer". The Journal of Pathology . 231 (4): 433–440. doi:10.1002/path.4260. ISSN 0022-3417. PMC 5012191 . PMID 24027095.
↑ Blayney, Jaine K.; Parkes, Eileen; Zheng, Huiru; Taggart, Laura; Browne, Fiona; Haberland, Valeriia; Lightbody, Gaye (2018). "Review of applications of high-throughput sequencing in personalized medicine: barriers and facilitators of future progress in research and clinical application". Briefings in Bioinformatics. 20 (5): 1795–1811. doi: 10.1093/bib/bby051 . PMID 30084865.
↑ Reference, Genetics Home. "What are whole exome sequencing and whole genome sequencing?". Genetics Home Reference. Retrieved 2019-04-04.
↑ Reference, Genetics Home. "What are genome-wide association studies?". Genetics Home Reference. Retrieved 2019-04-04.
↑ The 1000 Genomes Project Consortium (October 2015). "A global reference for human genetic variation". Nature. 526 (7571): 68–74. Bibcode:2015Natur.526...68T. doi:10.1038/nature15393. ISSN 1476-4687. PMC 4750478 . PMID 26432245.
1 2 "Gene-based Annotation - ANNOVAR Documentation". annovar.openbioinformatics.org. Retrieved 2019-02-28.
1 2 3 Yang, Hui; Wang, Kai (October 2015). "Genomic variant annotation and prioritization with ANNOVAR and wANNOVAR". Nature Protocols. 10 (10): 1556–1566. doi:10.1038/nprot.2015.105. ISSN 1754-2189. PMC 4718734 . PMID 26379229.
1 2 "Region-based Annotation - ANNOVAR Documentation". annovar.openbioinformatics.org. Retrieved 2019-02-28.
↑ Jordan, I. King; Rogozin, Igor B.; Wolf, Yuri I.; Koonin, Eugene V. (June 2002). "Essential Genes Are More Evolutionarily Conserved Than Are Nonessential Genes in Bacteria". Genome Research . 12 (6): 962–968. doi:10.1101/gr.87702. ISSN 1088-9051. PMC 1383730 . PMID 12045149.
↑ Reference, Genetics Home. "What is noncoding DNA?". Genetics Home Reference. Retrieved 2019-03-01.
1 2 3 4 5 "Filter-based Annotation - ANNOVAR Documentation". annovar.openbioinformatics.org. Retrieved 2019-02-28.
↑ Wu, Yiming; Jing, Runyu; Dong, Yongcheng; Kuang, Qifan; Li, Yan; Huang, Ziyan; Gan, Wei; Xue, Yue; Li, Yizhou (2017-03-06). "Functional annotation of sixty-five type-2 diabetes risk SNPs and its application in risk prediction". Scientific Reports. 7: 43709. Bibcode:2017NatSR...743709W. doi:10.1038/srep43709. ISSN 2045-2322. PMC 5337961 . PMID 28262806.
↑ Emahazion, T.; Feuk, L.; Jobs, M.; Sawyer, S. L.; Fredman, D.; St Clair, D.; Prince, J. A.; Brookes, A. J. (July 2001). "SNP association studies in Alzheimer's disease highlight problems for complex disease analysis". Trends in Genetics. 17 (7): 407–413. doi:10.1016/S0168-9525(01)02342-3. ISSN 0168-9525. PMID 11418222.
↑ Lakich, Delia; Kazazian, Haig H.; Antonarakis, Stylianos E.; Gitschier, Jane (November 1993). "Inversions disrupting the factor VIII gene are a common cause of severe haemophilia A". Nature Genetics. 5 (3): 236–241. doi:10.1038/ng1193-236. ISSN 1061-4036. PMID 8275087. S2CID 25636383.
↑ Lupski, James R. (June 2015). "Structural Variation Mutagenesis of the Human Genome: Impact on Disease and Evolution". Environmental and Molecular Mutagenesis. 56 (5): 419–436. doi:10.1002/em.21943. ISSN 0893-6692. PMC 4609214 . PMID 25892534.
↑ McLaren, William; Gil, Laurent; Hunt, Sarah E.; Riat, Harpreet Singh; Ritchie, Graham R. S.; Thormann, Anja; Flicek, Paul; Cunningham, Fiona (2016-06-06). "The Ensembl Variant Effect Predictor". Genome Biology. 17 (1): 122. doi:10.1186/s13059-016-0974-4. ISSN 1474-760X. PMC 4893825 . PMID 27268795.
↑ Jiang Q, Wang Y, Hao Y, Juan L, Teng M, Zhang X, Li M, Wang G, Liu Y (January 2009). "miR2Disease: a manually curated database for microRNA deregulation in human disease". Nucleic Acids Research. 37. 37 (Database issue): D98–104. doi:10.1093/nar/gkn714. PMC 2686559 . PMID 18927107.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[:0-1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Hakonarson, Hakon; Li, Mingyao; Wang, Kai (2010-09-01). "ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data". Nucleic Acids Research . 38 (16): e164. doi:10.1093/nar/gkq603. ISSN 0305-1048. PMC 2938201 . PMID 20601685.

[:2-2] 1 2 3 4 5 6 7 8 9 10 11 12 "ANNOVAR website". www.openbioinformatics.org. Retrieved 2019-02-28.

[3] "DNA Sequencing Costs: Data". National Human Genome Research Institute (NHGRI). Retrieved 2019-04-04.

[4] Emerson, Ryan O.; Sherwood, Anna M.; Rieder, Mark J.; Guenthoer, Jamie; Williamson, David W.; Carlson, Christopher S.; Drescher, Charles W.; Tewari, Muneesh; Bielas, Jason H. (December 2013). "High-throughput sequencing of T cell receptors reveals a homogeneous repertoire of tumor-infiltrating lymphocytes in ovarian cancer". The Journal of Pathology . 231 (4): 433–440. doi:10.1002/path.4260. ISSN 0022-3417. PMC 5012191 . PMID 24027095.

[5] Blayney, Jaine K.; Parkes, Eileen; Zheng, Huiru; Taggart, Laura; Browne, Fiona; Haberland, Valeriia; Lightbody, Gaye (2018). "Review of applications of high-throughput sequencing in personalized medicine: barriers and facilitators of future progress in research and clinical application". Briefings in Bioinformatics. 20 (5): 1795–1811. doi: 10.1093/bib/bby051 . PMID 30084865.

[6] Reference, Genetics Home. "What are whole exome sequencing and whole genome sequencing?". Genetics Home Reference. Retrieved 2019-04-04.

[7] Reference, Genetics Home. "What are genome-wide association studies?". Genetics Home Reference. Retrieved 2019-04-04.

[8] The 1000 Genomes Project Consortium (October 2015). "A global reference for human genetic variation". Nature. 526 (7571): 68–74. Bibcode:2015Natur.526...68T. doi:10.1038/nature15393. ISSN 1476-4687. PMC 4750478 . PMID 26432245.

[:3-9] 1 2 "Gene-based Annotation - ANNOVAR Documentation". annovar.openbioinformatics.org. Retrieved 2019-02-28.

[:5-10] 1 2 3 Yang, Hui; Wang, Kai (October 2015). "Genomic variant annotation and prioritization with ANNOVAR and wANNOVAR". Nature Protocols. 10 (10): 1556–1566. doi:10.1038/nprot.2015.105. ISSN 1754-2189. PMC 4718734 . PMID 26379229.

[:4-11] 1 2 "Region-based Annotation - ANNOVAR Documentation". annovar.openbioinformatics.org. Retrieved 2019-02-28.

[12] Jordan, I. King; Rogozin, Igor B.; Wolf, Yuri I.; Koonin, Eugene V. (June 2002). "Essential Genes Are More Evolutionarily Conserved Than Are Nonessential Genes in Bacteria". Genome Research . 12 (6): 962–968. doi:10.1101/gr.87702. ISSN 1088-9051. PMC 1383730 . PMID 12045149.

[13] Reference, Genetics Home. "What is noncoding DNA?". Genetics Home Reference. Retrieved 2019-03-01.

[:1-14] 1 2 3 4 5 "Filter-based Annotation - ANNOVAR Documentation". annovar.openbioinformatics.org. Retrieved 2019-02-28.

[15] Wu, Yiming; Jing, Runyu; Dong, Yongcheng; Kuang, Qifan; Li, Yan; Huang, Ziyan; Gan, Wei; Xue, Yue; Li, Yizhou (2017-03-06). "Functional annotation of sixty-five type-2 diabetes risk SNPs and its application in risk prediction". Scientific Reports. 7: 43709. Bibcode:2017NatSR...743709W. doi:10.1038/srep43709. ISSN 2045-2322. PMC 5337961 . PMID 28262806.

[16] Emahazion, T.; Feuk, L.; Jobs, M.; Sawyer, S. L.; Fredman, D.; St Clair, D.; Prince, J. A.; Brookes, A. J. (July 2001). "SNP association studies in Alzheimer's disease highlight problems for complex disease analysis". Trends in Genetics. 17 (7): 407–413. doi:10.1016/S0168-9525(01)02342-3. ISSN 0168-9525. PMID 11418222.

[17] Lakich, Delia; Kazazian, Haig H.; Antonarakis, Stylianos E.; Gitschier, Jane (November 1993). "Inversions disrupting the factor VIII gene are a common cause of severe haemophilia A". Nature Genetics. 5 (3): 236–241. doi:10.1038/ng1193-236. ISSN 1061-4036. PMID 8275087. S2CID 25636383.

[18] Lupski, James R. (June 2015). "Structural Variation Mutagenesis of the Human Genome: Impact on Disease and Evolution". Environmental and Molecular Mutagenesis. 56 (5): 419–436. doi:10.1002/em.21943. ISSN 0893-6692. PMC 4609214 . PMID 25892534.

[19] McLaren, William; Gil, Laurent; Hunt, Sarah E.; Riat, Harpreet Singh; Ritchie, Graham R. S.; Thormann, Anja; Flicek, Paul; Cunningham, Fiona (2016-06-06). "The Ensembl Variant Effect Predictor". Genome Biology. 17 (1): 122. doi:10.1186/s13059-016-0974-4. ISSN 1474-760X. PMC 4893825 . PMID 27268795.

[20] Jiang Q, Wang Y, Hao Y, Juan L, Teng M, Zhang X, Li M, Wang G, Liu Y (January 2009). "miR2Disease: a manually curated database for microRNA deregulation in human disease". Nucleic Acids Research. 37. 37 (Database issue): D98–104. doi:10.1093/nar/gkn714. PMC 2686559 . PMID 18927107.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

ANNOVAR

Contents

Background

Types of functional annotation of genetic variants

Gene-based annotation

Region-based annotation

Filter-based annotation

Technical information

File formats

Conversion to the ANNOVAR input file format

Output file

System efficiency

Data resources

Gene-based annotation

Region-based annotation

Filter-based annotation

Example application

Using ANNOVAR for prioritization of genetic variants to identify mutations in a rare genetic disease

Limitations of ANNOVAR

Alternate variant annotation tools

Related Research Articles

References