SnpEff

Last updated
SnpEff
Original author(s) Pablo Cingolani
Initial release2012
Stable release
5.2c / April 9, 2024;4 months ago (2024-04-09)
Repository github.com/pcingola/SnpEff
Written in Java
License MIT
Website pcingola.github.io/SnpEff/

SnpEff is an open source tool that performs annotation on variants and predicts their effects on genes by using an interval forest approach. This program takes pre-determined variants listed in a data file that contains the nucleotide change and its position and predicts if the variants are deleterious. This program was first developed to predict effects of single nucleotide polymorphisms (SNPs) in Drosophila,. [1] As of July 2024, this SnpEff paper has been cited 10076 times. SnpEff has been used for various applications [2] [3] [4] – from personalized medicine, [5] to profiling bacteria. [6] This annotation and prediction software can be compared to ANNOVAR and Variant Effect Predictor, but each use different nomenclatures. [7] [8]

Contents

Usage pathway for SnpEff Usage pathway for SnpEff.png
Usage pathway for SnpEff

Usage

SnpEff has the capability to work on Windows, Unix or Mac systems, although the installation steps differ. For all systems, SnpEff is first downloaded as a ZIP file, decompressed [9] and then copy-pasted into the desired software (Windows) or requires an additional command line (Unix and Mac). Once the software is installed, the user inputs a VCF or TXT file into the tool kit that contains the tab-separated columns: Chromosome name, Position, Variant’s ID, Reference genome, Alternative, Quality score, Quality filter and Information.

SnpEff Input File Example SnpEff Input File Example.jpg
SnpEff Input File Example

The chromosome name and position columns describe where the variant is located – chromosome number and nucleotide position. If the variant has a previously determined name (example: rs34567), it goes in the ID column. The reference column provides the specific nucleotide in the reference genome – differentiations from the reference are noted in the Alternative section. How accurate the variant is will be the Quality column and its readout from Quality filters are included in the filter column. Any other genomic information is put in the INFO column, which is altered to display the output after running SnpEff.

SnpEff Output Example SnpEff Output Example.png
SnpEff Output Example

The output in the INFO section includes: the effect of the variant (stop loss, stop gain, etc.), effect impact on gene (High, Moderate, Low or Modifier), functional class of the variant (nonsense, missense, frameshift etc.), codon change, amino acid change, amino acid length, gene name, gene biotype (protein coding, pseudogene, rRNA, etc. [10] ), coding information, transcript information, exon information and any errors or warnings detected. The Effect impact is what SnpEff uses to determine how deleterious the variant is on genes. For example, a HIGH impact output means that SnpEff predicts that the variant causes deleterious gene effects.

SnpEff is typically used for research and academic purposes at institutions and companies - and in some instances, personalized medicine.[ citation needed ] However, Pablo Cingolani now recommends that ClinEff (a combination of SnpEff and SnpSift) be used for medical purposes.[ citation needed ]

Advantages and limitations

SnpEff contains many advantages and limitations. It is able to analyze all variants from the 1000 Genome Project in less than 15 minutes and can be integrated into other tools such as Galaxy, GATK and GKNO. It can be combined with other toolkits to narrow variant prediction parameters (example: whitelist [11] ).

SnpEff Limitations:

See also

Related Research Articles

<span class="mw-page-title-main">Bioinformatics</span> Computational analysis of large, complex sets of biological data

Bioinformatics is an interdisciplinary field of science that develops methods and software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, chemistry, physics, computer science, computer programming, information engineering, mathematics and statistics to analyze and interpret biological data. The subsequent process of analyzing and interpreting data is referred to as computational biology.

<span class="mw-page-title-main">Human genome</span> Complete set of nucleic acid sequences for humans

The human genome is a complete set of nucleic acid sequences for humans, encoded as DNA within the 23 chromosome pairs in cell nuclei and in a small DNA molecule found within individual mitochondria. These are usually treated separately as the nuclear genome and the mitochondrial genome. Human genomes include both protein-coding DNA sequences and various types of DNA that does not encode proteins. The latter is a diverse category that includes DNA coding for non-translated RNA, such as that for ribosomal RNA, transfer RNA, ribozymes, small nuclear RNAs, and several types of regulatory RNAs. It also includes promoters and their associated gene-regulatory elements, DNA playing structural and replicatory roles, such as scaffolding regions, telomeres, centromeres, and origins of replication, plus large numbers of transposable elements, inserted viral DNA, non-functional pseudogenes and simple, highly repetitive sequences. Introns make up a large percentage of non-coding DNA. Some of this non-coding DNA is non-functional junk DNA, such as pseudogenes, but there is no firm consensus on the total amount of junk DNA.

In bioinformatics, sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of a wide range of analytical methods to understand its features, function, structure, or evolution. It can be performed on the entire genome, transcriptome or proteome of an organism, and can also involve only selected segments or regions, like tandem repeats and transposable elements. Methodologies used include sequence alignment, searches against biological databases, and others.

<span class="mw-page-title-main">Single-nucleotide polymorphism</span> Single nucleotide in genomic DNA at which different sequence alternatives exist

In genetics and bioinformatics, a single-nucleotide polymorphism is a germline substitution of a single nucleotide at a specific position in the genome. Although certain definitions require the substitution to be present in a sufficiently large fraction of the population, many publications do not apply such a frequency threshold.

<span class="mw-page-title-main">Nonsense-mediated decay</span> Elimination of mRNA with premature stop codons in eukaryotes

Nonsense-mediated mRNA decay (NMD) is a surveillance pathway that exists in all eukaryotes. Its main function is to reduce errors in gene expression by eliminating mRNA transcripts that contain premature stop codons. Translation of these aberrant mRNAs could, in some cases, lead to deleterious gain-of-function or dominant-negative activity of the resulting proteins.

The completion of the human genome sequencing in the early 2000s was a turning point in genomics research. Scientists have conducted series of research into the activities of genes and the genome as a whole. The human genome contains around 3 billion base pairs nucleotide, and the huge quantity of data created necessitates the development of an accessible tool to explore and interpret this information in order to investigate the genetic basis of disease, evolution, and biological processes. The field of genomics has continued to grow, with new sequencing technologies and computational tool making it easier to study the genome.

GeneCards is a database of human genes that provides genomic, proteomic, transcriptomic, genetic and functional information on all known and predicted human genes. It is being developed and maintained by the Crown Human Genome Center at the Weizmann Institute of Science, in collaboration with LifeMap Sciences.

<span class="mw-page-title-main">Exome sequencing</span> Sequencing of all the exons of a genome

Exome sequencing, also known as whole exome sequencing (WES), is a genomic technique for sequencing all of the protein-coding regions of genes in a genome. It consists of two steps: the first step is to select only the subset of DNA that encodes proteins. These regions are known as exons—humans have about 180,000 exons, constituting about 1% of the human genome, or approximately 30 million base pairs. The second step is to sequence the exonic DNA using any high-throughput DNA sequencing technology.

The Consensus Coding Sequence (CCDS) Project is a collaborative effort to maintain a dataset of protein-coding regions that are identically annotated on the human and mouse reference genome assemblies. The CCDS project tracks identical protein annotations on the reference mouse and human genomes with a stable identifier, and ensures that they are consistently represented by the National Center for Biotechnology Information (NCBI), Ensembl, and UCSC Genome Browser. The integrity of the CCDS dataset is maintained through stringent quality assurance testing and on-going manual curation.

Disease gene identification is a process by which scientists identify the mutant genotypes responsible for an inherited genetic disorder. Mutations in these genes can include single nucleotide substitutions, single nucleotide additions/deletions, deletion of the entire gene, and other genetic abnormalities.

Ensembl Genomes is a scientific project to provide genome-scale data from non-vertebrate species.

<span class="mw-page-title-main">INAVA</span> Protein-coding gene in the species Homo sapiens

INAVA, sometimes referred to as hypothetical protein LOC55765, is a protein of unknown function that in humans is encoded by the INAVA gene. Less common gene aliases include FLJ10901 and MGC125608.

In bioinformatics, a Gene Disease Database is a systematized collection of data, typically structured to model aspects of reality, in a way to comprehend the underlying mechanisms of complex diseases, by understanding multiple composite interactions between phenotype-genotype relationships and gene-disease mechanisms. Gene Disease Databases integrate human gene-disease associations from various expert curated databases and text mining derived associations including Mendelian, complex and environmental diseases.

Single nucleotide polymorphism annotation is the process of predicting the effect or function of an individual SNP using SNP annotation tools. In SNP annotation the biological information is extracted, collected and displayed in a clear form amenable to query. SNP functional annotation is typically performed based on the available information on nucleic acid and protein sequences.

MutationTaster is a free web-based application to evaluate DNA sequence variants for their disease-causing potential. The software performs a battery of in silico tests to estimate the impact of the variant on the gene product / protein. Tests are made on both, protein and DNA level, MutationTaster is hence not limited to substitutions of single amino acids but can also handle synonymous or intronic variants.

ANNOVAR is a bioinformatics software tool for the interpretation and prioritization of single nucleotide variants (SNVs), insertions, deletions, and copy number variants (CNVs) of a given genome.

<span class="mw-page-title-main">SMCO3</span> Protein-coding gene in the species Homo sapiens

Single-pass membrane and coiled-coil domain-containing protein 3 is a protein that is encoded in humans by the SMCO3 gene.

<span class="mw-page-title-main">FAM120AOS</span> Protein-coding gene in the species Homo sapiens

FAM120AOS, or family with sequence similarity 120A opposite strand, codes for uncharacterized protein FAM120AOS, which currently has no known function. The gene ontology describes the gene to be protein binding. Overall, it appears that the thyroid and the placenta are the two tissues with the highest expression levels of FAM120AOS across a majority of datasets.

<span class="mw-page-title-main">TMEM212</span> Protein-coding gene in the species Homo sapiens

Transmembrane protein 212 is a protein that in humans is encoded by the TMEM212 gene. The protein consists of five transmembrane domains and localizes in the plasma membrane and endoplasmic reticulum. TMEM212 has orthologs in vertebrates but not invertebrates. TMEM212 has been associated with sporadic Parkinson's disease, facial processing, and adiposity in African Americans.

<span class="mw-page-title-main">Chromosome 12 open reading frame 71</span> Protein encoded in humans by c12orf71 gene

Chromosome 12 open reading frame 71 (c12orf71) is a protein which in humans is encoded by c12orf71 gene. The protein is also known by the alias LOC728858.

References

  1. "A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3.", Cingolani P, Platts A, Wang le L, Coon M, Nguyen T, Wang L, Land SJ, Lu X, Ruden DM. Fly (Austin). 2012 Apr-Jun;6(2):80-92. PMID   22728672 [PubMed - in process]
  2. Medina, Ignacio, et al. "VARIANT: Command Line, Web service and Web interface for fast and accurate functional characterization of variants found by Next-Generation Sequencing." Nucleic Acids Research 40.W1 (2012): W54-W58.
  3. Kim, Yun Joong, et al. "Neuroimaging studies and whole exome sequencing of PLA2G6-associated neurodegeneration in a family with intrafamilial phenotypic heterogeneity." Parkinsonism & related disorders 21.4 (2015): 402-406.
  4. Reddy, Mettu M., and Kandasamy Ulaganathan. "Draft genome sequence of Oryza sativa elite indica cultivar RP Bio-226." Frontiers in plant science 6 (2015).
  5. Dewey, Frederick E., et al. "Distribution and clinical impact of functional variants in 50,726 whole-exome sequences from the DiscovEHR study." Science 354.6319 (2016): aaf6814.
  6. Medvedeva, E. S., et al. "Genomic and proteomic profiles of Acholeplasma laidlawii strains differing in sensitivity to ciprofloxacin." Doklady Biochemistry and Biophysics. Vol. 466. No. 1. Pleiades Publishing, 2016.
  7. Wang K, Li M, Hakonarson H. ANNOVAR: Functional annotation of genetic variants from next-generation sequencing data Nucleic Acids Research, 38:e164, 2010
  8. "Variant Effect Predictor." Variant Effect Predictor. EMBL-EBI, Dec. 2016. Web. 28 Feb. 2017. <http://uswest.ensembl.org/info/docs/tools/vep/index.html>.
  9. "SnpEff." SnpEff. N.p., n.d. Web. 28 Feb. 2017. <http://snpeff.sourceforge.net/SnpEff_manual.html>.
  10. "Help - Frequently Asked Questions - Homo sapiens - Ensembl genome browser 87." Help - Frequently Asked Questions - Homo sapiens - Ensembl genome browser 87. N.p., n.d. Web. 28 Feb. 2017.
  11. Dewey, Frederick E., et al. "Distribution and clinical impact of functional variants in 50,726 whole-exome sequences from the DiscovEHR study." Science 354.6319 (2016): aaf6814.