This article contains content that is written like an advertisement .(March 2020) |
Original author(s) | Pablo Cingolani |
---|---|
Initial release | 2012 |
Stable release | 5.2c / April 9, 2024 |
Repository | github |
Written in | Java |
License | MIT |
Website | pcingola |
SnpEff is an open source tool that performs annotation on genetic variants and predicts their effects on genes by using an interval forest approach. This program takes pre-determined variants listed in a data file that contains the nucleotide change and its position and predicts if the variants are deleterious. This program was first developed to predict effects of single nucleotide polymorphisms (SNPs) in Drosophila . [1] As of July 2024, this SnpEff paper has been cited 10076 times. SnpEff has been used for various applications [2] [3] [4] – from personalized medicine, [5] to profiling bacteria. [6] This annotation and prediction software can be compared to ANNOVAR and Variant Effect Predictor, but each use different nomenclatures. [7] [8]
SnpEff has the capability to work on Windows, Unix or Mac systems, although the installation steps differ. For all systems, SnpEff is first downloaded as a ZIP file, decompressed [9] and then copy-pasted into the desired software (Windows) or requires an additional command line (Unix and Mac). Once the software is installed, the user inputs a VCF or TXT file into the tool kit that contains the tab-separated columns: Chromosome name, Position, Variant’s ID, Reference genome, Alternative, Quality score, Quality filter and Information.
The chromosome name and position columns describe where the variant is located – chromosome number and nucleotide position. If the variant has a previously determined name (example: rs34567), it goes in the ID column. The reference column provides the specific nucleotide in the reference genome – differentiations from the reference are noted in the Alternative section. How accurate the variant is will be the Quality column and its readout from Quality filters are included in the filter column. Any other genomic information is put in the INFO column, which is altered to display the output after running SnpEff.
The output in the INFO section includes: the effect of the variant (stop loss, stop gain, etc.), effect impact on gene (High, Moderate, Low or Modifier), functional class of the variant (nonsense, missense, frameshift etc.), codon change, amino acid change, amino acid length, gene name, gene biotype (protein coding, pseudogene, rRNA, etc. [10] ), coding information, transcript information, exon information and any errors or warnings detected. The Effect impact is what SnpEff uses to determine how deleterious the variant is on genes. For example, a HIGH impact output means that SnpEff predicts that the variant causes deleterious gene effects.
SnpEff is typically used for research and academic purposes at institutions and companies - and in some instances, personalized medicine.[ citation needed ] However, Pablo Cingolani now recommends that ClinEff (a combination of SnpEff and SnpSift) be used for medical purposes.[ citation needed ]
This section needs additional citations for verification .(March 2020) |
SnpEff contains many advantages and limitations. It is able to analyze all variants from the 1000 Genome Project in less than 15 minutes and can be integrated into other tools such as Galaxy, GATK and GKNO. It can be combined with other toolkits to narrow variant prediction parameters (example: whitelist [11] ).
SnpEff Limitations:
Bioinformatics is an interdisciplinary field of science that develops methods and software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, chemistry, physics, computer science, computer programming, information engineering, mathematics and statistics to analyze and interpret biological data. The subsequent process of analyzing and interpreting data is referred to as computational biology.
The human genome is a complete set of nucleic acid sequences for humans, encoded as DNA within the 23 chromosome pairs in cell nuclei and in a small DNA molecule found within individual mitochondria. These are usually treated separately as the nuclear genome and the mitochondrial genome. Human genomes include both protein-coding DNA sequences and various types of DNA that does not encode proteins. The latter is a diverse category that includes DNA coding for non-translated RNA, such as that for ribosomal RNA, transfer RNA, ribozymes, small nuclear RNAs, and several types of regulatory RNAs. It also includes promoters and their associated gene-regulatory elements, DNA playing structural and replicatory roles, such as scaffolding regions, telomeres, centromeres, and origins of replication, plus large numbers of transposable elements, inserted viral DNA, non-functional pseudogenes and simple, highly repetitive sequences. Introns make up a large percentage of non-coding DNA. Some of this non-coding DNA is non-functional junk DNA, such as pseudogenes, but there is no firm consensus on the total amount of junk DNA.
In bioinformatics, sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of a wide range of analytical methods to understand its features, function, structure, or evolution. It can be performed on the entire genome, transcriptome or proteome of an organism, and can also involve only selected segments or regions, like tandem repeats and transposable elements. Methodologies used include sequence alignment, searches against biological databases, and others.
In genetics and bioinformatics, a single-nucleotide polymorphism is a germline substitution of a single nucleotide at a specific position in the genome. Although certain definitions require the substitution to be present in a sufficiently large fraction of the population, many publications do not apply such a frequency threshold.
Nonsense-mediated mRNA decay (NMD) is a surveillance pathway that exists in all eukaryotes. Its main function is to reduce errors in gene expression by eliminating mRNA transcripts that contain premature stop codons. Translation of these aberrant mRNAs could, in some cases, lead to deleterious gain-of-function or dominant-negative activity of the resulting proteins.
The completion of the human genome sequencing in the early 2000s was a turning point in genomics research. Scientists have conducted series of research into the activities of genes and the genome as a whole. The human genome contains around 3 billion base pairs nucleotide, and the huge quantity of data created necessitates the development of an accessible tool to explore and interpret this information in order to investigate the genetic basis of disease, evolution, and biological processes. The field of genomics has continued to grow, with new sequencing technologies and computational tool making it easier to study the genome.
Exome sequencing, also known as whole exome sequencing (WES), is a genomic technique for sequencing all of the protein-coding regions of genes in a genome. It consists of two steps: the first step is to select only the subset of DNA that encodes proteins. These regions are known as exons—humans have about 180,000 exons, constituting about 1% of the human genome, or approximately 30 million base pairs. The second step is to sequence the exonic DNA using any high-throughput DNA sequencing technology.
The Consensus Coding Sequence (CCDS) Project is a collaborative effort to maintain a dataset of protein-coding regions that are identically annotated on the human and mouse reference genome assemblies. The CCDS project tracks identical protein annotations on the reference mouse and human genomes with a stable identifier, and ensures that they are consistently represented by the National Center for Biotechnology Information (NCBI), Ensembl, and UCSC Genome Browser. The integrity of the CCDS dataset is maintained through stringent quality assurance testing and on-going manual curation.
Disease gene identification is a process by which scientists identify the mutant genotypes responsible for an inherited genetic disorder. Mutations in these genes can include single nucleotide substitutions, single nucleotide additions/deletions, deletion of the entire gene, and other genetic abnormalities.
Ensembl Genomes is a scientific project to provide genome-scale data from non-vertebrate species.
INAVA, sometimes referred to as hypothetical protein LOC55765, is a protein of unknown function that in humans is encoded by the INAVA gene. Less common gene aliases include FLJ10901 and MGC125608.
In bioinformatics, a Gene Disease Database is a systematized collection of data, typically structured to model aspects of reality, in a way to comprehend the underlying mechanisms of complex diseases, by understanding multiple composite interactions between phenotype-genotype relationships and gene-disease mechanisms. Gene Disease Databases integrate human gene-disease associations from various expert curated databases and text mining derived associations including Mendelian, complex and environmental diseases.
Single nucleotide polymorphism annotation is the process of predicting the effect or function of an individual SNP using SNP annotation tools. In SNP annotation the biological information is extracted, collected and displayed in a clear form amenable to query. SNP functional annotation is typically performed based on the available information on nucleic acid and protein sequences.
MutationTaster is a free web-based application to evaluate DNA sequence variants for their disease-causing potential. The software performs a battery of in silico tests to estimate the impact of the variant on the gene product / protein. Tests are made on both, protein and DNA level, MutationTaster is hence not limited to substitutions of single amino acids but can also handle synonymous or intronic variants.
ANNOVAR is a bioinformatics software tool for the interpretation and prioritization of single nucleotide variants (SNVs), insertions, deletions, and copy number variants (CNVs) of a given genome.
Single-pass membrane and coiled-coil domain-containing protein 3 is a protein that is encoded in humans by the SMCO3 gene.
ProteinFAM89A is a protein which in humans is encoded by the FAM89A gene. It is also known as chromosome 1 open reading frame 153 (C1orf153). Highest FAM89A gene expression is observed in the placenta and adipose tissue. Though its function is largely unknown, FAM89A is found to be differentially expressed in response to interleukin exposure, and it is implicated in immune responses pathways and various pathologies such as atherosclerosis and glioma cell expression.
FAM120AOS, or family with sequence similarity 120A opposite strand, codes for uncharacterized protein FAM120AOS, which currently has no known function. The gene ontology describes the gene to be protein binding. Overall, it appears that the thyroid and the placenta are the two tissues with the highest expression levels of FAM120AOS across a majority of datasets.
Transmembrane protein 212 is a protein that in humans is encoded by the TMEM212 gene. The protein consists of five transmembrane domains and localizes in the plasma membrane and endoplasmic reticulum. TMEM212 has orthologs in vertebrates but not invertebrates. TMEM212 has been associated with sporadic Parkinson's disease, facial processing, and adiposity in African Americans.
Chromosome 12 open reading frame 71 (c12orf71) is a protein which in humans is encoded by c12orf71 gene. The protein is also known by the alias LOC728858.