SNP annotation

Last updated
SNP annotation
Classification Bioinformatics
Subclassification Single-nucleotide polymorphism
Type of tools usedFunctional annotation tools
Other subjects related Genome project, Genomics

Single nucleotide polymorphism annotation (SNP annotation) is the process of predicting the effect or function of an individual SNP using SNP annotation tools. In SNP annotation the biological information is extracted, collected and displayed in a clear form amenable to query. SNP functional annotation is typically performed based on the available information on nucleic acid and protein sequences. [1]

Contents

Introduction

Directed graph of relationships among SNP prediction webservers and their bioinformatics sources. Relationships Among SNP Predictions.png
Directed graph of relationships among SNP prediction webservers and their bioinformatics sources.

Single nucleotide polymorphisms (SNPs) play an important role in genome wide association studies because they act as primary biomarkers. SNPs are currently the marker of choice due to their large numbers in virtually all populations of individuals. The location of these biomarkers can be tremendously important in terms of predicting functional significance, genetic mapping and population genetics. [3] Each SNP represents a nucleotide change between two individuals at a defined location. SNPs are the most common genetic variant found in all individual with one SNP every 100–300 bp in some species. [4] Since there is a massive number of SNPs on the genome, there is a clear need to prioritize SNPs according to their potential effect in order to expedite genotyping and analysis. [5]

Annotating large numbers of SNPs is a difficult and complex process, which need computational methods to handle such a large dataset. Many tools available have been developed for SNP annotation in different organisms: some of them are optimized for use with organisms densely sampled for SNPs (such as humans), but there are currently few tools available that are species non-specific or support non-model organism data. The majority of SNP annotation tools provide computationally predicted putative deleterious effects of SNPs. These tools examine whether a SNP resides in functional genomic regions such as exons, splice sites, or transcription regulatory sites, and predict the potential corresponding functional effects that the SNP may have using a variety of machine-learning approaches. But the tools and systems that prioritize functionally significant SNPs, suffer from few limitations: First, they examine the putative deleterious effects of SNPs with respect to a single biological function that provide only partial information about the functional significance of SNPs. Second, current systems classify SNPs into deleterious or neutral group. [6]

Many annotation algorithms focus on single nucleotide variants (SNVs), considered more rare than SNPs as defined by their minor allele frequency (MAF). [7] [8] As a consequence, training data for the corresponding prediction methods may be different and hence one should be careful to select the appropriate tool for a specific purpose. For the purposes of this article, "SNP" will be used to mean both SNP and SNV, but readers should bear in mind the differences.

SNP annotation

Different type of annotations in genomics SNPannotation1.png
Different type of annotations in genomics

For SNP annotation, many kinds of genetic and genomic information are used. Based on the different features used by each annotation tool, SNP annotation methods may be split roughly into the following categories:

Gene based annotation

Genomic information from surrounding genomic elements is among the most useful information for interpreting the biological function of an observed variant. Information from a known gene is used as a reference to indicate whether the observed variant resides in or near a gene and if it has the potential to disrupt the protein sequence and its function. Gene based annotation is based on the fact that non-synonymous mutations can alter the protein sequence and that splice site mutation may disrupt the transcript splicing pattern. [9]

Knowledge based annotation

Knowledge base annotation is done based on the information of gene attribute, protein function and its metabolism. In this type of annotation more emphasis is given to genetic variation that disrupts the protein function domain, protein-protein interaction and biological pathway. The non-coding region of genome contain many important regulatory elements including promoter, enhancer and insulator, any kind of change in this regulatory region can change the functionality of that protein. [10] The mutation in DNA can change the RNA sequence and then influence the RNA secondary structure, RNA binding protein recognition and miRNA binding activity,. [11] [12]

Functional annotation

This method mainly identifies variant function based on the information whether the variant loci are in the known functional region that harbor genomic or epigenomic signals. The function of non-coding variants are extensive in terms of the affected genomic region and they involve in almost all processes of gene regulation from transcriptional to post translational level [13]

Transcriptional gene regulation

Transcriptional gene regulation process depends on many spatial and temporal factors in the nucleus such as global or local chromatin states, nucleosome positioning, TF binding, enhancer/promoter activities. Variant that alter the function of any of these biological processes may alter the gene regulation and cause phenotypic abnormality. [14] Genetic variants that located in distal regulatory region can affect the binding motif of TFs, chromatin regulators and other distal transcriptional factors, which disturb the interaction between enhancer/silencer and its target gene. [15]

Alternative splicing

Alternative splicing is one of the most important components that show functional complexity of genome. Modified splicing has significant effect on the phenotype that is relevance to disease or drug metabolism. A change in splicing can be caused by modifying any of the components of the splicing machinery such as splice sites or splice enhancers or silencers. [16] Modification in the alternative splicing site can lead to a different protein form which will show a different function. Humans use an estimated 100,000 different proteins or more, so some genes must be capable of coding for a lot more than just one protein. Alternative splicing occurs more frequently than was previously thought and can be hard to control; genes may produce tens of thousands of different transcripts, necessitating a new gene model for each alternative splice.

RNA processing and post transcriptional regulation

Mutations in the untranslated region (UTR) affect many post-transcriptional regulation. Distinctive structural features are required for many RNA molecules and cis-acting regulatory elements to execute effective functions during gene regulation. SNVs can alter the secondary structure of RNA molecules and then disrupt the proper folding of RNAs, such as tRNA/mRNA/lncRNA folding and miRNA binding recognition regions. [17]

Translation and post translational modifications

Single nucleotide variant can also affect the cis-acting regulatory elements in mRNA’s to inhibit/promote the translation initiation. Change in the synonymous codons region due to mutation may affect the translation efficiency because of codon usage biases. The translation elongation can also be retarded by mutations along the ramp of ribosomal movement. In the post-translational level, genetic variants can contribute to proteostasis and amino acid modifications. However, mechanisms of variant effect in this field are complicated and there are only a few tools available to predict variant’s effect on translation related modifications. [18]

Protein function

Non-synonymous is the variant in exons that change the amino acid sequence encoded by the gene, including single base changes and non frameshift indels. It has been extremely investigated the function of non-synonymous variants on protein and many algorithms have been developed to predict the deleteriousness and pathogenesis of single nucleotide variants (SNVs). Classical bioinformatics tools, such as SIFT, Polyphen and MutationTaster, successfully predict the functional consequence of non-synonymous substitution. [19] [20] [21] [22] PopViz webserver provides a gene-centric approach to visualize the mutation damage prediction scores (CADD, SIFT, PolyPhen-2) or the population genetics (minor allele frequency) versus the amino acid positions of all coding variants of a certain human gene. [23] PopViz is also cross-linked with UniProt database, where the protein domain information can be found, and to then identify the predicted deleterious variants fall into these protein domains on the PopViz plot. [23]

Evolutionary conservation and nature selection

Comparative genomics approaches were used to predict the function-relevant variants under the assumption that the functional genetic locus should be conserved across different species at an extensive phylogenetic distance. On the other hand, some adaptive traits and the population differences are driven by positive selections of advantageous variants, and these genetic mutations are functionally relevant to population specific phenotypes. Functional prediction of variants’ effect in different biological processes is pivotal to pinpoint the molecular mechanism of diseases/traits and direct the experimental validation. [24]

List of available SNP annotation tools

To annotate the vast amounts of available NGS data, currently a large number of SNPs annotation tools are available. Some of them are specific to specific SNPs while others are more general. Some of the available SNPs annotation tools are as follows SNPeff, Ensembl Variant Effect Predictor (VEP), ANNOVAR, FATHMM, PhD-SNP, PolyPhen-2, SuSPect, F-SNP, AnnTools, SeattleSeq, SNPit, SCAN, Snap, SNPs&GO, LS-SNP, Snat, TREAT, TRAMS, Maviant, MutationTaster, SNPdat, Snpranker, NGS – SNP, SVA, VARIANT, SIFT, LIST-S2, PhD-SNP and FAST-SNP. The functions and approaches used in SNPs annotation tools are listed below.

ToolsDescriptionExternal resources useWebsiteURLReferences
PhyreRiskMaps genetics variants onto experimental and predicted protein structures Variant effect predictor, UniProt, Protein Data Bank, SIFTS, Phyre2 for predicted structures http://phyrerisk.bc.ic.ac.uk/home

[25]

Missense3DReports structural impact of a missense variant onto PDB and user-supplied protein coordinates. Developed to be applicable to experimental and predicted protein structures Protein Data Bank, Phyre2 for predicted structures http://www.sbg.bio.ic.ac.uk/~missense3d/

[26]

SNPeffSnpEff annotates variants based on their genomic locations and predicts coding effects. Uses an interval forest approachENSEMBL, UCSC and organism based e.g. FlyBase, WormBase and TAIR http://snpeff.sourceforge.net/SnpEff_manual.html [27]
Ensembl VEPDetermines effects of variants (SNPs, insertions, deletions, CNVs or structural variants) on genes, transcripts, proteins and regulatory regionsdbSNP, RefSeq, UniProt, COSMIC, PDBe, 1000 Genomes, gnomAD, PubMed https://www.ensembl.org/info/docs/tools/vep/index.html [28]
ANNOVARThis tool is suitable for pinpointing a small subset of functionally important variants. Uses mutation prediction approach for annotationUCSC, RefSeq and Ensembl http://annovar.openbioinformatics.org/ [29]
JannovarThis is a tool and library for genome annotationRefSeq, Ensembl, UCSC, etc. https://github.com/charite/jannovar [30]
PhD-SNPSVM-based method using sequence information retrieved by BLAST algorithm.UniRef90 http://snps.biofold.org/phd-snp/ [31]
PolyPhen-2Suitable for predicting damaging effects of missense mutations. Uses sequence conservation, structure to model position of amino acid substitution, and SWISS-PROT annotationUniProt http://genetics.bwh.harvard.edu/pph2/ [32]
MutationTaster Suitable for predicting damaging effects of all intragenic mutations (DNA and protein level), including InDels.Ensembl, 1000 Genomes Project, ExAC, UniProt, ClinVar, phyloP, phastCons, nnsplice, polyadq (...) http://www.mutationtaster.org/ [33]
SuSPectAn SVM-trained predictor of the damaging effects of missense mutations. Uses sequence conservation, structure and network (interactome) information to model phenotypic effect of amino acid substitution. Accepts VCF fileUniProt, PDB, Phyre2 for predicted structures, DOMINE and STRING for interactome http://www.sbg.bio.ic.ac.uk/suspect/index.html [34]
F-SNPComputationally predicts functional SNPs for disease association studies.PolyPhen, SIFT, SNPeffect, SNPs3D, LS-SNP, ESEfinder, RescueESE, ESRSearch, PESX, Ensembl, TFSearch, Consite, GoldenPath, Ensembl, KinasePhos, OGPET, Sulfinator, GoldenPath http://compbio.cs.queensu.ca/F-SNP/ [35]
AnnToolsDesign to Identify novel and SNP/SNV, INDEL and SV/CNV. AnnTools searches for overlaps with regulatory elements, disease/trait associated loci, known segmental duplications and artifact prone regionsdbSNP, UCSC, GATK refGene, GAD, published lists of common structural genomic variation, Database of Genomic Variants, lists of conserved TFBs, miRNA http://anntools.sourceforge.net/ [36]
SNPitAnalyses the potential functional significance of SNPs derived from genome wide association studiesdbSNP, EntrezGene, UCSC Browser, HGMD, ECR Browser, Haplotter, SIFT-/- [37]
SCANUses physical and functional based annotation to categorize according to their position relative to genes and according to linkage disequilibrium (LD) patterns and effects on expression levels-/- http://www.scandb.org/newinterface/about.html [38]
SNAPA neural network-based method for the prediction of the functional effects of non-synonymous SNPsEnsembl, UCSC, Uniprot, UniProt, Pfam, DAS-CBS, MINT, BIND, KEGG, TreeFam http://www.rostlab.org/services/SNAP [39]
SNPs&GOSVM-based method using sequence information, Gene Ontology annotation and when available protein structure.UniRef90, GO, PANTHER, PDB http://snps.biofold.org/snps-and-go/ [40]
LS-SNPMaps nsSNPs onto protein sequences, functional pathways and comparative protein structure modelsUniProtKB, Genome Browser, dbSNP, PD http://www.salilab.org/LS-SNP [41]
TREATTREAT is a tool for facile navigation and mining of the variants from both targeted resequencing and whole exome sequencing-/- http://ndc.mayo.edu/mayo/research/biostat/stand-alone-packages.cfm [42]
SNPdatSuitable for species non-specific or support non-model organism data. SNPdat does not require the creation of any local relational databases or pre-processing of any mandatory input files-/- https://code.google.com/p/snpdat/downloads/ [43]
NGS – SNPAnnotate SNPs comparing the reference amino acid and the non-reference amino acid to each orthologueEnsembl, NCBI and UniProt http://stothard.afns.ualberta.ca/downloads/NGS-SNP/ [44]
SVAPredicted biological function to variants identifiedNCBI RefSeq, Ensembl, variation databases, UCSC, HGNC, GO, KEGG, HapMap, 1000 Genomes Project and DG http://www.svaproject.org/ [45]
VARIANTVARIANT increases the information scope outside the coding regions by including all the available information on regulation, DNA structure, conservation, evolutionary pressures, etc. Regulatory variants constitute a recognized, but still unexplored, cause of pathologiesdbSNP,1000 genomes, disease-related variants from GWAS, OMIM, COSMIC http://variant.bioinfo.cipf.es/ [46]
SIFTSIFT is a program that predicts whether an amino acid substitution affects protein function. SIFT uses sequence homology to predict whether an amino acid substitution will affect protein functionPROT/TrEMBL, or NCBI's http://blocks.fhcrc.org/sift/SIFT.html [47]
LIST-S2LIST-S2 (Local Identity and Shared Taxa, Species-specific) is based on the assumption that variations observed in closely related species are more significant when assessing conservation compared to those in distantly related speciesUniProt SwissProt/TrEMBL and NCBI Taxonomy https://gsponerlab.msl.ubc.ca/software/list/ [48] [49]
FAST-SNPA web server that allows users to efficiently identify and prioritize high-risk SNPs according to their phenotypic risks and putative functional effectsNCBI dbSNP, Ensembl, TFSearch, PolyPhen, ESEfinder, RescueESE, FAS-ESS, SwissProt, UCSC Golden Path, NCBI Blast and HapMap http://fastsnp.ibms.sinica.edu.tw/ [50]
PANTHERPANTHER relate protein sequence evolution to the evolution of specific protein functions and biological roles. The source of protein sequences used to build the protein family trees and used a computer-assisted manual curation step to better define the protein family clustersSTKE, KEGG, MetaCyc, FREX and Reactome http://www.pantherdb.org/ [51]
Meta-SNPSVM-based meta predictor including 4 different methods.PhD-SNP, PANTHER, SIFT, SNAP http://snps.biofold.org/meta-snp [52]
PopVizIntegrative and interactive gene-centric visualization of population genetics and mutation damage prediction scores of human gene variantsgnomAD, Ensembl, UniProt, OMIM, UCSC, CADD, EIGEN, LINSIGHT, SIFT, PolyPhen-2, http://shiva.rockefeller.edu/PopViz/ [23]

Algorithms used in annotation tools

Variant annotation tools use machine learning algorithms to predict variant annotations. Different annotation tools use different algorithms. Common algorithms include:

Comparison of variant annotation tools

A large number of variant annotation tools are available for variant annotation. The annotation by different tools does not alway agree amongst each other, as the defined rules for data handling differ between applications. It is frankly impossible to perform a perfect comparison of the available tools. Not all tools have the same input and output nor the same functionality. Below is a table of major annotation tools and their functional area.

ToolsInput fileOutput fileSNPINDELCNVWEB or ProgramSource
AnnoVarVCF, pileup,

CompleteGenomics, GFF3-SOLiD, SOAPsnp, MAQ, CASAVA

TXTYesYesYesProgram [53]
JannovarVCFVCFYesYesYesJava Program [54]
SNPeffVCF, pileup/TXTVCF, TXT, HTMLYesYesNoProgram [27]
Ensembl VEPEnsembl default (coordinates), VCF, variant identifiers, HGVS, SPDI, REST-style regionsVCF, VEP, TXT, JSONYesYesYesWeb, Perl script, REST API [55]
AnnToolsVCF, pileup, TXTVCFYesYesNoNo [56]
SeattleSeqVVCF, MAQ, CASAVA,

GATK BED

VCF, SeattleSeqYesYesNoWeb [57]
VARIANTVCF, GFF2, BEDweb report, TXTYesYesYesWeb [58]

[59]

Application

Different annotations capture diverse aspects of variant function. [60] Simultaneous use of multiple, varied functional annotations could improve rare variants association analysis power of whole exome and whole genome sequencing studies. [61] Some tools have been developed to enable functionally-informed phenotype-genotype association analysis for common and rare variants by incorporating functional annotations in biobank-scale cohorts. [62] [63] [64] [65]

Conclusions

The next generation of SNP annotation webservers can take advantage of the growing amount of data in core bioinformatics resources and use intelligent agents to fetch data from different sources as needed. From a user’s point of view, it is more efficient to submit a set of SNPs and receive results in a single step, which makes meta-servers the most attractive choice. However, if SNP annotation tools deliver heterogeneous data covering sequence, structure, regulation, pathways, etc., they must also provide frameworks for integrating data into a decision algorithms, and quantitative confidence measures so users can assess which data are relevant and which are not.

Related Research Articles

<span class="mw-page-title-main">Bioinformatics</span> Computational analysis of large, complex sets of biological data

Bioinformatics is an interdisciplinary field of science that develops methods and software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, chemistry, physics, computer science, computer programming, information engineering, mathematics and statistics to analyze and interpret biological data. The subsequent process of analyzing and interpreting data is referred to as computational biology.

An intron is any nucleotide sequence within a gene that is not expressed or operative in the final RNA product. The word intron is derived from the term intragenic region, i.e., a region inside a gene. The term intron refers to both the DNA sequence within a gene and the corresponding RNA sequence in RNA transcripts. The non-intron sequences that become joined by this RNA processing to form the mature RNA are called exons.

<span class="mw-page-title-main">Human genome</span> Complete set of nucleic acid sequences for humans

The human genome is a complete set of nucleic acid sequences for humans, encoded as DNA within the 23 chromosome pairs in cell nuclei and in a small DNA molecule found within individual mitochondria. These are usually treated separately as the nuclear genome and the mitochondrial genome. Human genomes include both protein-coding DNA sequences and various types of DNA that does not encode proteins. The latter is a diverse category that includes DNA coding for non-translated RNA, such as that for ribosomal RNA, transfer RNA, ribozymes, small nuclear RNAs, and several types of regulatory RNAs. It also includes promoters and their associated gene-regulatory elements, DNA playing structural and replicatory roles, such as scaffolding regions, telomeres, centromeres, and origins of replication, plus large numbers of transposable elements, inserted viral DNA, non-functional pseudogenes and simple, highly repetitive sequences. Introns make up a large percentage of non-coding DNA. Some of this non-coding DNA is non-functional junk DNA, such as pseudogenes, but there is no firm consensus on the total amount of junk DNA.

The coding region of a gene, also known as the coding sequence(CDS), is the portion of a gene's DNA or RNA that codes for protein. Studying the length, composition, regulation, splicing, structures, and functions of coding regions compared to non-coding regions over different species and time periods can provide a significant amount of important information regarding gene organization and evolution of prokaryotes and eukaryotes. This can further assist in mapping the human genome and developing gene therapy.

<span class="mw-page-title-main">Nucleic acid sequence</span> Succession of nucleotides in a nucleic acid

A nucleic acid sequence is a succession of bases within the nucleotides forming alleles within a DNA or RNA (GACU) molecule. This succession is denoted by a series of a set of five different letters that indicate the order of the nucleotides. By convention, sequences are usually presented from the 5' end to the 3' end. For DNA, with its double helix, there are two possible directions for the notated sequence; of these two, the sense strand is used. Because nucleic acids are normally linear (unbranched) polymers, specifying the sequence is equivalent to defining the covalent structure of the entire molecule. For this reason, the nucleic acid sequence is also termed the primary structure.

<span class="mw-page-title-main">Single-nucleotide polymorphism</span> Single nucleotide in genomic DNA at which different sequence alternatives exist

In genetics and bioinformatics, a single-nucleotide polymorphism is a germline substitution of a single nucleotide at a specific position in the genome that is present in a sufficiently large fraction of considered population.

In computational biology, gene prediction or gene finding refers to the process of identifying the regions of genomic DNA that encode genes. This includes protein-coding genes as well as RNA genes, but may also include prediction of other functional elements such as regulatory regions. Gene finding is one of the first and most important steps in understanding the genome of a species once it has been sequenced.

In molecular biology, open reading frames (ORFs) are defined as spans of DNA sequence between the start and stop codons. Usually, this is considered within a studied region of a prokaryotic DNA sequence, where only one of the six possible reading frames will be "open". Such an ORF may contain a start codon and by definition cannot extend beyond a stop codon. That start codon indicates where translation may start. The transcription termination site is located after the ORF, beyond the translation stop codon. If transcription were to cease before the stop codon, an incomplete protein would be made during translation.

<span class="mw-page-title-main">Silent mutation</span> DNA mutation with no observable effect on an organisms phenotype

Silent mutations are mutations in DNA that do not have an observable effect on the organism's phenotype. They are a specific type of neutral mutation. The phrase silent mutation is often used interchangeably with the phrase synonymous mutation; however, synonymous mutations are not always silent, nor vice versa. Synonymous mutations can affect transcription, splicing, mRNA transport, and translation, any of which could alter phenotype, rendering the synonymous mutation non-silent. The substrate specificity of the tRNA to the rare codon can affect the timing of translation, and in turn the co-translational folding of the protein. This is reflected in the codon usage bias that is observed in many species. Mutations that cause the altered codon to produce an amino acid with similar functionality are often classified as silent; if the properties of the amino acid are conserved, this mutation does not usually significantly affect protein function.

<span class="mw-page-title-main">UniProt</span> Database of protein sequences and functional information

UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived from the research literature. It is maintained by the UniProt consortium, which consists of several European bioinformatics organisations and a foundation from Washington, DC, United States.

<span class="mw-page-title-main">Conserved sequence</span> Similar DNA, RNA or protein sequences within genomes or among species

In evolutionary biology, conserved sequences are identical or similar sequences in nucleic acids or proteins across species, or within a genome, or between donor and receptor taxa. Conservation indicates that a sequence has been maintained by natural selection.

The completion of the human genome sequencing in the early 2000s was a turning point in genomics research. Scientists have conducted series of research into the activities of genes and the genome as a whole. The human genome contains around 3 billion base pairs nucleotide, and the huge quantity of data created necessitates the development of an accessible tool to explore and interpret this information in order to investigate the genetic basis of disease, evolution, and biological processes. The field of genomics has continued to grow, with new sequencing technologies and computational tool making it easier to study the genome.

<span class="mw-page-title-main">SUHW4</span> Protein-coding gene in the species Homo sapiens

Zinc finger protein 280D, also known as Suppressor Of Hairy Wing Homolog 4, SUWH4, Zinc Finger Protein 634, ZNF634, or KIAA1584, is a protein that in humans is encoded by the ZNF280D gene located on chromosome 15q21.3.

<span class="mw-page-title-main">DNA annotation</span> The process of describing the structure and function of a genome

In molecular biology and genetics, DNA annotation or genome annotation is the process of describing the structure and function of the components of a genome, by analyzing and interpreting them in order to extract their biological significance and understand the biological processes in which they participate. Among other things, it identifies the locations of genes and all the coding regions in a genome and determines what those genes do.

Periannan Senapathy is a molecular biologist, geneticist, author and entrepreneur. He is the founder, president and chief scientific officer at Genome International Corporation, a biotechnology, bioinformatics, and information technology firm based in Madison, Wisconsin, which develops computational genomics applications of next-generation DNA sequencing (NGS) and clinical decision support systems for analyzing patient genome data that aids in diagnosis and treatment of diseases.

<span class="mw-page-title-main">Multiple Epidermal Growth Factor-like Domains 8</span> Protein-coding gene in the species Homo sapiens

Megf8 also known as Multiple Epidermal Growth Factor-like Domains 8, is a protein coding gene that encodes a single pass membrane protein, known to participate in developmental regulation and cellular communication. It is located on chromosome 19 at the 49th open reading frame in humans (19q13.2). There are two isoform constructs known for MEGF8, which differ by a 67 amino acid indel. The isoform 2 splice version is 2785 amino acids long, and predicted to be 296.6 kdal in mass. Isoform 1 is composed of 2845 amino acids and predicted to weigh 303.1 kdal. Using BLAST searches, orthologs were found primarily in mammals, but MEGF8 is also conserved in invertebrates and fishes, and rarely in birds, reptiles, and amphibians. A notably important paralog to multiple epidermal growth factor-like domains 8 is ATRNL1, which is also a single pass transmembrane protein, with several of the same key features and motifs as MEGF8, as indicated by Simple Modular Architecture Research Tool (SMART) which is hosted by the European Molecular Biology Laboratory located in Heidelberg, Germany. MEGF8 has been predicted to be a key player in several developmental processes, such as left-right patterning and limb formation. Currently, researchers have found MEGF8 SNP mutations to be the cause of Carpenter syndrome subtype 2.

In bioinformatics, a Gene Disease Database is a systematized collection of data, typically structured to model aspects of reality, in a way to comprehend the underlying mechanisms of complex diseases, by understanding multiple composite interactions between phenotype-genotype relationships and gene-disease mechanisms. Gene Disease Databases integrate human gene-disease associations from various expert curated databases and text mining derived associations including Mendelian, complex and environmental diseases.

ANNOVAR is a bioinformatics software tool for the interpretation and prioritization of single nucleotide variants (SNVs), insertions, deletions, and copy number variants (CNVs) of a given genome.

<span class="mw-page-title-main">C13orf42</span> C13orf42 gene page

C13orf42 is a protein which, in humans, is encoded by the gene chromosome 13 open reading frame 42 (C13orf42). RNA sequencing data shows low expression of the C13orf42 gene in a variety of tissues. The C13orf42 protein is predicted to be localized in the mitochondria, nucleus, and cytosol. Tertiary structure predictions for C13orf42 indicate multiple alpha helices.

References

  1. Aubourg S, Rouzé P (2001). "Genome annotation". Plant Physiol. Biochem. 29 (3–4): 181–193. doi:10.1016/S0981-9428(01)01242-6.
  2. Karchin R (January 2009). "Next generation tools for the annotation of human SNPs". Briefings in Bioinformatics. 10 (1): 35–52. doi:10.1093/bib/bbn047. PMC   2638621 . PMID   19181721.
  3. Shen TH, Carlson CS, Tarczy-Hornoch P (August 2009). "SNPit: a federated data integration system for the purpose of functional SNP annotation". Computer Methods and Programs in Biomedicine. 95 (2): 181–189. doi:10.1016/j.cmpb.2009.02.010. PMC   2680224 . PMID   19327864.
  4. N. C. Oraguzie, E.H.A. Rikkerink, S.E. Gardiner, H.N. de Silva (eds.), "Association Mapping in Plants", Springer, 2007
  5. Capriotti E, Nehrt NL, Kann MG, Bromberg Y (July 2012). "Bioinformatics for personal genome interpretation". Briefings in Bioinformatics. 13 (4): 495–512. doi:10.1093/bib/bbr070. PMC   3404395 . PMID   22247263.
  6. P. H. Lee, H. Shatkay, “Ranking single nucleotide polymorphisms by potential deleterious effects”, Computational Biology and Machine Learning Lab, School of Computing, Queen’s University, Kingston, ON, Canada
  7. "Single-nucleotide polymorphism", Wikipedia, 2019-08-12, retrieved 2019-09-03
  8. "Minor allele frequency", Wikipedia, 2019-08-12, retrieved 2019-09-03
  9. M. J. Li, J. Wang, "Current trend of annotating single nucleotide variation in humans – A case study on SNVrap", Elsevier, 2014, pp. 1–9
  10. Wang Z, Gerstein M, Snyder M (January 2009). "RNA-Seq: a revolutionary tool for transcriptomics". Nature Reviews. Genetics. 10 (1): 57–63. doi:10.1038/nrg2484. PMC   2949280 . PMID   19015660.
  11. Halvorsen M, Martin JS, Broadaway S, Laederach A (August 2010). "Disease-associated mutations that alter the RNA structural ensemble". PLOS Genetics. 6 (8): e1001074. doi: 10.1371/journal.pgen.1001074 . PMC   2924325 . PMID   20808897.
  12. Wan Y, Qu K, Zhang QC, Flynn RA, Manor O, Ouyang Z, et al. (January 2014). "Landscape and variation of RNA secondary structure across the human transcriptome". Nature. 505 (7485): 706–709. Bibcode:2014Natur.505..706W. doi:10.1038/nature12946. PMC   3973747 . PMID   24476892.
  13. Sauna ZE, Kimchi-Sarfaty C (August 2011). "Understanding the contribution of synonymous mutations to human disease". Nature Reviews. Genetics. 12 (10): 683–691. doi:10.1038/nrg3051. PMID   21878961. S2CID   8358824.
  14. Li MJ, Yan B, Sham PC, Wang J (May 2015). "Exploring the function of genetic variants in the non-coding genomic regions: approaches for identifying human regulatory variants affecting gene expression". Briefings in Bioinformatics. 16 (3): 393–412. doi: 10.1093/bib/bbu018 . PMID   24916300.
  15. French JD, Ghoussaini M, Edwards SL, Meyer KB, Michailidou K, Ahmed S, et al. (April 2013). "Functional variants at the 11q13 risk locus for breast cancer regulate cyclin D1 expression through long-range enhancers". American Journal of Human Genetics. 92 (4): 489–503. doi:10.1016/j.ajhg.2013.01.002. PMC   3617380 . PMID   23540573.
  16. Faber K, Glatting KH, Mueller PJ, Risch A, Hotz-Wagenblatt A (2011). "Genome-wide prediction of splice-modifying SNPs in human genes using a new analysis pipeline called AASsites". BMC Bioinformatics. 12 (Suppl 4): S2. doi: 10.1186/1471-2105-12-s4-s2 . PMC   3194194 . PMID   21992029.
  17. Kumar V, Westra HJ, Karjalainen J, Zhernakova DV, Esko T, Hrdlickova B, et al. (2013). "Human disease-associated genetic variation impacts large intergenic non-coding RNA expression". PLOS Genetics. 9 (1): e1003201. doi: 10.1371/journal.pgen.1003201 . PMC   3547830 . PMID   23341781.
  18. M. J. Li, J. Wang, "Current trend of annotating single nucleotide variation in humans – A case study on SNVrap", Elsevier, 2014, pp. 1–9
  19. J. Wu, R. Jiang, "Prediction of Deleterious Nonsynonymous Single-Nucleotide Polymorphism for Human Diseases", The Scientific World Journal, 2013, 10 pages
  20. Sim NL, Kumar P, Hu J, Henikoff S, Schneider G, Ng PC (July 2012). "SIFT web server: predicting effects of amino acid substitutions on proteins". Nucleic Acids Research. 40 (Web Server issue): W452–W457. doi:10.1093/nar/gks539. PMC   3394338 . PMID   22689647.
  21. Adzhubei IA, Schmidt S, Peshkin L, Ramensky VE, Gerasimova A, Bork P, et al. (April 2010). "A method and server for predicting damaging missense mutations". Nature Methods. 7 (4): 248–249. doi:10.1038/nmeth0410-248. PMC   2855889 . PMID   20354512.
  22. Schwarz JM, Rödelsperger C, Schuelke M, Seelow D (August 2010). "MutationTaster evaluates disease-causing potential of sequence alterations". Nature Methods. 7 (8): 575–576. doi:10.1038/nmeth0810-575. PMID   20676075. S2CID   26892938.
  23. 1 2 3 Zhang P, Bigio B, Rapaport F, Zhang SY, Casanova JL, Abel L, et al. (December 2018). "PopViz: a webserver for visualizing minor allele frequencies and damage prediction scores of human genetic variations". Bioinformatics. 34 (24): 4307–4309. doi:10.1093/bioinformatics/bty536. PMC   6289133 . PMID   30535305.
  24. M. J. Li, J. Wang, "Current trend of annotating single nucleotide variation in humans – A case study on SNVrap", Elsevier, 2014, pp. 1–9
  25. Ofoegbu TC, David A, Kelley LA, Mezulis S, Islam SA, Mersmann SF, et al. (June 2019). "PhyreRisk: A Dynamic Web Application to Bridge Genomics, Proteomics and 3D Structural Data to Guide Interpretation of Human Genetic Variants". Journal of Molecular Biology. 431 (13): 2460–2466. doi:10.1016/j.jmb.2019.04.043. PMC   6597944 . PMID   31075275.
  26. Ittisoponpisan S, Islam SA, Khanna T, Alhuzimi E, David A, Sternberg MJ (May 2019). "Can Predicted Protein 3D Structures Provide Reliable Insights into whether Missense Variants Are Disease Associated?". Journal of Molecular Biology. 431 (11): 2197–2212. doi:10.1016/j.jmb.2019.04.009. PMC   6544567 . PMID   30995449.
  27. 1 2 Cingolani P, Platts A, Wang LL, Coon M, Nguyen T, Wang L, et al. (2012). "A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3". Fly. 6 (2): 80–92. doi:10.4161/fly.19695. PMC   3679285 . PMID   22728672.
  28. McLaren W, Gil L, Hunt SE, Riat HS, Ritchie GR, Thormann A, et al. (June 2016). "The Ensembl Variant Effect Predictor". Genome Biology. 17 (1): 122. doi: 10.1186/s13059-016-0974-4 . PMC   4893825 . PMID   27268795.
  29. Wang K, Li M, Hakonarson H (September 2010). "ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data". Nucleic Acids Research. 38 (16): e164. doi:10.1093/nar/gkq603. PMC   2938201 . PMID   20601685.
  30. Jäger M, Wang K, Bauer S, Smedley D, Krawitz P, Robinson PN (May 2014). "Jannovar: a java library for exome annotation". Human Mutation. 35 (5): 548–555. doi: 10.1002/humu.22531 . PMID   24677618. S2CID   10822001.
  31. Capriotti E, Calabrese R, Casadio R (November 2006). "Predicting the insurgence of human genetic diseases associated to single point protein mutations with support vector machines and evolutionary information". Bioinformatics. 22 (22): 2729–2734. doi: 10.1093/bioinformatics/btl423 . PMID   16895930.
  32. Adzhubei I, Jordan DM, Sunyaev SR (January 2013). "Predicting functional effect of human missense mutations using PolyPhen-2". Current Protocols in Human Genetics. Chapter 7: Unit7.20. doi:10.1002/0471142905.hg0720s76. PMC   4480630 . PMID   23315928.
  33. Schwarz JM, Rödelsperger C, Schuelke M, Seelow D (August 2010). "MutationTaster evaluates disease-causing potential of sequence alterations". Nature Methods. 7 (8): 575–576. doi:10.1038/nmeth0810-575. PMID   20676075. S2CID   26892938.
  34. Yates CM, Filippis I, Kelley LA, Sternberg MJ (July 2014). "SuSPect: enhanced prediction of single amino acid variant (SAV) phenotype using network features". Journal of Molecular Biology. 426 (14): 2692–2701. doi:10.1016/j.jmb.2014.04.026. PMC   4087249 . PMID   24810707.
  35. Lee PH, Shatkay H (January 2008). "F-SNP: computationally predicted functional SNPs for disease association studies". Nucleic Acids Research. 36 (Database issue): D820–D824. doi:10.1093/nar/gkm904. PMC   2238878 . PMID   17986460.
  36. Makarov V, O'Grady T, Cai G, Lihm J, Buxbaum JD, Yoon S (March 2012). "AnnTools: a comprehensive and versatile annotation toolkit for genomic variants". Bioinformatics. 28 (5): 724–725. doi:10.1093/bioinformatics/bts032. PMC   3289923 . PMID   22257670.
  37. Shen TH, Carlson CS, Tarczy-Hornoch P (August 2009). "SNPit: a federated data integration system for the purpose of functional SNP annotation". Computer Methods and Programs in Biomedicine. 95 (2): 181–189. doi:10.1016/j.cmpb.2009.02.010. PMC   2680224 . PMID   19327864.
  38. Gamazon ER, Zhang W, Konkashbaev A, Duan S, Kistner EO, Nicolae DL, et al. (January 2010). "SCAN: SNP and copy number annotation". Bioinformatics. 26 (2): 259–262. doi:10.1093/bioinformatics/btp644. PMC   2852202 . PMID   19933162.
  39. Bromberg Y, Rost B (2007). "SNAP: predict effect of non-synonymous polymorphisms on function". Nucleic Acids Research. 35 (11): 3823–3835. doi:10.1093/nar/gkm238. PMC   1920242 . PMID   17526529.
  40. Calabrese R, Capriotti E, Fariselli P, Martelli PL, Casadio R (August 2009). "Functional annotations improve the predictive score of human disease-related mutations in proteins". Human Mutation. 30 (8): 1237–1244. doi: 10.1002/humu.21047 . PMID   19514061. S2CID   33900765.
  41. Karchin R, Diekhans M, Kelly L, Thomas DJ, Pieper U, Eswar N, et al. (June 2005). "LS-SNP: large-scale annotation of coding non-synonymous SNPs based on multiple information sources". Bioinformatics. 21 (12): 2814–2820. doi: 10.1093/bioinformatics/bti442 . PMID   15827081.
  42. Asmann YW, Middha S, Hossain A, Baheti S, Li Y, Chai HS, et al. (January 2012). "TREAT: a bioinformatics tool for variant annotations and visualizations in targeted and exome sequencing data". Bioinformatics. 28 (2): 277–278. doi:10.1093/bioinformatics/btr612. PMC   3259432 . PMID   22088845.
  43. Doran AG, Creevey CJ (February 2013). "Snpdat: easy and rapid annotation of results from de novo snp discovery projects for model and non-model organisms". BMC Bioinformatics. 14: 45. doi: 10.1186/1471-2105-14-45 . PMC   3574845 . PMID   23390980.
  44. Grant JR, Arantes AS, Liao X, Stothard P (August 2011). "In-depth annotation of SNPs arising from resequencing projects using NGS-SNP". Bioinformatics. 27 (16): 2300–2301. doi:10.1093/bioinformatics/btr372. PMC   3150039 . PMID   21697123.
  45. Ge D, Ruzzo EK, Shianna KV, He M, Pelak K, Heinzen EL, et al. (July 2011). "SVA: software for annotating and visualizing sequenced human genomes". Bioinformatics. 27 (14): 1998–2000. doi:10.1093/bioinformatics/btr317. PMC   3129530 . PMID   21624899.
  46. Medina I, De Maria A, Bleda M, Salavert F, Alonso R, Gonzalez CY, Dopazo J (July 2012). "VARIANT: Command Line, Web service and Web interface for fast and accurate functional characterization of variants found by Next-Generation Sequencing". Nucleic Acids Research. 40 (Web Server issue): W54–W58. doi:10.1093/nar/gks572. PMC   3394276 . PMID   22693211.
  47. Ng PC, Henikoff S (July 2003). "SIFT: Predicting amino acid changes that affect protein function". Nucleic Acids Research. 31 (13): 3812–3814. doi:10.1093/nar/gkg509. PMC   168916 . PMID   12824425.
  48. Malhis N, Jones SJ, Gsponer J (April 2019). "Improved measures for evolutionary conservation that exploit taxonomy distances". Nature Communications. 10 (1): 1556. Bibcode:2019NatCo..10.1556M. doi:10.1038/s41467-019-09583-2. PMC   6450959 . PMID   30952844.
  49. Malhis N, Jacobson M, Jones SJ, Gsponer J (July 2020). "LIST-S2: taxonomy based sorting of deleterious missense mutations across species". Nucleic Acids Research. 48 (W1): W154–W161. doi: 10.1093/nar/gkaa288 . PMC   7319545 . PMID   32352516.
  50. Yuan HY, Chiou JJ, Tseng WH, Liu CH, Liu CK, Lin YJ, et al. (July 2006). "FASTSNP: an always up-to-date and extendable service for SNP function analysis and prioritization". Nucleic Acids Research. 34 (Web Server issue): W635–W641. doi:10.1093/nar/gkl236. PMC   1538865 . PMID   16845089.
  51. Mi H, Guo N, Kejariwal A, Thomas PD (January 2007). "PANTHER version 6: protein sequence and function evolution data with expanded representation of biological pathways". Nucleic Acids Research. 35 (Database issue): D247–D252. doi:10.1093/nar/gkl869. PMC   1716723 . PMID   17130144.
  52. Capriotti E, Altman RB, Bromberg Y (2013). "Collective judgment predicts disease-associated single nucleotide variants". BMC Genomics. 14 (Suppl 3): S2. doi: 10.1186/1471-2164-14-S3-S2 . PMC   3839641 . PMID   23819846.
  53. Wang K, Li M, Hakonarson H (September 2010). "ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data". Nucleic Acids Research. 38 (16): e164. doi:10.1093/nar/gkq603. PMC   2938201 . PMID   20601685.
  54. "charite/jannovar". GitHub. Retrieved 2016-09-25.
  55. McLaren W, Gil L, Hunt SE, Riat HS, Ritchie GR, Thormann A, et al. (June 2016). "The Ensembl Variant Effect Predictor". Genome Biology. 17 (1): 122. doi: 10.1186/s13059-016-0974-4 . PMC   4893825 . PMID   27268795.
  56. Makarov V, O'Grady T, Cai G, Lihm J, Buxbaum JD, Yoon S (March 2012). "AnnTools: a comprehensive and versatile annotation toolkit for genomic variants". Bioinformatics. 28 (5): 724–725. doi:10.1093/bioinformatics/bts032. PMC   3289923 . PMID   22257670.
  57. "Input Variation List File for Annotation". SeattleSeq Annotation 151.
  58. Medina I, De Maria A, Bleda M, Salavert F, Alonso R, Gonzalez CY, Dopazo J (July 2012). "VARIANT: Command Line, Web service and Web interface for fast and accurate functional characterization of variants found by Next-Generation Sequencing". Nucleic Acids Research. 40 (Web Server issue): W54–W58. doi:10.1093/nar/gks572. PMC   3394276 . PMID   22693211.
  59. Pabinger S, Dander A, Fischer M, Snajder R, Sperk M, Efremova M, et al. (March 2014). "A survey of tools for variant analysis of next-generation genome sequencing data". Briefings in Bioinformatics. 15 (2): 256–278. doi:10.1093/bib/bbs086. PMC   3956068 . PMID   23341494.
  60. Lee PH, Lee C, Li X, Wee B, Dwivedi T, Daly M (January 2018). "Principles and methods of in-silico prioritization of non-coding regulatory variants". Human Genetics. 137 (1): 15–30. doi:10.1007/s00439-017-1861-0. PMC   5892192 . PMID   29288389.
  61. Li X, Li Z, Zhou H, Gaynor SM, Liu Y, Chen H, et al. (September 2020). "Dynamic incorporation of multiple in silico functional annotations empowers rare variant association analysis of large whole-genome sequencing studies at scale". Nature Genetics. 52 (9): 969–983. doi:10.1038/s41588-020-0676-4. PMC   7483769 . PMID   32839606.
  62. Watanabe K, Taskesen E, van Bochoven A, Posthuma D (November 2017). "Functional mapping and annotation of genetic associations with FUMA". Nature Communications. 8 (1): 1826. doi: 10.1038/s41467-017-01261-5 . PMC   5705698 . PMID   29184056.
  63. Li Z, Li X, Zhou H, Gaynor SM, Selvaraj MS, Arapoglou T, et al. (December 2022). "A framework for detecting noncoding rare-variant associations of large-scale whole-genome sequencing studies". Nature Methods. 19 (12): 1599–1611. doi:10.1038/s41592-022-01640-x. PMC   10008172 . PMID   36303018. S2CID   243873361.
  64. "STAARpipeline: an all-in-one rare-variant tool for biobank-scale whole-genome sequencing data". Nature Methods. 19 (12): 1532–1533. December 2022. doi:10.1038/s41592-022-01641-w. PMID   36316564. S2CID   253246835.
  65. Li, Xihao; Quick, Corbin; Zhou, Hufeng; Gaynor, Sheila M.; Liu, Yaowu; Chen, Han; Selvaraj, Margaret Sunitha; Sun, Ryan; Dey, Rounak; Arnett, Donna K.; Bielak, Lawrence F.; Bis, Joshua C.; Blangero, John; Boerwinkle, Eric; Bowden, Donald W.; Brody, Jennifer A.; Cade, Brian E.; Correa, Adolfo; Cupples, L. Adrienne; Curran, Joanne E.; de Vries, Paul S.; Duggirala, Ravindranath; Freedman, Barry I.; Göring, Harald H. H.; Guo, Xiuqing; Haessler, Jeffrey; Kalyani, Rita R.; Kooperberg, Charles; Kral, Brian G.; Lange, Leslie A.; Manichaikul, Ani; Martin, Lisa W.; McGarvey, Stephen T.; Mitchell, Braxton D.; Montasser, May E.; Morrison, Alanna C.; Naseri, Take; O’Connell, Jeffrey R.; Palmer, Nicholette D.; Peyser, Patricia A.; Psaty, Bruce M.; Raffield, Laura M.; Redline, Susan; Reiner, Alexander P.; Reupena, Muagututi’a Sefuiva; Rice, Kenneth M.; Rich, Stephen S.; Sitlani, Colleen M.; Smith, Jennifer A.; Taylor, Kent D.; Vasan, Ramachandran S.; Willer, Cristen J.; Wilson, James G.; Yanek, Lisa R.; Zhao, Wei; NHLBI Trans-Omics for Precision Medicine (TOPMed) Consortium; TOPMed Lipids Working Group; Rotter, Jerome I.; Natarajan, Pradeep; Peloso, Gina M.; Li, Zilin; Lin, Xihong (January 2023). "Powerful, scalable and resource-efficient meta-analysis of rare variant associations in large whole genome sequencing studies". Nature Genetics. 55 (1): 154–164. doi:10.1038/s41588-022-01225-6. PMC   10084891 . PMID   36564505. S2CID   255084231.