Content | |
---|---|
Description | Single Nucleotide Polymorphism Database |
Organisms | Homo sapiens |
Contact | |
Research center | National Center for Biotechnology Information |
Primary citation | PMID 21097890 |
Release date | 1998 |
Access | |
Data format | ASN.1, Fasta, XML |
Website | ncbi |
Download URL | ftp://ftp.ncbi.nih.gov/snp/ |
Web service URL | EUtils SOAP |
The Single Nucleotide Polymorphism Database [1] (dbSNP) is a free public archive for genetic variation within and across different species developed and hosted by the National Center for Biotechnology Information (NCBI) in collaboration with the National Human Genome Research Institute (NHGRI). Although the name of the database implies a collection of one class of polymorphisms only (i.e., single nucleotide polymorphisms (SNPs)), it in fact contains a range of molecular variation: (1) SNPs, (2) short deletion and insertion polymorphisms (indels/DIPs), (3) microsatellite markers or short tandem repeats (STRs), (4) multinucleotide polymorphisms (MNPs), (5) heterozygous sequences, and (6) named variants. [2] The dbSNP accepts apparently neutral polymorphisms, polymorphisms corresponding to known phenotypes, and regions of no variation. It was created in September 1998 to supplement GenBank, NCBI’s collection of publicly available nucleic acid and protein sequences. [2]
In 2017, NCBI stopped support for all non-human organisms in dbSNP. [3] As of build 153 (released in August 2019), dbSNP had amassed nearly 2 billion submissions representing more than 675 million distinct variants for Homo sapiens .
dbSNP is an online resource implemented to aid biology researchers. Its goal is to act as a single database that contains all identified genetic variation, which can be used to investigate a wide variety of genetically based natural phenomena. Specifically, access to the molecular variation cataloged within dbSNP aids basic research such as physical mapping, population genetics, investigations into evolutionary relationships, as well as being able to quickly and easily quantify the amount of variation at a given site of interest. In addition, dbSNP guides applied research in pharmacogenomics and the association of genetic variation with phenotypic traits. [4] According to the NCBI website, “The long-term investment in such novel and exciting research [dbSNP] promises not only to advance human biology but to revolutionise the practice of modern medicine.”
Originally, dbSNP accepts submissions for any organism from a wide variety of sources including individual research laboratories, collaborative polymorphism discovery efforts, large scale genome sequencing centers, other SNP databases (e.g. the SNP consortium, HapMap, etc.), and private businesses. [5] On September 1, 2017, dbSNP stopped accepting non-human variant data submissions and two months later, its interactive websites and related NCBI services stopped presenting non-human variant data. Now dbSNP only accepts and presents human variant data.
Every submitted variation receives a submitted SNP ID number (“ss#”). [5] This accession number is a stable and unique identifier for that submission. Unique submitted SNP records also receive a reference SNP ID number (“rs#”; "refSNP cluster"). However, more than one record of a variation will likely be submitted to dbSNP, especially for clinically relevant variations. To accommodate this, dbSNP routinely assembles identical submitted SNP records into a single reference SNP record, which is also a unique and stable identifier (see below). [4]
To submit variations to dbSNP, one must first acquire a submitter handle, which identifies the laboratory responsible for the submission. [4] Next, the author is required to complete a submission file containing the relevant information and data. Submitted records must contain the ten essential pieces of information listed in the following table. [4] Other information required for submissions includes contact information, publication information (title, journal, authors, year), molecule type (genomic DNA, cDNA, mitochondrial DNA, chloroplast DNA), and organism. [4]
Element | Explanation |
---|---|
Sequence Context (Required) | An essential component of a submission to dbSNP is an unambiguous location for the variation being submitted. dbSNP now minimally requires that you submit variant location as an asserted position on RefSeq or INSDC sequences. |
Alleles (Required) | Alleles define each variation class. dbSNP defines single nucleotide variants in its submission scheme as G, A, T, or C, and does not permit ambiguous IUPAC codes, such as N, in the allele definition of a variation. |
Method (Required) | Each submitter defines the methods in their submission as either the techniques used to assay variation or the techniques used to estimate allele frequencies. dbSNP groups methods by method class to facilitate queries using general experimental technique as a query field. The submitter provides all other details of the techniques in a free-text description of the method. |
Asserted Allele Origin (Required) | A submitter can provide a statement (assertion) with supporting experimental evidence that a variant has a particular allelic origin. Assertions for a single refSNP are summarized and given an attribute value of germline or unknown. |
Population (Required) | Each submitter defines population samples either as the group used to initially identify variations or as the group used to identify population-specific measures of allele frequencies. These populations may be one and the same in some experimental designs. |
Sample Size (Optional) | There are two sample-size fields in dbSNP. One field, SNPASSAY SAMPLE SIZE, reports the number of chromosomes in the sample used to initially ascertain or discover the variation. The other sample size field, SNPPOPUSE SAMPLE SIZE, reports the number of chromosomes used as the denominator in computing estimates of allele frequencies. |
Population-specific Allele Frequencies (Optional) | Frequency data are submitted to dbSNP as allele counts or binned frequency intervals, depending on the precision of the experimental method used to make the measurement. dbSNP contains records of allele frequencies for specific population samples that are defined by each submitter and used in validating submitted variations. |
Population-specific Genotype Frequencies (Optional) | Similar to alleles, genotypes have frequencies in populations that can be submitted to dbSNP, and are used in validating submitted variations. |
Individual genotypes | dbSNP accepts individual genotypes from samples provided by donors that have consented to having their DNA sequence housed in a public database (e.g. HapMap or the 1000 Genomes project). |
Validation Information (Optional) | Assays validated directly by the submitter through the VALIDATION section show the type of evidence used to confirm the variation. |
New information obtained by dbSNP becomes available to the public periodically in a series of “builds” (i.e. revisions and releases of data). [4] There is no schedule for releasing new builds; instead, builds are usually released when a new genome build becomes available, assuming that the genome has some cataloged variation associated with it. [6] This occurs approximately every 3–4 months. Genome sequences may be improved over time so reference SNPs (“refSNP”) from previous builds, as well as new submitted SNPs, are re-mapped to the newly available genome sequence. Multiple submitted SNPs, if mapping to the same location, are clustered into one refSNP cluster and are assigned a reference SNP ID number. However, if two refSNP cluster records are found to map to the same location (i.e. are identical), dbSNP will also merge those records. In this case, the smaller refSNP number ID (i.e. the earliest record) would now represent both records, and the larger refSNP number IDs would become obsolete. These obsolete refSNP number IDs and are not used again for new records. When a merger of two refSNP records occurs, the change is tracked, and the former refSNP number IDs can still be used as a search query. This process of merging identical records reduces redundancy within dbSNP. [6]
There are two exceptions to the above merging criteria. First, variation of different classes (e.g. a SNP and a DIP) are not merged. Secondly, clinically important refSNPs that have been cited in the literature are termed “precious”; a merger that would eliminate such a refSNP is never performed, since it could later cause confusion. [6]
The dbSNP can be searched using the Entrez SNP search tool. A variety of queries can be used for searching: an ss number ID, a refSNP number ID, a gene name, an experimental method, a population class, a population detail, a publication, a marker, an allele, a chromosome, a base position, a heterozygosity range, or a build number. [6] [7] In addition, many results can be retrieved simultaneously using batch queries. [6] Searches return refSNP number IDs that match the query term and a summary of the available information for that refSNP cluster.
The information available for a refSNP cluster includes the basic information from each of the individual submissions (see “Submission”) as well as information available from combining the data from multiple submissions (e.g. heterozygosity, genotype frequencies). Many tools are available to examine a refSNP cluster in greater depth. Map view shows the position of the variation in the genome and other nearby variations. Another tool, gene view reports the location of the variation within a gene (if it is in a gene), the old and new codon, the amino acids encoded by both, and whether the change is synonymous or non-synonymous. Sequence viewer shows the position of the variant in relation to introns, exons, and other distant and close variants. 3D structure mapping, which shows 3D images of the encoded protein is also available.
The dbSNP is also linked to many other NCBI resources including the nucleotide, protein, gene, taxonomy and structure databases, as well as PubMed, UniSTS, PMC, OMIM, and UniGene.
The validation status list the categories of evidence that support a variant. These include: (1) multiple independent submissions; (2) frequency or genotype data; (3) submitter confirmation; (4) observation of all alleles in at least two chromosomes; (5) genotyped by HapMap; and (6) sequenced in the 1000 Genomes Project. [6]
The quality of the data found on dbSNP has been questioned by many research groups, [8] [9] [10] [11] [12] [13] which suspect high false positive rates due to genotyping and base-calling errors. These mistakes can easily be entered into dbSNP if the submitter uses (1) uncritical bioinformatic alignments of highly similar but distinct DNA sequences, and/or (2) PCRs with primers that cannot discriminate between similar but distinct DNA sequences. [8] Mitchell et al. (2004) [9] reviewed four studies [10] [11] [12] [13] and concluded that dbSNP has a false positive rate between 15-17% for SNPs, and also that the minor allele frequency is greater than 10% for approximately 80% of the SNPs that are not false positives. Similarly, Musemeci et al. (2010) [8] states that as many as 8.32% of the biallelic coding SNPs in dbSNP are artifacts of highly similar DNA sequences (i.e. paralogous genes) and refer to these entries as single nucleotide differences (SNDs). The high error rates in dbSNP may not be surprising: of the 23.7 million refSNP entries for humans, only 14.5 million have been validated, leaving the remaining 9.2 million as candidate SNPs. However, according to Musemeci et al. (2010), [8] even the validation code provided in the refSNP record is only partially useful: only HapMap validation reduced the number of SNDs (3% vs 8%), but only accepting this method removes more than half of the real SNPs in the dbSNP. These authors also note that one source of submissions from the Lee group are plagued with errors: 20% of these submissions are SNDs (vs. 8% for submissions). However, as the authors note, ignoring all of these submissions would remove many real SNPs.
Errors in the dbSNP can hamper candidate gene association studies [14] and haplotype-based investigations. [15] Errors may also increase false conclusions in association studies: [8] increasing the number of SNPs that are tested by testing false SNPs requires more hypothesis tests. However, these false SNPs cannot actually be associated with traits, so the alpha level is decreased more than is necessary for a rigorous test if only the true SNPs were tested and the false negative rate will increase. Musemeci et al. (2010) [8] suggested that authors of negative association studies inspect their previous studies for false SNPs (SNDs), which could be removed from analysis.
Individual sequences can be referred to by their refSNP cluster ID numbers (e.g. rs206437). dbSNP should be referenced using the 2001 Sherry et al. paper: Sherry, S.T., Ward, M.H., Kholodov, M., Baker, J., Phan, L., Smigielski, E.M., Sirotkin, K. (2001). dbSNP: the NCBI database of genetic variation. Nucleic Acids Research, 29: 308-311. [5]
An allele, or allelomorph, is a variant of the sequence of nucleotides at a particular location, or locus, on a DNA molecule.
The human genome is a complete set of nucleic acid sequences for humans, encoded as DNA within the 23 chromosome pairs in cell nuclei and in a small DNA molecule found within individual mitochondria. These are usually treated separately as the nuclear genome and the mitochondrial genome. Human genomes include both protein-coding DNA sequences and various types of DNA that does not encode proteins. The latter is a diverse category that includes DNA coding for non-translated RNA, such as that for ribosomal RNA, transfer RNA, ribozymes, small nuclear RNAs, and several types of regulatory RNAs. It also includes promoters and their associated gene-regulatory elements, DNA playing structural and replicatory roles, such as scaffolding regions, telomeres, centromeres, and origins of replication, plus large numbers of transposable elements, inserted viral DNA, non-functional pseudogenes and simple, highly repetitive sequences. Introns make up a large percentage of non-coding DNA. Some of this non-coding DNA is non-functional junk DNA, such as pseudogenes, but there is no firm consensus on the total amount of junk DNA.
In genetics and bioinformatics, a single-nucleotide polymorphism is a germline substitution of a single nucleotide at a specific position in the genome that is present in a sufficiently large fraction of considered population.
The International HapMap Project was an organization that aimed to develop a haplotype map (HapMap) of the human genome, to describe the common patterns of human genetic variation. HapMap is used to find genetic variants affecting health, disease and responses to drugs and environmental factors. The information produced by the project is made freely available for research.
The Rat Genome Database (RGD) is a database of rat genomics, genetics, physiology and functional data, as well as data for comparative genomics between rat, human and mouse. RGD is responsible for attaching biological information to the rat genome via structured vocabulary, or ontology, annotations assigned to genes and quantitative trait loci (QTL), and for consolidating rat strain data and making it available to the research community. They are also developing a suite of tools for mining and analyzing genomic, physiologic and functional data for the rat, and comparative data for rat, mouse, human, and five other species.
In population genetics, an ancestry-informative marker (AIM) is a single-nucleotide polymorphism that exhibits substantially different frequencies between different populations. A set of many AIMs can be used to estimate the proportion of ancestry of an individual derived from each population.
Genotyping is the process of determining differences in the genetic make-up (genotype) of an individual by examining the individual's DNA sequence using biological assays and comparing it to another individual's sequence or a reference sequence. It reveals the alleles an individual has inherited from their parents. Traditionally genotyping is the use of DNA sequences to define biological populations by use of molecular tools. It does not usually involve defining the genes of an individual.
Human genetic variation is the genetic differences in and among populations. There may be multiple variants of any given gene in the human population (alleles), a situation called polymorphism.
The variome is the whole set of genetic variations found in populations of species that have gone through a relatively short evolution change. For example, among humans, about 1 in every 1,200 nucleotide bases differ. The size of human variome in terms of effective population size is claimed to be about 10,000 individuals. This variation rate is comparatively small compared to other species. For example, the effective population size of tigers which perhaps has the whole population size less than 10,000 in the wild is not much smaller than the human species indicating a much higher level of genetic diversity although they are close to extinction in the wild. In practice, the variome can be the sum of the single nucleotide polymorphisms (SNPs), indels, and structural variation (SV) of a population or species. The Human Variome Project seeks to compile this genetic variation data worldwide. Variomics is the study of variome and a branch of bioinformatics.
In molecular biology, SNP array is a type of DNA microarray which is used to detect polymorphisms within a population. A single nucleotide polymorphism (SNP), a variation at a single site in DNA, is the most frequent type of variation in the genome. Around 335 million SNPs have been identified in the human genome, 15 million of which are present at frequencies of 1% or higher across different populations worldwide.
A tag SNP is a representative single nucleotide polymorphism (SNP) in a region of the genome with high linkage disequilibrium that represents a group of SNPs called a haplotype. It is possible to identify genetic variation and association to phenotypes without genotyping every SNP in a chromosomal region. This reduces the expense and time of mapping genome areas associated with disease, since it eliminates the need to study every individual SNP. Tag SNPs are useful in whole-genome SNP association studies in which hundreds of thousands of SNPs across the entire genome are genotyped.
In genomics, a genome-wide association study, is an observational study of a genome-wide set of genetic variants in different individuals to see if any variant is associated with a trait. GWA studies typically focus on associations between single-nucleotide polymorphisms (SNPs) and traits like major human diseases, but can equally be applied to any other genetic variants and any other organisms.
The Reference Sequence (RefSeq) database is an open access, annotated and curated collection of publicly available nucleotide sequences and their protein products. RefSeq was introduced in 2000. This database is built by National Center for Biotechnology Information (NCBI), and, unlike GenBank, provides only a single record for each natural biological molecule for major organisms ranging from viruses to bacteria to eukaryotes.
The 1000 Genomes Project, launched in January 2008, was an international research effort to establish by far the most detailed catalogue of human genetic variation. Scientists planned to sequence the genomes of at least one thousand anonymous participants from a number of different ethnic groups within the following three years, using newly developed technologies which were faster and less expensive. In 2010, the project finished its pilot phase, which was described in detail in a publication in the journal Nature. In 2012, the sequencing of 1092 genomes was announced in a Nature publication. In 2015, two papers in Nature reported results and the completion of the project and opportunities for future research.
Exome sequencing, also known as whole exome sequencing (WES), is a genomic technique for sequencing all of the protein-coding regions of genes in a genome. It consists of two steps: the first step is to select only the subset of DNA that encodes proteins. These regions are known as exons—humans have about 180,000 exons, constituting about 1% of the human genome, or approximately 30 million base pairs. The second step is to sequence the exonic DNA using any high-throughput DNA sequencing technology.
Genomic structural variation is the variation in structure of an organism's chromosome. It consists of many kinds of variation in the genome of one species, and usually includes microscopic and submicroscopic types, such as deletions, duplications, copy-number variants, insertions, inversions and translocations. Originally, a structure variation affects a sequence length about 1kb to 3Mb, which is larger than SNPs and smaller than chromosome abnormality. However, the operational range of structural variants has widened to include events > 50bp. The definition of structural variation does not imply anything about frequency or phenotypical effects. Many structural variants are associated with genetic diseases, however many are not. Recent research about SVs indicates that SVs are more difficult to detect than SNPs. Approximately 13% of the human genome is defined as structurally variant in the normal population, and there are at least 240 genes that exist as homozygous deletion polymorphisms in human populations, suggesting these genes are dispensable in humans. Rapidly accumulating evidence indicates that structural variations can comprise millions of nucleotides of heterogeneity within every genome, and are likely to make an important contribution to human diversity and disease susceptibility.
The Functional Element SNPs Database (FESD) is a biological database of single nucleotide polymorphisms in molecular biology. The database is a tool designed to organize functional elements into categories in human gene regions and to output their sequences needed for genotyping experiments as well as provide a set of SNPs that lie within each region. The database defines functional elements into ten types: promoter regions, CpG islands,5' untranslated regions (5'-UTRs), translation start sites, splice sites, coding exons, introns, translation stop sites, polyadenylation signals, and 3' UTRs. People may reference this database for haplotype information or obtain a flanking sequence for genotyping. This may help in finding mutations that contribute to common and polygenic diseases. Researchers can manually choose a group of SNPs of special interest for certain functional elements along with their corresponding sequences. The database combines information from sources such as HapMap, UCSC GoldenPath, dbSNP, OMIM, and TRANSFAC. Users can obtain information about tag SNPs and simulate LD blocks for each gene. FESD is still a developing database and is not widely known so was unable to find projects that used the database. Research was found using similar databases or databases that are combined in FESD's information pool.
GWAS Central is a publicly available database of summary-level findings from genetic association studies in humans, including genome-wide association studies (GWAS).
Interferon lambda 3 encodes the IFNL3 protein. IFNL3 was formerly named IL28B, but the Human Genome Organization Gene Nomenclature Committee renamed this gene in 2013 while assigning a name to the then newly discovered IFNL4 gene. Together with IFNL1 and IFNL2, these genes lie in a cluster on chromosomal region 19q13. IFNL3 shares ~96% amino-acid identity with IFNL2, ~80% identity with IFNL1 and ~30% identity with IFNL4.
Single nucleotide polymorphism annotation is the process of predicting the effect or function of an individual SNP using SNP annotation tools. In SNP annotation the biological information is extracted, collected and displayed in a clear form amenable to query. SNP functional annotation is typically performed based on the available information on nucleic acid and protein sequences.
{{cite journal}}
: Cite journal requires |journal=
(help){{cite journal}}
: Cite journal requires |journal=
(help)