Accession number (bioinformatics)

Last updated

An accession number, in bioinformatics, is a unique identifier given to a DNA or protein sequence record to allow for tracking of different versions of that sequence record and the associated sequence over time in a single data repository. Because of its relative stability, accession numbers can be utilized as foreign keys for referring to a sequence object, but not necessarily to a unique sequence. All sequence information repositories implement the concept of "accession number" but might do so with subtle variations.

LRG

Locus Reference Genomic (LRG) records have unique accession numbers starting with LRG_ followed by a number. They are recommended in the Human Genome Variation Society Nomenclature guidelines as stable genomic reference sequences to report sequence variants in LSDBs and the literature.

Notes and references

  1. ^ Amos Bairoch; Rolf Apweiler; Cathy H. Wu. "User Manual". UniProt Knowledgebase. Archived from the original on 21 September 2005. Retrieved October 20, 2005.
  2. PD-icon.svg This article incorporates public domain material from NCBI Handbook. National Center for Biotechnology Information.

Related Research Articles

<span class="mw-page-title-main">Human genome</span> Complete set of nucleic acid sequences for humans

The human genome is a complete set of nucleic acid sequences for humans, encoded as DNA within the 23 chromosome pairs in cell nuclei and in a small DNA molecule found within individual mitochondria. These are usually treated separately as the nuclear genome and the mitochondrial genome. Human genomes include both protein-coding DNA sequences and various types of DNA that does not encode proteins. The latter is a diverse category that includes DNA coding for non-translated RNA, such as that for ribosomal RNA, transfer RNA, ribozymes, small nuclear RNAs, and several types of regulatory RNAs. It also includes promoters and their associated gene-regulatory elements, DNA playing structural and replicatory roles, such as scaffolding regions, telomeres, centromeres, and origins of replication, plus large numbers of transposable elements, inserted viral DNA, non-functional pseudogenes and simple, highly repetitive sequences. Introns make up a large percentage of non-coding DNA. Some of this non-coding DNA is non-functional junk DNA, such as pseudogenes, but there is no firm consensus on the total amount of junk DNA.

<span class="mw-page-title-main">National Center for Biotechnology Information</span> Database branch of the US National Library of Medicine

The National Center for Biotechnology Information (NCBI) is part of the United States National Library of Medicine (NLM), a branch of the National Institutes of Health (NIH). It is approved and funded by the government of the United States. The NCBI is located in Bethesda, Maryland, and was founded in 1988 through legislation sponsored by US Congressman Claude Pepper.

61 (sixty-one) is the natural number following 60 and preceding 62.

In bioinformatics and biochemistry, the FASTA format is a text-based format for representing either nucleotide sequences or amino acid (protein) sequences, in which nucleotides or amino acids are represented using single-letter codes.

A sequence profiling tool in bioinformatics is a type of software that presents information related to a genetic sequence, gene name, or keyword input. Such tools generally take a query such as a DNA, RNA, or protein sequence or ‘keyword’ and search one or more databases for information related to that sequence. Summaries and aggregate results are provided in standardized format describing the information that would otherwise have required visits to many smaller sites or direct literature searches to compile. Many sequence profiling tools are software portals or gateways that simplify the process of finding information about a query in the large and growing number of bioinformatics databases. The access to these kinds of tools is either web based or locally downloadable executables.

<span class="mw-page-title-main">UniProt</span> Database of protein sequences and functional information

UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived from the research literature. It is maintained by the UniProt consortium, which consists of several European bioinformatics organisations and a foundation from Washington, DC, United States.

LRG may refer to:

<span class="mw-page-title-main">Accession number (cultural property)</span> Object identifiers used in galleries, libraries, archives, and museums

In libraries, art galleries, museums and archives, an accession number is a unique identifier assigned to, and achieving initial control of, each acquisition. Assignment of accession numbers typically occurs at the point of accessioning or cataloging. The term is something of a misnomer, because the form accession numbers take is often alpha-numeric.

Research data archiving is the long-term storage of scholarly research data, including the natural sciences, social sciences, and life sciences. The various academic journals have differing policies regarding how much of their data and methods researchers are required to store in a public archive, and what is actually archived varies widely between different disciplines. Similarly, the major grant-giving institutions have varying attitudes towards public archival of data. In general, the tradition of science has been for publications to contain sufficient information to allow fellow researchers to replicate and therefore test the research. In recent years this approach has become increasingly strained as research in some areas depends on large datasets which cannot easily be replicated independently.

UniGene was a NCBI database of the transcriptome and thus, despite the name, not primarily a database for genes. Each entry is a set of transcripts that appear to stem from the same transcription locus. Information on protein similarities, gene expression, cDNA clones, and genomic location is included with each entry.

The Reference Sequence (RefSeq) database is an open access, annotated and curated collection of publicly available nucleotide sequences and their protein products. RefSeq was first introduced in 2000. This database is built by National Center for Biotechnology Information (NCBI), and, unlike GenBank, provides only a single record for each natural biological molecule for major organisms ranging from viruses to bacteria to eukaryotes.

dbSNP

The Single Nucleotide Polymorphism Database (dbSNP) is a free public archive for genetic variation within and across different species developed and hosted by the National Center for Biotechnology Information (NCBI) in collaboration with the National Human Genome Research Institute (NHGRI). Although the name of the database implies a collection of one class of polymorphisms only, it in fact contains a range of molecular variation: (1) SNPs, (2) short deletion and insertion polymorphisms (indels/DIPs), (3) microsatellite markers or short tandem repeats (STRs), (4) multinucleotide polymorphisms (MNPs), (5) heterozygous sequences, and (6) named variants. The dbSNP accepts apparently neutral polymorphisms, polymorphisms corresponding to known phenotypes, and regions of no variation. It was created in September 1998 to supplement GenBank, NCBI’s collection of publicly available nucleic acid and protein sequences.

A biorepository is a facility that collects, catalogs, and stores samples of biological material for laboratory research. Biorepositories collect and manage specimens from animals, plants, and other living organisms. Biorepositories store many different types of specimens, including samples of blood, urine, tissue, cells, DNA, RNA, and proteins. If the samples are from people, they may be stored with medical information along with written consent to use the samples in laboratory studies.

GeneCards is a database of human genes that provides genomic, proteomic, transcriptomic, genetic and functional information on all known and predicted human genes. It is being developed and maintained by the Crown Human Genome Center at the Weizmann Institute of Science, in collaboration with LifeMap Sciences.

<span class="mw-page-title-main">Reference genome</span>

A reference genome is a digital nucleic acid sequence database, assembled by scientists as a representative example of the set of genes in one idealized individual organism of a species. As they are assembled from the sequencing of DNA from a number of individual donors, reference genomes do not accurately represent the set of genes of any single individual organism. Instead a reference provides a haploid mosaic of different DNA sequences from each donor. For example, the most recent human reference genome is derived from >60 genomic clone libraries. There are reference genomes for multiple species of viruses, bacteria, fungus, plants, and animals. Reference genomes are typically used as a guide on which new genomes are built, enabling them to be assembled much more quickly and cheaply than the initial Human Genome Project. Reference genomes can be accessed online at several locations, using dedicated browsers such as Ensembl or UCSC Genome Browser.

Locus Reference Genomic (LRG) is a DNA sequence format that was developed to aid in curating locus specific databases (LSDBs) that record DNA sequence variation which can result in inherited diseases. LRGs have fixed sequences that are independent of the genome so that they provide a stable framework for reporting variants. The LRG format uses extensible markup language (XML) to provide highly structured single records containing the genomic DNA sequence for individual genes along with the mRNAs and proteins encoded by these genes. LRG records are recommended in the Human Genome Variation Society Nomenclature guidelines as reference sequences to report sequence variants in LSDBs and the literature.

Mutalyzer is a web-based software tool which was primarily developed to check the description of sequence variants identified in a gene during genetic testing. Mutalyzer applies the rules of the standard human sequence variant nomenclature and can correct descriptions accordingly. Apart from the sequence variant description, Mutalyzer requires a DNA sequence record containing the transcript and protein feature annotation as a reference. Mutalyzer 2 accepts GenBank and Locus Reference Genomic (LRG) records. The annotation is also used to apply the correct codon translation tables and generate DNA and protein variant descriptions for any organism. The Mutalyzer server supports programmatic access via a SOAP Web service described in the Web Services Description Language (WSDL) and an HTTP/RPC+JSON web service.

<span class="mw-page-title-main">Variant Call Format</span> Text file format for genomic data

The Variant Call Format (VCF) specifies the format of a text file used in bioinformatics for storing gene sequence variations. The format has been developed with the advent of large-scale genotyping and DNA sequencing projects, such as the 1000 Genomes Project. Existing formats for genetic data such as General feature format (GFF) stored all of the genetic data, much of which is redundant because it will be shared across the genomes. By using the variant call format only the variations need to be stored along with a reference genome.

Domesticated species and the human populations that domesticate them are typified by a mutualistic relationship of interdependence, in which humans have over thousands of years modified the genomics of domesticated species. Genomics is the study of the structure, content, and evolution of genomes, or the entire genetic information of organisms. Domestication is the process by which humans alter the morphology and genes of targeted organisms by selecting for desirable traits. These genomic changes produce the domestication syndromes.