RefSeq

Last updated
Refseq
US-NLM-NCBI-Logo.svg
Content
Descriptioncurated non-redundant sequence database of genomes.
Contact
Research center National Center for Biotechnology Information
Primary citation Pruitt KD & al. (2005) [1]
Access
Website https://www.ncbi.nlm.nih.gov/refseq/

The Reference Sequence (RefSeq) database [1] is an open access, annotated and curated collection of publicly available nucleotide sequences (DNA, RNA) and their protein products. RefSeq was introduced in 2000. [2] [3] This database is built by National Center for Biotechnology Information (NCBI), and, unlike GenBank, provides only a single record for each natural biological molecule (i.e. DNA, RNA or protein) for major organisms ranging from viruses to bacteria to eukaryotes.

Contents

For each model organism, RefSeq aims to provide separate and linked records for the genomic DNA, the gene transcripts, and the proteins arising from those transcripts. RefSeq is limited to major organisms for which sufficient data are available (121,461 distinct "named" organisms as of July 2022), [4] while GenBank includes sequences for any organism submitted (approximately 504,000 formally described species). [5]

RefSeq categories

RefSeq collection comprises different data types, with different origins, so it is necessary to establish standard categories and identifiers to store each data type. The most important categories are:

RefSeq accession categories and molecule types
CategoryDescription
NCComplete genomic molecules
NGIncomplete genomic region
NM mRNA
NR ncRNA
NP Protein
XMpredicted mRNA model
XRpredicted ncRNA model
XPpredicted Protein model (eukaryotic sequences)
WPpredicted Protein model (prokaryotic sequences)

For more details and more categories, see Table 1 in Chapter 18 of the book The Reference Sequence (RefSeq) Database.

RefSeq Projects

Several projects to improve RefSeq services are currently in development by the NCBI, often in collaboration with research centers such as EMBL-EBI:

Statistics

According to the RefSeq release 213 (July 2022), the number of species represented in the database by counting distinct taxonomic IDs are as follows: [4]

Taxonomic IDSpecies
Archaea 1443
Bacteria 69122
Fungi 16869
Invertebrate 5715
Mitochondrion 13648
Plant 9177
Plasmid 6073
Plastid 9430
Protozoa 746
Vertebrate (mammalian)1509
Viral 11620
Vertebrate (other)5237
Other4
Complete121461

The counts of accession and basepairs per molecule type are: [4]

Molecule typeAccessionsBasepairs/residues
Genomics40,758,7692.923212393984×10^12
RNA45,781,7161.22253022047×10^11
Protein234,520,0539.129062394×10^10

See also

References

  1. 1 2 Pruitt KD, Tatusova T, Maglott DR (January 2005). "NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins". Nucleic Acids Research. 33 (Database issue): D501 –D504. doi:10.1093/nar/gki025. PMC   539979 . PMID   15608248.
  2. Maglott DR, Katz KS, Sicotte H, Pruitt KD (January 2000). "NCBI's LocusLink and RefSeq". Nucleic Acids Research. 28 (1): 126–128. doi:10.1093/nar/28.1.126. PMC   102393 . PMID   10592200.
  3. Pruitt KD, Katz KS, Sicotte H, Maglott DR (January 2000). "Introducing RefSeq and LocusLink: curated human genome resources at the NCBI". Trends in Genetics. 16 (1): 44–47. doi:10.1016/s0168-9525(99)01882-x. PMID   10637631.
  4. 1 2 3 RefSeq Release 213 Statistics (Report). National Library of Medicine. 11 July 2022. Retrieved 20 July 2022.
  5. Sayers EW, Cavanaugh M, Clark K, Pruitt KD, Schoch CL, Sherry ST, Karsch-Mizrachi I (January 2022). "GenBank". Nucleic Acids Research. 50 (D1): D161 –D164. doi: 10.1093/nar/gkab1135 . PMC   8690257 . PMID   34850943.
  6. Pruitt KD, Harrow J, Harte RA, Wallin C, Diekhans M, Maglott DR, et al. (July 2009). "The consensus coding sequence (CCDS) project: Identifying a common protein-coding gene set for the human and mouse genomes". Genome Research. 19 (7): 1316–1323. doi:10.1101/gr.080531.108. PMC   2704439 . PMID   19498102.
  7. Pujar S, O'Leary NA, Farrell CM, Loveland JE, Mudge JM, Wallin C, et al. (January 2018). "Consensus coding sequence (CCDS) database: a standardized set of human and mouse protein-coding regions supported by expert curation". Nucleic Acids Research. 46 (D1): D221 –D228. doi:10.1093/nar/gkx1031. PMC   5753299 . PMID   29126148.
  8. Farrell CM, Goldfarb T, Rangwala SH, Astashyn A, Ermolaeva OD, Hem V, et al. (January 2022). "RefSeq Functional Elements as experimentally assayed nongenic reference standards and functional interactions in human and mouse". Genome Research. 32 (1): 175–188. doi:10.1101/gr.275819.121. PMC   8744684 . PMID   34876495.
  9. Gulley ML, Braziel RM, Halling KC, Hsi ED, Kant JA, Nikiforova MN, et al. (June 2007). "Clinical laboratory reports in molecular pathology". Archives of Pathology & Laboratory Medicine. 131 (6): 852–863. doi:10.5858/2007-131-852-CLRIMP. PMID   17550311.
  10. "NCBI RefSeq Targeted Loci Project". www.ncbi.nlm.nih.gov. Retrieved 2022-07-27.
  11. Hatcher EL, Zhdanov SA, Bao Y, Blinkova O, Nawrocki EP, Ostapchuck Y, et al. (January 2017). "Virus Variation Resource - improved response to emergent viral outbreaks". Nucleic Acids Research. 45 (D1): D482 –D490. doi:10.1093/nar/gkw1065. PMC   5210549 . PMID   27899678.
  12. "NCBI RefSeq Select". www.ncbi.nlm.nih.gov. Retrieved 2022-07-27.
  13. Morales J, Pujar S, Loveland JE, Astashyn A, Bennett R, Berry A, et al. (April 2022). "A joint NCBI and EMBL-EBI transcript set for clinical genomics and research". Nature. 604 (7905): 310–315. Bibcode:2022Natur.604..310M. doi:10.1038/s41586-022-04558-8. PMC   9007741 . PMID   35388217.

Sources