Sequence Read Archive

Last updated
Sequence Read Archive
Database.png
Content
Description FASTQ Sequences
BAM data
Organisms all
Contact
Research center National Center for Biotechnology Information
European Bioinformatics Institute
DNA Data Bank of Japan
Access
Website www.ncbi.nlm.nih.gov/sra/
www.ebi.ac.uk/ena/
trace.ddbj.nig.ac.jp/dra/index_e.html

The Sequence Read Archive (SRA, previously known as the Short Read Archive) is a bioinformatics database that provides a public repository for DNA sequencing data, especially the "short reads" generated by high-throughput sequencing, which are typically less than 1,000 base pairs in length. [1] The archive is part of the International Nucleotide Sequence Database Collaboration (INSDC), and run as a collaboration between the NCBI, the European Bioinformatics Institute (EBI), and the DNA Data Bank of Japan (DDBJ).

Contents

The archive was established by the National Center for Biotechnology Information (NCBI) in 2007 in order to provide a repository for data produced by RNA-Seq and ChIP-Seq studies as well as large-scale studies including the Human Microbiome Project and the 1000 Genomes Project. [1] [2] Originally called the Short Read Archive, the name was changed in anticipation of future sequencing technologies being able to produce longer sequence reads. [3]

The SRA has grown rapidly since 2008. As of 2011, most SRA sequence data was produced by Illumina's Genome Analyzer. History (and predicted future) size of the Sequence Read Archive.svg
The SRA has grown rapidly since 2008. As of 2011, most SRA sequence data was produced by Illumina's Genome Analyzer.

The volume of data deposited in the Sequence Read Archive has grown rapidly. As of September 2010, 65% of the SRA was human genomic sequence, with another 16% relating to human metagenome sequence reads. [6] Much of this data was deposited through the 1000 Genomes Project. In June 2011, the data contained within the SRA passed 100 Terabases of DNA in volume. [2]

The preferred data format for files submitted to the SRA is the BAM format, which is capable of storing both aligned and unaligned reads. [6] Internally the SRA relies on the NCBI SRA Toolkit, used at all three INSDC member databases, to provide flexible data compression, API access and conversion to other formats such as FASTQ. [5]

NCBI announced their plan to close the NCBI SRA in February 2011 due to funding reduction. [2] [7] However, EBI and DDBJ announced that they would continue to support the SRA. [8] In October 2011, NCBI announced continuation of funding for the SRA. [2]

Deposition of data in the SRA is mandated by most funding agencies and open access journals. Nature Publishing Group journals require that DNA and RNA sequencing data is made available through the SRA. [9]

See also

Related Research Articles

In genetics, an expressed sequence tag (EST) is a short sub-sequence of a cDNA sequence. ESTs may be used to identify gene transcripts, and were instrumental in gene discovery and in gene-sequence determination. The identification of ESTs has proceeded rapidly, with approximately 74.2 million ESTs now available in public databases. EST approaches have largely been superseded by whole genome and transcriptome sequencing and metagenome sequencing.

The GenBank sequence database is an open access, annotated collection of all publicly available nucleotide sequences and their protein translations. It is produced and maintained by the National Center for Biotechnology Information as part of the International Nucleotide Sequence Database Collaboration (INSDC).

The International Nucleotide Sequence Database Collaboration (INSDC) consists of a joint effort to collect and disseminate databases containing DNA and RNA sequences. It involves the following computerized databases: NIG's DNA Data Bank of Japan (Japan), NCBI's GenBank (USA) and the EMBL-EBI's European Nucleotide Archive (UK). New and updated data on nucleotide sequences contributed by research teams to each of the three databases are synchronized on a daily basis through continuous interaction between the staff at each the collaborating organizations.

<span class="mw-page-title-main">UniProt</span> Database of protein sequences and functional information

(See also: List of proteins in the human body)

<span class="mw-page-title-main">Ensembl genome database project</span> Scientific project at the European Bioinformatics Institute

Ensembl genome database project is a scientific project at the European Bioinformatics Institute, which provides a centralized resource for geneticists, molecular biologists and other researchers studying the genomes of our own species and other vertebrates and model organisms. Ensembl is one of several well known genome browsers for the retrieval of genomic information.

The European Bioinformatics Institute (EMBL-EBI) is an intergovernmental organization (IGO) which, as part of the European Molecular Biology Laboratory (EMBL) family, focuses on research and services in bioinformatics. It is located on the Wellcome Genome Campus in Hinxton near Cambridge, and employs over 600 full-time equivalent (FTE) staff. Institute leaders such as Rolf Apweiler, Alex Bateman, Ewan Birney, and Guy Cochrane, an adviser on the National Genomics Data Center Scientific Advisory Board, serve as part of the international research network of the BIG Data Center at the Beijing Institute of Genomics.

InterPro is a database of protein families, protein domains and functional sites in which identifiable features found in known proteins can be applied to new protein sequences in order to functionally characterise them.

<span class="mw-page-title-main">David J. Lipman</span> American biologist

David J. Lipman is an American biologist who from 1989 to 2017 was the director of the National Center for Biotechnology Information (NCBI) at the National Institutes of Health. NCBI is the home of GenBank, the U.S. node of the International Sequence Database Consortium, and PubMed, one of the most heavily used sites in the world for the search and retrieval of biomedical information. Lipman is one of the original authors of the BLAST sequence alignment program, and a respected figure in bioinformatics. In 2017, he left NCBI and became Chief Science Officer at Impossible Foods.

The DNA Data Bank of Japan (DDBJ) is a biological database that collects DNA sequences. It is located at the National Institute of Genetics (NIG) in the Shizuoka prefecture of Japan. It is also a member of the International Nucleotide Sequence Database Collaboration or INSDC. It exchanges its data with European Molecular Biology Laboratory at the European Bioinformatics Institute and with GenBank at the National Center for Biotechnology Information on a daily basis. Thus these three databanks contain the same data at any given time.

<span class="mw-page-title-main">Integrated Microbial Genomes System</span> Genome browsing and annotation platform

The Integrated Microbial Genomes system is a genome browsing and annotation platform developed by the U.S. Department of Energy (DOE)-Joint Genome Institute. IMG contains all the draft and complete microbial genomes sequenced by the DOE-JGI integrated with other publicly available genomes. IMG provides users a set of tools for comparative analysis of microbial genomes along three dimensions: genes, genomes and functions. Users can select and transfer them in the comparative analysis carts based upon a variety of criteria. IMG also includes a genome annotation pipeline that integrates information from several tools, including KEGG, Pfam, InterPro, and the Gene Ontology, among others. Users can also type or upload their own gene annotations and the IMG system will allow them to generate Genbank or EMBL format files containing these annotations.

<span class="mw-page-title-main">MicrobesOnline</span>

MicrobesOnline is a publicly and freely accessible website that hosts multiple comparative genomic tools for comparing microbial species at the genomic, transcriptomic and functional levels. MicrobesOnline was developed by the Virtual Institute for Microbial Stress and Survival, which is based at the Lawrence Berkeley National Laboratory in Berkeley, California. The site was launched in 2005, with regular updates until 2011.

The Reference Sequence (RefSeq) database is an open access, annotated and curated collection of publicly available nucleotide sequences and their protein products. RefSeq was first introduced in 2000. This database is built by National Center for Biotechnology Information (NCBI), and, unlike GenBank, provides only a single record for each natural biological molecule for major organisms ranging from viruses to bacteria to eukaryotes.

The Epigenomics database at the National Center for Biotechnology Information was a database for whole-genome epigenetics data sets. It was retired on 1 June 2016.

BioSamples (BioSD) is a database at European Bioinformatics Institute for the information about the biological samples used in sequencing.

<span class="mw-page-title-main">European Nucleotide Archive</span> Online database from the EBI on Nucleotides

The European Nucleotide Archive (ENA) is a repository providing free and unrestricted access to annotated DNA and RNA sequences. It also stores complementary information such as experimental procedures, details of sequence assembly and other metadata related to sequencing projects. The archive is composed of three main databases: the Sequence Read Archive, the Trace Archive and the EMBL Nucleotide Sequence Database. The ENA is produced and maintained by the European Bioinformatics Institute and is a member of the International Nucleotide Sequence Database Collaboration (INSDC) along with the DNA Data Bank of Japan and GenBank.

<span class="mw-page-title-main">BacDive</span> Online database for bacteria

BacDive is a bacterial metadatabase that provides strain-linked information about bacterial and archaeal biodiversity.

Donna R. Maglott is a staff scientist at the National Center for Biotechnology Information known for her research on large-scale genomics projects, including the mouse genome and development of databases required for genomics research.

In molecular phylogenetics, relationships among individuals are determined using character traits, such as DNA, RNA or protein, which may be obtained using a variety of sequencing technologies. High-throughput next-generation sequencing has become a popular technique in transcriptomics, which represent a snapshot of gene expression. In eukaryotes, making phylogenetic inferences using RNA is complicated by alternative splicing, which produces multiple transcripts from a single gene. As such, a variety of approaches may be used to improve phylogenetic inference using transcriptomic data obtained from RNA-Seq and processed using computational phylogenetics.

Transcriptomics technologies are the techniques used to study an organism's transcriptome, the sum of all of its RNA transcripts. The information content of an organism is recorded in the DNA of its genome and expressed through transcription. Here, mRNA serves as a transient intermediary molecule in the information network, whilst non-coding RNAs perform additional diverse functions. A transcriptome captures a snapshot in time of the total transcripts present in a cell. Transcriptomics technologies provide a broad account of which cellular processes are active and which are dormant. A major challenge in molecular biology is to understand how a single genome gives rise to a variety of cells. Another is how gene expression is regulated.

References

  1. 1 2 Wheeler, DL; Barrett, T; Benson, DA; Bryant, SH; Canese, K; Chetvernin, V; Church, DM; Dicuccio, M; Edgar, R; Federhen, S; Feolo, M; Geer, LY; Helmberg, W; Kapustin, Y; Khovayko, O; Landsman, D; Lipman, DJ; Madden, TL; Maglott, DR; Miller, V; Ostell, J; Pruitt, KD; Schuler, GD; Shumway, M; Sequeira, E; Sherry, ST; Sirotkin, K; Souvorov, A; Starchenko, G; Tatusov, RL; Tatusova, TA; Wagner, L; Yaschenko, E (Jan 2008). "Database resources of the National Center for Biotechnology Information". Nucleic Acids Research. 36 (Database issue): D13-21. doi:10.1093/nar/gkm1000. PMC   2238880 . PMID   18045790.
  2. 1 2 3 4 Galperin, M. Y.; Fernandez-Suarez, X. M. (5 December 2011). "The 2012 Nucleic Acids Research Database Issue and the online Molecular Biology Database Collection". Nucleic Acids Research. 40 (D1): D1–D8. doi:10.1093/nar/gkr1196. PMC   3245068 . PMID   22144685.
  3. Ostell, Jim (2009). "NCBI's Sequence Read Archive: A Core Enabling Infrastructure". Bio IT World. Retrieved 2013-01-08.
  4. "NCBI SRA Overview". NCBI. 1 Jan 2013. Retrieved 2013-01-08.
  5. 1 2 Kodama, Y.; Shumway, M.; Leinonen, R. (2011). "The sequence read archive: explosive growth of sequencing data". Nucleic Acids Research. 40 (D1): D54–D56. doi:10.1093/nar/gkr854. ISSN   0305-1048. PMC   3245110 . PMID   22009675.
  6. 1 2 Leinonen R; Sugawara H; Shumway M (January 2011). "The sequence read archive". Nucleic Acids Res. 39 (Database issue): D19–21. doi:10.1093/nar/gkq1019. PMC   3013647 . PMID   21062823.
  7. GB Editorial Team (Mar 22, 2011). "Closure of the NCBI SRA and implications for the long-term future of genomics data storage". Genome Biology. 12 (3): 402. doi:10.1186/gb-2011-12-3-402. PMC   3129670 . PMID   21418618.
  8. "DDBJ will continue Sequence Raw Data Archiving". www.ddbj.nig.ac.jp. Retrieved 2 September 2014.
  9. "Availability of data and materials : authors and referees @ npg". www.nature.com. Retrieved 2 September 2014.