European Nucleotide Archive

Last updated

European Nucleotide Archive (ENA)
European Nucleotide Archive logo.png
Content
DescriptionComprehensive archive of nucleotide sequences, annotations and associated data.
Data types
captured
Nucleotide sequence, functional annotation, sequencing reads and sequencer information, sample details, other related records.
Organisms All
Contact
Research center European Bioinformatics Institute
Laboratory PANDA Group
Primary citation PMID   20972220
Release dateApril 1982
Access
Data format XML
FASTQ
EMBL-Bank format
Website ENA
Download URL ENA download
Web service URL ENA browser
Tools
Standalone CRAM toolkit
Miscellaneous
License Unrestricted

The European Nucleotide Archive (ENA) is a repository providing free and unrestricted access to annotated DNA and RNA sequences. It also stores complementary information such as experimental procedures, details of sequence assembly and other metadata related to sequencing projects. [1] The archive is composed of three main databases: the Sequence Read Archive, the Trace Archive and the EMBL Nucleotide Sequence Database (also known as EMBL-bank). [2] The ENA is produced and maintained by the European Bioinformatics Institute and is a member of the International Nucleotide Sequence Database Collaboration (INSDC) along with the DNA Data Bank of Japan and GenBank.

Contents

The ENA has grown out of the EMBL Data Library which was released in 1982 as the first internationally supported resource for nucleotide sequence data. [3] As of early 2012, the ENA and other INSDC member databases each contained complete genomes of 5,682 organisms and sequence data for almost 700,000. [4] Moreover, the volume of data is increasing exponentially with a doubling time of approximately 10 months. [5]

History

The European Nucleotide Archive originated from separate databases, the earliest of which was the EMBL Data Library, established in October 1980 at the European Molecular Biology Laboratory (EMBL), Heidelberg. [3] The first release of this database was made in April 1982 and contained a total of 568 separate entries consisting of around 500,000 base pairs. [6] In 1984, referring to the EMBL Data Library, Kneale and Kennard remarked that "it was clear some years ago that a large computerized database of sequences would be essential for research in Molecular Biology". [6]

Nucleotide sequence data in book form. NucleotideSequences 86 87.jpeg
Nucleotide sequence data in book form.

Despite the primary distribution method at the time being via magnetic tape, by 1987, the EMBL Data Library was being used by an estimated 10,000 scientists internationally. [7] The same year, the EMBL File Server was introduced to serve database records over BITNET, EARN and the early Internet. [8] In May 1988 the journal Nucleic Acids Research introduced a policy stating that "manuscripts submitted to [Nucleic Acids Research] and containing or discussing sequence data must be accompanied by evidence that the data have been deposited with the EMBL Data Library." [9]

The EBI at the Wellcome Trust Genome Campus in Hinxton, UK which hosts the European Nucleotide Archive. European Bioinformatics Institute, Hinxton 2.jpg
The EBI at the Wellcome Trust Genome Campus in Hinxton, UK which hosts the European Nucleotide Archive.

During the 1990s the EMBL Data Library was renamed the EMBL Nucleotide Sequence Database [10] and was formally relocated to the European Bioinformatics Institute (EBI) from Heidelberg. [11] In 2003, the Nucleotide Sequence Database was extended with the addition of the Sequence Version Archive (SVA), which maintains records of all current and previous entries in the database. [1] A year later in June 2004, limits on the maximum sequence length for each record (then 350 kilobases) were removed, allowing entire genome sequences to be stored as a single database entry. [12]

Following the uptake of Sanger sequencing, the Wellcome Trust Sanger Institute (then known as The Sanger Centre) had begun cataloguing sequence reads along with quality information in a database called The Trace Archive. [13] The Trace Archive grew substantially with the commercialisation of high-throughput parallel sequencing technologies by companies such as Roche and Illumina. [14] In 2008, the EBI combined the Trace Archive, EMBL Nucleotide Sequence Database (now also known as EMBL-Bank) [2] and a newly developed Sequence (or Short) Read Archive (SRA) to make up the ENA, aimed at providing a comprehensive nucleotide sequence archive. [13] As a member of the International Nucleotide Sequence Database Collaboration, the ENA exchanges data submissions each day with both the DNA Data Bank of Japan and GenBank. [15]

EMBL Nucleotide Sequence Database

The EMBL Nucleotide Sequence Database (EMBL-Bank) has increased in size from around 600 entries in 1982 to over 2.5x10 by December 2012. EMBL-Bank growth.svg
The EMBL Nucleotide Sequence Database (EMBL-Bank) has increased in size from around 600 entries in 1982 to over 2.5×10 by December 2012.

The EMBL Nucleotide Sequence Database (also known as EMBL-Bank) is the section of the ENA which contains high-level genome assembly details, as well as assembled sequences and their functional annotation. [12] [17] EMBL-Bank is contributed to by direct submission from genome consortia and smaller research groups as well as by the retrieval of sequence data associated with patent applications. [2] [18]

As of release 114 (December 2012), the EMBL Nucleotide Sequence Database contains approximately 5×1011 nucleotides with an uncompressed filesize of 1.6 terabytes. [16]

Data classes

The EMBL Nucleotide Sequence Database supports a variety of data derived from different sources including, but not limited to: [19]

EMBL-Bank format

The EMBL Nucleotide Sequence Database uses a flat file plaintext format to represent and store data which is typically referred to as EMBL-Bank format. [20] EMBL-Bank format uses a different syntax to the records in DDBJ and GenBank, though each format uses certain standardised nomenclature, such as taxonomies as defined by the NCBI Taxon database. Each line of an EMBL-format file begins with a two-letter code, such as AC to label the accession number and KW for a list of keywords relevant to the record; each record ends with //. [20]

Sequence Read Archive

The SRA has grown rapidly since 2008. As of 2011, most SRA sequence data was produced by Illumina's Genome Analyzer. History (and predicted future) size of the Sequence Read Archive.svg
The SRA has grown rapidly since 2008. As of 2011, most SRA sequence data was produced by Illumina's Genome Analyzer.

The ENA operates an instance of the Sequence Read Archive (SRA), an archival repository of sequence reads and analyses which are intended for public release. [23] Originally called the Short Read Archive, the name was changed in anticipation of future sequencing technologies being able to produce longer sequence reads. [24] Currently, the archive accepts sequence reads generated by next-generation sequencing platforms such as the Illumina Genome Analyzer and ABI SOLiD as well as some corresponding analyses and alignments. [25] The SRA operates under the guidance of the International Nucleotide Sequence Database Collaboration (INSDC) [23] and is the fastest-growing repository in the ENA. [14]

In 2010 the Sequence Read Archive made up approximately 95% of the base pair data available through the ENA, [13] encompassing over 500,000,000,000 sequence reads made up of over 60 trillion (6×1013) base pairs. [23] Almost half of this data was deposited in relation to the 1000 Genomes Project [23] wherein the researchers published their sequence data to the SRA in real-time. [26] In total, as of September 2010, 65% of the Sequence Read Archive was human genomic sequence, with another 16% relating to human metagenome sequence reads. [23]

The preferred data format for files submitted to the SRA is the BAM format, which is capable of storing both aligned and unaligned reads. [23] Internally the SRA relies on the NCBI SRA Toolkit, used at all three INSDC member databases, to provide flexible data compression, API access and conversion to other formats such as FASTQ. [22]

Data access

Screenshot of the ENA browser web interface, showing an HTML record. ENA browser screenshot.png
Screenshot of the ENA browser web interface, showing an HTML record.

The data contained in the ENA can be accessed manually or programmatically via REST URL through the ENA browser. Initially limited to the Sequence Read Archive, [14] the ENA browser now also provides access to the Trace Archive and EMBL-Bank, allowing file retrieval in a range of formats including XML, HTML, FASTA and FASTQ. [13] Individual records can be accessed using their accession numbers and other text queries are enabled through the EB-eye search engine. [13] Additionally, sequence similarity-based searches implemented using De Bruijn graphs offer another method of retrieving records from the ENA. [14]

The ENA is accessible via the EBI SOAP and REST APIs, which also offer access to other databases hosted at the EBI, such as Ensembl and InterPro. [27]

Storage

The European Nucleotide Archive handles large volumes of data which pose a significant storage challenge. [5] [28] As of 2012, the ENA's storage requirements continue to grow exponentially, with a doubling time of approximately 10 months. [5] To manage this increase, the ENA selectively discards less-valuable sequencing platform data and implements advanced compression strategies. [23] [29] The CRAM reference-based compression toolkit was developed to help reduce ENA storage requirements. [5] [30]

Funding

Currently the ENA is funded jointly by the European Molecular Biology Laboratory, the European Commission and the Wellcome Trust. [13] The emerging ELIXIR framework, coordinated by EBI director Janet Thornton, aims to secure a sustainable European funding infrastructure to support the continued availability of life science databases such as the ENA. [29] [31] [32]

See also

Related Research Articles

The International Nucleotide Sequence Database Collaboration (INSDC) consists of a joint effort to collect and disseminate databases containing DNA and RNA sequences. It involves the following computerized databases: DNA Data Bank of Japan (Japan), GenBank (USA) and the European Nucleotide Archive (UK). New and updated data on nucleotide sequences contributed by research teams to each of the three databases are synchronized on a daily basis through continuous interaction between the staff at each the collaborating organizations.

UniProt Database of protein sequences and functional information

UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived from the research literature. It is maintained by the UniProt consortium, which consists of several European bioinformatics organisations and a foundation from Washington, DC, United States.

Restriction sites, or restriction recognition sites, are located on a DNA molecule containing specific sequences of nucleotides, which are recognized by restriction enzymes. These are generally palindromic sequences, and a particular restriction enzyme may cut the sequence between two nucleotides within its recognition site, or somewhere nearby.

Pfam

Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models. The most recent version, Pfam 34.0, was released in March 2021 and contains 19,179 families.

Ewan Birney English businessman

John Frederick William Birney is joint director with Rolf Apweiler of EMBL's European Bioinformatics Institute (EMBL-EBI), in Hinxton, Cambridgeshire and deputy director general of the European Molecular Biology Laboratory (EMBL). He also serves as non-executive director of Genomics England, chair of the Global Alliance for Genomics and Health (GA4GH) and honorary professor of bioinformatics at the University of Cambridge. Birney has made significant contributions to genomics, through his development of innovative bioinformatics and computational biology tools. He previously served as an associate faculty member at the Wellcome Trust Sanger Institute.

Amos Bairoch

Amos Bairoch is a Swiss bioinformatician and Professor of Bioinformatics at the Department of Human Protein Sciences of the University of Geneva where he leads the CALIPHO group at the Swiss Institute of Bioinformatics (SIB) combining bioinformatics, curation, and experimental efforts to functionally characterize human proteins.

InterPro is a database of protein families, domains and functional sites in which identifiable features found in known proteins can be applied to new protein sequences in order to functionally characterise them.

Chemical Entities of Biological Interest, also known as ChEBI, is a chemical database and ontology of molecular entities focused on 'small' chemical compounds, that is part of the Open Biomedical Ontologies (OBO) effort at the European Bioinformatics Institute (EBI). The term "molecular entity" refers to any "constitutionally or isotopically distinct atom, molecule, ion, ion pair, radical, radical ion, complex, conformer, etc., identifiable as a separately distinguishable entity". The molecular entities in question are either products of nature or synthetic products which have potential bioactivity. Molecules directly encoded by the genome, such as nucleic acids, proteins and peptides derived from proteins by proteolytic cleavage, are not as a rule included in ChEBI.

The DNA Data Bank of Japan (DDBJ) is a biological database that collects DNA sequences. It is located at the National Institute of Genetics (NIG) in the Shizuoka prefecture of Japan. It is also a member of the International Nucleotide Sequence Database Collaboration or INSDC. It exchanges its data with European Molecular Biology Laboratory at the European Bioinformatics Institute and with GenBank at the National Center for Biotechnology Information on a daily basis. Thus these three databanks contain the same data at any given time.

SUPERFAMILY is a database and search platform of structural and functional annotation for all proteins and genomes. It classifies amino acid sequences into known structural domains, especially into SCOP superfamilies. Domains are functional, structural, and evolutionary units that form proteins. Domains of common Ancestry are grouped into superfamilies. The domains and domain superfamilies are defined and described in SCOP. Superfamilies are groups of proteins which have structural evidence to support a common evolutionary ancestor but may not have detectable sequence homology.

PDBsum is a database that provides an overview of the contents of each 3D macromolecular structure deposited in the Protein Data Bank. The original version of the database was developed around 1995 by Roman Laskowski and collaborators at University College London. As of 2014, PDBsum is maintained by Laskowski and collaborators in the laboratory of Janet Thornton at the European Bioinformatics Institute (EBI).

Sequence Read Archive

The Sequence Read Archive is a bioinformatics database that provides a public repository for DNA sequencing data, especially the "short reads" generated by high-throughput sequencing, which are typically less than 1,000 base pairs in length. The archive is part of the International Nucleotide Sequence Database Collaboration (INSDC), and run as a collaboration between the NCBI, the European Bioinformatics Institute (EBI), and the DNA Data Bank of Japan (DDBJ).

Rolf Apweiler

Rolf Apweiler is a director of European Bioinformatics Institute (EBI) part of the European Molecular Biology Laboratory (EMBL) with Ewan Birney.

The BioSample Database (BioSD) is a database at European Bioinformatics Institute for the information about the biological samples used in sequencing.

Experimental factor ontology

Experimental factor ontology, also known as EFO, is an open-access ontology of experimental variables particularly those used in molecular biology. The ontology covers variables which include aspects of disease, anatomy, cell type, cell lines, chemical compounds and assay information. EFO is developed and maintained at the EMBL-EBI as a cross-cutting resource for the purposes of curation, querying and data integration in resources such as Ensembl, ChEMBL and Expression Atlas.

EPD is a biological database and web resource of eukaryotic RNA polymerase II promoters with experimentally defined transcription start sites. Originally, EPD was a manually curated resource relying on transcript mapping experiments targeted at individual genes and published in academic journals. More recently, automatically generated promoter collections derived from electronically distributed high-throughput data produced with the CAGE or TSS-Seq protocols were added as part of a special subsection named EPDnew. The EPD web server offers additional services, including an entry viewer which enables users to explore the genomic context of a promoter in a UCSC Genome Browser window, and direct links for uploading EPD-derived promoter subsets to associated web-based promoter analysis tools of the Signal Search Analysis (SSA) and ChIP-Seq servers. EPD also features a collection of position weight matrices (PWMs) for common promoter sequence motifs.

Alex Bateman

Alexander George Bateman is a computational biologist and Head of Protein Sequence Resources at the European Bioinformatics Institute (EBI), part of the European Molecular Biology Laboratory (EMBL) in Cambridge, UK. He has led the development of the Pfam biological database and introduced the Rfam database of RNA families. He has also been involved in the use of Wikipedia for community-based annotation of biological databases.

Toby James Gibson is a group leader and biochemist at the European Molecular Biology Laboratory (EMBL) in Heidelberg known for his work on Clustal. According to Nature, Gibson's co-authored papers describing Clustal are among the top ten most highly cited scientific papers of all time.

The Ontology Lookup Service (OLS) is a repository for biomedical ontologies, part of the ELIXIR infrastructure. It is supported by the European Bioinformatics Institute (EMBL-EBI).

References

  1. 1 2 Cochrane, G.; Akhtar, R.; Aldebert, P.; Althorpe, N.; Baldwin, A.; Bates, K.; Bhattacharyya, S.; Bonfield, J.; Bower, L. (2007). "Priorities for nucleotide trace, sequence and annotation data capture at the Ensembl Trace Archive and the EMBL Nucleotide Sequence Database". Nucleic Acids Research. 36 (Database): D5–D12. doi:10.1093/nar/gkm1018. ISSN   0305-1048. PMC   2238915 . PMID   18039715.
  2. 1 2 3 EMBL-EBI. "EMBL Nucleotide Sequence Database" . Retrieved 2013-01-08.
  3. 1 2 Hamm, G. H.; Cameron, G. N. (1986). "The EMBL data library". Nucleic Acids Research. 14 (1): 5–9. doi:10.1093/nar/14.1.5. PMC   339348 . PMID   3945550.
  4. Cochrane, Guy; Cook, Charles E; Birney, Ewan (2012). "The future of DNA sequence archiving". GigaScience. 1 (1): 2. doi:10.1186/2047-217X-1-2. ISSN   2047-217X. PMC   3617450 . PMID   23587147.
  5. 1 2 3 4 Cochrane, G.; Alako, B.; Amid, C.; Bower, L.; Cerdeno-Tarraga, A.; Cleland, I.; Gibson, R.; Goodgame, N.; Jang, M. (2012). "Facing growth in the European Nucleotide Archive". Nucleic Acids Research. 41 (D1): D30–D35. doi:10.1093/nar/gks1175. ISSN   0305-1048. PMC   3531187 . PMID   23203883.
  6. 1 2 Kneale, G.; Kennard, O. (1984). "The EMBL nucleotide sequence data library". Biochemical Society Transactions. 12 (6): 1011–1014. doi:10.1042/bst0121011. PMID   6530028.
  7. Cameron, G. N. (1988). "The EMBL data library". Nucleic Acids Research. 16 (5): 1865–1867. doi:10.1093/nar/16.5.1865. PMC   338182 . PMID   3353226.
  8. Fuchs, R.; Stoehr, P.; Rice, P.; Omond, R.; Cameron, G. (1990). "New services of the EMBL Data Library". Nucleic Acids Research. 18 (15): 4319–4323. doi:10.1093/nar/18.15.4319. PMC   331247 . PMID   2388823.
  9. Kahn, P.; Hazledine, D. (1988). "NAR's new requirement for data submission to the EMBL data library: Information for authors". Nucleic Acids Research. 16 (10): I–IV. PMC   336623 . PMID   16617480.
  10. "What is the European Nucleotide Archive?". EMBL-EBI. Retrieved 2013-01-06.
  11. Rodriguez-Tomé, P.; Stoehr, P. J.; Cameron, G. N.; Flores, T. P. (1996). "The European Bioinformatics Institute (EBI) databases". Nucleic Acids Research. 24 (1): 6–12. doi:10.1093/nar/24.1.6. PMC   145572 . PMID   8594602.
  12. 1 2 Stoesser, G.; Baker, W; Van Den Broek, A; Garcia-Pastor, M; Kanz, C; Kulikova, T; Leinonen, R; Lin, Q; Lombard, V (2003). "The EMBL Nucleotide Sequence Database: major new developments". Nucleic Acids Research. 31 (1): 17–22. doi:10.1093/nar/gkg021. ISSN   1362-4962. PMC   165468 . PMID   12519939.
  13. 1 2 3 4 5 6 Leinonen R, Akhtar R, Birney E, et al. (January 2011). "The European Nucleotide Archive". Nucleic Acids Res. 39 (Database issue): D28–31. doi:10.1093/nar/gkq967. PMC   3013801 . PMID   20972220.
  14. 1 2 3 4 Leinonen, R.; Akhtar, R.; Birney, E.; Bonfield, J.; Bower, L.; Corbett, M.; Cheng, Y.; Demiralp, F.; Faruque, N. (2009). "Improvements to services at the European Nucleotide Archive". Nucleic Acids Research. 38 (Database): D39–D45. doi:10.1093/nar/gkp998. ISSN   0305-1048. PMC   2808951 . PMID   19906712.
  15. EMBL-EBI. "About the European Nucleotide Archive" . Retrieved 2013-01-07.
  16. 1 2 "EMBL Nucleotide Sequence Database: Release Notes". EMBL-Bank Release Notes 114. EMBL-EBI. Dec 2012. Archived from the original on 2013-01-02. Retrieved 2013-01-07.
  17. Amid, C.; Birney, E.; Bower, L.; Cerdeno-Tarraga, A.; Cheng, Y.; Cleland, I.; Faruque, N.; Gibson, R.; Goodgame, N. (2011). "Major submissions tool developments at the European nucleotide archive". Nucleic Acids Research. 40 (D1): D43–D47. doi:10.1093/nar/gkr946. ISSN   0305-1048. PMC   3245037 . PMID   22080548.
  18. Stoesser, G.; Baker, W; Van Den Broek, A; Camon, E; Garcia-Pastor, M; Kanz, C; Kulikova, T; Leinonen, R; Lin, Q (2002). "The EMBL Nucleotide Sequence Database". Nucleic Acids Research. 30 (1): 21–26. doi:10.1093/nar/30.1.21. ISSN   1362-4962. PMC   99098 . PMID   11752244.
  19. "EMBL-Bank data classes". EBML-EBI. 2012. Retrieved 2013-01-08.
  20. 1 2 "EMBL-Bank User Manual (Release 129)" (Plaintext). EMBL-EBI. Sep 2016. Retrieved 2016-11-03.
  21. "NCBI SRA Overview". NCBI. 1 Jan 2013. Archived from the original on February 8, 2013. Retrieved 2013-01-08.
  22. 1 2 Kodama, Y.; Shumway, M.; Leinonen, R. (2011). "The sequence read archive: explosive growth of sequencing data". Nucleic Acids Research. 40 (D1): D54–D56. doi:10.1093/nar/gkr854. ISSN   0305-1048. PMC   3245110 . PMID   22009675.
  23. 1 2 3 4 5 6 7 Leinonen R, Sugawara H, Shumway M (January 2011). "The sequence read archive". Nucleic Acids Res. 39 (Database issue): D19–21. doi:10.1093/nar/gkq1019. PMC   3013647 . PMID   21062823.
  24. Ostell, Jim (2009). "NCBI's Sequence Read Archive: A Core Enabling Infrastructure". Bio IT World. Retrieved 2013-01-08.
  25. "About the NCBI Sequence Read Archive". NCBI. 8 Jan 2013. Archived from the original on 19 April 2013. Retrieved 2013-01-10.
  26. Shumway, M.; Cochrane, G.; Sugawara, H. (2009). "Archiving next generation sequencing data". Nucleic Acids Research. 38 (Database): D870–D871. doi:10.1093/nar/gkp1078. ISSN   0305-1048. PMC   2808927 . PMID   19965774.
  27. Mcwilliam, H.; Valentin, F.; Goujon, M.; Li, W.; Narayanasamy, M.; Martin, J.; Miyar, T.; Lopez, R. (2009). "Web services at the European Bioinformatics Institute-2009". Nucleic Acids Research. 37 (Web Server): W6–W10. doi:10.1093/nar/gkp302. ISSN   0305-1048. PMC   2703973 . PMID   19435877.
  28. Cochrane, G.; Akhtar, R.; Bonfield, J.; Bower, L.; Demiralp, F.; Faruque, N.; Gibson, R.; Hoad, G.; Hubbard, T. (2009). "Petabyte-scale innovations at the European Nucleotide Archive". Nucleic Acids Research. 37 (Database): D19–D25. doi:10.1093/nar/gkn765. ISSN   0305-1048. PMC   2686451 . PMID   18978013.
  29. 1 2 "EMBL-EBI will continue to support the Sequence Read Archive for raw data" (PDF). Press Release. EMBL-EBI. 16 Feb 2011. Archived from the original (PDF) on 15 May 2011. Retrieved 2013-01-07.
  30. Hsi-Yang Fritz, M.; Leinonen, R.; Cochrane, G.; Birney, E. (2011). "Efficient storage of high throughput DNA sequencing data using reference-based compression". Genome Research. 21 (5): 734–740. doi:10.1101/gr.114819.110. ISSN   1088-9051. PMC   3083090 . PMID   21245279.
  31. "About ELIXIR". ELIXIR. Retrieved 2013-01-09.
  32. Crosswell, Lindsey C.; Thornton, Janet M. (2012). "ELIXIR: a distributed infrastructure for European biological data". Trends in Biotechnology. 30 (5): 241–242. doi:10.1016/j.tibtech.2012.02.002. ISSN   0167-7799. PMID   22417641.