Content | |
---|---|
Description | A database containing aggregated information pertaining to reference samples and samples stored in the European Bioinformatics Institute assay databases. |
Data types captured | Biological sample metadata |
Organisms | All |
Contact | |
Research center | European Bioinformatics Institute. |
Authors | Mikhail Gostev |
Primary citation | Gostev & al. (2012) [1] |
Release date | 2011 |
Access | |
Data format | XML, RDF |
Website | EBI page, NCBI page |
Download URL | EBI FTP |
Web service URL | REST |
Sparql endpoint | BioSD Sparql |
Tools | |
Web | Sample display, advanced search by samples and groups, sorting by columns, links to assay database record |
Miscellaneous | |
License | Unrestricted |
Versioning | Yes |
Data release frequency | Daily |
Curation policy | Yes (manual) |
Bookmarkable entities | Yes - samples and sample groups |
BioSamples (BioSD) is a database at European Bioinformatics Institute for the information about the biological samples used in sequencing. [1]
It stores submitter-supplied metadata about the biological materials from which data stored in the National Center for Biotechnology Information’s (NCBI) primary data archives are derived. NCBI’s archives hosts data pertaining to diverse types of samples from many species, and as such the BioSample database is similarly diverse. Examples of a BioSample include a primary tissue biopsy, an individual organism or an environmental isolate.
The BioSamples database captures sample metadata in a structured way by encouraging use of controlled sample attribute field name vocabularies. This metadata is key in giving the sample data context, allowing it to be more fully understood, reused, and enables aggregation of disparate data sets.
Sample metadata is linked to relevant experimental data across many archival databases relieving submitter burden by enabling one-time submission of sample description. They then can reference that sample, when necessary, when making data deposits to other archives.
BioSample records are indexed and searchable, supporting cross-database queries by sample description.
The BioSamples database was launched in 2011 to help aggregate and standardise sample metadata. Historically, each archive had created its own convention for sample metadata collection. These usually were limited in their standardisation and had no method to indicate when the a sample was used across multiple data sets. In addition to this, there is a growing awareness amongst the research community that sample metadata is vital for understanding the underlying data. Further, chances for re-use, aggregation and integration of data are increased with improved metadata. The database was initially populated with existing descriptions extracted from SRA, EST, GSS and dbGaP. [2] As of May 2013, the database hosts almost 2 million BioSample records encompassing 18,000 species. [3]
The BioSamples database has doubled in size since January 2012 when 1 million samples were described in the BioSamples database, as of October 2013 2,846,137 samples are available as 80,232 groups. [4] The rapid growth is predominantly due to new data sources, and increased volume of data from existing sources. New data sources include 22,288 samples from The Cancer Genome Atlas, and 920,441 samples from the Catalogue of Somatic Mutation in Cancer (COSMIC). [5]
Attributes define the material under investigation using structured name: value pairs, for example:
tissue: liver
collection date: 31-Jan-2013
After specifying the sample type, the user is presented with a list of required and optional attribute fields to fill in, as well as the opportunity to supply any number of custom descriptive attributes. The BioSample database is extendible in that new types and attributes can be added as new standards develop. In addition to BioSample type and attributes, each BioSample record also contains:
IDs | An identifier block that lists not only the BioSample accession assigned to that record, but also any other external sample identifier, such as that issued by the source database or repository. |
---|---|
Organism | The organism name and taxonomy identifier. The full taxonomic tree is displayed and searchable. |
Title | BioSample title. A title is auto-generated if one is not supplied by the submitter. |
Description | [optional] A free text field in which to store non-structured information about the sample. |
Links | [optional] URL to link to relevant information on external sites. |
Owner | Submitter information, including name and affiliation where available. |
Dates | Information about when the record was submitted, released, and last updated. |
Access | Statement about whether the record is fully public or controlled access |
The full list and definitions of BioSample types and attributes is available for preview and download. [6]
There are a number of ways in which the database can be accessed. The initial release of BioSD to the public only provided access to the database through a web interface. This web interface was subsequently updated in November 2012 and then again in March 2013 following the EBI site-wide re-launch. In February 2013, a public Application Programming Interface (API) was released using a Representational state transfer (REST) system. In October 2013, as a part of the EBI's new RDF platform a SPARQL endpoint was released, providing access to the data in the RDF format. Additionally, the database can be downloaded through EBI's FTP service. [7]
The web interface allows users to access the BioSD database through a web browser. It provides functionality for both searching by sample groups and by samples themselves. The search features incremental search to assist users by providing them with possible search terms as they type. Advanced search is provided and allows users to search by applying the binary terms, AND, OR and NOT, to their search terms. Additionally, a wildcard character can be used to match any combination of characters including no characters. A question mark character can also be used to match any single character. [8] Examples of these can be seen in the following table:
Search query | Example results |
---|---|
mo*se | "mouse", "moose", "mose", "mofoobarse" |
mo?se | "mouse", "moose", "motse" |
The web interface also allows users to select search results and view further details of that search result. The detailed view provides further information and makes available a link to the assay database(s) from which the data was sourced. Ordering by columns is also provided.
The API provides a suitable method for retrieving data in a programmatic way. It uses a RESTful system that allows users to query URI endpoints and receive XML as results. The API has URI endpoints for a number of different types of requests. These requests can be used to, find specific samples, find specific groups, search for groups, search for samples and to search for samples within a group. [9]
The SPARQL endpoint allows users to search the database in a more comprehensive way than the standard web interface whilst still being usable from a web browser. [10] Through this interface, far more complex queries can be made to further enable users in their searches. However, there is an increased learning curve with this method of accessing the data. The SPARQL endpoint returns results in the RDF format which was initially designed with metadata in mind and is thus suited to the needs of BioSD. [11]
The development team forms a part of Helen Parkinson's team at EMBL-EBI and contains software engineers and web developers who are assisted with domain specific knowledge by ontologists and bioinformaticians.
The primary programming language used on the project is the Java programming language. To aid the development of the project, the development teams uses the integrated development environment, IntelliJ IDEA which is provided by JetBrains. Other tools used in the project include Bamboo for continuous integration and the management of software releases. Additionally, YourKit is a Java profiler which helps optimise and eliminate bugs in the BioSD project. [12]
The project is developed as an open-source project with all source code being freely available on GitHub. [13]
Currently the primary funding for the BioSD database development and maintenance is provided by the European Molecular Biology Laboratory (EMBL) core budget which is in turn funded by its 20 member countries. [1] There has also been additional contributions from the European Commission in the form of a number of grants. [14] Further funding has come from the Human Induced Pluripotent Stem Cells Initiative provided by the Wellcome Trust and the Medical Research Council and from the EBiSC Innovative Medicines Initiative. [15]
The European Bioinformatics Institute (EMBL-EBI) is an intergovernmental organization (IGO) which, as part of the European Molecular Biology Laboratory (EMBL) family, focuses on research and services in bioinformatics. It is located on the Wellcome Genome Campus in Hinxton near Cambridge, and employs over 600 full-time equivalent (FTE) staff. Institute leaders such as Rolf Apweiler, Alex Bateman, Ewan Birney, and Guy Cochrane, an adviser on the National Genomics Data Center Scientific Advisory Board, serve as part of the international research network of the BIG Data Center at the Beijing Institute of Genomics.
SPARQL is an RDF query language—that is, a semantic query language for databases—able to retrieve and manipulate data stored in Resource Description Framework (RDF) format. It was made a standard by the RDF Data Access Working Group (DAWG) of the World Wide Web Consortium, and is recognized as one of the key technologies of the semantic web. On 15 January 2008, SPARQL 1.0 was acknowledged by W3C as an official recommendation, and SPARQL 1.1 in March, 2013.
A web API is an application programming interface (API) for either a web server or a web browser. As a web development concept, it can be related to a web application's client side. A server-side web API consists of one or more publicly exposed endpoints to a defined request–response message system, typically expressed in JSON or XML by means of an HTTP-based web server. A server API (SAPI) is not considered a server-side web API, unless it is publicly accessible by a remote web application.
RDFa or Resource Description Framework in Attributes is a W3C Recommendation that adds a set of attribute-level extensions to HTML, XHTML and various XML-based document types for embedding rich metadata within Web documents. The Resource Description Framework (RDF) data-model mapping enables its use for embedding RDF subject-predicate-object expressions within XHTML documents. It also enables the extraction of RDF model triples by compliant user agents.
Chemical Entities of Biological Interest, also known as ChEBI, is a chemical database and ontology of molecular entities focused on 'small' chemical compounds, that is part of the Open Biomedical Ontologies (OBO) effort at the European Bioinformatics Institute (EBI). The term "molecular entity" refers to any "constitutionally or isotopically distinct atom, molecule, ion, ion pair, radical, radical ion, complex, conformer, etc., identifiable as a separately distinguishable entity". The molecular entities in question are either products of nature or synthetic products which have potential bioactivity. Molecules directly encoded by the genome, such as nucleic acids, proteins and peptides derived from proteins by proteolytic cleavage, are not as a rule included in ChEBI.
Oracle Spatial and Graph, formerly Oracle Spatial, is a free option component of the Oracle Database. The spatial features in Oracle Spatial and Graph aid users in managing geographic and location-data in a native type within an Oracle database, potentially supporting a wide range of applications — from automated mapping, facilities management, and geographic information systems (AM/FM/GIS), to wireless location services and location-enabled e-business. The graph features in Oracle Spatial and Graph include Oracle Network Data Model (NDM) graphs used in traditional network applications in major transportation, telcos, utilities and energy organizations and RDF semantic graphs used in social networks and social interactions and in linking disparate data sets to address requirements from the research, health sciences, finance, media and intelligence communities.
Semantic publishing on the Web, or semantic web publishing, refers to publishing information on the web as documents accompanied by semantic markup. Semantic publication provides a way for computers to understand the structure and even the meaning of the published information, making information search and data integration more efficient.
The EB-eye, also known as EBI Search, is a search engine that provides uniform access to the biological data resources hosted at the European Bioinformatics Institute (EBI).
Simple Sloppy Semantic Database (S3DB) is a distributed data management system that relies on Semantic Web concepts for management of heterogeneous data.
A triplestore or RDF store is a purpose-built database for the storage and retrieval of triples through semantic queries. A triple is a data entity composed of subject–predicate–object, like "Bob is 35" or "Bob knows Fred".
A graph database (GDB) is a database that uses graph structures for semantic queries with nodes, edges, and properties to represent and store data. A key concept of the system is the graph. The graph relates the data items in the store to a collection of nodes and edges, the edges representing the relationships between the nodes. The relationships allow data in the store to be linked together directly and, in many cases, retrieved with one operation. Graph databases hold the relationships between data as a priority. Querying relationships is fast because they are perpetually stored in the database. Relationships can be intuitively visualized using graph databases, making them useful for heavily inter-connected data.
ChEMBL or ChEMBLdb is a manually curated chemical database of bioactive molecules with drug inducing properties on in the human brain. It is maintained by the European Bioinformatics Institute (EBI), of the European Molecular Biology Laboratory (EMBL), based at the Wellcome Trust Genome Campus, Hinxton, UK.
BioMart is a community-driven project to provide a single point of access to distributed research data. The BioMart project contributes open source software and data services to the international scientific community. Although the BioMart software is primarily used by the biomedical research community, it is designed in such a way that any type of data can be incorporated into the BioMart framework. The BioMart project originated at the European Bioinformatics Institute as a data management solution for the Human Genome Project. Since then, BioMart has grown to become a multi-institute collaboration involving various database projects on five continents.
The European Nucleotide Archive (ENA) is a repository providing free and unrestricted access to annotated DNA and RNA sequences. It also stores complementary information such as experimental procedures, details of sequence assembly and other metadata related to sequencing projects. The archive is composed of three main databases: the Sequence Read Archive, the Trace Archive and the EMBL Nucleotide Sequence Database. The ENA is produced and maintained by the European Bioinformatics Institute and is a member of the International Nucleotide Sequence Database Collaboration (INSDC) along with the DNA Data Bank of Japan and GenBank.
Before data.europa.eu, the EU Open Data Portal was the point of access to public data published by the EU institutions, agencies and other bodies. On April 21, 2021 it was consolidated to the data.europa.eu portal, together with the European Data Portal: a similar initiative aimed at the EU Member States.
Identifiers.org is a project providing stable and perennial identifiers for data records used in the Life Sciences. The identifiers are provided in the form of Uniform Resource Identifiers (URIs). Identifiers.org is also a resolving system, that relies on collections listed in the MIRIAM Registry to provide direct access to different instances of the identified records.
Experimental factor ontology, also known as EFO, is an open-access ontology of experimental variables particularly those used in molecular biology. The ontology covers variables which include aspects of disease, anatomy, cell type, cell lines, chemical compounds and assay information. EFO is developed and maintained at the EMBL-EBI as a cross-cutting resource for the purposes of curation, querying and data integration in resources such as Ensembl, ChEMBL and Expression Atlas.
The Open Semantic Framework (OSF) is an integrated software stack using semantic technologies for knowledge management. It has a layered architecture that combines existing open source software with additional open source components developed specifically to provide a complete Web application framework. OSF is made available under the Apache 2 license.
MetaboLights is a data repository founded in 2012 for cross-species and cross-platform metabolomic studies that provides primary research data and meta data for metabolomic studies as well as a knowledge base for properties of individual metabolites. The database is maintained by the European Bioinformatics Institute (EMBL-EBI) and the development is funded by Biotechnology and Biological Sciences Research Council (BBSRC). As of July 2018, the MetaboLights browse functionality consists of 383 studies, two analytical platforms, NMR spectroscopy and mass spectrometry.
Schema-agnostic databases or vocabulary-independent databases aim at supporting users to be abstracted from the representation of the data, supporting the automatic semantic matching between queries and databases. Schema-agnosticism is the property of a database of mapping a query issued with the user terminology and structure, automatically mapping it to the dataset vocabulary.