BioSamples

Last updated
BioSamples
Logo of BioSamples.png
Content
DescriptionA database containing aggregated information pertaining to reference samples and samples stored in the European Bioinformatics Institute assay databases.
Data types
captured
Biological sample metadata
Organisms All
Contact
Research center European Bioinformatics Institute.
AuthorsMikhail Gostev
Primary citationGostev & al. (2012) [1]
Release date2011
Access
Data format XML, RDF
Website EBI page, NCBI page
Download URL EBI FTP
Web service URL REST
Sparql endpoint BioSD Sparql
Tools
Web Sample display, advanced search by samples and groups, sorting by columns, links to assay database record
Miscellaneous
License Unrestricted
Versioning Yes
Data release
frequency
Daily
Curation policyYes (manual)
Bookmarkable
entities
Yes - samples and sample groups

BioSamples (BioSD) is a database at European Bioinformatics Institute for the information about the biological samples used in sequencing. [1]

Contents

It stores submitter-supplied metadata about the biological materials from which data stored in the National Center for Biotechnology Information’s (NCBI) primary data archives are derived. NCBI’s archives hosts data pertaining to diverse types of samples from many species, and as such the BioSample database is similarly diverse. Examples of a BioSample include a primary tissue biopsy, an individual organism or an environmental isolate.

The BioSamples database captures sample metadata in a structured way by encouraging use of controlled sample attribute field name vocabularies. This metadata is key in giving the sample data context, allowing it to be more fully understood, reused, and enables aggregation of disparate data sets.

Sample metadata is linked to relevant experimental data across many archival databases relieving submitter burden by enabling one-time submission of sample description. They then can reference that sample, when necessary, when making data deposits to other archives.

BioSample records are indexed and searchable, supporting cross-database queries by sample description.

History

The BioSamples database was launched in 2011 to help aggregate and standardise sample metadata. Historically, each archive had created its own convention for sample metadata collection. These usually were limited in their standardisation and had no method to indicate when the a sample was used across multiple data sets. In addition to this, there is a growing awareness amongst the research community that sample metadata is vital for understanding the underlying data. Further, chances for re-use, aggregation and integration of data are increased with improved metadata. The database was initially populated with existing descriptions extracted from SRA, EST, GSS and dbGaP. [2] As of May 2013, the database hosts almost 2 million BioSample records encompassing 18,000 species. [3]

Content

The BioSamples database has doubled in size since January 2012 when 1 million samples were described in the BioSamples database, as of October 2013 2,846,137 samples are available as 80,232 groups. [4] The rapid growth is predominantly due to new data sources, and increased volume of data from existing sources. New data sources include 22,288 samples from The Cancer Genome Atlas, and 920,441 samples from the Catalogue of Somatic Mutation in Cancer (COSMIC). [5]

Attributes define the material under investigation using structured name: value pairs, for example:

tissue: liver
collection date: 31-Jan-2013

After specifying the sample type, the user is presented with a list of required and optional attribute fields to fill in, as well as the opportunity to supply any number of custom descriptive attributes. The BioSample database is extendible in that new types and attributes can be added as new standards develop. In addition to BioSample type and attributes, each BioSample record also contains:

IDsAn identifier block that lists not only the BioSample accession assigned to that record, but also any other external sample identifier, such as that issued by the source database or repository.
OrganismThe organism name and taxonomy identifier. The full taxonomic tree is displayed and searchable.
TitleBioSample title. A title is auto-generated if one is not supplied by the submitter.
Description[optional] A free text field in which to store non-structured information about the sample.
Links[optional] URL to link to relevant information on external sites.
OwnerSubmitter information, including name and affiliation where available.
DatesInformation about when the record was submitted, released, and last updated.
AccessStatement about whether the record is fully public or controlled access

The full list and definitions of BioSample types and attributes is available for preview and download. [6]

Data Access

There are a number of ways in which the database can be accessed. The initial release of BioSD to the public only provided access to the database through a web interface. This web interface was subsequently updated in November 2012 and then again in March 2013 following the EBI site-wide re-launch. In February 2013, a public Application Programming Interface (API) was released using a Representational state transfer (REST) system. In October 2013, as a part of the EBI's new RDF platform a SPARQL endpoint was released, providing access to the data in the RDF format. Additionally, the database can be downloaded through EBI's FTP service. [7]

Web Interface

The web interface allows users to access the BioSD database through a web browser. It provides functionality for both searching by sample groups and by samples themselves. The search features incremental search to assist users by providing them with possible search terms as they type. Advanced search is provided and allows users to search by applying the binary terms, AND, OR and NOT, to their search terms. Additionally, a wildcard character can be used to match any combination of characters including no characters. A question mark character can also be used to match any single character. [8] Examples of these can be seen in the following table:

Search queryExample results
mo*se"mouse", "moose", "mose", "mofoobarse"
mo?se"mouse", "moose", "motse"

The web interface also allows users to select search results and view further details of that search result. The detailed view provides further information and makes available a link to the assay database(s) from which the data was sourced. Ordering by columns is also provided.

Application Programming Interface

The API provides a suitable method for retrieving data in a programmatic way. It uses a RESTful system that allows users to query URI endpoints and receive XML as results. The API has URI endpoints for a number of different types of requests. These requests can be used to, find specific samples, find specific groups, search for groups, search for samples and to search for samples within a group. [9]


SPARQL Endpoint

The SPARQL endpoint allows users to search the database in a more comprehensive way than the standard web interface whilst still being usable from a web browser. [10] Through this interface, far more complex queries can be made to further enable users in their searches. However, there is an increased learning curve with this method of accessing the data. The SPARQL endpoint returns results in the RDF format which was initially designed with metadata in mind and is thus suited to the needs of BioSD. [11]

Development

The development team forms a part of Helen Parkinson's team at EMBL-EBI and contains software engineers and web developers who are assisted with domain specific knowledge by ontologists and bioinformaticians.

The primary programming language used on the project is the Java programming language. To aid the development of the project, the development teams uses the integrated development environment, IntelliJ IDEA which is provided by JetBrains. Other tools used in the project include Bamboo for continuous integration and the management of software releases. Additionally, YourKit is a Java profiler which helps optimise and eliminate bugs in the BioSD project. [12]

The project is developed as an open-source project with all source code being freely available on GitHub. [13]

Funding

Currently the primary funding for the BioSD database development and maintenance is provided by the European Molecular Biology Laboratory (EMBL) core budget which is in turn funded by its 20 member countries. [1] There has also been additional contributions from the European Commission in the form of a number of grants. [14] Further funding has come from the Human Induced Pluripotent Stem Cells Initiative provided by the Wellcome Trust and the Medical Research Council and from the EBiSC Innovative Medicines Initiative. [15]

See also

Related Research Articles

The European Bioinformatics Institute (EMBL-EBI) is an intergovernmental organization (IGO) which, as part of the European Molecular Biology Laboratory (EMBL) family, focuses on research and services in bioinformatics. It is located on the Wellcome Genome Campus in Hinxton near Cambridge, and employs over 600 full-time equivalent (FTE) staff. Institute leaders such as Rolf Apweiler, Alex Bateman, Ewan Birney, and Guy Cochrane, an adviser on the National Genomics Data Center Scientific Advisory Board, serve as part of the international research network of the BIG Data Center at the Beijing Institute of Genomics.

SPARQL is an RDF query language—that is, a semantic query language for databases—able to retrieve and manipulate data stored in Resource Description Framework (RDF) format. It was made a standard by the RDF Data Access Working Group (DAWG) of the World Wide Web Consortium, and is recognized as one of the key technologies of the semantic web. On 15 January 2008, SPARQL 1.0 was acknowledged by W3C as an official recommendation, and SPARQL 1.1 in March, 2013.

<span class="mw-page-title-main">Web API</span> HTTP-based application programming interface used in web development

A web API is an application programming interface (API) for either a web server or a web browser. As a web development concept, it can be related to a web application's client side. A server-side web API consists of one or more publicly exposed endpoints to a defined request–response message system, typically expressed in JSON or XML by means of an HTTP-based web server. A server API (SAPI) is not considered a server-side web API, unless it is publicly accessible by a remote web application.

RDFa or Resource Description Framework in Attributes is a W3C Recommendation that adds a set of attribute-level extensions to HTML, XHTML and various XML-based document types for embedding rich metadata within Web documents. The Resource Description Framework (RDF) data-model mapping enables its use for embedding RDF subject-predicate-object expressions within XHTML documents. It also enables the extraction of RDF model triples by compliant user agents.

Chemical Entities of Biological Interest, also known as ChEBI, is a chemical database and ontology of molecular entities focused on 'small' chemical compounds, that is part of the Open Biomedical Ontologies (OBO) effort at the European Bioinformatics Institute (EBI). The term "molecular entity" refers to any "constitutionally or isotopically distinct atom, molecule, ion, ion pair, radical, radical ion, complex, conformer, etc., identifiable as a separately distinguishable entity". The molecular entities in question are either products of nature or synthetic products which have potential bioactivity. Molecules directly encoded by the genome, such as nucleic acids, proteins and peptides derived from proteins by proteolytic cleavage, are not as a rule included in ChEBI.

Oracle Spatial and Graph, formerly Oracle Spatial, is a free option component of the Oracle Database. The spatial features in Oracle Spatial and Graph aid users in managing geographic and location-data in a native type within an Oracle database, potentially supporting a wide range of applications — from automated mapping, facilities management, and geographic information systems (AM/FM/GIS), to wireless location services and location-enabled e-business. The graph features in Oracle Spatial and Graph include Oracle Network Data Model (NDM) graphs used in traditional network applications in major transportation, telcos, utilities and energy organizations and RDF semantic graphs used in social networks and social interactions and in linking disparate data sets to address requirements from the research, health sciences, finance, media and intelligence communities.

Semantic publishing on the Web, or semantic web publishing, refers to publishing information on the web as documents accompanied by semantic markup. Semantic publication provides a way for computers to understand the structure and even the meaning of the published information, making information search and data integration more efficient.

The EB-eye, also known as EBI Search, is a search engine that provides uniform access to the biological data resources hosted at the European Bioinformatics Institute (EBI).

Simple Sloppy Semantic Database (S3DB) is a distributed data management system that relies on Semantic Web concepts for management of heterogeneous data.

A triplestore or RDF store is a purpose-built database for the storage and retrieval of triples through semantic queries. A triple is a data entity composed of subject–predicate–object, like "Bob is 35" or "Bob knows Fred".

A graph database (GDB) is a database that uses graph structures for semantic queries with nodes, edges, and properties to represent and store data. A key concept of the system is the graph. The graph relates the data items in the store to a collection of nodes and edges, the edges representing the relationships between the nodes. The relationships allow data in the store to be linked together directly and, in many cases, retrieved with one operation. Graph databases hold the relationships between data as a priority. Querying relationships is fast because they are perpetually stored in the database. Relationships can be intuitively visualized using graph databases, making them useful for heavily inter-connected data.

<span class="mw-page-title-main">ChEMBL</span> Chemical database of bioactive molecules also having drug-like properties

ChEMBL or ChEMBLdb is a manually curated chemical database of bioactive molecules with drug inducing properties on in the human brain. It is maintained by the European Bioinformatics Institute (EBI), of the European Molecular Biology Laboratory (EMBL), based at the Wellcome Trust Genome Campus, Hinxton, UK.

<span class="mw-page-title-main">BioMart</span>

BioMart is a community-driven project to provide a single point of access to distributed research data. The BioMart project contributes open source software and data services to the international scientific community. Although the BioMart software is primarily used by the biomedical research community, it is designed in such a way that any type of data can be incorporated into the BioMart framework. The BioMart project originated at the European Bioinformatics Institute as a data management solution for the Human Genome Project. Since then, BioMart has grown to become a multi-institute collaboration involving various database projects on five continents.

<span class="mw-page-title-main">European Nucleotide Archive</span> Online database from the EBI on Nucleotides

The European Nucleotide Archive (ENA) is a repository providing free and unrestricted access to annotated DNA and RNA sequences. It also stores complementary information such as experimental procedures, details of sequence assembly and other metadata related to sequencing projects. The archive is composed of three main databases: the Sequence Read Archive, the Trace Archive and the EMBL Nucleotide Sequence Database. The ENA is produced and maintained by the European Bioinformatics Institute and is a member of the International Nucleotide Sequence Database Collaboration (INSDC) along with the DNA Data Bank of Japan and GenBank.

Before data.europa.eu, the EU Open Data Portal was the point of access to public data published by the EU institutions, agencies and other bodies. On April 21, 2021 it was consolidated to the data.europa.eu portal, together with the European Data Portal: a similar initiative aimed at the EU Member States.

Identifiers.org is a project providing stable and perennial identifiers for data records used in the Life Sciences. The identifiers are provided in the form of Uniform Resource Identifiers (URIs). Identifiers.org is also a resolving system, that relies on collections listed in the MIRIAM Registry to provide direct access to different instances of the identified records.

<span class="mw-page-title-main">Experimental factor ontology</span>

Experimental factor ontology, also known as EFO, is an open-access ontology of experimental variables particularly those used in molecular biology. The ontology covers variables which include aspects of disease, anatomy, cell type, cell lines, chemical compounds and assay information. EFO is developed and maintained at the EMBL-EBI as a cross-cutting resource for the purposes of curation, querying and data integration in resources such as Ensembl, ChEMBL and Expression Atlas.

The Open Semantic Framework (OSF) is an integrated software stack using semantic technologies for knowledge management. It has a layered architecture that combines existing open source software with additional open source components developed specifically to provide a complete Web application framework. OSF is made available under the Apache 2 license.

<span class="mw-page-title-main">MetaboLights</span>

MetaboLights is a data repository founded in 2012 for cross-species and cross-platform metabolomic studies that provides primary research data and meta data for metabolomic studies as well as a knowledge base for properties of individual metabolites. The database is maintained by the European Bioinformatics Institute (EMBL-EBI) and the development is funded by Biotechnology and Biological Sciences Research Council (BBSRC). As of July 2018, the MetaboLights browse functionality consists of 383 studies, two analytical platforms, NMR spectroscopy and mass spectrometry.

Schema-agnostic databases or vocabulary-independent databases aim at supporting users to be abstracted from the representation of the data, supporting the automatic semantic matching between queries and databases. Schema-agnosticism is the property of a database of mapping a query issued with the user terminology and structure, automatically mapping it to the dataset vocabulary.

References

  1. 1 2 3 Gostev, Mikhail; Faulconbridge Adam; Brandizi Marco; Fernandez-Banet Julio; Sarkans Ugis; Brazma Alvis; Parkinson Helen (Jan 2012). "The BioSample Database (BioSD) at the European Bioinformatics Institute". Nucleic Acids Res. England. 40 (1): D64-70. doi:10.1093/nar/gkr937. PMC   3245134 . PMID   22096232.
  2. "About biosharing database of Genotypes and Phenotypes (dbGaP)" (HTML). Retrieved 11 September 2014.
  3. Barrett, Tanya (14 November 2013). "The NCBI Handbook [Internet] 2nd edition" . Retrieved 11 September 2014.
  4. Faulconbridge, Adam; Tony Burdett; Marco Brandizi; Mikhail Gostev; Rui Pereira; Drashtti Vasant; Ugis Sarkans; Alvis Brazma; Helen Parkinson (20 November 2013). "Updates to BioSamples database at European Bioinformatics Institute". Nucleic Acids Research. England. 42 (Database issue): D50-2. doi:10.1093/nar/gkt1081. PMC   3965081 . PMID   24265224.
  5. Shepherd, R; Beare D; Bamford S; Cole CG; Ward S; Bindal N; Gunasekaran P; Jia M; Kok CY; et al. (23 May 2011). "Data mining using the Catalogue of Somatic Mutations in Cancer BioMart". Database (Oxford). England. 2011: bar018. doi:10.1093/database/bar018. PMC   3263736 . PMID   21609966.
  6. "BioSample Template Generator". EMBL-EBI (HTML). Retrieved 11 September 2014.
  7. "BioSamples News". EMBL-EBI (HTML). Archived from the original on 10 September 2014. Retrieved 11 September 2014.
  8. "How to search BioSamples Database". EMBL-EBI (HTML). Archived from the original on 11 September 2014. Retrieved 11 September 2014.
  9. "BioSamples API Overview". EMBL-EBI (HTML). Retrieved 29 September 2018.
  10. "BioSamples Database SPARQL Endpoint". EMBL-EBI (HTML). Retrieved 11 September 2014.
  11. "Biosamples Database RDF". EMBL-EBI (HTML). Retrieved 11 September 2014.
  12. "About BioSamples". EMBL-EBI (HTML). Retrieved 10 September 2014.
  13. "EBI BioSamples Database GitHub Project". GitHub (HTML). Retrieved 10 September 2014.
  14. Faulconbridge, A.; Burdett, T.; Brandizi, M.; Gostev, M.; Pereira, R.; Vasant, D.; Sarkans, U.; Brazma, A.; Parkinson, H. (2013). "Updates to BioSamples database at European Bioinformatics Institute". Nucleic Acids Research. 42 (D1): D50–D52. doi:10.1093/nar/gkt1081. ISSN   0305-1048. PMC   3965081 . PMID   24265224.
  15. "BioSamples: Quick tour". EMBL-EBI (HTML). Archived from the original on 10 September 2014. Retrieved 10 September 2014.