The Genomic Standards Consortium (GSC) is an initiative working towards richer descriptions of our collection of genomes, metagenomes and marker genes. Established in September 2005, [1] this international community includes representatives from a range of major sequencing and bioinformatics centres (including NCBI, EMBL, DDBJ, JCVI, JGI, EBI, Sanger, FIG) and research institutions. The goal of the GSC is to promote mechanisms for standardizing the description of (meta)genomes, including the exchange and integration of (meta)genomic data. The number and pace of genomic and metagenomic sequencing projects will only increase as the use of ultra-high-throughput methods becomes common place and standards are vital to scientific progress and data sharing.
Community-driven standards have the best chance of success if developed within the auspices of international working groups. Participants in the GSC include biologists, computer scientists, those building genomic databases and conducting large-scale comparative genomic analyses, and those with experience of building community-based standards. The mission of the GSC is to work with the wider community towards:
Fulfilling this mission by holding face-to-face meetings, forming working groups, and building consensus products that can be widely used in this community. Bringing together investigators working in different systems to work on a common problem. [2]
The GSC has published a “Minimum Information about a (Meta)Genome Sequence” specification and has now completed a "Minimum Information about an ENvironmental Sequence" specification. MIGS/MIMS/MIMARKS provides an extension of the minimum information already captured by the primary nucleotide sequence archives (INSDC or DDBJ/ENA/GenBank). The development of any checklist must be an open and iterative process that involves a balanced group of participants. Further, this development process must be supported by providing mechanisms for achieving compliance if a checklist is to be adopted as a tool for the standardization of a particular area of knowledge. Work towards this goal has spawned a set of interlocking projects that are described in more detail here: GSC projects. These include The Genomic Contextual Data Markup Language (GCDML), Genomic Rosetta Stone (GRS), Habitat-Lite. Newer projects include the M5 project.
The GSC is interested in making and building links with other communities. As stated above, the GSC is engaged in ontology development within the OBO Foundry. The GSC is also a founding member community of the Minimum Information about a Biomedical or Biological Investigation (MIBBI), an umbrella community for supporting and co-ordinating the development of checklists describing Minimum Information Standards.
GSC and the Earth Microbiome Project maintain the Biological Observation Matrix (BIOM) file format, an open JSON-based file format for representing arbitrary observation by sample contingency tables with associated sample and observation metadata. [3]
The GSC maintains a list of publications on its wiki - GSC Publications. This list includes reports from all workshops, articles from the special issue of the journal OMICS on data standards, and the publications describing the MIGS/MIMS and MIMARKS specifications in the journal Nature Biotechnology (May 2008 and May 2011 respectively). The GSC has also published a series of papers "Genomic Standards Consortium and Beyond" in the journal GigaScience . [4] [2]
Genomics is an interdisciplinary field of biology focusing on the structure, function, evolution, mapping, and editing of genomes. A genome is an organism's complete set of DNA, including all of its genes as well as its hierarchical, three-dimensional structural configuration. In contrast to genetics, which refers to the study of individual genes and their roles in inheritance, genomics aims at the collective characterization and quantification of all of an organism's genes, their interrelations and influence on the organism. Genes may direct the production of proteins with the assistance of enzymes and messenger molecules. In turn, proteins make up body structures such as organs and tissues as well as control chemical reactions and carry signals between cells. Genomics also involves the sequencing and analysis of genomes through uses of high throughput DNA sequencing and bioinformatics to assemble and analyze the function and structure of entire genomes. Advances in genomics have triggered a revolution in discovery-based research and systems biology to facilitate understanding of even the most complex biological systems such as the brain.
The branches of science known informally as omics are various disciplines in biology whose names end in the suffix -omics, such as genomics, proteomics, metabolomics, metagenomics, phenomics and transcriptomics. Omics aims at the collective characterization and quantification of pools of biological molecules that translate into the structure, function, and dynamics of an organism or organisms.
The European Bioinformatics Institute (EMBL-EBI) is an intergovernmental organization (IGO) which, as part of the European Molecular Biology Laboratory (EMBL) family, focuses on research and services in bioinformatics. It is located on the Wellcome Genome Campus in Hinxton near Cambridge, and employs over 600 full-time equivalent (FTE) staff. Institute leaders such as Rolf Apweiler, Alex Bateman, Ewan Birney, and Guy Cochrane, an adviser on the National Genomics Data Center Scientific Advisory Board, serve as part of the international research network of the BIG Data Center at the Beijing Institute of Genomics.
The Rat Genome Database (RGD) is a database of rat genomics, genetics, physiology and functional data, as well as data for comparative genomics between rat, human and mouse. RGD is responsible for attaching biological information to the rat genome via structured vocabulary, or ontology, annotations assigned to genes and quantitative trait loci (QTL), and for consolidating rat strain data and making it available to the research community. They are also developing a suite of tools for mining and analyzing genomic, physiologic and functional data for the rat, and comparative data for rat, mouse, human, and five other species.
The International Neuroinformatics Coordinating Facility is an international non-profit organization with the mission to develop, evaluate, and endorse standards and best practices that embrace the principles of Open, FAIR, and Citable neuroscience. INCF also provides training on how standards and best practices facilitate reproducibility and enables the publishing of the entirety of research output, including data and code. INCF was established in 2005 by recommendations of the Global Science Forum working group of the OECD. The INCF is hosted by the Karolinska Institutet in Stockholm, Sweden. The INCF network comprises institutions, organizations, companies, and individuals active in neuroinformatics, neuroscience, data science, technology, and science policy and publishing. The Network is organized in governing bodies and working groups which coordinate various categories of global neuroinformatics activities that guide and oversee the development and endorsement of standards and best practices, as well as provide training on how standards and best practices facilitate reproducibility and enables the publishing of the entirety of research output, including data and code. The current Directors are Mathew Abrams and Helena Ledmyr, and the Governing Board Chair is Maryann Martone
Takashi Gojobori is a Japanese molecular biologist, Vice-Director of the National Institute of Genetics (NIG) and the DNA Data Bank of Japan (DDBJ) at NIG, in Mishima, Japan. Gojobori is a Distinguished Professor at King Abdullah University of Science and Technology (KAUST) in Thuwal, Saudi Arabia. He is a Professor of Bioscience and Acting Director at the Computational Bioscience Research Center at KAUST.
The 1000 Genomes Project (1KGP), taken place from January 2008 to 2015, was an international research effort to establish the most detailed catalogue of human genetic variation at the time. Scientists planned to sequence the genomes of at least one thousand anonymous healthy participants from a number of different ethnic groups within the following three years, using advancements in newly developed technologies. In 2010, the project finished its pilot phase, which was described in detail in a publication in the journal Nature. In 2012, the sequencing of 1092 genomes was announced in a Nature publication. In 2015, two papers in Nature reported results and the completion of the project and opportunities for future research.
SOAP is a suite of bioinformatics software tools from the BGI Bioinformatics department enabling the assembly, alignment, and analysis of next generation DNA sequencing data. It is particularly suited to short read sequencing data.
The Variant Call Format (VCF) is a standard text file format used in bioinformatics for storing gene sequence variations. The format was developed in 2010 for the 1000 Genomes Project and has since been used by other large-scale genotyping and DNA sequencing projects. VCF is a common output format for variant calling programs due to its relative simplicity and scalability. Many tools have been developed for editing and manipulating VCF files, including VCFtools, which was released in conjunction with the VCF format in 2011, and BCFtools, which was included as part of SAMtools until being split into an independent package in 2014.
The Genomes OnLine Database (GOLD) is a web-based resource for comprehensive information regarding genome and metagenome sequencing projects, and their associated metadata, around the world. Since 2011, the GOLD database has been run by the DOE Joint Genome Institute
The European Nucleotide Archive (ENA) is a repository providing free and unrestricted access to annotated DNA and RNA sequences. It also stores complementary information such as experimental procedures, details of sequence assembly and other metadata related to sequencing projects. The archive is composed of three main databases: the Sequence Read Archive, the Trace Archive and the EMBL Nucleotide Sequence Database. The ENA is produced and maintained by the European Bioinformatics Institute and is a member of the International Nucleotide Sequence Database Collaboration (INSDC) along with the DNA Data Bank of Japan and GenBank.
MetaboLights is a data repository founded in 2012 for cross-species and cross-platform metabolomic studies that provides primary research data and meta data for metabolomic studies as well as a knowledge base for properties of individual metabolites. The database is maintained by the European Bioinformatics Institute (EMBL-EBI) and the development is funded by Biotechnology and Biological Sciences Research Council (BBSRC). As of July 2018, the MetaboLights browse functionality consists of 383 studies, two analytical platforms, NMR spectroscopy and mass spectrometry.
The Earth BioGenome Project (EBP) is an initiative that aims to sequence and catalog the genomes of all of Earth's currently described eukaryotic species over a period of ten years. The initiative would produce an open DNA database of biological information that provides a platform for scientific research and supports environmental and conservation initiatives. A scientific paper presenting the vision for the project was published in PNAS in April 2018, and the project officially launched November 1, 2018.
Nikos Kyrpides is a Greek-American bioscientist who has worked on the origins of life, information processing, bioinformatics, microbiology, metagenomics and microbiome data science. He is a senior staff scientist at the Berkeley National Laboratory, head of the Prokaryote Super Program and leads the Microbiome Data Science program at the US Department of Energy Joint Genome Institute.
GigaDB is a disciplinary repository launched in 2011 with the aim of ensuring long-term access to massive multidimensional datasets from life science and biomedical science studies. The datasets are diverse and include genomic, transcriptomic, and imaging data. The datasets are curated by GigaDB biocurators who are employed by BGI and China National GeneBank.
Manuel Corpas is an Anglo-Spanish biologist and entrepreneur known primarily for his contributions to the field of Bioinformatics and Genomics. Currently Corpas is Chief Scientist of Cambridge startup Cambridge Precision Medicine, a tutor at the Institute for Continuing Education at the University of Cambridge and a lecturer at the Universidad Internacional de La Rioja. Manuel worked on the human genome from the beginning of his career, being one of the first consumers to sequence and his own genome and that of close relatives, which he published as the Corpasome. He has held positions at the Earlham Institute as Project Leader, and the Wellcome Sanger Institute, developing the DECIPHER database, a database that aids in the diagnosis of patients with rare genomic disorders.
Minimum information standards are sets of guidelines and formats for reporting data derived by specific high-throughput methods. Their purpose is to ensure the data generated by these methods can be easily verified, analysed and interpreted by the wider scientific community. Ultimately, they facilitate the transfer of data from journal articles into databases in a form that enables data to be mined across multiple data sets. Minimal information standards are available for a vast variety of experiment types including microarray (MIAME), RNAseq (MINSEQE), metabolomics (MSI) and proteomics (MIAPE).
Korean Genome Project (KGP) is the largest Korean Genome Project which currently includes over 10,000 human genomes sequenced in Korea by April 2021.
Susanna-Assunta Sansone is a British-Italian data scientist who is professor of data readiness at the University of Oxford where she leads the data readiness group and serves as associate director of the Oxford e-Research Centre. Her research investigates techniques for improving the interoperability, reproducibility and integrity of data.