The Genome Reference Consortium (GRC) is an international collective of academic and research institutes with expertise in genome mapping, sequencing, and informatics, formed to improve the representation of reference genomes. At the time the human reference was initially described, it was clear that some regions were recalcitrant to analysis with existing technology, leaving gaps in the known sequence. The main reason for improving the reference assemblies are that they are the cornerstones upon which all whole genome studies are based (e.g. the 1000 Genomes Project).
The GRC is a collaborative effort which interacts with various groups in the scientific community. [1] The primary member institutes are:
The goal of the Consortium is to correct the small number of regions in the reference that are currently misrepresented, to close as many remaining gaps as possible and to produce alternative assemblies of structurally variant loci when necessary. Initially the focus was on the human and mouse reference genomes, but in expansions new organisms were added to the consortium. In October 2010 full maintenance and improvement of the zebrafish genome sequence was added to the GRC; [2] in 2015, after the release of the chicken genome assembly Gallus_gallus-5.0, GRC added the chicken reference genome, [3] and in November 2020 the rat genome assembly was added. [4]
As of September 2019, the major assembly releases for human, mouse, zebrafish, and chicken are GRCh38, GRCm38, GRCz11, and GRCg6a, respectively. Major assembly releases do not follow a fixed cycle; however, there are minor assembly updates in the form of genome patches which either correct errors in the assembly or add additional alternate loci. [5] These assemblies are represented in various genome browsers and databases including Ensembl, those in NCBI and UCSC Genome Browser.
Institute Homepages
Genome assemblies
Genome projects are scientific endeavours that ultimately aim to determine the complete genome sequence of an organism and to annotate protein-coding genes and other important genome-encoded features. The genome sequence of an organism includes the collective DNA sequences of each chromosome in the organism. For a bacterium containing a single chromosome, a genome project will aim to map the sequence of that chromosome. For the human species, whose genome includes 22 pairs of autosomes and 2 sex chromosomes, a complete genome sequence will involve 46 separate chromosome sequences.
UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived from the research literature. It is maintained by the UniProt consortium, which consists of several European bioinformatics organisations and a foundation from Washington, DC, United States.
The Wellcome Sanger Institute, previously known as The Sanger Centre and Wellcome Trust Sanger Institute, is a non-profit British genomics and genetics research institute, primarily funded by the Wellcome Trust.
The Rat Genome Database (RGD) is a database of rat genomics, genetics, physiology and functional data, as well as data for comparative genomics between rat, human and mouse. RGD is responsible for attaching biological information to the rat genome via structured vocabulary, or ontology, annotations assigned to genes and quantitative trait loci (QTL), and for consolidating rat strain data and making it available to the research community. They are also developing a suite of tools for mining and analyzing genomic, physiologic and functional data for the rat, and comparative data for rat, mouse, human, and five other species.
The Bioinformatic Harvester was a bioinformatic meta search engine created by the European Molecular Biology Laboratory and subsequently hosted and further developed by KIT Karlsruhe Institute of Technology for genes and protein-associated information. Harvester currently works for human, mouse, rat, zebrafish, drosophila and arabidopsis thaliana based information. Harvester cross-links >50 popular bioinformatic resources and allows cross searches. Harvester serves tens of thousands of pages every day to scientists and physicians. Since 2014 the service is down.
HomoloGene, a tool of the United States National Center for Biotechnology Information (NCBI), is a system for automated detection of homologs among the annotated genes of several completely sequenced eukaryotic genomes.
John Frederick William Birney is joint director of EMBL's European Bioinformatics Institute (EMBL-EBI), in Hinxton, Cambridgeshire and deputy director general of the European Molecular Biology Laboratory (EMBL). He also serves as non-executive director of Genomics England, chair of the Global Alliance for Genomics and Health (GA4GH) and honorary professor of bioinformatics at the University of Cambridge. Birney has made significant contributions to genomics, through his development of innovative bioinformatics and computational biology tools. He previously served as an associate faculty member at the Wellcome Trust Sanger Institute.
Mouse Genome Informatics (MGI) is a free, online database and bioinformatics resource hosted by The Jackson Laboratory, with funding by the National Human Genome Research Institute (NHGRI), the National Cancer Institute (NCI), and the Eunice Kennedy Shriver National Institute of Child Health and Human Development (NICHD). MGI provides access to data on the genetics, genomics and biology of the laboratory mouse to facilitate the study of human health and disease. The database integrates multiple projects, with the two largest contributions coming from the Mouse Genome Database and Mouse Gene Expression Database (GXD). As of 2018, MGI contains data curated from over 230,000 publications.
UniGene was a NCBI database of the transcriptome and thus, despite the name, not primarily a database for genes. Each entry is a set of transcripts that appear to stem from the same transcription locus. Information on protein similarities, gene expression, cDNA clones, and genomic location is included with each entry.
The Vertebrate Genome Annotation (VEGA) database is a biological database dedicated to assisting researchers in locating specific areas of the genome and annotating genes or regions of vertebrate genomes. The VEGA browser is based on Ensembl web code and infrastructure and provides a public curation of known vertebrate genes for the scientific community. The VEGA website is updated frequently to maintain the most current information about vertebrate genomes and attempts to present consistently high-quality annotation of all its published vertebrate genomes or genome regions. VEGA was developed by the Wellcome Trust Sanger Institute and is in close association with other annotation databases, such as ZFIN, the Havana Group and GenBank. Manual annotation is currently more accurate at identifying splice variants, pseudogenes, polyadenylation features, non-coding regions and complex gene arrangements than automated methods.
A reference genome is a digital nucleic acid sequence database, assembled by scientists as a representative example of the set of genes in one idealized individual organism of a species. As they are assembled from the sequencing of DNA from a number of individual donors, reference genomes do not accurately represent the set of genes of any single individual organism. Instead a reference provides a haploid mosaic of different DNA sequences from each donor. For example, the most recent human reference genome is derived from >60 genomic clone libraries. There are reference genomes for multiple species of viruses, bacteria, fungus, plants, and animals. Reference genomes are typically used as a guide on which new genomes are built, enabling them to be assembled much more quickly and cheaply than the initial Human Genome Project. Reference genomes can be accessed online at several locations, using dedicated browsers such as Ensembl or UCSC Genome Browser.
The Consensus Coding Sequence (CCDS) Project is a collaborative effort to maintain a dataset of protein-coding regions that are identically annotated on the human and mouse reference genome assemblies. The CCDS project tracks identical protein annotations on the reference mouse and human genomes with a stable identifier, and ensures that they are consistently represented by the National Center for Biotechnology Information (NCBI), Ensembl, and UCSC Genome Browser. The integrity of the CCDS dataset is maintained through stringent quality assurance testing and on-going manual curation.
The International Protein Index (IPI) is a defunct protein database launched in 2001 by the European Bioinformatics Institute (EBI), and closed in 2011. Its purpose was to provide the proteomics community with a resource that enables
CARMIL1 is a protein that in humans is encoded by the CARMIL1 gene. The gene is also known as LRRC16, LRRC16A, CARMIL, or CARMIL1a.
Spinster homolog 2 (Drosophila) is a protein that in humans is encoded by the SPNS2 gene.
Timothy John Phillip Hubbard is a Professor of Bioinformatics at King's College London, Head of Genome Analysis at Genomics England and Honorary Faculty at the Wellcome Trust Sanger Institute in Cambridge, UK.
Deanna Church is a scientist working in the areas of bioinformatics and genomics. She is known for her work on the human genome, "making the genome a friendlier place".
Donna R. Maglott is a staff scientist at the National Center for Biotechnology Information known for her research on large-scale genomics projects, including the mouse genome and development of databases required for genomics research.