The Open Regulatory Annotation Database (also known as ORegAnno) is designed to promote community-based curation of regulatory information. Specifically, the database contains information about regulatory regions, transcription factor binding sites, regulatory variants, and haplotypes.
For each entry, cross-references are maintained to EnsEMBL, dbSNP, Entrez Gene, the NCBI Taxonomy database and PubMed. The information within ORegAnno is regularly mapped and provided as a UCSC Genome Browser track. Furthermore, each entry is associated with its experimental evidence, embedded as an Evidence Ontology within ORegAnno. This allows the researcher to analyze regulatory data using their own conditions as to the suitability of the supporting evidence.
The project is open source - all data and all software that is produced in the project can be freely accessed and used.
As of December 20, 2006, ORegAnno contained 4220 regulatory sequences (excluding deprecated records) for 2190 transcription factor binding sites, 1853 regulatory regions (enhancers, promoters, etc.), 170 regulatory polymorphisms, and 7 regulatory haplotypes for 17 different organisms (predominantly Drosophila melanogaster, Homo sapiens, Mus musculus, Caenorhabditis elegans, and Rattus norvegicus in that order). These records were obtained by manual curation of 828 publications by 45 ORegAnno users from the gene regulation community. The ORegAnno publication queue contained 4215 publications of which 858 were closed, 34 were in progress (open status), and 3321 were awaiting annotation (pending status). ORegAnno is continually updated and therefore current database contents should be obtained from www.oreganno.org.
The RegCreative jamboree was stimulated by a community initiative to curate in perpetuity the genomic sequences which have been experimentally determined to control gene expression. This objective is of fundamental importance to evolutionary analysis and translational research as regulatory mechanisms are widely implicated in species-specific adaptation and the etiology of disease. This initiative culminated in the formation of an international consortium of like-minded scientists dedicated to accomplishing this task. The RegCreative jamboree was the first opportunity for these groups to meet to be able to accurately assess the current state of knowledge in gene regulation and to begin to develop standards by which to curate regulatory information.
In total, 44 researchers attended the workshop from 9 different countries and 23 institutions. Funding was also obtained from ENFIN, the BioSapiens Network, FWO Research Foundation, Genome Canada and Genome British Columbia.
The specific outcomes of the RegCreative meeting to date are:
A regulatory sequence is a segment of a nucleic acid molecule which is capable of increasing or decreasing the expression of specific genes within an organism. Regulation of gene expression is an essential feature of all living organisms and viruses.
The Gene Ontology (GO) is a major bioinformatics initiative to unify the representation of gene and gene product attributes across all species. More specifically, the project aims to: 1) maintain and develop its controlled vocabulary of gene and gene product attributes; 2) annotate genes and gene products, and assimilate and disseminate annotation data; and 3) provide tools for easy access to all aspects of the data provided by the project, and to enable functional interpretation of experimental data using the GO, for example via enrichment analysis. GO is part of a larger classification effort, the Open Biomedical Ontologies, being one of the Initial Candidate Members of the OBO Foundry.
Ensembl genome database project is a scientific project at the European Bioinformatics Institute, which was launched in 1999 in response to the imminent completion of the Human Genome Project. Ensembl aims to provide a centralized resource for geneticists, molecular biologists and other researchers studying the genomes of our own species and other vertebrates and model organisms. Ensembl is one of several well known genome browsers for the retrieval of genomic information.
The Encyclopedia of DNA Elements (ENCODE) is a public research project which aims to identify functional elements in the human genome.
The Rat Genome Database (RGD) is a database of rat genomics, genetics, physiology and functional data, as well as data for comparative genomics between rat, human and mouse. RGD is responsible for attaching biological information to the rat genome via structured vocabulary, or ontology, annotations assigned to genes and quantitative trait loci (QTL), and for consolidating rat strain data and making it available to the research community. They are also developing a suite of tools for mining and analyzing genomic, physiologic and functional data for the rat, and comparative data for rat, mouse, human, and five other species.
Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models. The most recent version, Pfam 34.0, was released in March 2021 and contains 19,179 families.
The Saccharomyces Genome Database (SGD) is a scientific database of the molecular biology and genetics of the yeast Saccharomyces cerevisiae, which is commonly known as baker's or budding yeast.
Rfam is a database containing information about non-coding RNA (ncRNA) families and other structured RNA elements. It is an annotated, open access database originally developed at the Wellcome Trust Sanger Institute in collaboration with Janelia Farm, and currently hosted at the European Bioinformatics Institute. Rfam is designed to be similar to the Pfam database for annotating protein families.
Interferome is an online bioinformatics database of interferon-regulated genes (IRGs). These Interferon Regulated Genes are also known as Interferon Stimulated Genes (ISGs). The database contains information on type I, type II and type III regulated genes and is regularly updated. It is used by the interferon and cytokine research community both as an analysis tool and an information resource. Interferons were identified as antiviral proteins more than 50 years ago. However, their involvement in immunomodulation, cell proliferation, inflammation and other homeostatic processes has been since identified. These cytokines are used as therapeutics in many diseases such as chronic viral infections, cancer and multiple sclerosis. These interferons regulate the transcription of approximately 2000 genes in an interferon subtype, dose, cell type and stimulus dependent manner. This database of interferon regulated genes is an attempt at integrating information from high-throughput experiments and molecular biology databases to gain a detailed understanding of interferon biology.
SUPERFAMILY is a database and search platform of structural and functional annotation for all proteins and genomes. It classifies amino acid sequences into known structural domains, especially into SCOP superfamilies. Domains are functional, structural, and evolutionary units that form proteins. Domains of common Ancestry are grouped into superfamilies. The domains and domain superfamilies are defined and described in SCOP. Superfamilies are groups of proteins which have structural evidence to support a common evolutionary ancestor but may not have detectable sequence homology.
DNA annotation or genome annotation is the process of identifying the locations of genes and all of the coding regions in a genome and determining what those genes do. An annotation is a note added by way of explanation or commentary. Once a genome is sequenced, it needs to be annotated to make sense of it. Genes in a eukaryotic genome can be annotated using various annotation tools such as FINDER. A modern annotation pipeline can support a user-friendly web interface and software containerization such as MOSGA.
The Mammalian Promoter Database (MPromDb) is a curated database of gene promoters identified from ChIP-seq. The proximal promoter region contains the cis-regulatory elements of most of the transcription factors (TFs).
TRANSFAC is a manually curated database of eukaryotic transcription factors, their genomic binding sites and DNA binding profiles. The contents of the database can be used to predict potential transcription factor binding sites.
DisProt is a manually curated biological database of intrinsically disordered proteins (IDPs) and regions (IDRs). DisProt annotations cover state information on the protein but also, when available, its state transitions, interactions and functional aspects of disorder detected by specific experimental methods. DisProt is hosted and maintained in the BioComputing UP laboratory.
Blast2GO, first published in 2005, is a bioinformatics software tool for the automatic, high-throughput functional annotation of novel sequence data. It makes use of the BLAST algorithm to identify similar sequences to then transfers existing functional annotation from yet characterised sequences to the novel one. The functional information is represented via the Gene Ontology (GO), a controlled vocabulary of functional attributes. The Gene Ontology, or GO, is a major bioinformatics initiative to unify the representation of gene and gene product attributes across all species.
In bioinformatics, the PANTHER classification system is a large curated biological database of gene/protein families and their functionally related subfamilies that can be used to classify and identify the function of gene products. PANTHER is part of the Gene Ontology Reference Genome Project designed to classify proteins and their genes for high-throughput analysis.
PomBase is a model organism database that provides online access to the fission yeast Schizosaccharomyces pombe genome sequence and annotated features, together with a wide range of manually curated functional gene-specific data. The PomBase website was redeveloped in 2016 to provide users with a more fully integrated, better-performing service.
JASPAR is an open access and widely used database of manually curated, non-redundant transcription factor (TF) binding profiles stored as position frequency matrices (PFM) and transcription factor flexible models (TFFM) for TFs from species in six taxonomic groups. From the supplied PFMs, users may generate position-specific weight matrices (PWM). The JASPAR database was introduced in 2004. There were seven major updates and new releases in 2006, 2008, 2010, 2014, 2016, 2018, 2020 and 2022, which is the latest release of JASPAR.
Biocuration is the field of life sciences dedicated to organizing biomedical data, information and knowledge into structured formats, such as spreadsheets, tables and knowledge graphs. The biocuration of biomedical knowledge is made possible by the cooperative work of biocurators, software developers and bioinformaticians and is at the base of the work of biological databases.