Content | |
---|---|
Description | Curated collection of binding models for human and mouse transcription factors |
Data types captured | Transcription factor binding profiles |
Organisms | Homo sapiens, Mus musculus laboratory: autosome.org author: Vorontsov, Makeev, Kulakovskiy |
Contact | |
Primary citation | Vorontsov et al [1] |
Access | |
Website | HOCOMOCO |
HOCOMOCO [1] [2] [3] [4] is an open-access database providing curated and benchmarked binding motifs of human and mouse transcription factors. It captures the following data types: Homo sapiens (human) and Mus musculus (mouse) transcription factors, their DNA binding site motifs, and motif subtypes.
Transcription factors (TFs) are proteins that bind DNA and thus regulate the trasncription process. The binding is sequence-specific. A sequence motif [5] is a model that describes the common pattern of the DNA binding sites [6] that a particular TF prefers to bind. One of the possible representations of the model is the Position-Weight Matrix (PWM) [7] .
According to the Web of Science, the 2018 publication of HOCOMOCO [2] has been cited 396 times (as of January 2024). The publications [3] [4] have been cited 144 and 151 times.
In biology, a sequence motif is a nucleotide or amino-acid sequence pattern that is widespread and usually assumed to be related to biological function of the macromolecule. For example, an N-glycosylation site motif can be defined as Asn, followed by anything but Pro, followed by either Ser or Thr, followed by anything but Pro residue.
DNA-binding proteins are proteins that have DNA-binding domains and thus have a specific or general affinity for single- or double-stranded DNA. Sequence-specific DNA-binding proteins generally interact with the major groove of B-DNA, because it exposes more functional groups that identify a base pair.
In bioinformatics, a sequence logo is a graphical representation of the sequence conservation of nucleotides or amino acids . A sequence logo is created from a collection of aligned sequences and depicts the consensus sequence and diversity of the sequences. Sequence logos are frequently used to depict sequence characteristics such as protein-binding sites in DNA or functional units in proteins.
SOX genes encode a family of transcription factors that bind to the minor groove in DNA, and belong to a super-family of genes characterized by a homologous sequence called the HMG-box. This HMG box is a DNA binding domain that is highly conserved throughout eukaryotic species. Homologues have been identified in insects, nematodes, amphibians, reptiles, birds and a range of mammals. However, HMG boxes can be very diverse in nature, with only a few amino acids being conserved between species.
Cis-regulatory elements (CREs) or Cis-regulatory modules (CRMs) are regions of non-coding DNA which regulate the transcription of neighboring genes. CREs are vital components of genetic regulatory networks, which in turn control morphogenesis, the development of anatomy, and other aspects of embryonic development, studied in evolutionary developmental biology.
The initiator element (Inr), sometimes referred to as initiator motif, is a core promoter that is similar in function to the Pribnow box or the TATA box. The Inr is the simplest functional promoter that is able to direct transcription initiation without a functional TATA box. It has the consensus sequence YYANWYY in humans. Similarly to the TATA box, the Inr element facilitates the binding of transcription Factor II D (TFIID). The Inr works by enhancing binding affinity and strengthening the promoter.
High-mobility group protein HMG-I/HMG-Y is a protein that in humans is encoded by the HMGA1 gene.
CCAAT/enhancer-binding protein gamma (C/EBPγ) is a protein that in humans is encoded by the CEBPG gene. This gene has no introns.
ChIP-sequencing, also known as ChIP-seq, is a method used to analyze protein interactions with DNA. ChIP-seq combines chromatin immunoprecipitation (ChIP) with massively parallel DNA sequencing to identify the binding sites of DNA-associated proteins. It can be used to map global binding sites precisely for any protein of interest. Previously, ChIP-on-chip was the most common technique utilized to study these protein–DNA relations.
Anders Krogh is a bioinformatician at the University of Copenhagen, where he leads the university's bioinformatics center. He is known for his pioneering work on the use of hidden Markov models in bioinformatics, and is co-author of a widely used textbook in bioinformatics. In addition, he also co-authored one of the early textbooks on neural networks. His current research interests include promoter analysis, non-coding RNA, gene prediction and protein structure prediction.
BIOBASE is an international bioinformatics company headquartered in Wolfenbüttel, Germany. The company focuses on the generation, maintenance, and licensing of databases in the field of molecular biology, and their related software platforms.
DNA binding sites are a type of binding site found in DNA where other molecules may bind. DNA binding sites are distinct from other binding sites in that (1) they are part of a DNA sequence and (2) they are bound by DNA-binding proteins. DNA binding sites are often associated with specialized proteins known as transcription factors, and are thus linked to transcriptional regulation. The sum of DNA binding sites of a specific transcription factor is referred to as its cistrome. DNA binding sites also encompasses the targets of other proteins, like restriction enzymes, site-specific recombinases and methyltransferases.
Phyloscan is a web service for DNA sequence analysis that is free and open to all users. For locating matches to a user-specified sequence motif for a regulatory binding site, Phyloscan provides a statistically sensitive scan of user-supplied mixed aligned and unaligned DNA sequence data. Phyloscan's strength is that it brings together
In molecular biology, the BEN domain is a protein domain which is found in diverse proteins including:
TRANSFAC is a manually curated database of eukaryotic transcription factors, their genomic binding sites and DNA binding profiles. The contents of the database can be used to predict potential transcription factor binding sites.
The WRKY domain is found in the WRKY transcription factor family, a class of transcription factors. The WRKY domain is found almost exclusively in plants although WRKY genes appear present in some diplomonads, social amoebae and other amoebozoa, and fungi incertae sedis. They appear absent in other non-plant species. WRKY transcription factors have been a significant area of plant research for the past 20 years. The WRKY DNA-binding domain recognizes the W-box (T)TGAC(C/T) cis-regulatory element.
Transcription factors are proteins that bind genomic regulatory sites. Identification of genomic regulatory elements is essential for understanding the dynamics of developmental, physiological and pathological processes. Recent advances in chromatin immunoprecipitation followed by sequencing (ChIP-seq) have provided powerful ways to identify genome-wide profiling of DNA-binding proteins and histone modifications. The application of ChIP-seq methods has reliably discovered transcription factor binding sites and histone modification sites.
JASPAR is an open access and widely used database of manually curated, non-redundant transcription factor (TF) binding profiles stored as position frequency matrices (PFM) and transcription factor flexible models (TFFM) for TFs from species in six taxonomic groups. From the supplied PFMs, users may generate position-specific weight matrices (PWM). The JASPAR database was introduced in 2004. There were seven major updates and new releases in 2006, 2008, 2010, 2014, 2016, 2018, 2020 and 2022, which is the latest release of JASPAR.
Ivan Erill is a Spanish computational biologist known for his research in comparative genomics and molecular microbiology. His work focuses primarily on bacterial comparative genomics, through the development of computational methods for analyzing regulatory networks and their evolution.