DNA binding site

Last updated
DNA contacts of different types of DNA-binding domains Transcription factors DNA binding sites.svg
DNA contacts of different types of DNA-binding domains

DNA binding sites are a type of binding site found in DNA where other molecules may bind. DNA binding sites are distinct from other binding sites in that (1) they are part of a DNA sequence (e.g. a genome) and (2) they are bound by DNA-binding proteins. DNA binding sites are often associated with specialized proteins known as transcription factors, and are thus linked to transcriptional regulation. The sum of DNA binding sites of a specific transcription factor is referred to as its cistrome. DNA binding sites also encompasses the targets of other proteins, like restriction enzymes, site-specific recombinases (see site-specific recombination) and methyltransferases. [1]

Contents

DNA binding sites can be thus defined as short DNA sequences (typically 4 to 30 base pairs long, but up to 200 bp for recombination sites) that are specifically bound by one or more DNA-binding proteins or protein complexes. It has been reported that some binding sites have potential to undergo fast evolutionary change. [2]

Types of DNA binding sites

DNA binding sites can be categorized according to their biological function. Thus, we can distinguish between transcription factor-binding sites, restriction sites and recombination sites. Some authors have proposed that binding sites could also be classified according to their most convenient mode of representation. [3] On the one hand, restriction sites can be generally represented by consensus sequences. This is because they target mostly identical sequences and restriction efficiency decreases abruptly for less similar sequences. On the other hand, DNA binding sites for a given transcription factor are usually all different, with varying degrees of affinity of the transcription factor for the different binding sites. This makes it difficult to accurately represent transcription factor binding sites using consensus sequences, and they are typically represented using position specific frequency matrices (PSFM), which are often graphically depicted using sequence logos. This argument, however, is partly arbitrary. Restriction enzymes, like transcription factors, yield a gradual, though sharp, range of affinities for different sites [4] and are thus also best represented by PSFM. Likewise, site-specific recombinases also show a varied range of affinities for different target sites. [5] [6]

History and main experimental techniques

The existence of something akin to DNA binding sites was suspected from the experiments on the biology of the bacteriophage lambda [7] and the regulation of the Escherichia coli lac operon. [8] DNA binding sites were finally confirmed in both systems [9] [10] [11] with the advent of DNA sequencing techniques. From then on, DNA binding sites for many transcription factors, restriction enzymes and site-specific recombinases have been discovered using a profusion of experimental methods. Historically, the experimental techniques of choice to discover and analyze DNA binding sites have been the DNAse footprinting assay and the Electrophoretic Mobility Shift Assay (EMSA). However, the development of DNA microarrays and fast sequencing techniques has led to new, massively parallel methods for in-vivo identification of binding sites, such as ChIP-chip and ChIP-Seq. [12] To quantify the binding affinity [13] of proteins and other molecules to specific DNA binding sites the biophysical method Microscale Thermophoresis [14] is used.

Databases

Due to the diverse nature of the experimental techniques used in determining binding sites and to the patchy coverage of most organisms and transcription factors, there is no central database (akin to GenBank at the National Center for Biotechnology Information) for DNA binding sites. Even though NCBI contemplates DNA binding site annotation in its reference sequences (RefSeq), most submissions omit this information. Moreover, due to the limited success of bioinformatics in producing efficient DNA binding site prediction tools (large false positive rates are often associated with in-silico motif discovery / site search methods), there has been no systematic effort to computationally annotate these features in sequenced genomes.

There are, however, several private and public databases devoted to compilation of experimentally reported, and sometimes computationally predicted, binding sites for different transcription factors in different organisms. Below is a non-exhaustive table of available databases:

NameOrganismsSourceAccessURL
PlantRegMap165 plant species (e.g., Arabidopsis thaliana, Oryza sativa, Zea mays, etc.)Expert curation and projectionPublic
JASPAR Vertebrates, Plants, Fungi, Flies, and WormsExpert curation with literature supportPublic
CIS-BP All EukaryotesExperimentally derived motifs and predictionsPublic
CollecTF ProkaryotesLiterature curationPublic
RegPrecise ProkaryotesExpert curationPublic
RegTransBase ProkaryotesExpert/literature curationPublic
RegulonDB Escherichia coliExpert curationPublic Archived 2017-05-07 at the Wayback Machine
PRODORICProkaryotesExpert curationPublic Archived 2007-05-16 at the Wayback Machine
TRANSFAC MammalsExpert/literature curationPublic/Private
TREDHuman, Mouse, RatComputer predictions, manual curationPublic
DBSDDrosophila speciesLiterature/Expert curationPublic
HOCOMOCO Human, MouseLiterature/Expert curationPublic ,
MethMotifHuman, MouseExpert curationPublic

Representation of DNA binding sites

A collection of DNA binding sites, typically referred to as a DNA binding motif, can be represented by a consensus sequence. This representation has the advantage of being compact, but at the expense of disregarding a substantial amount of information. [15] A more accurate way of representing binding sites is through Position Specific Frequency Matrices (PSFM). These matrices give information on the frequency of each base at each position of the DNA binding motif. [3] PSFM are usually conceived with the implicit assumption of positional independence (different positions at the DNA binding site contribute independently to the site function), although this assumption has been disputed for some DNA binding sites. [16] Frequency information in a PSFM can be formally interpreted under the framework of Information Theory, [17] leading to its graphical representation as a sequence logo.

12345678910111213141516
A1015325352334144313344523
C50101560441338175120
G00541555122711310152
T555135144092711289324111
Sum56565656565656565656565656565656

PSFM for the transcriptional repressor LexA as derived from 56 LexA-binding sites stored in Prodoric. Relative frequencies are obtained by dividing the counts in each cell by the total count (56)

Computational search and discovery of binding sites

In bioinformatics, one can distinguish between two separate problems regarding DNA binding sites: searching for additional members of a known DNA binding motif (the site search problem) and discovering novel DNA binding motifs in collections of functionally related sequences (the sequence motif discovery problem). [18] Many different methods have been proposed to search for binding sites. Most of them rely on the principles of information theory and have available web servers (Yellaboina)(Munch), while other authors have resorted to machine learning methods, such as artificial neural networks. [3] [19] [20] A plethora of algorithms is also available for sequence motif discovery. These methods rely on the hypothesis that a set of sequences share a binding motif for functional reasons. Binding motif discovery methods can be divided roughly into enumerative, deterministic and stochastic. [21] MEME [22] and Consensus [23] are classical examples of deterministic optimization, while the Gibbs sampler [24] is the conventional implementation of a purely stochastic method for DNA binding motif discovery. Another instance of this class of methods is SeSiMCMC [25] that is focused of weak TFBS sites with symmetry. While enumerative methods often resort to regular expression representation of binding sites, PSFM and their formal treatment under Information Theory methods are the representation of choice for both deterministic and stochastic methods. Hybrid methods, e.g. ChIPMunk [26] that combines greedy optimization with subsampling, also use PSFM. Recent advances in sequencing have led to the introduction of comparative genomics approaches to DNA binding motif discovery, as exemplified by PhyloGibbs. [27] [28]

More complex methods for binding site search and motif discovery rely on the base stacking and other interactions between DNA bases, but due to the small sample sizes typically available for binding sites in DNA, their efficiency is still not completely harnessed. An example of such tool is the ULPB [29]

See also

Related Research Articles

<span class="mw-page-title-main">Zinc finger</span> Small structural protein motif found mostly in transcriptional proteins

A zinc finger is a small protein structural motif that is characterized by the coordination of one or more zinc ions (Zn2+) which stabilizes the fold. It was originally coined to describe the finger-like appearance of a hypothesized structure from the African clawed frog (Xenopus laevis) transcription factor IIIA. However, it has been found to encompass a wide variety of differing protein structures in eukaryotic cells. Xenopus laevis TFIIIA was originally demonstrated to contain zinc and require the metal for function in 1983, the first such reported zinc requirement for a gene regulatory protein followed soon thereafter by the Krüppel factor in Drosophila. It often appears as a metal-binding domain in multi-domain proteins.

A regulatory sequence is a segment of a nucleic acid molecule which is capable of increasing or decreasing the expression of specific genes within an organism. Regulation of gene expression is an essential feature of all living organisms and viruses.

In biology, a sequence motif is a nucleotide or amino-acid sequence pattern that is widespread and usually assumed to be related to biological function of the macromolecule. For example, an N-glycosylation site motif can be defined as Asn, followed by anything but Pro, followed by either Ser or Thr, followed by anything but Pro residue.

<span class="mw-page-title-main">Helicase</span> Class of enzymes to unpack an organisms genes

Helicases are a class of enzymes thought to be vital to all organisms. Their main function is to unpack an organism's genetic material. Helicases are motor proteins that move directionally along a nucleic acid phosphodiester backbone, separating two hybridized nucleic acid strands, using energy from ATP hydrolysis. There are many helicases, representing the great variety of processes in which strand separation must be catalyzed. Approximately 1% of eukaryotic genes code for helicases.

<span class="mw-page-title-main">DNA-binding protein</span> Proteins that bind with DNA, such as transcription factors, polymerases, nucleases and histones

DNA-binding proteins are proteins that have DNA-binding domains and thus have a specific or general affinity for single- or double-stranded DNA. Sequence-specific DNA-binding proteins generally interact with the major groove of B-DNA, because it exposes more functional groups that identify a base pair.

<span class="mw-page-title-main">Triple-stranded DNA</span> DNA structure

Triple-stranded DNA is a DNA structure in which three oligonucleotides wind around each other and form a triple helix. In triple-stranded DNA, the third strand binds to a B-form DNA double helix by forming Hoogsteen base pairs or reversed Hoogsteen hydrogen bonds.

<span class="mw-page-title-main">Nucleoprotein</span> Type of protein

Nucleoproteins are proteins conjugated with nucleic acids. Typical nucleoproteins include ribosomes, nucleosomes and viral nucleocapsid proteins.

A DNA-binding domain (DBD) is an independently folded protein domain that contains at least one structural motif that recognizes double- or single-stranded DNA. A DBD can recognize a specific DNA sequence or have a general affinity to DNA. Some DNA-binding domains may also include nucleic acids in their folded structure.

Cis-regulatory elements (CREs) or cis-regulatory modules (CRMs) are regions of non-coding DNA which regulate the transcription of neighboring genes. CREs are vital components of genetic regulatory networks, which in turn control morphogenesis, the development of anatomy, and other aspects of embryonic development, studied in evolutionary developmental biology.

Cre-Lox recombination is a site-specific recombinase technology, used to carry out deletions, insertions, translocations and inversions at specific sites in the DNA of cells. It allows the DNA modification to be targeted to a specific cell type or be triggered by a specific external stimulus. It is implemented both in eukaryotic and prokaryotic systems. The Cre-lox recombination system has been particularly useful to help neuroscientists to study the brain in which complex cell types and neural circuits come together to generate cognition and behaviors. NIH Blueprint for Neuroscience Research has created several hundreds of Cre driver mouse lines which are currently used by the worldwide neuroscience community.

Site-specific recombinase technologies are genome engineering tools that depend on recombinase enzymes to replace targeted sections of DNA.

<span class="mw-page-title-main">Cre recombinase</span> Genetic recombination enzyme

Cre recombinase is a tyrosine recombinase enzyme derived from the P1 bacteriophage. The enzyme uses a topoisomerase I-like mechanism to carry out site specific recombination events. The enzyme is a member of the integrase family of site specific recombinase and it is known to catalyse the site specific recombination event between two DNA recognition sites. This 34 base pair (bp) loxP recognition site consists of two 13 bp palindromic sequences which flank an 8bp spacer region. The products of Cre-mediated recombination at loxP sites are dependent upon the location and relative orientation of the loxP sites. Two separate DNA species both containing loxP sites can undergo fusion as the result of Cre mediated recombination. DNA sequences found between two loxP sites are said to be "floxed". In this case the products of Cre mediated recombination depends upon the orientation of the loxP sites. DNA found between two loxP sites oriented in the same direction will be excised as a circular loop of DNA whilst intervening DNA between two loxP sites that are opposingly orientated will be inverted. The enzyme requires no additional cofactors or accessory proteins for its function.

DNA footprinting is a method of investigating the sequence specificity of DNA-binding proteins in vitro. This technique can be used to study protein-DNA interactions both outside and within cells.

Therapeutic gene modulation refers to the practice of altering the expression of a gene at one of various stages, with a view to alleviate some form of ailment. It differs from gene therapy in that gene modulation seeks to alter the expression of an endogenous gene whereas gene therapy concerns the introduction of a gene whose product aids the recipient directly.

<span class="mw-page-title-main">DMC1 (gene)</span> Protein-coding gene in the species Homo sapiens

Meiotic recombination protein DMC1/LIM15 homolog is a protein that in humans is encoded by the DMC1 gene.

ChIP-sequencing, also known as ChIP-seq, is a method used to analyze protein interactions with DNA. ChIP-seq combines chromatin immunoprecipitation (ChIP) with massively parallel DNA sequencing to identify the binding sites of DNA-associated proteins. It can be used to map global binding sites precisely for any protein of interest. Previously, ChIP-on-chip was the most common technique utilized to study these protein–DNA relations.

Phyloscan is a web service for DNA sequence analysis that is free and open to all users. For locating matches to a user-specified sequence motif for a regulatory binding site, Phyloscan provides a statistically sensitive scan of user-supplied mixed aligned and unaligned DNA sequence data. Phyloscan's strength is that it brings together

TRANSFAC is a manually curated database of eukaryotic transcription factors, their genomic binding sites and DNA binding profiles. The contents of the database can be used to predict potential transcription factor binding sites.

Transcription factors are proteins that bind genomic regulatory sites. Identification of genomic regulatory elements is essential for understanding the dynamics of developmental, physiological and pathological processes. Recent advances in chromatin immunoprecipitation followed by sequencing (ChIP-seq) have provided powerful ways to identify genome-wide profiling of DNA-binding proteins and histone modifications. The application of ChIP-seq methods has reliably discovered transcription factor binding sites and histone modification sites.

HOCOMOCO is an open-access database providing curated and benchmarked binding motifs of human and mouse transcription factors. It captures the following data types: Homo sapiens (human) and Mus musculus (mouse) transcription factors, their DNA binding site motifs, and motif subtypes.

References

  1. Halford E.S.; Marko J.F. (2004). "How do site-specific DNA-binding proteins find their targets?". Nucleic Acids Research. 32 (10): 3040–3052. doi:10.1093/nar/gkh624. PMC   434431 . PMID   15178741.
  2. Borneman, A.R.; Gianoulis, T.A.; Zhang, Z.D.; Yu, H.; Rozowsky, J.; Seringhaus, M.R.; Wang, L.Y.; Gerstein, M. & Snyder, M. (2007). "Divergence of transcription factor binding sites across related yeast species". Science. 317 (5839): 815–819. Bibcode:2007Sci...317..815B. doi:10.1126/science.1140748. PMID   17690298. S2CID   21535866.
  3. 1 2 3 Stormo GD (2000). "DNA binding sites: representation and discovery". Bioinformatics. 16 (1): 16–23. doi: 10.1093/bioinformatics/16.1.16 . PMID   10812473.
  4. Pingoud A, Jeltsch A (1997). "Recognition and Cleavage of DNA by Type-II Restriction Endonucleases". European Journal of Biochemistry. 246 (1): 1–22. doi: 10.1111/j.1432-1033.1997.t01-6-00001.x . PMID   9210460.
  5. Gyohda A, Komano T (2000). "Purification and characterization of the R64 shufflon-specific recombinase". Journal of Bacteriology. 182 (10): 2787–2792. doi:10.1128/JB.182.10.2787-2792.2000. PMC   101987 . PMID   10781547.
  6. Birge, E.A. (2006). "15: Site Specific Recombination". Bacterial and Bacteriophage Genetics (5th ed.). Springer. pp. 463–478. ISBN   978-0-387-23919-4.
  7. Campbell A (1963). "Fine Structure Genetics and its Relation to Function". Annual Review of Microbiology. 17 (1): 2787–2792. doi:10.1146/annurev.mi.17.100163.000405. PMID   14145311.
  8. Jacob F, Monod J (1961). "Genetic regulatory mechanisms in the synthesis of proteins". Journal of Molecular Biology. 3 (3): 318–356. doi:10.1016/S0022-2836(61)80072-7. PMID   13718526. S2CID   19804795.
  9. Gilbert W, Maxam A (1973). "The nucleotide sequence of the lac operator". Proceedings of the National Academy of Sciences of the United States of America. 70 (12): 3581–3584. Bibcode:1973PNAS...70.3581G. doi: 10.1073/pnas.70.12.3581 . PMC   427284 . PMID   4587255.
  10. Maniatis T, Ptashne M, Barrell BG, Donelson J (1974). "Sequence of a repressor-binding site in the DNA of bacteriophage lambda". Nature. 250 (465): 394–397. Bibcode:1974Natur.250..394M. doi:10.1038/250394a0. PMID   4854243. S2CID   4204720.
  11. Nash H. A. (1975). "Integrative recombination of bacteriophage lambda DNA in vitro". Proceedings of the National Academy of Sciences of the United States of America. 72 (3): 1072–1076. Bibcode:1975PNAS...72.1072N. doi: 10.1073/pnas.72.3.1072 . PMC   432468 . PMID   1055366.
  12. Elnitski L, Jin VX, Farnham PJ, Jones SJ (2006). "Locating mammalian transcription factor binding sites: a survey of computational and experimental techniques". Genome Research. 16 (12): 1455–1464. doi: 10.1101/gr.4140006 . PMID   17053094.
  13. Baaske P, Wienken CJ, Reineck P, Duhr S, Braun D (Feb 2010). "Optical Thermophoresis quantifies Buffer dependence of Aptamer Binding". Angew. Chem. Int. Ed. 49 (12): 2238–41. doi:10.1002/anie.200903998. PMID   20186894. S2CID   42489892.
  14. Wienken CJ; et al. (2010). "Protein-binding assays in biological liquids using microscale thermophoresis". Nature Communications. 1 (7): 100. Bibcode:2010NatCo...1..100W. doi: 10.1038/ncomms1093 . PMID   20981028.
  15. Schneider T.D. (2002). "Consensus sequence Zen". Applied Bioinformatics. 1 (3): 111–119. PMC   1852464 . PMID   15130839.
  16. Bulyk M.L.; Johnson P.L.; Church G.M. (2002). "Nucleotides of transcription factor binding sites exert interdependent effects on the binding affinities of transcription factors". Nucleic Acids Research. 30 (5): 1255–1261. doi:10.1093/nar/30.5.1255. PMC   101241 . PMID   11861919.
  17. Schneider TD, Stormo GD, Gold L, Ehrenfeucht A (1986). "Information content of binding sites on nucleotide sequences". Journal of Molecular Biology. 188 (3): 415–431X. doi:10.1016/0022-2836(86)90165-8. PMID   3525846.
  18. Erill I; O'Neill MC (2009). "A reexamination of information theory-based methods for DNA-binding site identification". BMC Bioinformatics. 10 (1): 57. doi: 10.1186/1471-2105-10-57 . PMC   2680408 . PMID   19210776.
  19. Bisant D, Maizel J (1995). "Identification of ribosome binding sites in Escherichia coli using neural network models". Nucleic Acids Research. 23 (9): 1632–1639. doi:10.1093/nar/23.9.1632. PMC   306908 . PMID   7784221.
  20. O'Neill M.C. (1991). "Training back-propagation neural networks to define and detect DNA-binding sites". Nucleic Acids Research. 19 (2): 133–318. doi:10.1093/nar/19.2.313. PMC   333596 . PMID   2014171.
  21. Bailey T.L. (2008). "Discovering Sequence Motifs". Bioinformatics (PDF). Methods in Molecular Biology. Vol. 452. pp. 231–251. doi:10.1007/978-1-60327-159-2_12. ISBN   978-1-58829-707-5. PMID   18566768.
  22. Bailey T.L. (2002). "Discovering novel sequence motifs with MEME". Current Protocols in Bioinformatics. 2 (4): 2.4.1–2.4.35. doi:10.1002/0471250953.bi0204s00. PMID   18792935. S2CID   205157795.
  23. Stormo GD, Hartzell GW 3rd (1989). "Identifying protein-binding sites from unaligned DNA fragments". Proceedings of the National Academy of Sciences of the United States of America. 86 (4): 1183–1187. Bibcode:1989PNAS...86.1183S. doi: 10.1073/pnas.86.4.1183 . PMC   286650 . PMID   2919167.
  24. Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, Wootton JC (1993). "Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment". Science. 262 (5131): 208–214. Bibcode:1993Sci...262..208L. doi:10.1126/science.8211139. PMID   8211139. S2CID   3040614.
  25. Favorov, A V; M S Gelfand; A V Gerasimova; D A Ravcheev; A A Mironov; V J Makeev (2005-05-15). "A Gibbs sampler for identification of symmetrically structured, spaced DNA motifs with improved estimation of the signal length". Bioinformatics. 21 (10): 2240–2245. doi: 10.1093/bioinformatics/bti336 . ISSN   1367-4803. PMID   15728117.
  26. Kulakovskiy, I V; V A Boeva; A V Favorov; V J Makeev (2010-08-24). "Deep and wide digging for binding motifs in ChIP-Seq data". Bioinformatics. 26 (20): 2622–3. doi: 10.1093/bioinformatics/btq488 . ISSN   1367-4811. PMID   20736340.
  27. Das MK, Dai HK (2007). "A survey of DNA motif finding algorithms". BMC Bioinformatics. 8 (Suppl 7): S21. doi: 10.1186/1471-2105-8-S7-S21 . PMC   2099490 . PMID   18047721.
  28. Siddharthan R, Siggia ED, van Nimwegen E (2005). "PhyloGibbs: A Gibbs sampling motif finder that incorporates phylogeny". PLOS Comput Biol. 1 (7): e67. Bibcode:2005PLSCB...1...67S. doi: 10.1371/journal.pcbi.0010067 . PMC   1309704 . PMID   16477324.
  29. Salama RA, Stekel DJ (2010). "Inclusion of neighboring base interdependencies substantially improves genome-wide prokaryotic transcription factor binding site prediction". Nucleic Acids Research. 38 (12): e135. doi:10.1093/nar/gkq274. PMC   2896541 . PMID   20439311.