Phylogenetic footprinting

Last updated May 02, 2019

Phylogenetic footprinting is a technique used to identify transcription factor binding sites (TFBS) within a non-coding region of DNA of interest by comparing it to the orthologous sequence in different species. When this technique is used with a large number of closely related species, this is called phylogenetic shadowing.^[1]

In molecular biology, a transcription factor (TF) is a protein that controls the rate of transcription of genetic information from DNA to messenger RNA, by binding to a specific DNA sequence. The function of TFs is to regulate—turn on and off—genes in order to make sure that they are expressed in the right cell at the right time and in the right amount throughout the life of the cell and the organism. Groups of TFs function in a coordinated fashion to direct cell division, cell growth, and cell death throughout life; cell migration and organization during embryonic development; and intermittently in response to signals from outside the cell, such as a hormone. There are up to 2600 TFs in the human genome.

In mathematics, a sequence is an enumerated collection of objects in which repetitions are allowed. Like a set, it contains members. The number of elements is called the length of the sequence. Unlike a set, the same elements can appear multiple times at different positions in a sequence, and order matters. Formally, a sequence can be defined as a function whose domain is either the set of the natural numbers or the set of the first n natural numbers. The position of an element in a sequence is its rank or index; it is the natural number from which the element is the image. It depends on the context or a specific convention, if the first element has index 0 or 1. When a symbol has been chosen for denoting a sequence, the nth element of the sequence is denoted by this symbol with n as subscript; for example, the nth element of the Fibonacci sequence is generally denoted F_n.

In biology, a species ( ) is the basic unit of classification and a taxonomic rank of an organism, as well as a unit of biodiversity. A species is often defined as the largest group of organisms in which any two individuals of the appropriate sexes or mating types can produce fertile offspring, typically by sexual reproduction. Other ways of defining species include their karyotype, DNA sequence, morphology, behaviour or ecological niche. In addition, paleontologists use the concept of the chronospecies since fossil reproduction cannot be examined. While these definitions may seem adequate, when looked at more closely they represent problematic species concepts. For example, the boundaries between closely related species become unclear with hybridisation, in a species complex of hundreds of similar microspecies, and in a ring species. Also, among organisms that reproduce only asexually, the concept of a reproductive species breaks down, and each clone is potentially a microspecies.

Researchers have found that non-coding pieces of DNA contain binding sites for regulatory proteins that govern the spatiotemporal expression of genes. These transcription factor binding sites (TFBS), or regulatory motifs, have proven hard to identify, primarily because they are short in length, and can show sequence variation. The importance of understanding transcriptional regulation to many fields of biology has led researchers to develop strategies for predicting the presence of TFBS, many of which have led to publicly available databases. One such technique is Phylogenetic Footprinting.

Deoxyribonucleic acid is a molecule composed of two chains that coil around each other to form a double helix carrying the genetic instructions used in the growth, development, functioning, and reproduction of all known organisms and many viruses. DNA and ribonucleic acid (RNA) are nucleic acids; alongside proteins, lipids and complex carbohydrates (polysaccharides), nucleic acids are one of the four major types of macromolecules that are essential for all known forms of life.

Biology is the natural science that studies life and living organisms, including their physical structure, chemical processes, molecular interactions, physiological mechanisms, development and evolution. Despite the complexity of the science, there are certain unifying concepts that consolidate it into a single, coherent field. Biology recognizes the cell as the basic unit of life, genes as the basic unit of heredity, and evolution as the engine that propels the creation and extinction of species. Living organisms are open systems that survive by transforming energy and decreasing their local entropy to maintain a stable and vital condition defined as homeostasis.

Footprinting is the technique used for gathering information about computer systems and the entities they belong to. To get this information, a hacker might use various tools and technologies. This information is very useful to a hacker who is trying to crack a whole system.

Phylogenetic footprinting relies upon two major concepts:

The function and DNA binding preferences of transcription factors are well-conserved between diverse species.
Important non-coding DNA sequences that are essential for regulating gene expression will show differential selective pressure. A slower rate of change occurs in TFBS than in other, less critical, parts of the non-coding genome.^[2]

History

Phylogenetic footprinting was first used and published by Tagle et al. in 1988, which allowed researchers to predict evolutionary conserved cis-regulatory elements responsible for embryonic ε and γ globulin gene expression in primates.^[3]

The globulins are a family of globular proteins that have higher molecular weights than albumins and are insoluble in pure water but dissolve in dilute salt solutions. Some globulins are produced in the liver, while others are made by the immune system. Globulins, albumins, and fibrinogen are the major blood proteins. The normal concentration of globulins in human blood is about 2.6-3.5 g/dL.

Before phylogenetic footprinting, DNase footprinting was used, where protein would be bound to DNA transcription factor binding sites (TFBS) protecting it from DNase digestion. One of the problems with this technique was the amount of time and labor it would take. Unlike DNase footprinting, phylogenetic footprinting relies on evolutionary constraints within the genome, with the "important" parts of the sequence being conserved among the different species.^[4]

Protocol

It is important when using this technique to decide which genome your sequence should be aligned to. More divergent species will have less sequence similarity between orthologous genes. Therefore, the key is to pick species that are related enough to detect homology, but divergent enough to maximize non-alignment "noise". Step wise approach to Phylogenetic footprinting consists of :

One should decide on the gene of interest.
Carefully choose species with orthologous genes.
Decide on the length of the upstream or maybe downstream region to be looked at.
Align the sequences.
Look for conserved regions and analyse them.

Not all TFBS are found

Not all transcription binding sites can be found using phylogenetic footprinting due to the statistical nature of this technique. Here are several reasons why some TFBS are not found:

Species specific binding sites

Some binding sites seem to have no significant matches in most other species. Therefore, detecting these sites by phylogenetic footprinting is likely impossible unless a large number of closely related species are available.

Very short binding sites

Some binding sites show excellent conservation, but just in a shorter region than the ones were looked for. Such short motifs (e.g., GC-box) often occur by chance in nonfunctional sequences and detecting these motifs can be challenging.

Less specific binding factors

Some binding sites show some conservation but have had insertions or deletions. It is not obvious if these sequences with insertions or deletions are still functional. Though they may still be functional if the binding factor is less specific (or less 'picky' if you will). Because deletions and insertions are rare in binding sites, considering insertions and deletions in the sequence would detect a few more true TFBSs, but it could likely include many more false positives.

Not enough data

Some motifs are quite well conserved, but they are statistically insignificant in a specific dataset. The motif might have appeared in different species by chance. These motifs could be detected if sequences from more organisms are available. So this will be less of a problem in the future.

Compound binding regions

Some transcription factors bind as dimers. Therefore, their binding sites may consist of two conserved regions, separated by a few variable nucleotides. Because of the variable internal sequence, the motif cannot be detected. However, if we could use a program to search for motifs containing a variable sequence in the middle, without counting mutations, these motifs could be discovered.

Accuracy

It is important to keep in mind that not all conserved sequences are under selection pressure. To eliminate false positives statistical analysis must be performed that will show that the motifs reported have a mutation rate meaningfully less than that of the surrounding nonfunctional sequence.

Moreover, results could be more accurate if the prior knowledge about the sequence is considered. For example, some regulatory elements are repeated 15 times in a promoter region (e.g., some metallothionein promoters have up to 15 metal response elements (MREs)). Thus, to eliminate false motifs with inconsistent order across species, the orientation and order of regulatory elements in a promoter region should be the same in all species. This type of information could help us to identify regulatory elements that are not adequately conserved but occur in several copies in the input sequence.^[5]

Related Research Articles

In genetics, a promoter is a region of DNA that initiates transcription of a particular gene. Promoters are located near the transcription start sites of genes, on the same strand and upstream on the DNA . Promoters can be about 100–1000 base pairs long.

A regulatory sequence is a segment of a nucleic acid molecule which is capable of increasing or decreasing the expression of specific genes within an organism. Regulation of gene expression is an essential feature of all living organisms and viruses.

In molecular biology, the TATA box is a sequence of DNA found in the core promoter region of genes in archaea and eukaryotes. The prokaryotic homolog of the TATA box is called the Pribnow box which has a shorter consensus sequence.

This is a list of topics in molecular biology. See also index of biochemistry articles.

The Encyclopedia of DNA Elements (ENCODE) is a public research project which aims to identify functional elements in the human genome.

A regulator gene, regulator, or regulatory gene is a gene involved in controlling the expression of one or more other genes. Regulatory sequences, which encode regulatory genes, are often 5' to the start site of transcription of the gene they regulate. In addition, these sequences can also be found 3' to the transcription start site. In both cases, whether the regulatory sequence occurs before (5') or after (3') the gene it regulates, the sequence is often many kilobases away from the transcription start site. A regulator gene may encode a protein, or it may work at the level of RNA, as in the case of genes encoding microRNAs. An example of a regulator gene is a gene that codes for a repressor protein that inhibits the activity of an operator gene.

DNA footprinting is a method of investigating the sequence specificity of DNA-binding proteins in vitro. This technique can be used to study protein-DNA interactions both outside and within cells.

ChIP-sequencing, also known as ChIP-seq, is a method used to analyze protein interactions with DNA. ChIP-seq combines chromatin immunoprecipitation (ChIP) with massively parallel DNA sequencing to identify the binding sites of DNA-associated proteins. It can be used to map global binding sites precisely for any protein of interest. Previously, ChIP-on-chip was the most common technique utilized to study these protein–DNA relations.

Tiling arrays are a subtype of microarray chips. Like traditional microarrays, they function by hybridizing labeled DNA or RNA target molecules to probes fixed onto a solid surface.

DNA binding sites are a type of binding site found in DNA where other molecules may bind. DNA binding sites are distinct from other binding sites in that (1) they are part of a DNA sequence and (2) they are bound by DNA-binding proteins. DNA binding sites are often associated with specialized proteins known as transcription factors, and are thus linked to transcriptional regulation. The sum of DNA binding sites of a specific transcription factor is referred to as its cistrome. DNA binding sites also encompasses the targets of other proteins, like restriction enzymes, site-specific recombinases and methyltransferases.

Phyloscan is a web service for DNA sequence analysis that is free and open to all users. For locating matches to a user-specified sequence motif for a regulatory binding site, Phyloscan provides a statistically sensitive scan of user-supplied mixed aligned and unaligned DNA sequence data. Phyloscan's strength is that it brings together

Chromatin Interaction Analysis by Paired-End Tag Sequencing is a technique that incorporates chromatin immunoprecipitation (ChIP)-based enrichment, chromatin proximity ligation, Paired-End Tags, and High-throughput sequencing to determine de novo long-range chromatin interactions genome-wide.

A conserved non-coding sequence (CNS) is a DNA sequence of noncoding DNA that is evolutionarily conserved. These sequences are of interest for their potential to regulate gene production.

DNase-seq is a method in molecular biology used to identify the location of regulatory regions, based on the genome-wide sequencing of regions sensitive to cleavage by DNase I. FAIRE-Seq is a successor of DNase-seq for the genome-wide identification of accessible DNA regions in the genome. Both the protocols for identifying open chromatin regions have biases depending on underlying nucleosome structure. For example, FAIRE-seq provides higher tag counts at non-promoter regions. On the other hand, DNase-seq signal is higher at promoter regions, and DNase-seq has been shown to have better sensitivity than FAIRE-seq even at non-promoter regions.

TRANSFAC is a manually curated database of eukaryotic transcription factors, their genomic binding sites and DNA binding profiles. The contents of the database can be used to predict potential transcription factor binding sites.

The WRKY domain is found in the WRKY transcription factor family, a class of transcription factors. The WRKY domain is found almost exclusively in plants although WRKY genes appear present in some diplomonads, social amoebae and other amoebozoa, and fungi incertae sedis. They appear absent in other non-plant species. WRKY transcription factors have been a significant area of plant research for the past 20 years. The WRKY DNA-binding domain recognizes the W-box (T)TGAC(C/T) cis-regulatory element.

Archaeal transcription factor B is one of several extrinsic transcription factors that guide the initiation of RNA transcription in organisms that fall under the domain of Archaea. It is homologous to eukaryotic TFIIB and, more distantly, to bacterial sigma factor. Like these proteins, it is involved in forming transcription preinitiation complexes. Its structure includes several conserved motifs which interact with DNA and other transcription factors, notably the single type of RNA polymerase that performs transcription in Archaea.

In genetics, DNase I hypersensitive sites (DHSs) are regions of chromatin that are sensitive to cleavage by the DNase I enzyme. In these specific regions of the genome, chromatin has lost its condensed structure, exposing the DNA and making it accessible. This raises the availability of DNA to degradation by enzymes, such as DNase I. These accessible chromatin zones are functionally related to transcriptional activity, since this remodeled state is necessary for the binding of proteins such as transcription factors.

STARR-seq is a novel method to assay enhancer activity for millions of candidates from arbitrary sources of DNA. It is used to identify the sequences that act as transcriptional enhancers in a direct, quantitative, and genome-wide manner.

Mathieu Daniel Blanchette is a computational biologist and Associate Professor in the School of Computer Science at McGill University. His research focuses on developing new algorithms for the detection of functional regions in DNA sequences.

References

↑ Phylogenetic Shadowing of Primate Sequences to Find Functional Regions of the Human Genome doi : 10.1126/science.1081331
↑ Neph, S. and Tompa, M. 2006. MicroFootPrinter: a tool for phylogenetic footprinting in prokaryotic genomes. Nucleic Acids Research. 34: 366-368
↑ Tagle, D. A., Koop, B. F., Goodman, M., Slightom, J. L., Hess, D., and Jones, R. T. 1988. Embryonic ε and γ globin genes of a prosimian primate (Galago crassicaudatis): nucleotide and amino acid sequences, developmental regulation, and phylogenetic footprints. J. Mol. Biol. 203:439-455.
↑ Zhang, Z. and Gerstein, M. 2003. Of mice and men: phylogenetic footprinting aids the discovery of regulatory elements.J. Biol.2:11-11.4
↑ Blanchette, M. and Tompa, M. 2002. Discovery of Regulatory Elements by a Computational Method for Phylogenetic Footprinting. Genome Res. 12: 739-748

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Phylogenetic Shadowing of Primate Sequences to Find Functional Regions of the Human Genome doi : 10.1126/science.1081331

[2] Neph, S. and Tompa, M. 2006. MicroFootPrinter: a tool for phylogenetic footprinting in prokaryotic genomes. Nucleic Acids Research. 34: 366-368

[3] Tagle, D. A., Koop, B. F., Goodman, M., Slightom, J. L., Hess, D., and Jones, R. T. 1988. Embryonic ε and γ globin genes of a prosimian primate (Galago crassicaudatis): nucleotide and amino acid sequences, developmental regulation, and phylogenetic footprints. J. Mol. Biol. 203:439-455.

[4] Zhang, Z. and Gerstein, M. 2003. Of mice and men: phylogenetic footprinting aids the discovery of regulatory elements.J. Biol.2:11-11.4

[5] Blanchette, M. and Tompa, M. 2002. Discovery of Regulatory Elements by a Computational Method for Phylogenetic Footprinting. Genome Res. 12: 739-748