Conserved non-coding sequence

Last updated

A conserved non-coding sequence (CNS) is a DNA sequence of noncoding DNA that is evolutionarily conserved. These sequences are of interest for their potential to regulate gene production. [1]

Contents

CNSs in plants [2] and animals [1] are highly associated with transcription factor binding sites and other cis-acting regulatory elements. Conserved non-coding sequences can be important sites of evolutionary divergence [3] as mutations in these regions may alter the regulation of conserved genes, producing species-specific patterns of gene expression. These features have made them an invaluable resource in comparative genomics.

Sources

All CNSs are likely to perform some function in order to have constraints on their evolution, but they can be distinguished based on where in the genome they are found and how they got there.

Introns

Introns are stretches of sequence found mostly in eukaryotic organisms which interrupt the coding regions of genes, with basepair lengths varying across three orders of magnitude. Intron sequences may be conserved, often because they contain expression regulating elements that put functional constraints on their evolution. [4] Patterns of conserved introns between species of different kingdoms have been used to make inferences about intron density at different points in evolutionary history. This makes them an important resource for understanding the dynamics of intron gain and loss in eukaryotes (1,28). [4] [5]

Untranslated regions

Some of the most highly conserved noncoding regions are found in the untranslated regions (UTRs) at the 3' end of mature RNA transcripts, rather than in the introns. This suggests an important function operating at the post-transcriptional level. If these regions perform an important regulatory function, the increase in 3'-UTR length over evolutionary time suggests that conserved UTRs contribute to organism complexity. Regulatory motifs in UTRs often conserved in genes belonging to the same metabolic family could potentially be used to develop highly specific medicines that target RNA transcripts. [4]

Transposable elements

Repetitive elements can accumulate in an organism's genome as the result of a few different transposition processes. The extent to which this has taken place during the evolution of eukaryotes varies greatly: repetitive DNA accounts for just 3% of the fly genome, but accounts for 50% of the human genome. [4]

There are different theories explaining the conservation of transposable elements. One holds that, like pseudogenes, they provide a source of new genetic material, allowing for faster adaptation to changes in the environment. A simpler alternative is that, because eukaryotic genomes may have no means to prevent the proliferation of transposable elements, they are free to accumulate as long as they are not inserted into or near a gene in such a way that they would disrupt essential functions. [6] A recent study showed that transposons contribute at least 16% of the eutherian-specific CNSs, marking them as a "major creative force" in the evolution of gene regulation in mammals. [7] There are three major classes of transposable elements, distinguished by the mechanisms by which they proliferate. [6]

Classes

DNA transposons encode a transposase protein, which is flanked by inverted repeat sequences. The transposase excises the sequence and reintegrates it elsewhere in the genome. By excising immediately following DNA replication and inserting into target sites which have not yet been replicated, the number of transposons in the genome can increase. [6]

Retrotransposons use reverse transcriptase to generate a cDNA from the TE transcript. These are further divided into long terminal repeat (LTR) retrotransposons, long interspersed nuclear elements (LINEs), and short interspersed nuclear elements (SINEs). In LTR retrotransposons, after the RNA template is degraded, a DNA strand complementary to the reverse-transcribed cDNA returns the element to a double-stranded state. Integrase, an enzyme encoded by the LTR retrotransposon, then reincorporates the element at a new target site. These elements are flanked by long terminal repeats (300–500bp) which mediate the transposition process. [6]

LINEs use a simpler method in which the cDNA is synthesized at the target site following cleavage by a LINE-encoded endonuclease. LINE-encoded reverse transcriptase is not highly sequence-specific. The incorporation by LINE machinery of unrelated RNA transcripts gives rise to non-functional processed pseudogenes. If a small gene's promoter is included in the transcribed portion of the gene, the stable transcript can be duplicated and reinserted into the genome multiple times. The elements produced by this process are called SINEs. [6]

Conserved regulatory transposable elements

When conserved regulatory transposable elements are active in a genome, they can introduce new promoter regions, disrupt existing regulatory sites, or, if inserted into transcribed regions, alter splicing patterns. A particular transposed element will be positively selected for if the altered expression it produces confers an adaptive advantage. This has resulted in some of the conserved regions found in humans. Nearly 25% of characterized promoters in humans contain transposed elements. [8] This is of particular interest in light of the fact that most transposable elements in humans are no longer active. [6]

Pseudogenes

Pseudogenes are vestiges of once-functional genes disabled by sequence deletions, insertions, or mutations. The primary evidence for this process is the presence of fully functioning orthologues to these inactivated sequences in other related genomes. [4] Pseudogenes commonly emerge following a gene duplication or polyploidization event. With two functional copies of a gene, there is no selective pressure to maintain expressibility of both, leaving one free to accumulate mutations as a nonfunctioning pseudogene. This is the typical case, whereby neutral selection allows pseudogenes to accumulate mutations, serving as "reservoirs" of new genetic material, with potential to be reincorporated into the genome. However, some pseudogenes have been found to be conserved in mammals. [9] The simplest explanation for this is that these noncoding regions may serve some biological function, and this has been found to be the case for several conserved pseudogenes. Makorin1 mRNA, for example, was found to be stabilized by its paralogous pseudogene, Makorin1-p1, which is conserved in several mouse species. Other pseudogenes have also been found to be conserved between humans and mice and between humans and chimpanzees, originating from duplication events prior to the divergence of the species. Evidence of these pseudogenes' transcription also supports the hypothesis that they have a biological function. [10] Findings of potentially functional pseudogenes creates difficulty in defining them, since the term was originally meant for degenerate sequences with no biological function. [11]

An example of a pseudogene is the gene for L-gulonolactone oxidase, a liver enzyme necessary for biosynthesis of L-ascorbic acid (vitamin C) in most birds and mammals, but which is mutated in the haplorrhini suborder of primates, including humans which require ascorbic acid or ascorbate from food. The remains of this non-functional gene with many mutations is still present in the genomes of guinea pigs and humans. [12]

Ultraconserved regions

Ultraconserved regions (UCRs) are regions over 200 bp in length with 100% identity across species. These unique sequences are mostly found in noncoding regions. It is still not fully understood why the negative selective pressure on these regions is so much stronger than the selection in protein-coding regions. [13] [14] Though these regions can be seen as unique, the distinction between regions with a high degree of sequence conservation and those with perfect sequence conservation is not necessarily one of biological significance. One study in Science found that all extremely conserved noncoding sequences have important regulatory functions regardless of whether the conservation is perfect, making the distinction of ultraconservation appear somewhat arbitrary. [14]

In comparative genomics

The conservation of both functional and nonfunctional noncoding regions provides an important tool for comparative genomics, though conservation of cis-regulatory elements has proven particularly useful. [4] The presence of CNSs could be due in some cases to a lack of divergence time, [15] though the more common thinking is that they perform functions which place varying degrees of constraint on their evolution. Consistent with this theory, cis-regulatory elements are commonly found in conserved noncoding regions. Thus, sequence similarity is often used as a parameter to limit the search space when trying to identify regulatory elements conserved across species, though this is most useful in analyzing distantly related organisms, since closer relatives have sequence conservation among nonfunctional elements as well. [4] [16] [17]

Orthologues with high sequence similarity may not share the same regulatory elements. [18] These differences may account for different expression patterns across species. [19] Conservation of noncoding sequence is important for the analysis of paralogs within a single species as well. CNSs shared by paralogous clusters of Hox genes are candidates for expression regulating regions, possibly coordinating the similar expression patterns of these genes. [16]

Comparative genomic studies of the promoter regions of orthologous genes can also detect differences in the presence and relative positioning of transcription factor binding sites in promoter regions. [20] Orthologues with high sequence similarity may not share the same regulatory elements. [18] These differences may account for different expression patterns across species. [19]

The regulatory functions commonly associated with conserved non-coding regions are thought to play a role in the evolution of eukaryotic complexity. On average, plants contain fewer CNSs per gene than mammals. This is thought to be related to their having undergone more polyploidization, or genome duplication events. During the subfunctionalization that ensues following gene duplication, there is potential for a greater rate of CNS loss per gene. Thus, genome duplication events may account for the fact that plants have more genes, each with fewer CNSs. Assuming the number of CNSs to be a proxy for regulatory complexity, this may account for the disparity in complexity between plants and mammals. [21]

Because changes in gene regulation are thought to account for most of the differences between humans and chimpanzees, researchers have looked to CNSs to try to show this. A portion of the CNSs between humans and other primates have an enrichment of human-specific single-nucleotide polymorphisms, suggesting positive selection for these SNPs and accelerated evolution of those CNSs. Many of these SNPs are also associated with changes in gene expression, suggesting that these CNSs played an important role in human evolution. [22]

Online bioinformatic software

ProgramWebsite [4]
Consite http://consite.genereg.net/ Archived 2009-01-05 at the Wayback Machine
Ancora http://ancora.genereg.net/
FootPrinter http://bio.cs.washington.edu/software Archived 2011-11-22 at the Wayback Machine
GenomeTrafac http://genometrafac.cchmc.org/genome-trafac/index.jsp Archived 2020-08-12 at the Wayback Machine
rVISTA http://rvista.dcode.org/
Toucan http://homes.esat.kuleuven.be/~saerts/software/toucan.php
Trafac http://trafac.chmcc.org/trafac/index.jsp
UCNEbase http://ccg.vital-it.ch/UCNEbase/

Related Research Articles

<span class="mw-page-title-main">Genome</span> All genetic material of an organism

In the fields of molecular biology and genetics, a genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA. The nuclear genome includes protein-coding genes and non-coding genes, other functional regions of the genome such as regulatory sequences, and often a substantial fraction of junk DNA with no evident function. Almost all eukaryotes have mitochondria and a small mitochondrial genome. Algae and plants also contain chloroplasts with a chloroplast genome.

An intron is any nucleotide sequence within a gene that is not expressed or operative in the final RNA product. The word intron is derived from the term intragenic region, i.e., a region inside a gene. The term intron refers to both the DNA sequence within a gene and the corresponding RNA sequence in RNA transcripts. The non-intron sequences that become joined by this RNA processing to form the mature RNA are called exons.

<span class="mw-page-title-main">Transposable element</span> Semiparasitic DNA sequence

A transposable element is a nucleic acid sequence in DNA that can change its position within a genome, sometimes creating or reversing mutations and altering the cell's genetic identity and genome size. Transposition often results in duplication of the same genetic material. In the human genome, L1 and Alu elements are two examples. Barbara McClintock's discovery of them earned her a Nobel Prize in 1983. Its importance in personalized medicine is becoming increasingly relevant, as well as gaining more attention in data analytics given the difficulty of analysis in very high dimensional spaces.

<span class="mw-page-title-main">Human genome</span> Complete set of nucleic acid sequences for humans

The human genome is a complete set of nucleic acid sequences for humans, encoded as DNA within the 23 chromosome pairs in cell nuclei and in a small DNA molecule found within individual mitochondria. These are usually treated separately as the nuclear genome and the mitochondrial genome. Human genomes include both protein-coding DNA sequences and various types of DNA that does not encode proteins. The latter is a diverse category that includes DNA coding for non-translated RNA, such as that for ribosomal RNA, transfer RNA, ribozymes, small nuclear RNAs, and several types of regulatory RNAs. It also includes promoters and their associated gene-regulatory elements, DNA playing structural and replicatory roles, such as scaffolding regions, telomeres, centromeres, and origins of replication, plus large numbers of transposable elements, inserted viral DNA, non-functional pseudogenes and simple, highly repetitive sequences. Introns make up a large percentage of non-coding DNA. Some of this non-coding DNA is non-functional junk DNA, such as pseudogenes, but there is no firm consensus on the total amount of junk DNA.

Non-coding DNA (ncDNA) sequences are components of an organism's DNA that do not encode protein sequences. Some non-coding DNA is transcribed into functional non-coding RNA molecules. Other functional regions of the non-coding DNA fraction include regulatory sequences that control gene expression; scaffold attachment regions; origins of DNA replication; centromeres; and telomeres. Some non-coding regions appear to be mostly nonfunctional such as introns, pseudogenes, intergenic DNA, and fragments of transposons and viruses.

Junk DNA is a DNA sequence that has no relevant biological function. Most organisms have some junk DNA in their genomes - mostly pseudogenes and fragments of transposons and viruses - but it is possible that some organisms have substantial amounts of junk DNA.

<span class="mw-page-title-main">Pseudogene</span> Functionless relative of a gene

Pseudogenes are nonfunctional segments of DNA that resemble functional genes. Most arise as superfluous copies of functional genes, either directly by gene duplication or indirectly by reverse transcription of an mRNA transcript. Pseudogenes are usually identified when genome sequence analysis finds gene-like sequences that lack regulatory sequences needed for transcription or translation, or whose coding sequences are obviously defective due to frameshifts or premature stop codons. Pseudogenes are a type of junk DNA.

An Alu element is a short stretch of DNA originally characterized by the action of the Arthrobacter luteus (Alu) restriction endonuclease. Alu elements are the most abundant transposable elements, containing over one million copies dispersed throughout the human genome. Alu elements were thought to be selfish or parasitic DNA, because their sole known function is self reproduction. However, they are likely to play a role in evolution and have been used as genetic markers. They are derived from the small cytoplasmic 7SL RNA, a component of the signal recognition particle. Alu elements are highly conserved within primate genomes and originated in the genome of an ancestor of Supraprimates.

Gene duplication is a major mechanism through which new genetic material is generated during molecular evolution. It can be defined as any duplication of a region of DNA that contains a gene. Gene duplications can arise as products of several types of errors in DNA replication and repair machinery as well as through fortuitous capture by selfish genetic elements. Common sources of gene duplications include ectopic recombination, retrotransposition event, aneuploidy, polyploidy, and replication slippage.

<span class="mw-page-title-main">Gene family</span> Set of several similar genes

A gene family is a set of several similar genes, formed by duplication of a single original gene, and generally with similar biochemical functions. One such family are the genes for human hemoglobin subunits; the ten genes are in two clusters on different chromosomes, called the α-globin and β-globin loci. These two gene clusters are thought to have arisen as a result of a precursor gene being duplicated approximately 500 million years ago.

Repeated sequences are short or long patterns of nucleic acids that occur in multiple copies throughout the genome. In many organisms, a significant fraction of the genomic DNA is repetitive, with over two-thirds of the sequence consisting of repetitive elements in humans. Some of these repeated sequences are necessary for maintaining important genome structures such as telomeres or centromeres.

<span class="mw-page-title-main">Retrotransposon</span> Type of genetic component

Retrotransposons are a type of genetic component that copy and paste themselves into different genomic locations (transposon) by converting RNA back into DNA through the reverse transcription process using an RNA transposition intermediate.

An intergenic region is a stretch of DNA sequences located between genes. Intergenic regions may contain functional elements and junk DNA.

Exon shuffling is a molecular mechanism for the formation of new genes. It is a process through which two or more exons from different genes can be brought together ectopically, or the same exon can be duplicated, to create a new exon-intron structure. There are different mechanisms through which exon shuffling occurs: transposon mediated exon shuffling, crossover during sexual recombination of parental genomes and illegitimate recombination.

NUMT, pronounced "new might", is an acronym for "nuclear mitochondrial DNA" segment or genetic locus coined by evolutionary geneticist, Jose V. Lopez, which describes a transposition of any type of cytoplasmic mitochondrial DNA into the nuclear genome of eukaryotic organisms.

<span class="mw-page-title-main">Gene redundancy</span>

Gene redundancy is the existence of multiple genes in the genome of an organism that perform the same function. Gene redundancy can result from gene duplication. Such duplication events are responsible for many sets of paralogous genes. When an individual gene in such a set is disrupted by mutation or targeted knockout, there can be little effect on phenotype as a result of gene redundancy, whereas the effect is large for the knockout of a gene with only one copy. Gene knockout is a method utilized in some studies aiming to characterize the maintenance and fitness effects functional overlap.

<span class="mw-page-title-main">DNA annotation</span> The process of describing the structure and function of a genome

In molecular biology and genetics, DNA annotation or genome annotation is the process of describing the structure and function of the components of a genome, by analyzing and interpreting them in order to extract their biological significance and understand the biological processes in which they participate. Among other things, it identifies the locations of genes and all the coding regions in a genome and determines what those genes do.

<span class="mw-page-title-main">Genome evolution</span> Process by which a genome changes in structure or size over time

Genome evolution is the process by which a genome changes in structure (sequence) or size over time. The study of genome evolution involves multiple fields such as structural analysis of the genome, the study of genomic parasites, gene and ancient genome duplications, polyploidy, and comparative genomics. Genome evolution is a constantly changing and evolving field due to the steadily growing number of sequenced genomes, both prokaryotic and eukaryotic, available to the scientific community and the public at large.

<span class="mw-page-title-main">Short interspersed nuclear element</span>

Short interspersed nuclear elements (SINEs) are non-autonomous, non-coding transposable elements (TEs) that are about 100 to 700 base pairs in length. They are a class of retrotransposons, DNA elements that amplify themselves throughout eukaryotic genomes, often through RNA intermediates. SINEs compose about 13% of the mammalian genome.

Gene deserts are regions of the genome that are devoid of protein-coding genes. Gene deserts constitute an estimated 25% of the entire genome, leading to the recent interest in their true functions. Originally believed to contain inessential and “junk” DNA due to their inability to create proteins, gene deserts have since been linked to several vital regulatory functions, including distal enhancing and conservatory inheritance. Thus, an increasing number of risks that lead to several major diseases, including a handful of cancers, have been attributed to irregularities found in gene deserts. One of the most notable examples is the 8q24 gene region, which, when affected by certain single nucleotide polymorphisms, lead to a myriad of diseases. The major identifying factors of gene deserts lay in their low GpC content and their relatively high levels of repeats, which are not observed in coding regions. Recent studies have even further categorized gene deserts into variable and stable forms; regions are categorized based on their behavior through recombination and their genetic contents. Although current knowledge of gene deserts is rather limited, ongoing research and improved techniques are beginning to open the doors for exploration on the various important effects of these noncoding regions.

References

  1. 1 2 Hardison, RC. (Sep 2000). "Conserved noncoding sequences are reliable guides to regulatory elements". Trends Genet. 16 (9): 369–72. doi:10.1016/s0168-9525(00)02081-3. PMID   10973062. Archived from the original on 2000-12-04. Retrieved 2011-02-18.
  2. Freeling, M; Subramaniam, S (Apr 2009). "Conserved noncoding sequences (CNSs) in higher plants". Curr Opin Plant Biol. 12 (2): 126–32. doi:10.1016/j.pbi.2009.01.005. PMID   19249238.
  3. Prabhakar, S.; Noonan, JP.; Pääbo, S.; Rubin, EM. (Nov 2006). "Accelerated evolution of conserved noncoding sequences in humans". Science. 314 (5800): 786. doi:10.1126/science.1130738. PMID   17082449. S2CID   15049725.
  4. 1 2 3 4 5 6 7 8 Jegga, AG.; Aronow, BJ. (Apr 2008). Evolutionarily Conserved Noncoding DNA. doi:10.1002/9780470015902.a0006126.pub2. ISBN   978-0470016176.{{cite book}}: |journal= ignored (help)
  5. Rogozin, IB.; Wolf, YI.; Sorokin, AV.; Mirkin, BG.; Koonin, EV. (Sep 2003). "Remarkable Interkingdom Conservation of Intron Positions and Massive, Lineage-Specific Intron Loss and Gain in Eukaryotic Evolution". Current Biology. 13 (17): 1512–1517. doi: 10.1016/S0960-9822(03)00558-X . PMID   12956953.
  6. 1 2 3 4 5 6 Eickbush, TH.; Eickbush, DJ. (July 2006). "Transposable Elements: Evolution". eLS. doi:10.1038/npg.els.0005130. ISBN   9780470016176.
  7. Mikkelsen, T.S.; et al. (2007). "Genome of the marsupial Monodelphis domestica reveals innovation in non-coding sequences". Nature. 447 (7141): 167–177. Bibcode:2007Natur.447..167M. doi: 10.1038/nature05805 . PMID   17495919.
  8. Feschotte, Cédric (May 2008). "Transposable Elements and the Evolution of Regulatory Networks". Nature Reviews Genetics. 9 (5): 397–405. doi:10.1038/nrg2337. PMC   2596197 . PMID   18368054.
  9. Cooper, DN. Human Gene Evolution. Oxford: BIOS Scientific Publishers, Sept, 1988, p.265-292
  10. Svensson, O.; Arvestad, L.; Lagergren, J. (May 2005). "Genome-wide survey for biologically functional pseudogenes". PLOS Comput. Biol. 2 (5): 46. doi: 10.1371/journal.pcbi.0020046 . PMC   1456316 . PMID   16680195.
  11. Podlaha, Ondrej.; Zhang, Jianzhi. (Nov 2010). "Pseudogenes and Their Evolution". eLS. doi:10.1002/9780470015902.a0005118.pub2. ISBN   9780470016176.
  12. Nishikimi M, Kawai T, Yagi K (October 1992). "Guinea pigs possess a highly mutated gene for L-gulono-gamma-lactone oxidase, the key enzyme for L-ascorbic acid biosynthesis missing in this species". J. Biol. Chem. 267 (30): 21967–72. doi: 10.1016/S0021-9258(19)36707-9 . PMID   1400507.
  13. Bejerano, G.; Pheasant, M.; Makunin, I.; Stephen, S.; Kent, W.J.; Mattick, J.S.; Haussler, David. (May 2004). "Ultraconserved Elements in the Human Genome". Science. 304 (5675): 1321–1325. Bibcode:2004Sci...304.1321B. CiteSeerX   10.1.1.380.9305 . doi:10.1126/science.1098119. PMID   15131266. S2CID   2790337.
  14. 1 2 Katzman, Sol.; Kern, A.D.; Bejerano, G.; Fewell, G.; Fulton, L.; Wilson, R.K.; Salama, S.R.; Haussler, David. (Aug 2007). "Human Genome Ultraconserved Elements are Ultraselected". Science. 317 (5840): 915. Bibcode:2007Sci...317..915K. doi:10.1126/science.1142430. PMID   17702936. S2CID   35322654.
  15. Dubchack, I.; Brudno, M.; Loots, GG.; Pachter, L.; Mayor, C.; Rubin, EM.; Frazer, KA. (2000). "Active Conservation of Noncoding Sequences Revealed by Three-Way Species Comparisons". Genome Res. 10 (9): 1304–1306. doi:10.1101/gr.142200. PMC   310906 . PMID   10984448.
  16. 1 2 Matsunami, M.; Sumiyama, K.; Saitou, N. (Sep 2010). "Evolution of Conserved Non-Coding Sequences Within the Vertebrate Hox Clusters Through the Two-Round Whole Genome Duplications Revealed by Phylogenetic Footprinting Analysis". Journal of Molecular Evolution. 71 (5–6): 427–463. Bibcode:2010JMolE..71..427M. doi:10.1007/s00239-010-9396-1. PMID   20981416. S2CID   9733304.
  17. Santini, S.; Boore, JL.; Meyer, A. (2003). "Evolutionary Conservation of Regulatory Elements in Vertebrate Hox Gene Clusters". Genome Res. 13 (6A): 1111–1122. doi:10.1101/gr.700503. PMC   403639 . PMID   12799348.
  18. 1 2 Greaves, D.R.; et al. (1998). "Functional Comparison of the Murine Macrosialin and Human CD68 Promoters in Macrophage and Nonmacrophage Cell Lines". Genomics. 54 (1): 165–168. doi:10.1006/geno.1998.5546. PMID   9806844.
  19. 1 2 Marchese, A.; et al. (1994). "Mapping Studies of Two G Protein-Coupled Receptor Genes: An Amino Acid Difference May Confer a Functional Variation Between a Human and Rodent Receptor". Biochem Biophys Res Commun. 205 (3): 1952–1958. doi:10.1006/bbrc.1994.2899. PMID   7811287.
  20. Margarit, Ester; et al. (1998). "Identification of Conserved Potentially Regulatory Sequences of the SRY Gene from 10 Different Species of Mammals". Biochem Biophys Res Commun. 245 (2): 370–377. doi:10.1006/bbrc.1998.8441. PMID   9571157.
  21. Lockton, Steven.; Gaut, BS. (Jan 2005). "Plant conserved non-coding sequences and paralogue evolution". Trends in Genetics. 21 (1): 60–65. doi:10.1016/j.tig.2004.11.013. PMID   15680516.
  22. Bird, Christine P.; et al. (2007). "Fast-evolving noncoding sequences in the human genome". Genome Biology. 8 (6): R118. doi: 10.1186/gb-2007-8-6-r118 . PMC   2394770 . PMID   17578567.