An ultra-conserved element (UCE) was originally defined as a genome segment longer than 200 base pairs (bp) that is absolutely conserved, with no insertions or deletions and 100% identity, between orthologous regions of the human, rat, and mouse genomes. [1] [2] 481 ultra-conserved elements have been identified in the human genome. [1] [2] If ribosomal DNA (rDNA regions) are excluded, these range in size from 200 bp to 781 bp. [2] UCRs are found on all chromosomes except for 21 and Y. [3] A database collecting genomic information about ultra-conserved elements (UCbase) is available at http://ucbase.unimore.it. [4]
Since its creation, this term's usage has broadened to include more evolutionary distant species or shorter segments, for example 100 bp instead of 200 bp. [1] [2] By some definitions, segments need not be syntenic between species. [1] Human UCEs also show high conservation with more evolutionarily distant species, such as chicken and fugu. [2] Out of 481 identified human UCEs, approximately 97% align with high identity to the chicken genome, though only 4% of human genome can only be reliably aligned to the chicken genome. [2] Similarly, the same sequences in the fugu genome have 68% identity to human UCEs, despite the human genome only reliably aligning to 1.8% of the fugu genome. [2] Despite often being noncoding DNA, [5] some ultra-conserved elements have been found to be transcriptionally active, producing non-coding RNA molecules. [6]
Researchers originally assumed that perfect conservation of these long stretches of DNA implied evolutionary importance, as these regions appear to have experienced strong negative (purifying) selection for 300-400 million years. [2] [5] [7] More recently, this assumption has been replaced by two main hypotheses: that UCEs are created through a reduced negative selection rate, or through reduced mutation rates, also known as a "cold spot" of evolution. [1] [2] Many studies have examined the validity of each hypothesis. The probability of finding ultra-conserved elements by chance (under neutral evolution) has been estimated at less than 10−22 in 2.9 billion bases. [2] In support of the cold spot hypothesis, UCEs were found to be mutating 20 fold less than expected under conservative models for neutral mutation rates. [2] This fold change difference in mutation rates was consistent between humans, chimpanzees, and chickens. [2] Ultra-conserved elements are not exempt from mutations, as exemplified by the presence of 29,983 polymorphisms in the UCE regions of the human genome assembly GRCh38. [8] However, affected phenotypes were only caused by 112 of these polymorphisms, most of which were located in coding regions of the UCEs. [8] A study performed in mice determined that deleting UCEs from the genome did not create obvious deleterious phenotypes, despite deletion of UCEs in proximity to promoters and protein coding genes. [9] Affected mice were fertile and targeted screens of the nearby coding genes showed no altered phenotype. [9] A separate mouse study demonstrated that ultra-conserved enhancers were robust to mutagenesis, concluding that perfect conservation of UCE sequences is not required for their function, which would suggest another reason for the sequence consistency besides evolutionary importance. [10] Computational analysis of human ultra-conserved noncoding elements (UCNEs) found that the regions are enriched for A-T sequences and are generally GC poor. [11] However, the UNCEs were found to be enriched for CpG, or highly methylated. [11] This may indicate that there is some change to DNA structure in these regions favoring their precise retention, but this possibility has not been validated through testing. [11]
Often, ultra-conserved elements are located near transcriptional regulators or developmental genes performing functions such as gene enhancing and splicing regulation. [1] [2] [12] A study comparing ultra-conserved elements between humans and the Japanese puffer fish Takifugu rubripes proposed an importance in vertebrate development. [13] Double-knockouts of UCEs near the ARX gene in mice caused a shrunken hippocampus in the brain, though the effect was not lethal. [14] Some UCEs are not transcribed, and are referred to as ultra-conserved noncoding elements. [11] However, many UCRs in humans are extensively transcribed. [6] A small number of those which are transcribed, known as transcribed UTRs (T-UTRs), have been connected with human carcinomas and leukemias. [6] For example, TUC338 is strongly upregulated in human hepatocellular carcinoma cells. [15] Indeed, UCEs are often affected by copy number variation in cancer cells much more than in healthy contexts, suggesting that altering the copy number of T-UCEs may be deleterious. [16] [17] [18]
Research has demonstrated that T-UCRs have a tissue-specific expression, and a differential expression profile between tumors and other diseases. [3] The tables below highlight transcripts and polymorphisms within UCRs that have been shown to contribute to human diseases. [3] [8] For example, UCRs tend to accumulate less mutations than flanking segments, in both neoplastic and non-neoplastic samples from persons with hereditary non-polyposis colorectal cancer. [19]
miR/methylation/transcript factor associated with T-UCRs | Disease | References |
miR-24-1/uc.160 | Leukemia | Calin et al., 2007 [6] |
miR-130b/uc.63 | Prostate CA | Sekino et al., 2017 [20] |
miR-153/uc.416 | Colorectal and renal CA | Goto et al., 2016; [21] Sekino et al., 2017 [20] |
miR-155/uc.160 | Gastric CA | Calin et al., 2007; [6] Pang et al., 2018 [22] |
miR-155/uc346A | Leukemia | Calin et al., 2007 [6] |
mir-195/uc.283 | Bladder CA | Liz et al., 2014 [23] |
miR-195, miR-4668/uc.372 | Lipid metabolism | Guo et al., 2018 [24] |
mir-195/uc.173 | Gastrointestinal tract | Xiao et al., 2018 [25] |
miR-214/uc.276 | Colorectal CA | Wojcik et al., 2010 [26] |
miR-291a-3p/uc.173 | Nervous system | Nan et al., 2016 [27] |
miR-29b/uc.173 | Gastrointestinal tract | J. Y. Wang et al., 2018 [28] |
miR-339-3p, miR-663b-3p, miR-95-5p/uc.339 | Lung CA | Vannini et al., 2017 [29] |
miR-596/uc.8 | Bladder CA | Olivieri et al., 2016 [30] |
DNA methylation/uc.160, uc.283, and uc.346 | Colorectal CA | Kottorou et al., 2018 [31] |
DNA methylation/uc.158 + A, uc.160+, uc.241 + A, uc.283 + A, uc.346 + A | Gastric CA | Goto et al., 2016; [21] Lujambio et al., 2010 [20] |
Transcription factor SP1/uc.138 (TRA2β4) | Colorectal CA | Kajita et al., 2016 [32] |
Transcription factor YY1/uc.8 | Bladder CA | Terreri et al., 2016 [33] |
Polymorphism name | Associated phenotype description | Source |
rs17105335 | Amyotrophic lateral sclerosis | Cronin et al. (2008) [34] |
rs2020906 | Lynch syndrome | Hansen et al. (2014) [35] |
rs10496382 | Height | Chiang et al. (2012) [36] |
rs13382811 | Severe myopia | Khor et al. (2013) [37] |
rs104893634 | Vertical talus congenital | Dobbs et al. (2006); [38] Shrimpton et al. (2004) [38] |
rs2307121 | Central corneal thickness | Lu et al. (2013) [39] |
rs587777277 | Bosch-Boonstra-Schaaf optic atrophy syndrome | Bosch et al. (2014) [40] |
rs587777275 | Bosch-Boonstra-Schaaf optic atrophy syndrome | Bosch et al. (2014) [40] |
rs587777274 | Bosch-Boonstra-Schaaf optic atrophy syndrome | Bosch et al. (2014) [40] |
rs387906239 | Familial adenomatous polyposis 1 attenuated | Soravia et al. (1999) [41] |
rs3797704 | No association with breast cancer | Chang et al. (2016) [42] |
rs387906232 | Familial adenomatous polyposis 1 | Fodde et al. (1992) [43] |
rs387906237 | Familial adenomatous polyposis 1 attenuated | Curia et al. (1998) [44] |
rs121434591 | Distal myopathy | Senderek et al. (2009) [12] |
rs587777300 | Amyotrophic lateral sclerosis 21 | Johnson et al. (2014) [45] |
rs863223403 | Au-Kline syndrome | Au et al. (2015) [46] |
rs121917900 | Cockayne syndrome B | Mallery et al. (1998) [47] |
rs75462234 | Papillorenal syndrome | Schimmenti et al. (1999) [48] |
rs77453353 | Renal coloboma syndrome | Amiel et al. (2000) [49] |
rs76675173 | Papillorenal syndrome | Schimmenti et al. (1997) [50] |
rs587777708 | Focal segmental glomerulosclerosis 7 | Barua et al. (2014) [51] |
rs11190870 | Adolescent idiopathic scoliosis, no association with breast cancer | Chettier et al. (2015); [52] Gao et al. (2013); [53] Grauers et al. (2015); [54] Jiang et al. (2013); [55] Londono et al. (2014); [56] Miyake et al. (2013); [57] Shen et al. (2011); [58] Takahashi et al. (2011) [59] |
rs724159963 | Peroxisomal fatty acyl-CoA reductase 1 disorder | Buchert et al. (2014) [60] |
rs16932455 | Capecitabine sensitivity | O'Donnell et al. (2012) [61] |
rs997295 | Motion sickness; BMI | De et al. (2015); [62] Guo et al. (2013); [63] Hromatka et al. [64] |
rs587777373 | Congenital heart defects multiple types 4 | Al Turki et al. (2014) [65] |
rs398123839 | Duchenne muscular dystrophy | Hofstra et al. (2004); [66] Roberts et al. (1992) [67] |
rs863224976 | Becker muscular dystrophy | Tuffery-Giraud et al. (2005) [68] |
rs132630295 | Spastic paraplegia 2 X-linked | Gorman et al. (2007) [69] |
rs132630287 | Spastic paraplegia 2 X-linked | Saugier-Veber et al. (1994) [70] |
rs132630292 | Pelizaeus/Merzbacher disease atypical | Hodes et al. (1997) [71] |
rs137852350 | Mental retardation X-linked 94 | Wu et al. (2007) [72] |
rs122459149 | Emery-Dreifuss muscular dystrophy 6 X-linked | Gueneau et al. (2009); [73] Knoblauch et al. (2010) [74] |
rs122458141 | Myopathy X-linked with postural muscle atrophy | Schoser et al. (2009); [75] Windpassinger et al. (2008) [76] |
rs786200914 | Myopathy X-linked with postural muscle atrophy | Schoser et al. (2009) [75] |
rs267606811 | Myopathy X-linked with postural muscle atrophy | Windpassinger et al. (2008) [76] |
rs62621672 | Rett syndrome (nonpathogenic variant) | Zahorakova et al. (2007) [77] |
In genetics, a promoter is a sequence of DNA to which proteins bind to initiate transcription of a single RNA transcript from the DNA downstream of the promoter. The RNA transcript may encode a protein (mRNA), or can have a function in and of itself, such as tRNA or rRNA. Promoters are located near the transcription start sites of genes, upstream on the DNA . Promoters can be about 100–1000 base pairs long, the sequence of which is highly dependent on the gene and product of transcription, type or class of RNA polymerase recruited to the site, and species of organism.
A non-coding RNA (ncRNA) is a functional RNA molecule that is not translated into a protein. The DNA sequence from which a functional non-coding RNA is transcribed is often called an RNA gene. Abundant and functionally important types of non-coding RNAs include transfer RNAs (tRNAs) and ribosomal RNAs (rRNAs), as well as small RNAs such as microRNAs, siRNAs, piRNAs, snoRNAs, snRNAs, exRNAs, scaRNAs and the long ncRNAs such as Xist and HOTAIR.
The CpG sites or CG sites are regions of DNA where a cytosine nucleotide is followed by a guanine nucleotide in the linear sequence of bases along its 5' → 3' direction. CpG sites occur with high frequency in genomic regions called CpG islands.
Pseudogenes are nonfunctional segments of DNA that resemble functional genes. Most arise as superfluous copies of functional genes, either directly by gene duplication or indirectly by reverse transcription of an mRNA transcript. Pseudogenes are usually identified when genome sequence analysis finds gene-like sequences that lack regulatory sequences needed for transcription or translation, or whose coding sequences are obviously defective due to frameshifts or premature stop codons. Pseudogenes are a type of junk DNA.
An Alu element is a short stretch of DNA originally characterized by the action of the Arthrobacter luteus (Alu) restriction endonuclease. Alu elements are the most abundant transposable elements in the human genome, present in excess of one million copies. Alu elements were thought to be selfish or parasitic DNA, because their sole known function is self reproduction. However, they are likely to play a role in evolution and have been used as genetic markers. They are derived from the small cytoplasmic 7SL RNA, a component of the signal recognition particle. Alu elements are highly conserved within primate genomes and originated in the genome of an ancestor of Supraprimates.
In evolutionary biology, conserved sequences are identical or similar sequences in nucleic acids or proteins across species, or within a genome, or between donor and receptor taxa. Conservation indicates that a sequence has been maintained by natural selection.
T-box transcription factor T, also known as Brachyury protein, is encoded for in humans by the TBXT gene. Brachyury functions as a transcription factor within the T-box family of genes. Brachyury homologs have been found in all bilaterian animals that have been screened, as well as the freshwater cnidarian Hydra.
Serine/threonine kinase 11 (STK11) also known as liver kinase B1 (LKB1) or renal carcinoma antigen NY-REN-19 is a protein kinase that in humans is encoded by the STK11 gene.
Transcription factor SOX-9 is a protein that in humans is encoded by the SOX9 gene.
Long non-coding RNAs are a type of RNA, generally defined as transcripts more than 200 nucleotides that are not translated into protein. This arbitrary limit distinguishes long ncRNAs from small non-coding RNAs, such as microRNAs (miRNAs), small interfering RNAs (siRNAs), Piwi-interacting RNAs (piRNAs), small nucleolar RNAs (snoRNAs), and other short RNAs. Given that some lncRNAs have been reported to have the potential to encode small proteins or micro-peptides, the latest definition of lncRNA is a class of RNA molecules of over 200 nucleotides that have no or limited coding capacity. Long intervening/intergenic noncoding RNAs (lincRNAs) are sequences of lncRNA which do not overlap protein-coding genes.
HOTAIR is a human gene located between HOXC11 and HOXC12 on chromosome 12. It is the first example of an RNA expressed on one chromosome that has been found to influence transcription of HOXD cluster posterior genes located on chromosome 2. The sequence and function of HOTAIR is different in human and mouse. Sequence analysis of HOTAIR revealed that it exists in mammals, has poorly conserved sequences and considerably conserved structures, and has evolved faster than nearby HoxC genes. A subsequent study identified HOTAIR has 32 nucleotide long conserved noncoding element (CNE) that has a paralogous copy in HOXD cluster region, suggesting that the HOTAIR conserved sequences predates whole genome duplication events at the root of vertebrate. While the conserved sequence paralogous with HOXD cluster is 32 nucleotide long, the HOTAIR sequence conserved from human to fish is about 200 nucleotide long and is marked by active enhancer features.
UCbase is a database of ultraconserved sequences that were first described by Bejerano, G. et al. in 2004. They are highly conserved genome regions that share 100% identity among human, mouse and rat. UCRs are 481 sequences longer than 200 bases. They are frequently located at genomic regions involved in cancer, differentially expressed in human leukemias and carcinomas and in some instances regulated by microRNAs. The first release of UCbase was published by Taccioli, C. et al. in 2009. Recent updates include new annotation based on hg19 Human genome, information about disorders related to the chromosome coordinates using the SNOMED CT classification, a query tool to search for SNPs, and a new text box to directly interrogate the database using a MySQL interface. Moreover, a sequence comparison tool allows the researchers to match selected sequences against ultraconserved elements located in genomic regions involved in specific disorders. To facilitate the interactive, visual interpretation of UCR chromosomal coordinates, the authors have implemented the graph visualization feature of UCbase creating a link to the UCSC Genome Browser. UCbase 2.0 does not provide microRNAs (miRNAs) information anymore focusing only on UCRs. The official release of UCbase 2.0 was published in 2014.
A conserved non-coding sequence (CNS) is a DNA sequence of noncoding DNA that is evolutionarily conserved. These sequences are of interest for their potential to regulate gene production.
CDKN2B-AS, also known as ANRIL is a long non-coding RNA consisting of 19 exons, spanning 126.3kb in the genome, and its spliced product is a 3834bp RNA. It is located within the p15/CDKN2B-p16/CDKN2A-p14/ARF gene cluster, in the antisense direction. Single nucleotide polymorphisms (SNPs) which alter the expression of CDKN2B-AS are associated with human healthy life expectancy, as well as with multiple diseases, including coronary artery disease, diabetes and many cancers. It binds to chromobox 7 (CBX7) within the polycomb repressive complex 1 and to SUZ12, a component of polycomb repression complex 2 and through these interactions is involved in transcriptional repression.
TUC338 is an ultra-conserved element which is transcribed to give a non-coding RNA. The TUC338 gene was first identified as uc.338, along with 480 other ultra-conserved elements in the human genome. Expression of this RNA gene has been found to dramatically increase in hepatocellular carcinoma (HCC) cells.
Long interspersed nuclear elements (LINEs) are a group of non-LTR retrotransposons that are widespread in the genome of many eukaryotes. LINEs contain an internal Pol II promoter to initiate transcription into mRNA, and encode one or two proteins, ORF1 and ORF2. The functional domains present within ORF1 vary greatly among LINEs, but often exhibit RNA/DNA binding activity. ORF2 is essential to successful retrotransposition, and encodes a protein with both reverse transcriptase and endonuclease activity.
LINE1 is a family of related class I transposable elements in the DNA of some organisms, classified with the long interspersed elements (LINEs). L1 transposons comprise approximately 17% of the human genome. These active L1s can interrupt the genome through insertions, deletions, rearrangements, and copy number variations. L1 activity has contributed to the instability and evolution of genomes and is tightly regulated in the germline by DNA methylation, histone modifications, and piRNA. L1s can further impact genome variation through mispairing and unequal crossing over during meiosis due to its repetitive DNA sequences.
Short interspersed nuclear elements (SINEs) are non-autonomous, non-coding transposable elements (TEs) that are about 100 to 700 base pairs in length. They are a class of retrotransposons, DNA elements that amplify themselves throughout eukaryotic genomes, often through RNA intermediates. SINEs compose about 13% of the mammalian genome.
Generally, in progression to cancer, hundreds of genes are silenced or activated. Although silencing of some genes in cancers occurs by mutation, a large proportion of carcinogenic gene silencing is a result of altered DNA methylation. DNA methylation causing silencing in cancer typically occurs at multiple CpG sites in the CpG islands that are present in the promoters of protein coding genes.
Solute carrier family 39 member 12 is a protein that in humans is encoded by the SLC39A12 gene.