Gene deserts are regions of the genome that are devoid of protein-coding genes. Gene deserts constitute an estimated 25% of the entire genome, leading to the recent interest in their true functions. [1] Originally believed to contain inessential and "Junk DNA" due to their inability to create proteins, gene deserts have since been linked to several vital regulatory functions, including distal enhancing and conservatory inheritance. Thus, an increasing number of risks that lead to several major diseases, including a handful of cancers, have been attributed to irregularities found in gene deserts.
One of the most notable examples is the 8q24 gene region, which, when affected by certain single nucleotide polymorphisms, lead to a myriad of diseases. The major identifying factors of gene deserts lay in their low GpC content and their relatively high levels of repeats, which are not observed in coding regions. Recent studies have even further categorized gene deserts into variable and stable forms; regions are categorized based on their behavior through recombination and their genetic contents. Although current knowledge of gene deserts is rather limited, ongoing research and improved techniques are beginning to open the doors for exploration on the various important effects of these noncoding regions.
Although the possibility of function in gene deserts was predicted as early as the 1960s, genetic identification tools were unable to uncover any specific characteristics of the long noncoding regions, other than that no coding occurred in those regions. [2]
Before the completion of the human genome in 2001 through the Human Genome Project, most of the early associative gene comparisons relied on the belief that essential housekeeping genes were clustered in the same areas of the genome for ease of access and tight regulation. This belief later constructed a hypothesis that gene deserts are therefore previous regulatory sequences that are highly linked (and hence do not undergo recombination), but have had substitutions between them over time. [2] [3] These substitutions could cause tightly conserved genes to separate over time, thus forming regions of nonsense codes with a few essential genes. However, uncertainty due to differential gene conservation rates in different portions of chromosomes prevented accurate identification. Later associations were remodeled when regulatory sequences were associated with transcription factors, leading to the birth of large-scale genome-wide mapping. Thus began the hunt for the contents and functions of gene deserts.
Recent advancements in the screening of chromatin signatures on chromosomes (for instance, chromosome conformation capture, also known as 3C) have allowed the confirmation of the long-range gene activation model, which postulates that there are indeed physical links between regulatory enhancers and their target promoters. [2] Research on gene deserts, although centralized on human genetics, has also been applied to mice, various birds, and Drosophila melanogaster. [4] [5] Although conservation is variable among selected species’ genomes, orthologous gene deserts function similarly. Thus, the prevailing the contention of gene deserts is that these noncoding sequences harbor active and important regulatory elements.
One study focused on a regulatory archipelago, a region with “islands” of coding sequences surrounded by vast noncoding regions. The study, which explored the effects of regulation on the hox genes, initially focused on two enhancer sequences, GCR and Prox, which are located 200 basepairs and 50 basepairs respectively upstream of the Hox D locus. [5] To manipulate the region, the study inverted the two enhancer sequences and discovered no major effects on the transcription of the Hox D gene, even though the two sequences were the closest sequences to the gene. Thus, the turned to the gene desert that flanked the GCR sequence upstream and found 5 regulatory islands within it that could regulate the gene. To select the most likely candidate, the study then applied several individual and multiple deletions to the five islands to observe the effects. These varied deletions only resulted in minor effects including physical abnormalities or a few missing digits.
When the experiment was taken a step further and applied a deletion of the entire 830 kilobase gene desert, the functionality of the entire Hox D locus was rendered inactive. [5] This indicates that the neighboring gene desert, as an entire 830 kilobase unit (including the five island sequences within it), serves as an important regulator of a single gene that spans merely 50 kilobases. Therefore, these results hinted at the regulatory effects of flanking gene deserts. This study was supported by a later observation through a comparison between fluorescence in situ hybridization and chromosome conformation capture which discovered that the Hox D locus was the most decondensed portion in the region. This meant that it had relatively higher activity in comparison to the flanking gene deserts. [6] Hence, the Hox D could be regulated by specific nearby enhancer sequences that were not expressed in unison. However, this does caution that proximity is inaccurate when either analytical method is used. [6] Thus, associations between regulatory gene deserts and their target promoters seem to have variable distances and are not required to act as borders.
The variability in distance demonstrates that distance may be another important factor that is determined by gene deserts. For instance, distal enhancers may interact with their target promoters through looping interactions which must act over a certain distance. [7] Thus, proximity is not an accurate predictor of enhancers: enhancers do not need to border their target sequence to regulate them. While this leads to a variation in distances, the average distance between transcription start sites and the interaction complex mediated by their enhancer elements is 120 kilobases upstream of the start site. [7]
Gene deserts may play a role in constructing this distance to allow maximal looping to occur. Given that the mechanism of enhancer complex formation is a fairly simply regulated mechanism (the structures that are recruited into the enhancing complex have various regulatory controls that control construction), more than 50% of promoters have several long-range interactions. Certain core genes even have up to 20 possible enhancing interactions. There is a curious bias for complexes to form only upstream of the promoters. [7] Thus, given the correlation that many regulatory gene deserts appear upstream of their target promoters, it is possible that the more immediate role that gene deserts play is in long-range regulation of key sequences. As the ideal formation of enhancer interactions requires specific constructs, a possible side-product of the regulatory roles of gene deserts may be the conservation of genes: to retain the specific lengths of loops and order of regulating genes hidden in gene deserts, certain portions of gene deserts are more highly conserved than others when passing through inheritance events. These conserved noncoding sequences (CNS) are directly associated with syntenic inheritance in all vertebrates. [8] Thus, the presence of these CNSs could serve to conserve of large regions of genes.
Although distance may vary in regulatory gene deserts, distance appears to have an upper limit in conservative gene deserts. CNSs were initially thought to occur close to their conserved genes: earlier estimates placed most CNSs in proximity of gene sequences. [8] However, the expansion of genetic data has revealed that several CNSs reside up to 2.5 megabases from their target genes, with the majority of CNSs falling between 1 and 2 megabases. This range, which was measured for the human genome, is varied among different species. For instance, in comparison to humans, the Fugu fish has a smaller range, with an estimated maximum distance of a few hundred kilobases. Regardless of the difference in lengths, CNSs work in similar methods in both species. [8] Thus, as functions differ between gene deserts, so do their contents.
Certain gene deserts are heavy regulators, while others may be deleted without any effect. As a possible classification, gene deserts can be broken down into two subtypes: stable and variable. [1] Stable gene deserts have fewer repeats and have relatively higher Guanine to Cytosine (GpC) content than observed in variable gene deserts.
Guanine and cytosine content is indicative of protein-coding functionality. For example, in a study on chromosomes 2 and 4, which have been linked to several genetic diseases, there were elevated GpC content in certain regions. [9] Mutations in these GC-rich regions caused a variety of diseases, revealing the necessary integrity of these genes. High density CpG regions serve as regulatory regions for DNA methylation. [10] Therefore, essential coding genes should be represented by high-CpG regions. In particular, regions with high GC content should tend to have high densities of genes that are devoted mainly to the essential housekeeping and tissue specific processes. [11] These processes would require the most protein production to express functionality. Stable gene deserts, which have higher levels of GC content, should therefore contain the essential enhancer sequences. This could determine the conservatory functions of stable gene deserts.
On the other hand, approximately 80% of gene deserts have low GpC contents, indicating that they have very few essential genes. [9] Thus, the majority of gene deserts are variable gene deserts, which may have alternate functions. One prevalent theory regarding the origins of gene deserts postulates that gene deserts are accumulations of essential genes that act as a distance. [1] [10] This may hold true, as given the low numbers of essential genes within them, these regions would have been less conserved. As a result, due to the prevalence of cytosine to thymine conversions, the most common SNP, would cause a gradual separation between the few essential genes within variable gene deserts. These essential sequences would have been maintained and conserved, leading to small regions of high density that regulate at a distance. [10] GC content is therefore indication for the presence of coding or regulatory processes in DNA.
While stable gene deserts have higher GC content, this relative value is only an average. Within stable gene deserts, although the ends contain very high levels of GC content, the main bulk of the DNA contains even less GC content than observed in variable gene deserts. This indicates that there are very few highly conserved regions in stable gene deserts that do not recombine, or do so at very low rates. [9] Given that the ends of the stable gene deserts have particularly high levels of GC contents, these sequences must be extremely conserved. This conservation may in turn cause the flanking genes to also have higher conservation rates. Thus, stable genes should be directly linked to at least one of their flanking genes and cannot be separated from coding sequences by recombination events. [1] Most gene deserts appear to cluster in pairs around a small number of genes. This clustering creates long loci that have very low gene density; small regions with high numbers of genes are surrounded by long stretches of gene deserts, creating a low gene average. Therefore, the minimized probability of recombination events in these long loci creates syntenic blocks that are inherited together over time. [1] These syntenic blocks can be conserved for very long periods of time, preventing loss of essential material, even while the distance between essential genes may grow in time.
Although this effect should theoretically be amplified through the even lower GC-content in variable gene deserts (thereby truly minimalizing gene density), the gene conservation rates in variable gene deserts are even lower than observed in stable gene deserts—in fact, the rate is far lower than the rest of the genome. A possible explanation for this phenomenon is that variable gene deserts may be recently evolved regions that have not yet been fixed into stable gene deserts. [1] Therefore, shuffling may still occur before stabilizing regions within the variable gene deserts begin to cluster as whole units. There are a few exceptions to this minimal rate of conservation, as a few GC gene deserts are subjected to hypermethylation, which greatly reduces the accessibility to the DNA, thus effectively protecting the region from recombination. [11] However, these occur rarely in observation.
Although stable and variable gene deserts differ in content and function, both wield conservatory abilities. It is possible that since most variable gene deserts have regulatory elements that can act at a distance, conservation of the entire gene desert into a sytenic locus would not have been necessary, so long as these regulatory elements themselves were conserved as units. Given the particularly low levels of GC content, the regulatory elements would therefore be in a minimal gene density situation as observed similarly in flanking stable gene deserts, with the same effect. Thus, both types of gene deserts serve to retain essential genes within the genome.
The conservative nature of gene deserts confirms that these stretches of noncoding bases are essential to proper functioning. Indeed, a wide range of studies on irregularities in the noncoding genes discovered several associations to genetic diseases. One of the most studied gene deserts is the 8q24 region. Early genome wide association studies were focused on the 8q24 region (residing on chromosome 8) due to the abnormally high rates of SNPs that seem to occur in the region. These studies found that the region was linked to increased risks for a variety of cancers, notably in the prostate, breast, ovaries, colonic, and pancreas. [12] [13] Using inserts of the gene desert into bacterial artificial chromosomes, one study was able to produce enhancer activity in certain regions, which were isolated via cloning systems. [14] This study successfully identified an enhancer sequence hidden in the region. Within this enhancer sequence, an SNP that conferred risk for prostate cancer, labeled SNP s6983267, was discovered in diseased mice. However, the 8q24 region is not solely limited to conferred risks of prostate cancer. A study in 2008 screened human subjects (and controls) with variations in the gene desert region, discovering five different regions that conferred different risks when affected by different SNPs. [12] This study used identified SNP markers in the gene desert to identify risk conference from each of the regions to a specific tissue expression. Although these risks were successfully linked to various forms of cancer, Ghoussaini, M., et al. note their uncertainty in whether the SNPs functioned merely as markers or were the direct causants of the cancers.
These varied effects occur due to the different interactions between the SNPs in this region and MYC promoters of different organs. The MYC promoter, which is located at a short distance downstream of the 8q24 region, is perhaps the most studied oncogene due to its association with a myriad of diseases. [13] Normal functioning of the MYC promoter ensures that cells divide regularly. The study postulates that the 8q region, which underwent a chromosomal translocation in humans, could have moved an essential enhancer for the MYC promoter. [13] This areas around this region could have been subjected to recombination that may have hidden the essential MYC enhancer within the gene desert through time, although its enhancing effects are still very much retained. This analysis stems from disease associations observed in several mice species where this region is retained at proximity to the MYC promoter. [13] Thus, the 8q24 gene desert should have been somewhat linked to the MYC promoter. The desert resembles a stable gene desert that has had very little recombination after the translocation event. Thus, a potential hypothesis is that SNPs affecting this region disrupt the important tissue-specific genes with the stable gene desert, which could explain the risks of cancer in various tissue forms. This effect of hidden enhancer elements can also be observed in other locations in the genome. For instance, SNPs in the 5p13.1 deregulate the PTGER4 coding region, leading to Crohn's Disease. [15] Another affected region in the 9p21 gene desert causes several coronary artery diseases. [16] However, none of these risk-conferring gene deserts seem to be affected as much as the 8q24 regions. Current studies are still unsure about the SNP-affected processes in the 8q24 region that result in particularly amplified responses to the MYC promoter. With the aid of a more accessible population and more specific markers for genome wide association mapping, an increasing number of risk alleles are now being marked in gene deserts, where small, isolated, and seemingly-unimportant regions of genes may moderate important genes.
The majority of the contents in gene deserts are still likely to be disposable.[ citation needed ] Naturally, this is not to say that the roles that gene deserts play are inessential or unimportant, rather than their functions may include buffering effects. An example of essential gene deserts with inessential DNA content are the telomeres that protect the ends of genomes. Telomeres can be categorized as true gene deserts, given that they solely contain repeats of TTAGGG (in humans) and do not have apparent protein-coding functions. Without these telomeres, human genomes would be severely mutated within a fixed number of cell cycles. On the other hand, since telomeres do not code for proteins, their loss ensures that there is no effect in important processes. Therefore, the term “junk” DNA should no longer be applied to any region of the genome; every portion of the genome should play a role in protecting, regulating, or repairing the protein coding regions that determine the functions of life. Although there is still much to learn about the nooks and crannies of the immense (yet limited) human genome, with the aid of various new technologies and the synthesis of the full human genome, we may perhaps unravel a great collection of secrets in the approaching years about the marvels of our genetic code.
In genetics, a promoter is a sequence of DNA to which proteins bind to initiate transcription of a single RNA transcript from the DNA downstream of the promoter. The RNA transcript may encode a protein (mRNA), or can have a function in and of itself, such as tRNA or rRNA. Promoters are located near the transcription start sites of genes, upstream on the DNA . Promoters can be about 100–1000 base pairs long, the sequence of which is highly dependent on the gene and product of transcription, type or class of RNA polymerase recruited to the site, and species of organism.
The human genome is a complete set of nucleic acid sequences for humans, encoded as the DNA within each of the 24 distinct chromosomes in the cell nucleus. A small DNA molecule is found within individual mitochondria. These are usually treated separately as the nuclear genome and the mitochondrial genome. Human genomes include both protein-coding DNA sequences and various types of DNA that does not encode proteins. The latter is a diverse category that includes DNA coding for non-translated RNA, such as that for ribosomal RNA, transfer RNA, ribozymes, small nuclear RNAs, and several types of regulatory RNAs. It also includes promoters and their associated gene-regulatory elements, DNA playing structural and replicatory roles, such as scaffolding regions, telomeres, centromeres, and origins of replication, plus large numbers of transposable elements, inserted viral DNA, non-functional pseudogenes and simple, highly repetitive sequences. Introns make up a large percentage of non-coding DNA. Some of this non-coding DNA is non-functional junk DNA, such as pseudogenes, but there is no firm consensus on the total amount of junk DNA.
Non-coding DNA (ncDNA) sequences are components of an organism's DNA that do not encode protein sequences. Some non-coding DNA is transcribed into functional non-coding RNA molecules. Other functional regions of the non-coding DNA fraction include regulatory sequences that control gene expression; scaffold attachment regions; origins of DNA replication; centromeres; and telomeres. Some non-coding regions appear to be mostly nonfunctional, such as introns, pseudogenes, intergenic DNA, and fragments of transposons and viruses. Regions that are completely nonfunctional are called junk DNA.
The coding region of a gene, also known as the coding sequence (CDS), is the portion of a gene's DNA or RNA that codes for a protein. Studying the length, composition, regulation, splicing, structures, and functions of coding regions compared to non-coding regions over different species and time periods can provide a significant amount of important information regarding gene organization and evolution of prokaryotes and eukaryotes. This can further assist in mapping the human genome and developing gene therapy.
In molecular biology and genetics, transcriptional regulation is the means by which a cell regulates the conversion of DNA to RNA (transcription), thereby orchestrating gene activity. A single gene can be regulated in a range of ways, from altering the number of copies of RNA that are transcribed, to the temporal control of when the gene is transcribed. This control allows the cell or organism to respond to a variety of intra- and extracellular signals and thus mount a response. Some examples of this include producing the mRNA that encode enzymes to adapt to a change in a food source, producing the gene products involved in cell cycle specific activities, and producing the gene products responsible for cellular differentiation in multicellular eukaryotes, as studied in evolutionary developmental biology.
Sequence homology is the biological homology between DNA, RNA, or protein sequences, defined in terms of shared ancestry in the evolutionary history of life. Two segments of DNA can have shared ancestry because of three phenomena: either a speciation event (orthologs), or a duplication event (paralogs), or else a horizontal gene transfer event (xenologs).
In biology, the epigenome of an organism is the collection of chemical changes to its DNA and histone proteins that affects when, where, and how the DNA is expressed; these changes can be passed down to an organism's offspring via transgenerational epigenetic inheritance. Changes to the epigenome can result in changes to the structure of chromatin and changes to the function of the genome. The human epigenome, including DNA methylation and histone modification, is maintained through cell division. The epigenome is essential for normal development and cellular differentiation, enabling cells with the same genetic code to perform different functions. The human epigenome is dynamic and can be influenced by environmental factors such as diet, stress, and toxins.
An intergenic region is a stretch of DNA sequences located between genes. Intergenic regions may contain functional elements and junk DNA.
Hox genes, a subset of homeobox genes, are a group of related genes that specify regions of the body plan of an embryo along the head-tail axis of animals. Hox proteins encode and specify the characteristics of 'position', ensuring that the correct structures form in the correct places of the body. For example, Hox genes in insects specify which appendages form on a segment, and Hox genes in vertebrates specify the types and shape of vertebrae that will form. In segmented animals, Hox proteins thus confer segmental or positional identity, but do not form the actual segments themselves.
Cis-regulatory elements (CREs) or cis-regulatory modules (CRMs) are regions of non-coding DNA which regulate the transcription of neighboring genes. CREs are vital components of genetic regulatory networks, which in turn control morphogenesis, the development of anatomy, and other aspects of embryonic development, studied in evolutionary developmental biology.
In biology, the word gene has two meanings. The Mendelian gene is a basic unit of heredity. The molecular gene is a sequence of nucleotides in DNA that is transcribed to produce a functional RNA. There are two types of molecular genes: protein-coding genes and non-coding genes.
Eukaryotic chromosome fine structure refers to the structure of sequences for eukaryotic chromosomes. Some fine sequences are included in more than one class, so the classification listed is not intended to be completely separate.
A gene family is a set of homologous genes within one organism. A gene cluster is a group of two or more genes found within an organism's DNA that encode similar polypeptides, or proteins, which collectively share a generalized function and are often located within a few thousand base pairs of each other. The size of gene clusters can vary significantly, from a few genes to several hundred genes. Portions of the DNA sequence of each gene within a gene cluster are found to be identical; however, the resulting protein of each gene is distinctive from the resulting protein of another gene within the cluster. Genes found in a gene cluster may be observed near one another on the same chromosome or on different, but homologous chromosomes. An example of a gene cluster is the Hox gene, which is made up of eight genes and is part of the Homeobox gene family.
Phylogenetic footprinting is a technique used to identify transcription factor binding sites (TFBS) within a non-coding region of DNA of interest by comparing it to the orthologous sequence in different species. When this technique is used with a large number of closely related species, this is called phylogenetic shadowing.
Long non-coding RNAs are a type of RNA, generally defined as transcripts more than 200 nucleotides that are not translated into protein. This arbitrary limit distinguishes long ncRNAs from small non-coding RNAs, such as microRNAs (miRNAs), small interfering RNAs (siRNAs), Piwi-interacting RNAs (piRNAs), small nucleolar RNAs (snoRNAs), and other short RNAs. Given that some lncRNAs have been reported to have the potential to encode small proteins or micro-peptides, the latest definition of lncRNA is a class of RNA molecules of over 200 nucleotides that have no or limited coding capacity. Long intervening/intergenic noncoding RNAs (lincRNAs) are sequences of lncRNA which do not overlap protein-coding genes.
HOTAIR is a human gene located between HOXC11 and HOXC12 on chromosome 12. It is the first example of an RNA expressed on one chromosome that has been found to influence the transcription of the HOXD cluster posterior genes located on chromosome 2. The sequence and function of HOTAIR are different in humans and mice. Sequence analysis of HOTAIR revealed that it exists in mammals, has poorly conserved sequences and considerably conserved structures, and has evolved faster than nearby HoxC genes. A subsequent study identified HOTAIR has 32 nucleotides long conserved noncoding element (CNE) that has a paralogous copy in HOXD cluster region, suggesting that the HOTAIR conserved sequences predate whole genome duplication events at the root of vertebrate. While the conserved sequence paralogous with HOXD cluster is 32 nucleotide long, the HOTAIR sequence conserved from human to fish is about 200 nucleotide long and is marked by active enhancer features.
A conserved non-coding sequence (CNS) is a DNA sequence of noncoding DNA that is evolutionarily conserved. These sequences are of interest for their potential to regulate gene production.
Genome evolution is the process by which a genome changes in structure (sequence) or size over time. The study of genome evolution involves multiple fields such as structural analysis of the genome, the study of genomic parasites, gene and ancient genome duplications, polyploidy, and comparative genomics. Genome evolution is a constantly changing and evolving field due to the steadily growing number of sequenced genomes, both prokaryotic and eukaryotic, available to the scientific community and the public at large.
Epigenetics of human development is the study of how epigenetics effects human development.
Nuclear organization refers to the spatial distribution of chromatin within a cell nucleus. There are many different levels and scales of nuclear organisation. Chromatin is a higher order structure of DNA.