Exon shuffling is a molecular mechanism for the formation of new genes. It is a process through which two or more exons from different genes can be brought together ectopically, or the same exon can be duplicated, to create a new exon-intron structure. [1] There are different mechanisms through which exon shuffling occurs: transposon mediated exon shuffling, crossover during sexual recombination of parental genomes and illegitimate recombination.
Exon shuffling follows certain splice frame rules. Introns can interrupt the reading frame of a gene by inserting a sequence between two consecutive codons (phase 0 introns), between the first and second nucleotide of a codon (phase 1 introns), or between the second and third nucleotide of a codon (phase 2 introns). Additionally exons can be classified into nine different groups based on the phase of the flanking introns (symmetrical: 0-0, 1-1, 2-2 and asymmetrical: 0–1, 0–2, 1–0, 1–2, etc.) Symmetric exons are the only ones that can be inserted into introns, undergo duplication, or be deleted without changing the reading frame. [2]
Exon shuffling was first introduced in 1978 when Walter Gilbert discovered that the existence of introns could play a major role in the evolution of proteins. [3] It was noted that recombination within introns could help assort exons independently and that repetitive segments in the middle of introns could create hotspots for recombination to shuffle the exonic sequences. However, the presence of these introns in eukaryotes and absence in prokaryotes created a debate about the time in which these introns appeared. Two theories arose: the "introns early" theory and the "introns late" theory. Supporters of the "introns early theory" believed that introns and RNA splicing were the relics of the RNA world and therefore both prokaryotes and eukaryotes had introns in the beginning. However, prokaryotes eliminated their introns in order to obtain a higher efficiency, while eukaryotes retained the introns and the genetic plasticity of the ancestors. On the other hand, supporters of the "introns late" theory believe that prokaryotic genes resemble the ancestral genes and introns were inserted later in the genes of eukaryotes. What is clear now is that the eukaryotic exon-intron structure is not static, introns are continually inserted and removed from genes and the evolution of introns evolves parallel to exon shuffling.[ citation needed ]
In order for exon shuffling to start to play a major role in protein evolution the appearance of spliceosomal introns had to take place. This was due to the fact that the self-splicing introns of the RNA world were unsuitable for exon-shuffling by intronic recombination. These introns had an essential function and therefore could not be recombined. Additionally there is strong evidence that spliceosomal introns evolved fairly recently and are restricted in their evolutionary distribution. Therefore, exon shuffling became a major role in the construction of younger proteins.[ citation needed ]
Moreover, to define more precisely the time when exon shuffling became significant in eukaryotes, the evolutionary distribution of modular proteins that evolved through this mechanism were examined in different organisms such as Escherichia coli , Saccharomyces cerevisiae , and Arabidopsis thaliana . These studies suggested that there was an inverse relationship between the genome compactness and the proportion of intronic and repetitive sequences, and that exon shuffling became significant after metazoan radiation. [4]
Evolution of eukaryotes is mediated by sexual recombination of parental genomes and since introns are longer than exons most of the crossovers occur in noncoding regions. In these introns there are large numbers of transposable elements and repeated sequences which promote recombination of nonhomologous genes. In addition it has also been shown that mosaic proteins are composed of mobile domains which have spread to different genes during evolution and which are capable of folding themselves.[ citation needed ]
There is a mechanism for the formation and shuffling of said domains, this is the modularization hypothesis. This mechanism is divided into three stages. The first stage is the insertion of introns at positions that correspond to the boundaries of a protein domain. The second stage is when the "protomodule" undergoes tandem duplications by recombination within the inserted introns. The third stage is when one or more protomodules are transferred to a different nonhomologous gene by intronic recombination. All states of modularization have been observed in different domains such as those of hemostatic proteins. [2]
A potential mechanism for exon shuffling is the long interspersed element (LINE) -1 mediated 3' transduction. However it is important first to understand what LINEs are. LINEs are a group of genetic elements that are found in abundant quantities in eukaryotic genomes. [5] LINE-1 is the most common LINE found in humans. It is transcribed by RNA polymerase II to give an mRNA that codes for two proteins: ORF1 and ORF2, which are necessary for transposition. [6]
Upon transposition, L1 associates with 3' flanking DNA and carries the non-L1 sequence to a new genomic location. This new location does not have to be in a homologous sequence or in close proximity to the donor DNA sequence. The donor DNA sequence remains unchanged throughout this process because it functions in a copy-paste manner via RNA intermediates; however, only those regions located in the 3' region of the L1 have been proven to be targeted for duplication.[ citation needed ]
Nevertheless, there is reason to believe that this may not hold true every time as shown by the following example. The human ATM gene is responsible for the human autosomal-recessive disorder ataxia-telangiectasia and is located on chromosome 11. However, a partial ATM sequence is found in chromosome 7. Molecular features suggest that this duplication was mediated by L1 retrotransposition: the derived sequence was flanked by 15bp target side duplications (TSD), the sequence around the 5' end matched with the consensus sequence for L1 endonuclease cleavage site and a poly(A) tail preceded the 3' TSD. But since the L1 element was present in neither the retrotransposed segment nor the original sequence the mobilization of the segment cannot be explained by 3' transduction. Additional information has led to the belief that trans-mobilization of the DNA sequence is another mechanism of L1 to shuffle exons, but more research on the subject must be done. [7]
Another mechanism through which exon shuffling occurs is by the usage of helitrons. Helitron transposons were first discovered during studies of repetitive DNA segments of rice, worm and the thale crest genomes. Helitrons have been identified in all eukaryotic kingdoms, but the number of copies varies from species to species.[ citation needed ]
Helitron encoded proteins are composed of a rolling-circle (RC) replication initiator (Rep) and a DNA helicase (Hel) domain. The Rep domain is involved in the catalytic reactions for endonucleolytic cleavage, DNA transfer and ligation. In addition this domain contains three motifs. The first motif is necessary for DNA binding. The second motif has two histidines and is involved in metal ion binding. Lastly the third motif has two tyrosines and catalyzes DNA cleavage and ligation.[ citation needed ]
There are three models of gene capture by helitrons: the 'read-through" model 1 (RTM1), the 'read-through" model 2 (RTM2) and a filler DNA model (FDNA). According to the RTM1 model an accidental "malfunction" of the replication terminator at the 3' end of the Helitron leads to transposition of genomic DNA. It is composed of the read-through Helitron element and its downstream genomic regions, flanked by a random DNA site, serving as a "de novo" RC terminator. According to the RTM2 model the 3' terminus of another Helitron serves as an RC terminator of transposition. This occurs after a malfunction of the RC terminator. Lastly in the FDNA model portions of genes or non-coding regions can accidentally serve as templates during repair of ds DNA breaks occurring in helitrons. [8] Even though helitrons have been proven to be a very important evolutionary tool, the specific details for their mechanisms of transposition are yet to be defined.[ citation needed ]
An example of evolution by using helitrons is the diversity commonly found in maize. Helitrons in maize cause a constant change of genic and nongenic regions by using transposable elements, leading to diversity among different maize lines.[ citation needed ]
Long-terminal repeat (LTR) retrotransposons are part of another mechanism through which exon shuffling takes place. They usually encode two open reading frames (ORF). The first ORF named gag is related to viral structural proteins. The second ORF named pol is a polyprotein composed of an aspartic protease (AP)which cleaves the polyprotein, an Rnase H (RH) which splits the DNR-RNA hybrid, a reverse transcriptase (RT) which produces a cDNA copy of the transposons RNA and a DDE integrase which inserts cDNA into the host's genome. Additionally LTR retrotransponsons are classified into five subfamilies: Ty1/copia, Ty3/gypsy, Bel/Pao, retroviruses and endogenous retroviruses. [9]
The LTR retrotransponsons require an RNA intermediate in their transposition cycle mechanism. Retrotransponsons synthesize a cDNA copy based on the RNA strand using a reverse transcriptase related to retroviral RT. The cDNA copy is then inserted into new genomic positions to form a retrogene. [10] This mechanism has been proven to be important in gene evolution of rice and other grass species through exon shuffling.[ citation needed ]
DNA transposon with Terminal inverted repeats (TIRs) can also contribute to gene shuffling. In plants, some non-autonomous elements called Pack-TYPE can capture gene fragments during their mobilization. [11] This process appears to be mediated by acquisition of genic DNA residing between neighbouring Pack-TYPE transposons and its subsequent mobilization. [12]
Lastly, illegitimate recombination (IR) is another of the mechanisms through which exon shuffling occurs. IR is the recombination between short homologous sequences or nonhomologous sequences. [13]
There are two classes of IR: The first corresponds to errors of enzymes which cut and join DNA (i.e., DNases.) This process is initiated by a replication protein which helps generate a primer for DNA synthesis. While one DNA strand is being synthesized the other is being displaced. This process ends when the displaced strand is joined by its ends by the same replication protein. The second class of IR corresponds to the recombination of short homologous sequences which are not recognized by the previously mentioned enzymes. However, they can be recognized by non-specific enzymes which introduce cuts between the repeats. The ends are then removed by exonuclease to expose the repeats. Then the repeats anneal and the resulting molecule is repaired using polymerase and ligase. [14]
An exon is any part of a gene that will form a part of the final mature RNA produced by that gene after introns have been removed by RNA splicing. The term exon refers to both the DNA sequence within a gene and to the corresponding sequence in RNA transcripts. In RNA splicing, introns are removed and exons are covalently joined to one another as part of generating the mature RNA. Just as the entire set of genes for a species constitutes the genome, the entire set of exons constitutes the exome.
In the fields of molecular biology and genetics, a genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA. The nuclear genome includes protein-coding genes and non-coding genes, other functional regions of the genome such as regulatory sequences, and often a substantial fraction of junk DNA with no evident function. Almost all eukaryotes have mitochondria and a small mitochondrial genome. Algae and plants also contain chloroplasts with a chloroplast genome.
An intron is any nucleotide sequence within a gene that is not expressed or operative in the final RNA product. The word intron is derived from the term intragenic region, i.e., a region inside a gene. The term intron refers to both the DNA sequence within a gene and the corresponding RNA sequence in RNA transcripts. The non-intron sequences that become joined by this RNA processing to form the mature RNA are called exons.
Retroposons are repetitive DNA fragments which are inserted into chromosomes after they had been reverse transcribed from any RNA molecule.
A transposable element is a nucleic acid sequence in DNA that can change its position within a genome, sometimes creating or reversing mutations and altering the cell's genetic identity and genome size. Transposition often results in duplication of the same genetic material. In the human genome, L1 and Alu elements are two examples. Barbara McClintock's discovery of them earned her a Nobel Prize in 1983. Its importance in personalized medicine is becoming increasingly relevant, as well as gaining more attention in data analytics given the difficulty of analysis in very high dimensional spaces.
Retrotransposons are a type of genetic component that copy and paste themselves into different genomic locations (transposon) by converting RNA back into DNA through the reverse transcription process using an RNA transposition intermediate.
An intergenic region is a stretch of DNA sequences located between genes. Intergenic regions may contain functional elements and junk DNA. Intergenic regions should not be confused with intragenic regions, which are non-coding regions that are found within genes, especially within the genes of eukaryotic organisms.
P elements are transposable elements that were discovered in Drosophila as the causative agents of genetic traits called hybrid dysgenesis. The transposon is responsible for the P trait of the P element and it is found only in wild flies. They are also found in many other eukaryotes.
Mobile genetic elements (MGEs) sometimes called selfish genetic elements are a type of genetic material that can move around within a genome, or that can be transferred from one species or replicon to another. MGEs are found in all organisms. In humans, approximately 50% of the genome is thought to be MGEs. MGEs play a distinct role in evolution. Gene duplication events can also happen through the mechanism of MGEs. MGEs can also cause mutations in protein coding regions, which alters the protein functions. These mechanisms can also rearrange genes in the host genome generating variation. These mechanism can increase fitness by gaining new or additional functions. An example of MGEs in evolutionary context are that virulence factors and antibiotic resistance genes of MGEs can be transported to share genetic code with neighboring bacteria. However, MGEs can also decrease fitness by introducing disease-causing alleles or mutations. The set of MGEs in an organism is called a mobilome, which is composed of a large number of plasmids, transposons and viruses.
Transposon mutagenesis, or transposition mutagenesis, is a biological process that allows genes to be transferred to a host organism's chromosome, interrupting or modifying the function of an extant gene on the chromosome and causing mutation. Transposon mutagenesis is much more effective than chemical mutagenesis, with a higher mutation frequency and a lower chance of killing the organism. Other advantages include being able to induce single hit mutations, being able to incorporate selectable markers in strain construction, and being able to recover genes after mutagenesis. Disadvantages include the low frequency of transposition in living systems, and the inaccuracy of most transposition systems.
LTR retrotransposons are class I transposable element characterized by the presence of long terminal repeats (LTRs) directly flanking an internal coding region. As retrotransposons, they mobilize through reverse transcription of their mRNA and integration of the newly created cDNA into another location. Their mechanism of retrotransposition is shared with retroviruses, with the difference that most LTR-retrotransposons do not form infectious particles that leave the cells and therefore only replicate inside their genome of origin. Those that do (occasionally) form virus-like particles are classified under Ortervirales.
Helitrons are one of the three groups of eukaryotic class 2 transposable elements (TEs) so far described. They are the eukaryotic rolling-circle transposable elements which are hypothesized to transpose by a rolling circle replication mechanism via a single-stranded DNA intermediate. They were first discovered in plants and in the nematode Caenorhabditis elegans, and now they have been identified in a diverse range of species, from protists to mammals. Helitrons make up a substantial fraction of many genomes where non-autonomous elements frequently outnumber the putative autonomous partner. Helitrons seem to have a major role in the evolution of host genomes. They frequently capture diverse host genes, some of which can evolve into novel host genes or become essential for Helitron transposition.
A conserved non-coding sequence (CNS) is a DNA sequence of noncoding DNA that is evolutionarily conserved. These sequences are of interest for their potential to regulate gene production.
Genome evolution is the process by which a genome changes in structure (sequence) or size over time. The study of genome evolution involves multiple fields such as structural analysis of the genome, the study of genomic parasites, gene and ancient genome duplications, polyploidy, and comparative genomics. Genome evolution is a constantly changing and evolving field due to the steadily growing number of sequenced genomes, both prokaryotic and eukaryotic, available to the scientific community and the public at large.
Periannan Senapathy is a molecular biologist, geneticist, author and entrepreneur. He is the founder, president and chief scientific officer at Genome International Corporation, a biotechnology, bioinformatics, and information technology firm based in Madison, Wisconsin, which develops computational genomics applications of next-generation DNA sequencing (NGS) and clinical decision support systems for analyzing patient genome data that aids in diagnosis and treatment of diseases.
Long interspersed nuclear elements (LINEs) are a group of non-LTR retrotransposons that are widespread in the genome of many eukaryotes. LINEs contain an internal Pol II promoter to initiate transcription into mRNA, and encode one or two proteins, ORF1 and ORF2. The functional domains present within ORF1 vary greatly among LINEs, but often exhibit RNA/DNA binding activity. ORF2 is essential to successful retrotransposition, and encodes a protein with both reverse transcriptase and endonuclease activity.
Transposable elements are short strands of repetitive DNA that can self-replicate and translocate within the eukaryotic genome, and are generally perceived as parasitic in nature. Their transcription can lead to the production of dsRNAs, which resemble retroviruses transcripts. While most host cellular RNA has a singular, unpaired sense strand, dsRNA possesses sense and anti-sense transcripts paired together, and this difference in structure allows an host organism to detect dsRNA production, and thereby the presence of transposons. Plants lack distinct divisions between somatic cells and reproductive cells, and also have, generally, larger genomes than animals, making them an intriguing case-study kingdom to be used in attempting to better understand the epigenetics function of transposable elements.
Short interspersed nuclear elements (SINEs) are non-autonomous, non-coding transposable elements (TEs) that are about 100 to 700 base pairs in length. They are a class of retrotransposons, DNA elements that amplify themselves throughout eukaryotic genomes, often through RNA intermediates. SINEs compose about 13% of the mammalian genome.
DNA transposons are DNA sequences, sometimes referred to "jumping genes", that can move and integrate to different locations within the genome. They are class II transposable elements (TEs) that move through a DNA intermediate, as opposed to class I TEs, retrotransposons, that move through an RNA intermediate. DNA transposons can move in the DNA of an organism via a single-or double-stranded DNA intermediate. DNA transposons have been found in both prokaryotic and eukaryotic organisms. They can make up a significant portion of an organism's genome, particularly in eukaryotes. In prokaryotes, TE's can facilitate the horizontal transfer of antibiotic resistance or other genes associated with virulence. After replicating and propagating in a host, all transposon copies become inactivated and are lost unless the transposon passes to a genome by starting a new life cycle with horizontal transfer. It is important to note that DNA transposons do not randomly insert themselves into the genome, but rather show preference for specific sites.
The split gene theory is a theory of the origin of introns, long non-coding sequences in eukaryotic genes between the exons. The theory holds that the randomness of primordial DNA sequences would only permit small (< 600bp) open reading frames (ORFs), and that important intron structures and regulatory sequences are derived from stop codons. In this introns-first framework, the spliceosomal machinery and the nucleus evolved due to the necessity to join these ORFs into larger proteins, and that intronless bacterial genes are less ancestral than the split eukaryotic genes. The theory originated with Periannan Senapathy.