Coding region

Last updated

The coding region of a gene, also known as the coding DNA sequence (CDS), is the portion of a gene's DNA or RNA that codes for a protein. [1] Studying the length, composition, regulation, splicing, structures, and functions of coding regions compared to non-coding regions over different species and time periods can provide a significant amount of important information regarding gene organization and evolution of prokaryotes and eukaryotes. [2] This can further assist in mapping the human genome and developing gene therapy. [3]

Contents

Definition

Although this term is also sometimes used interchangeably with exon, it is not the exact same thing: the exon is composed of the coding region as well as the 3' and 5' untranslated regions of the RNA, and so therefore, an exon would be partially made up of coding regions. The 3' and 5' untranslated regions of the RNA, which do not code for protein, are termed non-coding regions and are not discussed on this page. [4]

There is often confusion between coding regions and exomes and there is a clear distinction between these terms. While the exome refers to all exons within a genome, the coding region refers to a singular section of the DNA or RNA which specifically codes for a certain kind of protein.  

History

In 1978, Walter Gilbert published "Why Genes in Pieces" which first began to explore the idea that the gene is a mosaic—that each full nucleic acid strand is not coded continuously but is interrupted by "silent" non-coding regions. This was the first indication that there needed to be a distinction between the parts of the genome that code for protein, now called coding regions, and those that do not. [5]

Composition

Point mutation types: transitions (blue) are elevated compared to transversions (red) in GC-rich coding regions. Transitions-transversions.png
Point mutation types: transitions (blue) are elevated compared to transversions (red) in GC-rich coding regions.

The evidence suggests that there is a general interdependence between base composition patterns and coding region availability. [6] The coding region is thought to contain a higher GC-content than non-coding regions. There is further research that discovered that the longer the coding strand, the higher the GC-content. Short coding strands are comparatively still GC-poor, similar to the low GC-content of the base composition translational stop codons like TAG, TAA, and TGA. [7]

GC-rich areas are also where the ratio point mutation type is altered slightly: there are more transitions, which are changes from purine to purine or pyrimidine to pyrimidine, compared to transversions, which are changes from purine to pyrimidine or pyrimidine to purine. The transitions are less likely to change the encoded amino acid and remain a silent mutation (especially if they occur in the third nucleotide of a codon) which is usually beneficial to the organism during translation and protein formation. [8]

This indicates that essential coding regions (gene-rich) are higher in GC-content and more stable and resistant to mutation compared to accessory and non-essential regions (gene-poor). [9] However, it is still unclear whether this came about through neutral and random mutation or through a pattern of selection. [10] There is also debate on whether the methods used, such as gene windows, to ascertain the relationship between GC-content and coding region are accurate and unbiased. [11]

Structure and function

Transcription: RNA Polymerase (RNAP) uses a template DNA strand and begins coding at the promoter sequence (green) and ends at the terminator sequence (red) in order to encompass the entire coding region into the pre-mRNA (teal). The pre-mRNA is polymerised 5' to 3' and the template DNA read 3' to 5' Coding Region in DNA.png
Transcription: RNA Polymerase (RNAP) uses a template DNA strand and begins coding at the promoter sequence (green) and ends at the terminator sequence (red) in order to encompass the entire coding region into the pre-mRNA (teal). The pre-mRNA is polymerised 5' to 3' and the template DNA read 3' to 5'
An electron-micrograph of DNA strands decorated by hundreds of RNAP molecules too small to be resolved. Each RNAP is transcribing an RNA strand, which can be seen branching off from the DNA. "Begin" indicates the 3' end of the DNA, where RNAP initiates transcription; "End" indicates the 5' end, where the longer RNA molecules are completely transcribed. Transcription label en.jpg
An electron-micrograph of DNA strands decorated by hundreds of RNAP molecules too small to be resolved. Each RNAP is transcribing an RNA strand, which can be seen branching off from the DNA. "Begin" indicates the 3' end of the DNA, where RNAP initiates transcription; "End" indicates the 5' end, where the longer RNA molecules are completely transcribed.

In DNA, the coding region is flanked by the promoter sequence on the 5' end of the template strand and the termination sequence on the 3' end. During transcription, the RNA Polymerase (RNAP) binds to the promoter sequence and moves along the template strand to the coding region. RNAP then adds RNA nucleotides complementary to the coding region in order to form the mRNA, substituting uracil in place of thymine. [12] This continues until the RNAP reaches the termination sequence. [12]

After transcription and maturation, the mature mRNA formed encompasses multiple parts important for its eventual translation into protein. The coding region in an mRNA is flanked by the 5' untranslated region (5'-UTR) and 3' untranslated region (3'-UTR), [1] the 5' cap, and Poly-A tail. During translation, the ribosome facilitates the attachment of the tRNAs to the coding region, 3 nucleotides at a time (codons). [13] The tRNAs transfer their associated amino acids to the growing polypeptide chain, eventually forming the protein defined in the initial DNA coding region.

The coding region (teal) is flanked by untranslated regions, the 5' cap, and the poly(A) tail which together form the mature mRNA. Mature mRNA.png
The coding region (teal) is flanked by untranslated regions, the 5' cap, and the poly(A) tail which together form the mature mRNA.

Regulation

The coding region can be modified in order to regulate gene expression.

Alkylation is one form of regulation of the coding region. [15] The gene that would have been transcribed can be silenced by targeting a specific sequence. The bases in this sequence would be blocked using alkyl groups, which create the silencing effect. [16]

While the regulation of gene expression manages the abundance of RNA or protein made in a cell, the regulation of these mechanisms can be controlled by a regulatory sequence found before the open reading frame begins in a strand of DNA. The regulatory sequence will then determine the location and time that expression will occur for a protein coding region. [17]

RNA splicing ultimately determines what part of the sequence becomes translated and expressed, and this process involves cutting out introns and putting together exons. Where the RNA spliceosome cuts, however, is guided by the recognition of splice sites, in particular the 5' splicing site, which is one of the substrates for the first step in splicing. [18] The coding regions are within the exons, which become covalently joined together to form the mature messenger RNA.

Mutations

Mutations in the coding region can have very diverse effects on the phenotype of the organism. While some mutations in this region of DNA/RNA can result in advantageous changes, others can be harmful and sometimes even lethal to an organism's survival. In contrast, changes in the non-coding region may not always result in detectable changes in phenotype.

Mutation types

Examples of the various forms of point mutations that may exist within coding regions. Such alterations may or may not have phenotypic changes, depending on whether or not they code for different amino acids during translation. Different Types of Mutations.png
Examples of the various forms of point mutations that may exist within coding regions. Such alterations may or may not have phenotypic changes, depending on whether or not they code for different amino acids during translation.

There are various forms of mutations that can occur in coding regions. One form is silent mutations, in which a change in nucleotides does not result in any change in amino acid after transcription and translation. [20] There also exist nonsense mutations, where base alterations in the coding region code for a premature stop codon, producing a shorter final protein. Point mutations, or single base pair changes in the coding region, that code for different amino acids during translation, are called missense mutations. Other types of mutations include frameshift mutations such as insertions or deletions. [20]

Formation

Some forms of mutations are hereditary (germline mutations), or passed on from a parent to its offspring. [21] Such mutated coding regions are present in all cells within the organism. Other forms of mutations are acquired (somatic mutations) during an organism's lifetime, and may not be constant cell-to-cell. [21] These changes can be caused by mutagens, carcinogens, or other environmental agents (ex. UV). Acquired mutations can also be a result of copy-errors during DNA replication and are not passed down to offspring. Changes in the coding region can also be de novo (new); such changes are thought to occur shortly after fertilization, resulting in a mutation present in the offspring's DNA while being absent in both the sperm and egg cells. [21]

Prevention

There exist multiple transcription and translation mechanisms to prevent lethality due to deleterious mutations in the coding region. Such measures include proofreading by some DNA Polymerases during replication, mismatch repair following replication, [22] and the 'Wobble Hypothesis' which describes the degeneracy of the third base within an mRNA codon. [23]

Constrained coding regions (CCRs)

While it is well known that the genome of one individual can have extensive differences when compared to the genome of another, recent research has found that some coding regions are highly constrained, or resistant to mutation, between individuals of the same species. This is similar to the concept of interspecies constraint in conserved sequences. Researchers termed these highly constrained sequences constrained coding regions (CCRs), and have also discovered that such regions may be involved in high purifying selection. On average, there is approximately 1 protein-altering mutation every 7 coding bases, but some CCRs can have over 100 bases in sequence with no observed protein-altering mutations, some without even synonymous mutations. [24] These patterns of constraint between genomes may provide clues to the sources of rare developmental diseases or potentially even embryonic lethality. Clinically validated variants and de novo mutations in CCRs have been previously linked to disorders such as infantile epileptic encephalopathy, developmental delay and severe heart disease. [24]

Coding sequence detection

Schematic karyogram of a human, showing an overview of the human genome on G banding (which includes Giemsa-staining), wherein coding DNA regions occur to a greater extent in lighter (GC rich) regions.

.mw-parser-output .hatnote{font-style:italic}.mw-parser-output div.hatnote{padding-left:1.6em;margin-bottom:0.5em}.mw-parser-output .hatnote i{font-style:normal}.mw-parser-output .hatnote+link+.hatnote{margin-top:-0.5em}@media print{body.ns-0 .mw-parser-output .hatnote{display:none!important}}
Further information: Karyotype Human karyotype with bands and sub-bands.png
Schematic karyogram of a human, showing an overview of the human genome on G banding (which includes Giemsa-staining), wherein coding DNA regions occur to a greater extent in lighter (GC rich) regions.

While identification of open reading frames within a DNA sequence is straightforward, identifying coding sequences is not, because the cell translates only a subset of all open reading frames to proteins. [26] Currently CDS prediction uses sampling and sequencing of mRNA from cells, although there is still the problem of determining which parts of a given mRNA are actually translated to protein. CDS prediction is a subset of gene prediction, the latter also including prediction of DNA sequences that code not only for protein but also for other functional elements such as RNA genes and regulatory sequences.

In both prokaryotes and eukaryotes, gene overlapping occurs relatively often in both DNA and RNA viruses as an evolutionary advantage to reduce genome size while retaining the ability to produce various proteins from the available coding regions. [27] [28] For both DNA and RNA, pairwise alignments can detect overlapping coding regions, including short open reading frames in viruses, but would require a known coding strand to compare the potential overlapping coding strand with. [29] An alternative method using single genome sequences would not require multiple genome sequences to execute comparisons but would require at least 50 nucleotides overlapping in order to be sensitive. [30]

See also

Related Research Articles

<span class="mw-page-title-main">Base pair</span> Two nucleobases bound by hydrogen bonds

A base pair (bp) is a fundamental unit of double-stranded nucleic acids consisting of two nucleobases bound to each other by hydrogen bonds. They form the building blocks of the DNA double helix and contribute to the folded structure of both DNA and RNA. Dictated by specific hydrogen bonding patterns, "Watson–Crick" base pairs allow the DNA helix to maintain a regular helical structure that is subtly dependent on its nucleotide sequence. The complementary nature of this based-paired structure provides a redundant copy of the genetic information encoded within each strand of DNA. The regular structure and data redundancy provided by the DNA double helix make DNA well suited to the storage of genetic information, while base-pairing between DNA and incoming nucleotides provides the mechanism through which DNA polymerase replicates DNA and RNA polymerase transcribes DNA into RNA. Many DNA-binding proteins can recognize specific base-pairing patterns that identify particular regulatory regions of genes.

<span class="mw-page-title-main">Exon</span> A region of a transcribed gene present in the final functional mRNA molecule

An exon is any part of a gene that will form a part of the final mature RNA produced by that gene after introns have been removed by RNA splicing. The term exon refers to both the DNA sequence within a gene and to the corresponding sequence in RNA transcripts. In RNA splicing, introns are removed and exons are covalently joined to one another as part of generating the mature RNA. Just as the entire set of genes for a species constitutes the genome, the entire set of exons constitutes the exome.

<span class="mw-page-title-main">Messenger RNA</span> RNA that is read by the ribosome to produce a protein

In molecular biology, messenger ribonucleic acid (mRNA) is a single-stranded molecule of RNA that corresponds to the genetic sequence of a gene, and is read by a ribosome in the process of synthesizing a protein.

<span class="mw-page-title-main">Protein biosynthesis</span> Assembly of proteins inside biological cells

Protein biosynthesis is a core biological process, occurring inside cells, balancing the loss of cellular proteins through the production of new proteins. Proteins perform a number of critical functions as enzymes, structural proteins or hormones. Protein synthesis is a very similar process for both prokaryotes and eukaryotes but there are some distinct differences.

<span class="mw-page-title-main">Reading frame</span> Division of RNA/DNA sequences into sets of triplets which correspond to amino acids

In molecular biology, a reading frame is a way of dividing the sequence of nucleotides in a nucleic acid molecule into a set of consecutive, non-overlapping triplets. Where these triplets equate to amino acids or stop signals during translation, they are called codons.

In computational biology, gene prediction or gene finding refers to the process of identifying the regions of genomic DNA that encode genes. This includes protein-coding genes as well as RNA genes, but may also include prediction of other functional elements such as regulatory regions. Gene finding is one of the first and most important steps in understanding the genome of a species once it has been sequenced.

<span class="mw-page-title-main">Frameshift mutation</span> Mutation that shifts codon alignment

A frameshift mutation is a genetic mutation caused by indels of a number of nucleotides in a DNA sequence that is not divisible by three. Due to the triplet nature of gene expression by codons, the insertion or deletion can change the reading frame, resulting in a completely different translation from the original. The earlier in the sequence the deletion or insertion occurs, the more altered the protein. A frameshift mutation is not the same as a single-nucleotide polymorphism in which a nucleotide is replaced, rather than inserted or deleted. A frameshift mutation will in general cause the reading of the codons after the mutation to code for different amino acids. The frameshift mutation will also alter the first stop codon encountered in the sequence. The polypeptide being created could be abnormally short or abnormally long, and will most likely not be functional.

<span class="mw-page-title-main">Point mutation</span> Replacement, insertion, or deletion of a single DNA or RNA nucleotide

A point mutation is a genetic mutation where a single nucleotide base is changed, inserted or deleted from a DNA or RNA sequence of an organism's genome. Point mutations have a variety of effects on the downstream protein product—consequences that are moderately predictable based upon the specifics of the mutation. These consequences can range from no effect to deleterious effects, with regard to protein production, composition, and function.

<span class="mw-page-title-main">SR protein</span>

SR proteins are a conserved family of proteins involved in RNA splicing. SR proteins are named because they contain a protein domain with long repeats of serine and arginine amino acid residues, whose standard abbreviations are "S" and "R" respectively. SR proteins are ~200-600 amino acids in length and composed of two domains, the RNA recognition motif (RRM) region and the RS domain. SR proteins are more commonly found in the nucleus than the cytoplasm, but several SR proteins are known to shuttle between the nucleus and the cytoplasm.

In molecular biology, reading frames are defined as spans of DNA sequence between the start and stop codons. Usually, this is considered within a studied region of a prokaryotic DNA sequence, where only one of the six possible reading frames will be "open". Such an ORF may contain a start codon and by definition cannot extend beyond a stop codon. That start codon indicates where translation may start. The transcription termination site is located after the ORF, beyond the translation stop codon. If transcription were to cease before the stop codon, an incomplete protein would be made during translation.

<span class="mw-page-title-main">Primary transcript</span> RNA produced by transcription

A primary transcript is the single-stranded ribonucleic acid (RNA) product synthesized by transcription of DNA, and processed to yield various mature RNA products such as mRNAs, tRNAs, and rRNAs. The primary transcripts designated to be mRNAs are modified in preparation for translation. For example, a precursor mRNA (pre-mRNA) is a type of primary transcript that becomes a messenger RNA (mRNA) after processing.

<span class="mw-page-title-main">Silent mutation</span> DNA mutation with no observable effect on an organisms phenotype

Silent mutations, also called synonymous or samesense mutations, are mutations in DNA that do not have an observable effect on the organism's phenotype. The phrase silent mutation is often used interchangeably with the phrase synonymous mutation; however, synonymous mutations are not always silent, nor vice versa. Synonymous mutations can affect transcription, splicing, mRNA transport, and translation, any of which could alter phenotype, rendering the synonymous mutation non-silent. The substrate specificity of the tRNA to the rare codon can affect the timing of translation, and in turn the co-translational folding of the protein. This is reflected in the codon usage bias that is observed in many species. Mutations that cause the altered codon to produce an amino acid with similar functionality are often classified as silent; if the properties of the amino acid are conserved, this mutation does not usually significantly affect protein function.

<span class="mw-page-title-main">Insertion (genetics)</span> Type of mutation

In genetics, an insertion is the addition of one or more nucleotide base pairs into a DNA sequence. This can often happen in microsatellite regions due to the DNA polymerase slipping. Insertions can be anywhere in size from one base pair incorrectly inserted into a DNA sequence to a section of one chromosome inserted into another. The mechanism of the smallest single base insertion mutations is believed to be through base-pair separation between the template and primer strands followed by non-neighbor base stacking, which can occur locally within the DNA polymerase active site. On a chromosome level, an insertion refers to the insertion of a larger sequence into a chromosome. This can happen due to unequal crossover during meiosis.

<span class="mw-page-title-main">Gene</span> Sequence of DNA that determines traits in an organism

In biology, the word gene has two meanings. The Mendelian gene is a basic unit of heredity. The molecular gene is a sequence of nucleotides in DNA that is transcribed to produce a functional RNA. There are two types of molecular genes: protein-coding genes and non-coding genes. During gene expression, DNA is first copied into RNA. RNA can be directly functional or be the intermediate template for the synthesis of a protein.

Eukaryotic chromosome fine structure refers to the structure of sequences for eukaryotic chromosomes. Some fine sequences are included in more than one class, so the classification listed is not intended to be completely separate.

<span class="mw-page-title-main">Untranslated region</span> Non-coding regions on either end of mRNA

In molecular genetics, an untranslated region refers to either of two sections, one on each side of a coding sequence on a strand of mRNA. If it is found on the 5' side, it is called the 5' UTR, or if it is found on the 3' side, it is called the 3' UTR. mRNA is RNA that carries information from DNA to the ribosome, the site of protein synthesis (translation) within a cell. The mRNA is initially transcribed from the corresponding DNA sequence and then translated into protein. However, several regions of the mRNA are usually not translated into protein, including the 5' and 3' UTRs.

Numerous key discoveries in biology have emerged from studies of RNA, including seminal work in the fields of biochemistry, genetics, microbiology, molecular biology, molecular evolution, and structural biology. As of 2010, 30 scientists have been awarded Nobel Prizes for experimental work that includes studies of RNA. Specific discoveries of high biological significance are discussed in this article.

<span class="mw-page-title-main">Genome evolution</span> Process by which a genome changes in structure or size over time

Genome evolution is the process by which a genome changes in structure (sequence) or size over time. The study of genome evolution involves multiple fields such as structural analysis of the genome, the study of genomic parasites, gene and ancient genome duplications, polyploidy, and comparative genomics. Genome evolution is a constantly changing and evolving field due to the steadily growing number of sequenced genomes, both prokaryotic and eukaryotic, available to the scientific community and the public at large.

The split gene theory is a theory of the origin of introns, long non-coding sequences in eukaryotic genes between the exons. The theory holds that the randomness of primordial DNA sequences would only permit small (< 600bp) open reading frames (ORFs), and that important intron structures and regulatory sequences are derived from stop codons. In this introns-first framework, the spliceosomal machinery and the nucleus evolved due to the necessity to join these ORFs into larger proteins, and that intronless bacterial genes are less ancestral than the split eukaryotic genes. The theory originated with Periannan Senapathy.

This glossary of cellular and molecular biology is a list of definitions of terms and concepts commonly used in the study of cell biology, molecular biology, and related disciplines, including molecular genetics, biochemistry, and microbiology. It is split across two articles:

References

  1. 1 2 Twyman, Richard (1 August 2003). "Gene Structure". The Wellcome Trust. Archived from the original on 28 March 2007. Retrieved 6 April 2003.
  2. Höglund M, Säll T, Röhme D (February 1990). "On the origin of coding sequences from random open reading frames". Journal of Molecular Evolution. 30 (2): 104–108. Bibcode:1990JMolE..30..104H. doi:10.1007/bf02099936. ISSN   0022-2844. S2CID   5978109.
  3. Sakharkar MK, Chow VT, Kangueane P (2004). "Distributions of exons and introns in the human genome". In Silico Biology. 4 (4): 387–93. PMID   15217358.
  4. Parnell, Laurence D. (2012-01-01). "Advances in Technologies and Study Design". In Bouchard, C.; Ordovas, J. M. (eds.). Recent Advances in Nutrigenetics and Nutrigenomics. Vol. 108. Academic Press. pp. 17–50. doi:10.1016/B978-0-12-398397-8.00002-2. ISBN   9780123983978. PMID   22656372 . Retrieved 2019-11-07.{{cite book}}: |journal= ignored (help)
  5. Gilbert W (February 1978). "Why genes in pieces?". Nature. 271 (5645): 501. Bibcode:1978Natur.271..501G. doi: 10.1038/271501a0 . PMID   622185. S2CID   4216649.
  6. Lercher MJ, Urrutia AO, Pavlícek A, Hurst LD (October 2003). "A unification of mosaic structures in the human genome". Human Molecular Genetics. 12 (19): 2411–5. doi: 10.1093/hmg/ddg251 . PMID   12915446.
  7. Oliver JL, Marín A (September 1996). "A relationship between GC content and coding-sequence length". Journal of Molecular Evolution. 43 (3): 216–23. Bibcode:1996JMolE..43..216O. doi:10.1007/pl00006080. PMID   8703087.
  8. "ROSALIND | Glossary | Gene coding region". rosalind.info. Retrieved 2019-10-31.
  9. Vinogradov AE (April 2003). "DNA helix: the importance of being GC-rich". Nucleic Acids Research. 31 (7): 1838–44. doi:10.1093/nar/gkg296. PMC   152811 . PMID   12654999.
  10. Bohlin J, Eldholm V, Pettersson JH, Brynildsrud O, Snipen L (February 2017). "The nucleotide composition of microbial genomes indicates differential patterns of selection on core and accessory genomes". BMC Genomics. 18 (1): 151. doi: 10.1186/s12864-017-3543-7 . PMC   5303225 . PMID   28187704.
  11. Sémon M, Mouchiroud D, Duret L (February 2005). "Relationship between gene expression and GC-content in mammals: statistical significance and biological relevance". Human Molecular Genetics. 14 (3): 421–7. doi: 10.1093/hmg/ddi038 . PMID   15590696.
  12. 1 2 Overview of transcription. (n.d.). Retrieved from https://www.khanacademy.org/science/biology/gene-expression-central-dogma/transcription-of-dna-into-rna/a/overview-of-transcription .
  13. Clancy, Suzanne (2008). "Translation: DNA to mRNA to Protein". Scitable: By Nature Education.
  14. Plociam (2005-08-08), English: The structure of a mature eukaryotic mRNA. A fully processed mRNA includes the 5' cap, 5' UTR, coding region, 3' UTR, and poly(A) tail. , retrieved 2019-11-19
  15. Shinohara K, Sasaki S, Minoshima M, Bando T, Sugiyama H (2006-02-13). "Alkylation of template strand of coding region causes effective gene silencing". Nucleic Acids Research. 34 (4): 1189–95. doi:10.1093/nar/gkl005. PMC   1383623 . PMID   16500890.
  16. "DNA alkylation Gene Ontology Term (GO:0006305)". www.informatics.jax.org. Retrieved 2019-10-30.
  17. Shafee T, Lowe R (2017). "Eukaryotic and prokaryotic gene structure". WikiJournal of Medicine. 4 (1). doi: 10.15347/wjm/2017.002 .
  18. Konarska MM (1998). "Recognition of the 5' splice site by the spliceosome". Acta Biochimica Polonica. 45 (4): 869–81. doi: 10.18388/abp.1998_4346 . PMID   10397335.
  19. Jonsta247 (2013-05-10), English: Example of silent mutation , retrieved 2019-11-19{{citation}}: CS1 maint: numeric names: authors list (link)
  20. 1 2 Yang, J. (2016, March 23). What are Genetic Mutation? Retrieved from https://www.singerinstruments.com/resource/what-are-genetic-mutation/ .
  21. 1 2 3 What is a gene mutation and how do mutations occur? - Genetics Home Reference - NIH. (n.d.). Retrieved from https://ghr.nlm.nih.gov/primer/mutationsanddisorders/genemutation .
  22. "DNA proofreading and repair (article)". Khan Academy. Retrieved 2023-05-22.
  23. Peretó J. (2011) Wobble Hypothesis (Genetics). In: Gargaud M. et al. (eds) Encyclopedia of Astrobiology. Springer, Berlin, Heidelberg
  24. 1 2 Havrilla, J. M., Pedersen, B. S., Layer, R. M., & Quinlan, A. R. (2018). A map of constrained coding regions in the human genome. Nature Genetics, 88–95. doi : 10.1101/220814
  25. Romiguier J, Roux C (2017). "Analytical Biases Associated with GC-Content in Molecular Evolution". Front Genet. 8: 16. doi: 10.3389/fgene.2017.00016 . PMC   5309256 . PMID   28261263.
  26. Furuno M, Kasukawa T, Saito R, Adachi J, Suzuki H, Baldarelli R, et al. (June 2003). "CDS annotation in full-length cDNA sequence". Genome Research. 13 (6B). Cold Spring Harbor Laboratory Press: 1478–87. doi:10.1101/gr.1060303. PMC   403693 . PMID   12819146.
  27. Rogozin IB, Spiridonov AN, Sorokin AV, Wolf YI, Jordan IK, Tatusov RL, Koonin EV (May 2002). "Purifying and directional selection in overlapping prokaryotic genes". Trends in Genetics. 18 (5): 228–32. doi:10.1016/S0168-9525(02)02649-5. PMID   12047938.
  28. Chirico N, Vianelli A, Belshaw R (December 2010). "Why genes overlap in viruses". Proceedings. Biological Sciences. 277 (1701): 3809–17. doi:10.1098/rspb.2010.1052. PMC   2992710 . PMID   20610432.
  29. Firth AE, Brown CM (February 2005). "Detecting overlapping coding sequences with pairwise alignments". Bioinformatics. 21 (3): 282–92. doi: 10.1093/bioinformatics/bti007 . PMID   15347574.
  30. Schlub TE, Buchmann JP, Holmes EC (October 2018). Malik H (ed.). "A Simple Method to Detect Candidate Overlapping Genes in Viruses Using Single Genome Sequences". Molecular Biology and Evolution. 35 (10): 2572–2581. doi:10.1093/molbev/msy155. PMC   6188560 . PMID   30099499.