Chargaff's rules

Last updated
A diagram of DNA base pairing, demonstrating the basis for Chargaff's rules DNA Diagram.png
A diagram of DNA base pairing, demonstrating the basis for Chargaff's rules

Chargaff's rules (given by Erwin Chargaff) state that in the DNA of any species and any organism, the amount of guanine should be equal to the amount of cytosine and the amount of adenine should be equal to the amount of thymine. Further, a 1:1 stoichiometric ratio of purine and pyrimidine bases (i.e., A+G=T+C) should exist. This pattern is found in both strands of the DNA. They were discovered by Austrian-born chemist Erwin Chargaff [1] [2] in the late 1940s.

Contents

Definitions

First parity rule

The first rule holds that a double-stranded DNA molecule, globally has percentage base pair equality: A% = T% and G% = C%. The rigorous validation of the rule constitutes the basis of Watson–Crick base pairs in the DNA double helix model.

Second parity rule

The second rule holds that both Α% ≈ Τ% and G% ≈ C% are valid for each of the two DNA strands. [3] This describes only a global feature of the base composition in a single DNA strand. [4]

Research

The second parity rule was discovered in 1968. [3] It states that, in single-stranded DNA, the number of adenine units is approximately equal to that of thymine (%A %T), and the number of cytosine units is approximately equal to that of guanine (%C %G).

The first empirical generalization of Chargaff's second parity rule, called the Symmetry Principle, was proposed by Vinayakumar V. Prabhu [5] in 1993. This principle states that for any given oligonucleotide, its frequency is approximately equal to the frequency of its complementary reverse oligonucleotide. A theoretical generalization [6] was mathematically derived by Michel E. B. Yamagishi and Roberto H. Herai in 2011. [7]

In 2006, it was shown that this rule applies to four [2] of the five types of double stranded genomes; specifically it applies to the eukaryotic chromosomes, the bacterial chromosomes, the double stranded DNA viral genomes, and the archaeal chromosomes. [8] It does not apply to organellar genomes (mitochondria and plastids) smaller than ~20-30 kbp, nor does it apply to single stranded DNA (viral) genomes or any type of RNA genome. The basis for this rule is still under investigation, although genome size may play a role.

Histogram showing how 20309 chromosomes adhere to Chargaff's second parity rule Chargaff-2nd-histogram.png
Histogram showing how 20309 chromosomes adhere to Chargaff's second parity rule

The rule itself has consequences. In most bacterial genomes (which are generally 80-90% coding) genes are arranged in such a fashion that approximately 50% of the coding sequence lies on either strand. Wacław Szybalski, in the 1960s, showed that in bacteriophage coding sequences purines (A and G) exceed pyrimidines (C and T). [9] This rule has since been confirmed in other organisms and should probably be now termed "Szybalski's rule". While Szybalski's rule generally holds, exceptions are known to exist. [10] [11] [12] The biological basis for Szybalski's rule is not yet known.

The combined effect of Chargaff's second rule and Szybalski's rule can be seen in bacterial genomes where the coding sequences are not equally distributed. The genetic code has 64 codons of which 3 function as termination codons: there are only 20 amino acids normally present in proteins. (There are two uncommon amino acids—selenocysteine and pyrrolysine—found in a limited number of proteins and encoded by the stop codons—TGA and TAG respectively.) The mismatch between the number of codons and amino acids allows several codons to code for a single amino acid—such codons normally differ only at the third codon base position.

Multivariate statistical analysis of codon use within genomes with unequal quantities of coding sequences on the two strands has shown that codon use in the third position depends on the strand on which the gene is located. This seems likely to be the result of Szybalski's and Chargaff's rules. Because of the asymmetry in pyrimidine and purine use in coding sequences, the strand with the greater coding content will tend to have the greater number of purine bases (Szybalski's rule). Because the number of purine bases will, to a very good approximation, equal the number of their complementary pyrimidines within the same strand and, because the coding sequences occupy 80–90% of the strand, there appears to be (1) a selective pressure on the third base to minimize the number of purine bases in the strand with the greater coding content; and (2) that this pressure is proportional to the mismatch in the length of the coding sequences between the two strands.

Chargaff's 2nd parity rule for prokaryotic 6-mers Chargraff-2nd-6-mers.png
Chargaff's 2nd parity rule for prokaryotic 6-mers

The origin of the deviation from Chargaff's rule in the organelles has been suggested to be a consequence of the mechanism of replication. [13] During replication the DNA strands separate. In single stranded DNA, cytosine spontaneously slowly deaminates to adenosine (a C to A transversion). The longer the strands are separated the greater the quantity of deamination. For reasons that are not yet clear the strands tend to exist longer in single form in mitochondria than in chromosomal DNA. This process tends to yield one strand that is enriched in guanine (G) and thymine (T) with its complement enriched in cytosine (C) and adenosine (A), and this process may have given rise to the deviations found in the mitochondria. [ citation needed ][ dubious ]

Chargaff's second rule appears to be the consequence of a more complex parity rule: within a single strand of DNA any oligonucleotide (k-mer or n-gram; length ≤ 10) is present in equal numbers to its reverse complementary nucleotide. Because of the computational requirements this has not been verified in all genomes for all oligonucleotides. It has been verified for triplet oligonucleotides for a large data set. [14] Albrecht-Buehler has suggested that this rule is the consequence of genomes evolving by a process of inversion and transposition. [14] This process does not appear to have acted on the mitochondrial genomes. Chargaff's second parity rule appears to be extended from the nucleotide-level to populations of codon triplets, in the case of whole single-stranded Human genome DNA. [15] A kind of "codon-level second Chargaff's parity rule" is proposed as follows:

Intra-strand relation among percentages of codon populations
First codonSecond codonRelation proposedDetails
Twx (1st base position is T)yzA (3rd base position is A) % Twx % yzATwx and yzA are mirror codons, e.g. TCG and CGA
Cwx (1st base position is C)yzG (3rd base position is G) % Cwx % yzGCwx and yzG are mirror codons, e.g. CTA and TAG
wTx (2nd base position is T)yAz (2nd base position is A) % wTx % yAzwTx and yAz are mirror codons, e.g. CTG and CAG
wCx (2nd base position is C)yGz (2nd base position is G) % wCx % yGzwCx and yGz are mirror codons, e.g. TCT and AGA
wxT (3rd base position is T)Ayz (1st base position is A) % wxT % AyzwxT and Ayz are mirror codons, e.g. CTT and AAG
wxC (3rd base position is C)Gyz (1st base position is G) % wxC % GyzwxC and Gyz are mirror codons, e.g. GGC and GCC

Examples — computing whole human genome using the first codons reading frame provides:

36530115 TTT and 36381293 AAA (ratio % = 1.00409). 2087242 TCG and 2085226 CGA (ratio % = 1.00096), etc...

In 2020, it is suggested that the physical properties of the dsDNA (double stranded DNA) and the tendency to maximum entropy of all the physical systems are the cause of Chargaff's second parity rule. [16] The symmetries and patterns present in the dsDNA sequences can emerge from the physical peculiarities of the dsDNA molecule and the maximum entropy principle alone, rather than from biological or environmental evolutionary pressure.

Percentages of bases in DNA

The following table is a representative sample of Erwin Chargaff's 1952 data, listing the base composition of DNA from various organisms and support both of Chargaff's rules. [17] An organism such as φX174 with significant variation from A/T and G/C equal to one, is indicative of single stranded DNA.

OrganismTaxon%A%G%C%TA / TG / C%GC%AT
Maize Zea 26.822.823.227.20.990.9846.154.0
Octopus Octopus 33.217.617.631.61.051.0035.264.8
Chicken Gallus 28.022.021.628.40.991.0243.756.4
Rat Rattus 28.621.420.528.41.011.0042.957.0
Human Homo 29.320.720.030.00.981.0440.759.3
Grasshopper Orthoptera 29.320.520.729.31.000.9941.258.6
Sea urchin Echinoidea 32.817.717.332.11.021.0235.064.9
Wheat Triticum 27.322.722.827.11.011.0045.554.4
Yeast Saccharomyces 31.318.717.132.90.951.0935.864.4
E. coli Escherichia 24.726.025.723.61.051.0151.748.3
φX174 PhiX174 24.023.321.531.20.771.0844.855.2

See also

Related Research Articles

<span class="mw-page-title-main">Base pair</span> Unit consisting of two nucleobases bound to each other by hydrogen bonds

A base pair (bp) is a fundamental unit of double-stranded nucleic acids consisting of two nucleobases bound to each other by hydrogen bonds. They form the building blocks of the DNA double helix and contribute to the folded structure of both DNA and RNA. Dictated by specific hydrogen bonding patterns, "Watson–Crick" base pairs allow the DNA helix to maintain a regular helical structure that is subtly dependent on its nucleotide sequence. The complementary nature of this based-paired structure provides a redundant copy of the genetic information encoded within each strand of DNA. The regular structure and data redundancy provided by the DNA double helix make DNA well suited to the storage of genetic information, while base-pairing between DNA and incoming nucleotides provides the mechanism through which DNA polymerase replicates DNA and RNA polymerase transcribes DNA into RNA. Many DNA-binding proteins can recognize specific base-pairing patterns that identify particular regulatory regions of genes.

<span class="mw-page-title-main">DNA</span> Molecule that carries genetic information

Deoxyribonucleic acid is a polymer composed of two polynucleotide chains that coil around each other to form a double helix. The polymer carries genetic instructions for the development, functioning, growth and reproduction of all known organisms and many viruses. DNA and ribonucleic acid (RNA) are nucleic acids. Alongside proteins, lipids and complex carbohydrates (polysaccharides), nucleic acids are one of the four major types of macromolecules that are essential for all known forms of life.

<span class="mw-page-title-main">Nucleic acid</span> Class of large biomolecules essential to all known life

Nucleic acids are large biomolecules that are crucial in all cells and viruses. They are composed of nucleotides, which are the monomer components: a 5-carbon sugar, a phosphate group and a nitrogenous base. The two main classes of nucleic acids are deoxyribonucleic acid (DNA) and ribonucleic acid (RNA). If the sugar is ribose, the polymer is RNA; if the sugar is deoxyribose, a variant of ribose, the polymer is DNA.

<span class="mw-page-title-main">Nucleotide</span> Biological molecules constituting nucleic acids

Nucleotides are organic molecules composed of a nitrogenous base, a pentose sugar and a phosphate. They serve as monomeric units of the nucleic acid polymers – deoxyribonucleic acid (DNA) and ribonucleic acid (RNA), both of which are essential biomolecules within all life-forms on Earth. Nucleotides are obtained in the diet and are also synthesized from common nutrients by the liver.

<span class="mw-page-title-main">Nucleobase</span> Nitrogen-containing biological compounds that form nucleosides

Nucleobases are nitrogen-containing biological compounds that form nucleosides, which, in turn, are components of nucleotides, with all of these monomers constituting the basic building blocks of nucleic acids. The ability of nucleobases to form base pairs and to stack one upon another leads directly to long-chain helical structures such as ribonucleic acid (RNA) and deoxyribonucleic acid (DNA). Five nucleobases—adenine (A), cytosine (C), guanine (G), thymine (T), and uracil (U)—are called primary or canonical. They function as the fundamental units of the genetic code, with the bases A, G, C, and T being found in DNA while A, G, C, and U are found in RNA. Thymine and uracil are distinguished by merely the presence or absence of a methyl group on the fifth carbon (C5) of these heterocyclic six-membered rings. In addition, some viruses have aminoadenine (Z) instead of adenine. It differs in having an extra amine group, creating a more stable bond to thymine.

The coding region of a gene, also known as the coding sequence (CDS), is the portion of a gene's DNA or RNA that codes for a protein. Studying the length, composition, regulation, splicing, structures, and functions of coding regions compared to non-coding regions over different species and time periods can provide a significant amount of important information regarding gene organization and evolution of prokaryotes and eukaryotes. This can further assist in mapping the human genome and developing gene therapy.

<span class="mw-page-title-main">Nucleic acid sequence</span> Succession of nucleotides in a nucleic acid

A nucleic acid sequence is a succession of bases within the nucleotides forming alleles within a DNA or RNA (GACU) molecule. This succession is denoted by a series of a set of five different letters that indicate the order of the nucleotides. By convention, sequences are usually presented from the 5' end to the 3' end. For DNA, with its double helix, there are two possible directions for the notated sequence; of these two, the sense strand is used. Because nucleic acids are normally linear (unbranched) polymers, specifying the sequence is equivalent to defining the covalent structure of the entire molecule. For this reason, the nucleic acid sequence is also termed the primary structure.

<span class="mw-page-title-main">Molecular genetics</span> Scientific study of genes at the molecular level

Molecular genetics is a branch of biology that addresses how differences in the structures or expression of DNA molecules manifests as variation among organisms. Molecular genetics often applies an "investigative approach" to determine the structure and/or function of genes in an organism's genome using genetic screens. 

<span class="mw-page-title-main">GC-content</span> Percentage of guanine and cytosine in DNA or RNA molecules

In molecular biology and genetics, GC-content is the percentage of nitrogenous bases in a DNA or RNA molecule that are either guanine (G) or cytosine (C). This measure indicates the proportion of G and C bases out of an implied four total bases, also including adenine and thymine in DNA and adenine and uracil in RNA.

<span class="mw-page-title-main">Triple-stranded DNA</span> DNA structure

Triple-stranded DNA is a DNA structure in which three oligonucleotides wind around each other and form a triple helix. In triple-stranded DNA, the third strand binds to a B-form DNA double helix by forming Hoogsteen base pairs or reversed Hoogsteen hydrogen bonds.

<span class="mw-page-title-main">Depurination</span>

Depurination is a chemical reaction of purine deoxyribonucleosides, deoxyadenosine and deoxyguanosine, and ribonucleosides, adenosine or guanosine, in which the β-N-glycosidic bond is hydrolytically cleaved releasing a nucleic base, adenine or guanine, respectively. The second product of depurination of deoxyribonucleosides and ribonucleosides is sugar, 2'-deoxyribose and ribose, respectively. More complex compounds containing nucleoside residues, nucleotides and nucleic acids, also suffer from depurination. Deoxyribonucleosides and their derivatives are substantially more prone to depurination than their corresponding ribonucleoside counterparts. Loss of pyrimidine bases occurs by a similar mechanism, but at a substantially lower rate.

Nucleic acid thermodynamics is the study of how temperature affects the nucleic acid structure of double-stranded DNA (dsDNA). The melting temperature (Tm) is defined as the temperature at which half of the DNA strands are in the random coil or single-stranded (ssDNA) state. Tm depends on the length of the DNA molecule and its specific nucleotide sequence. DNA, when in a state where its two strands are dissociated, is referred to as having been denatured by the high temperature.

Circular molecules of DNA, such as plasmids and typical mitochondrial genomes, consist of two strands of DNA called the heavy strand and the light strand. The two strands have different masses due to different proportions of heavier nucleotides. While this difference is not known to have any functional significance, it can be used in the laboratory to segregate the strands of denatured DNA, and hence to analyze the strands separately.

<span class="mw-page-title-main">Nucleic acid structure</span> Biomolecular structure of nucleic acids such as DNA and RNA

Nucleic acid structure refers to the structure of nucleic acids such as DNA and RNA. Chemically speaking, DNA and RNA are very similar. Nucleic acid structure is often divided into four different levels: primary, secondary, tertiary, and quaternary.

<span class="mw-page-title-main">Nucleic acid secondary structure</span>

Nucleic acid secondary structure is the basepairing interactions within a single nucleic acid polymer or between two polymers. It can be represented as a list of bases which are paired in a nucleic acid molecule. The secondary structures of biological DNAs and RNAs tend to be different: biological DNA mostly exists as fully base paired double helices, while biological RNA is single stranded and often forms complex and intricate base-pairing interactions due to its increased ability to form hydrogen bonds stemming from the extra hydroxyl group in the ribose sugar.

<span class="mw-page-title-main">Complementarity (molecular biology)</span> Lock-and-key pairing between two structures

In molecular biology, complementarity describes a relationship between two structures each following the lock-and-key principle. In nature complementarity is the base principle of DNA replication and transcription as it is a property shared between two DNA or RNA sequences, such that when they are aligned antiparallel to each other, the nucleotide bases at each position in the sequences will be complementary, much like looking in the mirror and seeing the reverse of things. This complementary base pairing allows cells to copy information from one generation to another and even find and repair damage to the information stored in the sequences.

Twisted intercalating nucleic acid (TINA) is a nucleic acid molecule that, when added to triplex-forming oligonucleotides (TFOs), stabilizes Hoogsteen triplex DNA formation from double-stranded DNA (dsDNA) and TFOs. Its ability to twist around a triple bond increases ease of intercalation within double stranded DNA in order to form triplex DNA. Certain configurations have been shown to stabilize Watson-Crick antiparallel duplex DNA. TINA-DNA primers have been shown to increase the specificity of binding in PCR. The use of TINA insertions in G-quadruplexes has also been shown to enhance anti-HIV-1 activity. TINA stabilized PT demonstrates improved sensitivity and specificity of DNA based clinical diagnostic assays.

<span class="mw-page-title-main">GC skew</span> Over- or under-abundance of guanine and cytosine in a particular region of DNA or RNA

GC skew is when the nucleotides guanine and cytosine are over- or under-abundant in a particular region of DNA or RNA. GC skew is also a statistical method for measuring strand-specific guanine overrepresentation.

<span class="mw-page-title-main">Polypurine reverse-Hoogsteen hairpin</span>

Polypurine reverse-Hoogsteen hairpins (PPRHs) are non-modified oligonucleotides containing two polypurine domains, in a mirror repeat fashion, linked by a pentathymidine stretch forming double-stranded DNA stem-loop molecules. The two polypurine domains interact by intramolecular reverse-Hoogsteen bonds allowing the formation of this specific hairpin structure.

Szybalski's rule says that lower-protein particles like viruses contain more purines than pyrimidine in their nucleic acid sequence. This is to prevent double-stranded RNA formation of one or two separate RNA strand that have complementary regions. The formation of a double-stranded RNA is not efficient for viruses as it may delay or stop RNA replication or protein formation. The rule is named for Wacław Szybalski.

References

  1. Elson D, Chargaff E (1952). "On the deoxyribonucleic acid content of sea urchin gametes". Experientia. 8 (4): 143–145. doi:10.1007/BF02170221. PMID   14945441. S2CID   36803326.
  2. 1 2 Chargaff E, Lipshitz R, Green C (1952). "Composition of the deoxypentose nucleic acids of four genera of sea-urchin". J Biol Chem. 195 (1): 155–160. doi: 10.1016/S0021-9258(19)50884-5 . PMID   14938364. S2CID   11358561.
  3. 1 2 Rudner, R; Karkas, JD; Chargaff, E (1968). "Separation of B. Subtilis DNA into complementary strands. 3. Direct analysis". Proceedings of the National Academy of Sciences of the United States of America. 60 (3): 921–2. Bibcode:1968PNAS...60..921R. doi: 10.1073/pnas.60.3.921 . PMC   225140 . PMID   4970114.
  4. Prabhu VV (1993). "Symmetry observation in long nucleotide sequences". Nucleic Acids Research. 21 (12): 2797–2800. doi:10.1093/nar/21.12.2797. PMC   309655 . PMID   8332488.
  5. Yamagishi MEB (2017). Mathematical Grammar of Biology. SpringerBriefs in Mathematics. Springer. arXiv: 1112.1528 . doi:10.1007/978-3-319-62689-5. ISBN   978-3-319-62688-8. S2CID   16742066.
  6. Yamagishi ME, Herai RH (2011). Chargaff's "Grammar of Biology": New Fractal-like Rules. SpringerBriefs in Mathematics. arXiv: 1112.1528 . doi:10.1007/978-3-319-62689-5. ISBN   978-3-319-62688-8. S2CID   16742066.
  7. Mitchell D, Bridge R (2006). "A test of Chargaff's second rule". Biochem Biophys Res Commun. 340 (1): 90–94. doi:10.1016/j.bbrc.2005.11.160. PMID   16364245.
  8. Szybalski W, Kubinski H, Sheldrick O (1966). "Pyrimidine clusters on the transcribing strand of DNA and their possible role in the initiation of RNA synthesis". Cold Spring Harb Symp Quant Biol. 31: 123–127. doi:10.1101/SQB.1966.031.01.019. PMID   4966069.
  9. Cristillo AD (1998). Characterization of G0/G1 switch genes in cultured T lymphocytes. Kingston, Ontario, Canada: Queen's University.
  10. Bell SJ, Forsdyke DR (1999). "Deviations from Chargaff's second parity rule correlate with direction of transcription". J Theor Biol. 197 (1): 63–76. Bibcode:1999JThBi.197...63B. doi:10.1006/jtbi.1998.0858. PMID   10036208.
  11. Lao PJ, Forsdyke DR (2000). "Thermophilic Bacteria Strictly Obey Szybalski's Transcription Direction Rule and Politely Purine-Load RNAs with Both Adenine and Guanine". Genome Research. 10 (2): 228–236. doi:10.1101/gr.10.2.228. PMC   310832 . PMID   10673280.
  12. Nikolaou C, Almirantis Y (2006). "Deviations from Chargaff's second parity rule in organellar DNA. Insights into the evolution of organellar genomes". Gene. 381: 34–41. doi:10.1016/j.gene.2006.06.010. PMID   16893615.
  13. 1 2 Albrecht-Buehler G (2006). "Asymptotically increasing compliance of genomes with Chargaff's second parity rules through inversions and inverted transpositions". Proc Natl Acad Sci USA. 103 (47): 17828–17833. Bibcode:2006PNAS..10317828A. doi: 10.1073/pnas.0605553103 . PMC   1635160 . PMID   17093051.
  14. Perez, J.-C. (September 2010). "Codon populations in single-stranded whole human genome DNA are fractal and fine-tuned by the Golden Ratio 1.618". Interdisciplinary Sciences: Computational Life Sciences. 2 (3): 228–240. doi:10.1007/s12539-010-0022-0. PMID   20658335. S2CID   54565279.
  15. Piero Farisell, Cristian Taccioli, Luca Pagani & Amos Maritan (April 2020). "DNA sequence symmetries from randomness: the origin of the Chargaff's second parity rule". Briefings in Bioinformatics. 22 (bbaa04): 2172–2181. doi: 10.1093/bib/bbaa041 . PMC   7986665 . PMID   32266404.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  16. Bansal M (2003). "DNA structure: Revisiting the Watson-Crick double helix" (PDF). Current Science. 85 (11): 1556–1563. Archived from the original (PDF) on 2014-07-26. Retrieved 2013-07-26.

Further reading

  1. Hallin PF, David Ussery D (2004). "CBS Genome Atlas Database: A dynamic storage for bioinformatic results and sequence data". Bioinformatics. 20 (18): 3682–3686. doi: 10.1093/bioinformatics/bth423 . PMID   15256401.