GC skew is when the nucleotides guanine and cytosine are over- or under-abundant in a particular region of DNA or RNA. GC skew is also a statistical method for measuring strand-specific guanine overrepresentation. [1]
In equilibrium conditions (without mutational or selective pressure and with nucleotides randomly distributed within the genome) there is an equal frequency of the four DNA bases (adenine, guanine, thymine, and cytosine) on both single strands of a DNA molecule. [2] However, in most bacteria (e.g. E. coli ) and some archaea (e.g. Sulfolobus solfataricus ), nucleotide compositions are asymmetric between the leading strand and the lagging strand: the leading strand contains more guanine (G) and thymine (T), whereas the lagging strand contains more adenine (A) and cytosine (C). [2] This phenomenon is referred to as GC and AT skew and the corresponding statistics were defined [2] as:
GC skew = (G - C)/(G + C)
AT skew = (A − T)/(A + T)
Erwin Chargaff's work in 1950 demonstrated that, in DNA, the bases guanine and cytosine were found in equal abundance, and the bases adenine and thymine were found in equal abundance. However, there was no equality between the amount of one pair versus the other. [3] Chargaff's finding is referred to as Chargaff's rule or parity rule 2. [3] Three years later, Watson and Crick used this fact during their derivation of the structure of DNA, their double helix model.
A natural result of parity rule 1, at the state of equilibrium, in which there is no mutation and/or selection biases in any of the two DNA strands, is that when there is an equal substitution rate, the complementary nucleotides on each strand have equal amounts of a given base and its complement. [4] In other words, in each DNA strand the frequency of the occurrence of T is equal to A and the frequency of the occurrence of G is equal to C because the substitution rate is presumably equal. This phenomenon is referred to as parity rule 2. Hence, the second parity rule only exists when there is no mutation or substitution.
Any deviation from parity rule 2 will result in asymmetric base composition that discriminates the leading strand–i.e., the DNA strand that is replicated in the forward direction–from the lagging strand. This asymmetry is referred to as GC or AT skew. [2]
In some bacterial genomes, there is an enrichment of guanine over cytosine and thymine over adenine on the leading strand and vice versa for the lagging strand. The nucleotide composition skew spectra ranges from −1, which corresponds to G = 0 or A = 0, to +1, which corresponds to T= 0 or C = 0. [2] Therefore, positive GC skew represents richness of G over C and the negative GC skew represents richness of C over G. As a result, one expects to see a positive GC skew and negative AT skew in the leading strand, and a negative GC skew and a positive AT skew in the lagging strand. [5] GC or AT skew changes sign at the boundaries of the two replichores, which correspond to DNA replication origin or terminus. [2] [4] [5] Originally, this asymmetric nucleotide composition was explained as a different mechanism used in DNA replication between the leading strand and lagging strand. DNA replication is semi-conservative and an asymmetric process itself. [6] This asymmetry is due to the formation of the replication fork and its division into nascent leading and lagging strands. The leading strand is synthesized continuously and in juxtapose to the leading strand; the lagging strand is replicated through short fragments of polynucleotide (Okazaki fragments) in a 5' to 3' direction. [6]
There are three major approaches to calculate and graphically demonstrate GC skew and its properties.
The first approach is GC and AT asymmetry. [2] Jean R. Lobry was the first to report, in 1996, [7] the presence of compositional asymmetry in the genomes of three bacteria: E. coli , Bacillus subtilis , and Haemophilus influenzae . The original formulas at the time were not called skew, but rather deviation from [A] = [T] or [C] = [G]:
deviation from [A] = [T] as (A − T)/(A + T);
deviation from [C] = [G] as (C − G)/(C + G);
where A, T, G, and C represent the frequency of occurrence of the equivalent base in a particular sequence in a defined length. A window sliding strategy is used to calculate deviation from C through the genome. In these plots, a positive deviation from C corresponds to lagging strand and negative deviation from C corresponds to leading strand. [8] Furthermore, the site where the deviation sign switches corresponds to the origin or terminal. The x-axis represents the chromosome locations plotted 5′ to 3′ and y-axis represents the deviation value. The major weakness of this method is its window-size dependent property. Therefore, choosing an adequate window size greatly affects the outcome of the plot. Other techniques should be combined with deviation in order to identify and locate the origin of the DNA replication with greater accuracy.
The second approach is referred to as cumulative GC skew (CGC skew). [9] This method still uses the sliding window strategy but it takes advantage of the sum of the adjacent windows from an arbitrary start. In this scheme, the entire genome is usually plotted 5' to 3' using an arbitrary start and arbitrary strand. In the cumulative GC skew plot, the peaks corresponds to the switch points (terminus or origin).
In contrast to Lobry's earlier paper, recent implementations of GC skew flips the original definition, redefining it to be:
GC skew = (G − C)/(G + C).
With the flipped definition of GC skew, the maximum value of the cumulative skew corresponds to the terminal, and the minimum value corresponds to the origin of replication.
The final approach is the Z curve. [10] Unlike the previous methods, this method do not uses the sliding window strategy and is thought to perform better as to finding the origin of replication. [10] In this method, each base's cumulative frequency with respect to the base at the beginning of the sequence is investigated. The Z curve uses a three-dimensional representation with the following parameters:
Where , represents the excess of purine over pyrimidine, denotes excess of keto over amino, and shows the relationship between the weak and strong hydrogen bonds. and components can alone detect the replication origin and asymmetric composition of the strands. A combination of these methods should be used for prediction of replication origin and terminal, in order to compensate for their weakness.
There is lack of consensus in scientific community with regard to the mechanism underlying the bias in nucleotide composition within each DNA strand. There are two major schools of thought that explain the mechanism behind the strand specific nucleotide composition in bacteria. [4]
The first one describes a bias and an asymmetric mutational pressure on each DNA strand during replication and transcription. [4] [11] Due to the asymmetric nature of the replication process, an unequal mutational frequency and DNA repair efficiency during the replication process can introduce more mutations in one strand as compared to the other. [5] Furthermore, the time used for replication between the two strands varies and may lead to asymmetric mutational pressure between leading and lagging strand. [12] In addition to mutations during DNA replication, transcriptional mutations can create strand specific nucleotide composition skew. [5] Deamination of cytosine and ultimately mutation of cytosine to thymine in one DNA strand can increase the relative number of guanine and thymine to cytosine and adenine. [5] In most bacteria, the majority of the genes are encoded in the leading strand. [4] For instance, the leading strand in Bacillussubtilis encodes 75% of the genes. [5] In addition, an excess of deamination and conversion of cytosine to thymine in the coding strand compared to the non-coding strand has been reported. [4] [5] [13] One possible explanation is that the non-transcribed strand (coding strand) is single-stranded during the transcription process; therefore, it is more vulnerable to deamination compared to the transcribed strand (non-coding strand). [5] [14] Another explanation is that the deamination repair activity during transcription does not occur on the coding strand. [5] Only the transcribed strand benefits from these deamination repair events.
The second school of thought describes the mechanism of GC and AT skew as resulting from a difference in selective pressure between the leading and lagging strands. [4] [5] [14] Examination of the prokaryotic genome shows a preference in third codon position for G over C and T over A. [5] This discrimination creates an asymmetric nucleotide composition, if the coding strand is unequally distributed between the leading and lagging strands, as in the case for bacteria. In addition, the highly transcribed genes, such as ribosomal proteins, have been shown to be located mostly on the leading strand in bacteria. [5] Therefore, a bias in the third-position codon choice of G over C can lead to GC skew. Additionally, some signal sequences are rich in guanine and thymine, such as chi sequences, and these sequences might have a higher frequency of occurrence in one strand compared to the other. [4] [5]
Both mutational and selective pressure can independently introduce asymmetry in DNA strands. However, the combination and cumulative effect of both mechanisms is the most plausible explanation for GC and AT skew. [4] [14]
The GC skew is proven to be useful as the indicator of the DNA leading strand, lagging strand, replication origin, and replication terminal. [2] [4] [5] Most bacteria and archaea contain only one DNA replication origin. [2] The GC skew is positive and negative in the leading strand and in the lagging strand respectively; therefore, it is expected to see a switch in GC skew sign just at the point of DNA replication origin and terminus. [4] GC skew can also be used to study the strand biases and mechanism related to them by calculating the excess of one base over its complementary base in different milieus. [4] [5] [14] Method such as GC skew, CGC skew, and Z curve are tools that can provide opportunity to better investigate the mechanism of DNA replication in different organisms.
A base pair (bp) is a fundamental unit of double-stranded nucleic acids consisting of two nucleobases bound to each other by hydrogen bonds. They form the building blocks of the DNA double helix and contribute to the folded structure of both DNA and RNA. Dictated by specific hydrogen bonding patterns, "Watson–Crick" base pairs allow the DNA helix to maintain a regular helical structure that is subtly dependent on its nucleotide sequence. The complementary nature of this based-paired structure provides a redundant copy of the genetic information encoded within each strand of DNA. The regular structure and data redundancy provided by the DNA double helix make DNA well suited to the storage of genetic information, while base-pairing between DNA and incoming nucleotides provides the mechanism through which DNA polymerase replicates DNA and RNA polymerase transcribes DNA into RNA. Many DNA-binding proteins can recognize specific base-pairing patterns that identify particular regulatory regions of genes.
Cytosine is one of the four nucleobases found in DNA and RNA, along with adenine, guanine, and thymine. It is a pyrimidine derivative, with a heterocyclic aromatic ring and two substituents attached. The nucleoside of cytosine is cytidine. In Watson-Crick base pairing, it forms three hydrogen bonds with guanine.
In molecular biology, DNA replication is the biological process of producing two identical replicas of DNA from one original DNA molecule. DNA replication occurs in all living organisms acting as the most essential part of biological inheritance. This is essential for cell division during growth and repair of damaged tissues, while it also ensures that each of the new cells receives its own copy of the DNA. The cell possesses the distinctive property of division, which makes replication of DNA essential.
Nucleobases are nitrogen-containing biological compounds that form nucleosides, which, in turn, are components of nucleotides, with all of these monomers constituting the basic building blocks of nucleic acids. The ability of nucleobases to form base pairs and to stack one upon another leads directly to long-chain helical structures such as ribonucleic acid (RNA) and deoxyribonucleic acid (DNA). Five nucleobases—adenine (A), cytosine (C), guanine (G), thymine (T), and uracil (U)—are called primary or canonical. They function as the fundamental units of the genetic code, with the bases A, G, C, and T being found in DNA while A, G, C, and U are found in RNA. Thymine and uracil are distinguished by merely the presence or absence of a methyl group on the fifth carbon (C5) of these heterocyclic six-membered rings. In addition, some viruses have aminoadenine (Z) instead of adenine. It differs in having an extra amine group, creating a more stable bond to thymine.
Deamination is the removal of an amino group from a molecule. Enzymes that catalyse this reaction are called deaminases.
5-Methylcytosine is a methylated form of the DNA base cytosine (C) that regulates gene transcription and takes several other biological roles. When cytosine is methylated, the DNA maintains the same sequence, but the expression of methylated genes can be altered. 5-Methylcytosine is incorporated in the nucleoside 5-methylcytidine.
The CpG sites or CG sites are regions of DNA where a cytosine nucleotide is followed by a guanine nucleotide in the linear sequence of bases along its 5' → 3' direction. CpG sites occur with high frequency in genomic regions called CpG islands.
A nucleic acid sequence is a succession of bases within the nucleotides forming alleles within a DNA or RNA (GACU) molecule. This succession is denoted by a series of a set of five different letters that indicate the order of the nucleotides. By convention, sequences are usually presented from the 5' end to the 3' end. For DNA, with its double helix, there are two possible directions for the notated sequence; of these two, the sense strand is used. Because nucleic acids are normally linear (unbranched) polymers, specifying the sequence is equivalent to defining the covalent structure of the entire molecule. For this reason, the nucleic acid sequence is also termed the primary structure.
Molecular genetics is a branch of biology that addresses how differences in the structures or expression of DNA molecules manifests as variation among organisms. Molecular genetics often applies an "investigative approach" to determine the structure and/or function of genes in an organism's genome using genetic screens.
Chargaff's rules state that in the DNA of any species and any organism, the amount of guanine should be equal to the amount of cytosine and the amount of adenine should be equal to the amount of thymine. Further, a 1:1 stoichiometric ratio of purine and pyrimidine bases should exist. This pattern is found in both strands of the DNA. They were discovered by Austrian-born chemist Erwin Chargaff in the late 1940s.
DNA repair is a collection of processes by which a cell identifies and corrects damage to the DNA molecules that encodes its genome. In human cells, both normal metabolic activities and environmental factors such as radiation can cause DNA damage, resulting in tens of thousands of individual molecular lesions per cell per day. Many of these lesions cause structural damage to the DNA molecule and can alter or eliminate the cell's ability to transcribe the gene that the affected DNA encodes. Other lesions induce potentially harmful mutations in the cell's genome, which affect the survival of its daughter cells after it undergoes mitosis. As a consequence, the DNA repair process is constantly active as it responds to damage in the DNA structure. When normal repair processes fail, and when cellular apoptosis does not occur, irreparable DNA damage may occur, including double-strand breaks and DNA crosslinkages. This can eventually lead to malignant tumors, or cancer as per the two-hit hypothesis.
In molecular biology and genetics, GC-content is the percentage of nitrogenous bases in a DNA or RNA molecule that are either guanine (G) or cytosine (C). This measure indicates the proportion of G and C bases out of an implied four total bases, also including adenine and thymine in DNA and adenine and uracil in RNA.
Nuclear DNA (nDNA), or nuclear deoxyribonucleic acid, is the DNA contained within each cell nucleus of a eukaryotic organism. It encodes for the majority of the genome in eukaryotes, with mitochondrial DNA and plastid DNA coding for the rest. It adheres to Mendelian inheritance, with information coming from two parents, one male and one female—rather than matrilineally as in mitochondrial DNA.
DNA glycosylases are a family of enzymes involved in base excision repair, classified under EC number EC 3.2.2. Base excision repair is the mechanism by which damaged bases in DNA are removed and replaced. DNA glycosylases catalyze the first step of this process. They remove the damaged nitrogenous base while leaving the sugar-phosphate backbone intact, creating an apurinic/apyrimidinic site, commonly referred to as an AP site. This is accomplished by flipping the damaged base out of the double helix followed by cleavage of the N-glycosidic bond.
Base excision repair (BER) is a cellular mechanism, studied in the fields of biochemistry and genetics, that repairs damaged DNA throughout the cell cycle. It is responsible primarily for removing small, non-helix-distorting base lesions from the genome. The related nucleotide excision repair pathway repairs bulky helix-distorting lesions. BER is important for removing damaged bases that could otherwise cause mutations by mispairing or lead to breaks in DNA during replication. BER is initiated by DNA glycosylases, which recognize and remove specific damaged or inappropriate bases, forming AP sites. These are then cleaved by an AP endonuclease. The resulting single-strand break can then be processed by either short-patch or long-patch BER.
Activation-induced cytidine deaminase, also known as AICDA, AID and single-stranded DNA cytosine deaminase, is a 24 kDa enzyme which in humans is encoded by the AICDA gene. It creates mutations in DNA by deamination of cytosine base, which turns it into uracil. In other words, it changes a C:G base pair into a U:G mismatch. The cell's DNA replication machinery recognizes the U as a T, and hence C:G is converted to a T:A base pair. During germinal center development of B lymphocytes, AID also generates other types of mutations, such as C:G to A:T. The mechanism by which these other mutations are created is not well understood. It is a member of the APOBEC family.
A postzygotic mutation is a change in an organism's genome that is acquired during its lifespan, instead of being inherited from its parent(s) through fusion of two haploid gametes. Mutations that occur after the zygote has formed can be caused by a variety of sources that fall under two classes: spontaneous mutations and induced mutations. How detrimental a mutation is to an organism is dependent on what the mutation is, where it occurred in the genome and when it occurred.
Pyrimidine dimers are molecular lesions formed from thymine or cytosine bases in DNA via photochemical reactions, commonly associated with direct DNA damage. Ultraviolet light induces the formation of covalent linkages between consecutive bases along the nucleotide chain in the vicinity of their carbon–carbon double bonds. The photo-coupled dimers are fluorescent. The dimerization reaction can also occur among pyrimidine bases in dsRNA —uracil or cytosine. Two common UV products are cyclobutane pyrimidine dimers (CPDs) and 6–4 photoproducts. These premutagenic lesions alter the structure of the DNA helix and cause non-canonical base pairing. Specifically, adjacent thymines or cytosines in DNA will form a cyclobutane ring when joined together and cause a distortion in the DNA. This distortion prevents replication or transcription machinery beyond the site of the dimerization. Up to 50–100 such reactions per second might occur in a skin cell during exposure to sunlight, but are usually corrected within seconds by photolyase reactivation or nucleotide excision repair. In humans, the most common form of DNA repair is nucleotide excision repair (NER). In contrast, organisms such as bacteria can counterintuitively harvest energy from the sun to fix DNA damage from pyrimidine dimers via photolyase activity. If these lesions are not fixed, polymerase machinery may misread or add in the incorrect nucleotide to the strand. If the damage to the DNA is overwhelming, mutations can arise within the genome of an organism and may lead to the production of cancer cells. Uncorrected lesions can inhibit polymerases, cause misreading during transcription or replication, or lead to arrest of replication. It causes sunburn and it triggers the production of melanin. Pyrimidine dimers are the primary cause of melanomas in humans.
In biochemistry, two biopolymers are antiparallel if they run parallel to each other but with opposite directionality (alignments). An example is the two complementary strands of a DNA double helix, which run in opposite directions alongside each other.
Uracil-DNA glycosylase is an enzyme. Its most important function is to prevent mutagenesis by eliminating uracil from DNA molecules by cleaving the N-glycosidic bond and initiating the base-excision repair (BER) pathway.