Codon Adaptation Index

Last updated November 01, 2024

The Codon Adaptation Index (CAI)^[1] is the most widespread technique for analyzing codon usage bias. As opposed to other measures of codon usage bias, such as the 'effective number of codons' (Nc), which measure deviation from a uniform bias (null hypothesis), CAI measures the deviation of a given protein coding gene sequence with respect to a reference set of genes. CAI is used as a quantitative method of predicting the level of expression of a gene based on its codon sequence.^[1]

Rationale

Ideally, the reference set in CAI is composed of highly expressed genes, so that CAI provides an indication of gene expression level under the assumption that there is translational selection to optimize gene sequences according to their expression levels. The rationale for this is dual: highly expressed genes need to compete for resources (i.e. ribosomes) in fast-growing organisms and it makes sense for them to be also more accurately translated. Both hypotheses lead to highly expressed genes using mostly codons for tRNA species that are abundant in the cell.

Implementation

For each amino acid in a gene, the weight of each of its codons represented by a parameter termed relative adaptiveness ( $w i$ ), is computed from a reference sequence set, as the ratio between the observed frequency of the codon $f i$ and the frequency of the most frequent synonymous codon $f j$ for that amino acid.

w_{i}={\frac {f_{i}}{\max(f_{j})}}\qquad i,j\in [{\text{synonymous codons for amino acid}}]

The CAI of a gene is simply defined as the geometric mean of the weight associated to each codon over the length ( $L$ ) of the gene sequence (measured in codons).^[2]

{\text{CAI}}=(\Pi _{i=1}^{L}w_{i})^{\frac {1}{L}}

Related Research Articles

In molecular biology, a stop codon is a codon that signals the termination of the translation process of the current protein. Most codons in messenger RNA correspond to the addition of an amino acid to a growing polypeptide chain, which may ultimately become a protein; stop codons signal the termination of this process by binding release factors, which cause the ribosomal subunits to disassociate, releasing the amino acid chain.

Codon usage bias refers to differences in the frequency of occurrence of synonymous codons in coding DNA. A codon is a series of three nucleotides that encodes a specific amino acid residue in a polypeptide chain or for the termination of translation.

In molecular biology, a reading frame is a way of dividing the sequence of nucleotides in a nucleic acid molecule into a set of consecutive, non-overlapping triplets. Where these triplets equate to amino acids or stop signals during translation, they are called codons.

Transfer RNA is an adaptor molecule composed of RNA, typically 76 to 90 nucleotides in length. In a cell, it provides the physical link between the genetic code in messenger RNA (mRNA) and the amino acid sequence of proteins, carrying the correct sequence of amino acids to be combined by the protein-synthesizing machinery, the ribosome. Each three-nucleotide codon in mRNA is complemented by a three-nucleotide anticodon in tRNA. As such, tRNAs are a necessary component of translation, the biological synthesis of new proteins in accordance with the genetic code.

<span class="mw-page-title-main">GC-content</span> Percentage of guanine and cytosine in DNA or RNA molecules

In molecular biology and genetics, GC-content is the percentage of nitrogenous bases in a DNA or RNA molecule that are either guanine (G) or cytosine (C). This measure indicates the proportion of G and C bases out of an implied four total bases, also including adenine and thymine in DNA and adenine and uracil in RNA.

Silent mutations, also called synonymous or samesense mutations, are mutations in DNA that do not have an observable effect on the organism's phenotype. The phrase silent mutation is often used interchangeably with the phrase synonymous mutation; however, synonymous mutations are not always silent, nor vice versa. Synonymous mutations can affect transcription, splicing, mRNA transport, and translation, any of which could alter phenotype, rendering the synonymous mutation non-silent. The substrate specificity of the tRNA to the rare codon can affect the timing of translation, and in turn the co-translational folding of the protein. This is reflected in the codon usage bias that is observed in many species. Mutations that cause the altered codon to produce an amino acid with similar functionality are often classified as silent; if the properties of the amino acid are conserved, this mutation does not usually significantly affect protein function.

In bioinformatics, GLIMMER (Gene Locator and Interpolated Markov ModelER) is used to find genes in prokaryotic DNA. "It is effective at finding genes in bacteria, archea, viruses, typically finding 98-99% of all relatively long protein coding genes". GLIMMER was the first system that used the interpolated Markov model to identify coding regions. The GLIMMER software is open source and is maintained by Steven Salzberg, Art Delcher, and their colleagues at the Center for Computational Biology at Johns Hopkins University. The original GLIMMER algorithms and software were designed by Art Delcher, Simon Kasif and Steven Salzberg and applied to bacterial genome annotation in collaboration with Owen White.

A synonymous substitution is the evolutionary substitution of one base for another in an exon of a gene coding for a protein, such that the produced amino acid sequence is not modified. This is possible because the genetic code is "degenerate", meaning that some amino acids are coded for by more than one three-base-pair codon; since some of the codons for a given amino acid differ by just one base pair from others coding for the same amino acid, a mutation that replaces the "normal" base by one of the alternatives will result in incorporation of the same amino acid into the growing polypeptide chain when the gene is translated. Synonymous substitutions and mutations affecting noncoding DNA are often considered silent mutations; however, it is not always the case that the mutation is silent.

In evolutionary biology, conserved sequences are identical or similar sequences in nucleic acids or proteins across species, or within a genome, or between donor and receptor taxa. Conservation indicates that a sequence has been maintained by natural selection.

Pancrustacea is the clade that comprises all crustaceans, and all hexapods. This grouping is contrary to the Atelocerata hypothesis, in which Hexapoda and Myriapoda are sister taxa, and Crustacea are only more distantly related. As of 2010, the Pancrustacea taxon was considered well accepted, with most studies recovering Hexapoda within Crustacea. The clade has also been called Tetraconata, referring to having four cone cells in the ommatidia. The term "Tetraconata" is preferred by some scientists in order to avoid confusion with the use of "pan-" to indicate a clade that includes a crown group and all of its stem group representatives.

Neutral mutations are changes in DNA sequence that are neither beneficial nor detrimental to the ability of an organism to survive and reproduce. In population genetics, mutations in which natural selection does not affect the spread of the mutation in a species are termed neutral mutations. Neutral mutations that are inheritable and not linked to any genes under selection will be lost or will replace all other alleles of the gene. That loss or fixation of the gene proceeds based on random sampling known as genetic drift. A neutral mutation that is in linkage disequilibrium with other alleles that are under selection may proceed to loss or fixation via genetic hitchhiking and/or background selection.

In bioinformatics, k-mers are substrings of length $contained within a biological sequence. Primarily used within the context of computational genomics and sequence analysis, in which k -mers are composed of nucleotides, k -mers are capitalized upon to assemble DNA sequences, improve heterologous gene expression, identify species in metagenomic samples, and create attenuated vaccines. Usually, the term k -mer refers to all of a sequence's subsequences of length, such that the sequence AGAT would have four monomers, three 2-mers, two 3-mers and one 4-mer (AGAT). More generally, a sequence of length will have k -mers and total possible k -mers, where is number of possible monomers.$

In bioinformatics, the template modeling score or TM-score is a measure of similarity between two protein structures. The TM-score is intended as a more accurate measure of the global similarity of full-length protein structures than the often used RMSD measure. The TM-score indicates the similarity between two structures by a score between $, where 1 indicates a perfect match between two structures. Generally scores below 0.20 corresponds to randomly chosen unrelated proteins whereas structures with a score higher than 0.5 assume roughly the same fold. A quantitative study shows that proteins of TM-score = 0.5 have a posterior probability of 37% in the same CATH topology family and of 13% in the same SCOP fold family. The probabilities increase rapidly when TM-score > 0.5. The TM-score is designed to be independent of protein lengths.$

The McDonald–Kreitman test is a statistical test often used by evolutionary and population biologists to detect and measure the amount of adaptive evolution within a species by determining whether adaptive evolution has occurred, and the proportion of substitutions that resulted from positive selection. To do this, the McDonald–Kreitman test compares the amount of variation within a species (polymorphism) to the divergence between species (substitutions) at two types of sites, neutral and nonneutral. A substitution refers to a nucleotide that is fixed within one species, but a different nucleotide is fixed within a second species at the same base pair of homologous DNA sequences. A site is nonneutral if it is either advantageous or deleterious. The two types of sites can be either synonymous or nonsynonymous within a protein-coding region. In a protein-coding sequence of DNA, a site is synonymous if a point mutation at that site would not change the amino acid, also known as a silent mutation. Because the mutation did not result in a change in the amino acid that was originally coded for by the protein-coding sequence, the phenotype, or the observable trait, of the organism is generally unchanged by the silent mutation. A site in a protein-coding sequence of DNA is nonsynonymous if a point mutation at that site results in a change in the amino acid, resulting in a change in the organism's phenotype. Typically, silent mutations in protein-coding regions are used as the "control" in the McDonald–Kreitman test.

GC skew is when the nucleotides guanine and cytosine are over- or under-abundant in a particular region of DNA or RNA. GC skew is also a statistical method for measuring strand-specific guanine overrepresentation.

Single nucleotide polymorphism annotation is the process of predicting the effect or function of an individual SNP using SNP annotation tools. In SNP annotation the biological information is extracted, collected and displayed in a clear form amenable to query. SNP functional annotation is typically performed based on the available information on nucleic acid and protein sequences.

The ambush hypothesis is a hypothesis in the field of molecular genetics that suggests that the prevalence of “hidden” or off-frame stop codons in DNA selectively deters off-frame translation of mRNA to save energy, molecular resources, and to reduce strain on biosynthetic machinery by truncating the production of non-functional, potentially cytotoxic protein products. Typical coding sequences of DNA lack in-frame internal stop codons to avoid the premature reduction of protein products when translation proceeds normally. The ambush hypothesis suggests that kinetic, cis-acting mechanisms are responsible for the productive frameshifting of translational units so that the degeneracy of the genetic code can be used to prevent deleterious translation. Ribosomal slippage is the most well described mechanism of translational frameshifting where the ribosome moves one codon position either forward (+1) or backward (-1) to translate the mRNA sequence in a different reading frame and thus produce different protein products.

Paul Martin Sharp is a British bioinformatician who is a professor of genetics at the University of Edinburgh, where he holds the Alan Robertson chair of genetics in the Institute of Evolutionary Biology.

PBX/Knotted 1 Homeobox 2 (PKNOX2) protein belongs to the three amino acid loop extension (TALE) class of homeodomain proteins, and is encoded by PKNOX2 gene in humans. The protein regulates the transcription of other genes and affects anatomical development.

This glossary of cellular and molecular biology is a list of definitions of terms and concepts commonly used in the study of cell biology, molecular biology, and related disciplines, including molecular genetics, biochemistry, and microbiology. It is split across two articles:

References

1 2 Sharp, Paul M.; Li, Wen-Hsiung (1987). "The codon adaptation index-a measure of directional synonymous codon usage bias, and its potential applications". Nucleic Acids Research . 15 (3): 1281–1295. doi:10.1093/nar/15.3.1281. PMC 340524 . PMID 3547335.
↑ Gerstein, Mark; Bussemaker, Harmen J.; Jansen, Ronald (2003-04-15). "Revisiting the codon adaptation index from a whole-genome perspective: analyzing the relationship between gene expression and codon occurrence in yeast using a variety of models". Nucleic Acids Research. 31 (8): 2242–2251. doi:10.1093/nar/gkg306. ISSN 0305-1048. PMC 153734 . PMID 12682375.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[SharpLi1987-1] 1 2 Sharp, Paul M.; Li, Wen-Hsiung (1987). "The codon adaptation index-a measure of directional synonymous codon usage bias, and its potential applications". Nucleic Acids Research . 15 (3): 1281–1295. doi:10.1093/nar/15.3.1281. PMC 340524 . PMID 3547335.

[2] Gerstein, Mark; Bussemaker, Harmen J.; Jansen, Ronald (2003-04-15). "Revisiting the codon adaptation index from a whole-genome perspective: analyzing the relationship between gene expression and codon occurrence in yeast using a variety of models". Nucleic Acids Research. 31 (8): 2242–2251. doi:10.1093/nar/gkg306. ISSN 0305-1048. PMC 153734 . PMID 12682375.

[1]

[2]

Codon Adaptation Index

Contents

Rationale

Implementation

See also

Related Research Articles

References