Nucleic acid notation

Last updated

The nucleic acid notation currently in use was first formalized by the International Union of Pure and Applied Chemistry (IUPAC) in 1970. [1] This universally accepted notation uses the Roman characters G, C, A, and T, to represent the four nucleotides commonly found in deoxyribonucleic acids (DNA).

Contents

Given the rapidly expanding role for genetic sequencing, synthesis, and analysis in biology, some researchers have developed alternate notations to further support the analysis and manipulation of genetic data. These notations generally exploit size, shape, and symmetry to accomplish these objectives.

IUPAC notation

IUPAC degenerate base symbols [2]
DescriptionSymbolBases represented Complementary
bases
No.ACGT
Adenine A1AT
Cytosine CCG
Guanine GGC
Thymine TTA
Uracil UUA
WeakW2ATW
StrongSCGS
Amino MACK
Ketone KGTM
Purine RAGY
Pyrimidine YCTR
Not AB3CGTV
Not CDAGTH
Not GHACTD
Not T [lower-alpha 1] VACGB
Any one baseN4ACGTN
Gap-0-
  1. Not U for RNA

Degenerate base symbols in biochemistry are an IUPAC [2] [3] representation for a position on a DNA sequence that can have multiple possible alternatives. These should not be confused with non-canonical bases because each particular sequence will have in fact one of the regular bases. These are used to encode the consensus sequence of a population of aligned sequences and are used for example in phylogenetic analysis to summarise into one multiple sequences or for BLAST searches, even though IUPAC degenerate symbols are masked (as they are not coded).

Under the commonly used IUPAC system, nucleobases are represented by the first letters of their chemical names: guanine, cytosine, adenine, and thymine. [1] This shorthand also includes eleven "ambiguity" characters associated with every possible combination of the four DNA bases. [4] The ambiguity characters were designed to encode positional variations in order to report DNA sequencing errors, consensus sequences, or single-nucleotide polymorphisms. The IUPAC notation, including ambiguity characters and suggested mnemonics, is shown in Table 1.

Despite its broad and nearly universal acceptance, the IUPAC system has a number of limitations, which stem from its reliance on the Roman alphabet. The poor legibility of upper-case Roman characters, which are generally used when displaying genetic data, may be chief among these limitations. The value of external projections in distinguishing letters has been well documented. [5] However, these projections are absent from upper case letters, which in some cases are only distinguishable by subtle internal cues. Take for example the upper case C and G used to represent cytosine and guanine. These characters generally comprise half the characters in a genetic sequence but are differentiated by a small internal tick (depending on the typeface). Nevertheless, these Roman characters are available in the ASCII character set most commonly used in textual communications, which reinforces this system's ubiquity.

Another shortcoming of the IUPAC notation arises from the fact that its eleven ambiguity characters have been selected from the remaining characters of the Roman alphabet. The authors of the notation endeavored to select ambiguity characters with logical mnemonics. For example, S is used to represent the possibility of finding cytosine or guanine at genetic loci, both of which form strong cross-strand binding interactions. Conversely, the weaker interactions of thymine and adenine are represented by a W. However, convenient mnemonics are not as readily available for the other ambiguity characters displayed in Table 1. This has made ambiguity characters difficult to use and may account for their limited application.

Nucleic acid nomenclature

Numbered ribose carbons on cytidine. Numbered cytidine.png
Numbered ribose carbons on cytidine.

The positions of the carbons in the ribose sugar that forms the backbone of the nucleic acid chain are numbered, and are used to indicate the direction of nucleic acids (5'->3' versus 3'->5'). This is referred to as directionality. [3]

Alternative visually enhanced notations

Legibility issues associated with IUPAC-encoded genetic data have led biologists to consider alternative strategies for displaying genetic data. These creative approaches to visualizing DNA sequences have generally relied on the use of spatially distributed symbols and/or visually distinct shapes to encode lengthy nucleic acid sequences. Alternative notations for nucleotide sequences have been attempted, however general uptake has been low. Several of these approaches are summarized below.

Stave projection

The Stave Projection uses spatially distributed dots to enhance the legibility of DNA sequences. Stave Projection.jpg
The Stave Projection uses spatially distributed dots to enhance the legibility of DNA sequences.

In 1986, Cowin et al. described a novel method for visualizing DNA sequence known as the Stave Projection. [6] Their strategy was to encode nucleotides as circles on series of horizontal bars akin to notes on musical stave. As illustrated in Figure 1, each gap on the five-line staff corresponded to one of the four DNA bases. The spatial distribution of the circles made it far easier to distinguish individual bases and compare genetic sequences than IUPAC-encoded data.

The order of the bases (from top to bottom, G, A, T, C) is chosen so that the complementary strand can be read by turning the projection upside down.

Geometric symbols

Zimmerman et al. took a different approach to visualizing genetic data. [7] Rather than relying on spatially distributed circles to highlight genetic features, they exploited four geometrically diverse symbols found in a standard computer font to distinguish the four bases. The authors developed a simple WordPerfect macro to translate IUPAC characters into the more visually distinct symbols.

DNA Skyline

With the growing availability of font editors, Jarvius and Landegren devised a novel set of genetic symbols, known as the DNA Skyline font, which uses increasingly taller blocks to represent the different DNA bases. [8] While reminiscent of Cowin et al.'s spatially distributed Stave Projection, the DNA Skyline font is easy to download and permits translation to and from the IUPAC notation by simply changing the font in most standard word processing applications.

Ambigraphic notations

AmbiScript uses ambigrams to reflect DNA symmetries and support the manipulation and analysis of genetic data. AmbiScript Page.jpg
AmbiScript uses ambigrams to reflect DNA symmetries and support the manipulation and analysis of genetic data.

Ambigrams (symbols that convey different meaning when viewed in a different orientation) have been designed to mirror structural symmetries found in the DNA double helix. [9] By assigning ambigraphic characters to complementary bases (i.e. guanine: b, cytosine: q, adenine: n, and thymine: u), it is possible to complement DNA sequences by simply rotating the text 180 degrees. [10] An ambigraphic nucleic acid notation also makes it easy to identify genetic palindromes, such as endonuclease restriction sites, as sections of text that can be rotated 180 degrees without changing the sequence.

One example of an ambigraphic nucleic acid notation is AmbiScript, a rationally designed nucleic acid notations that combined many of the visual and functional features of its predecessors. [11] Its notation also uses spatially offset characters to facilitate the visual review and analysis of genetic data. AmbiScript was also designed to indicate ambiguous nucleotide positions via compound symbols. This strategy aimed to offer a more intuitive solution to the use of ambiguity characters first proposed by the IUPAC. [4] As with Jarvius and Landegren's DNA Skyline fonts, AmbiScript fonts can be downloaded and applied to IUPAC-encoded sequence data.

Triple Helix Base Pairing

Watson and Crick base pairs are indicated by a "•" or a "-" or a "." (example: A•T, or poly(rC)•2poly(rC)).

Hoogsteen triple helix base pairs are indicated by a "*" or a ":" (example: C•G*G+, or T•A*T, or C•G*G, or T•A*A).

See also

Related Research Articles

<span class="mw-page-title-main">Base pair</span> Unit consisting of two nucleobases bound to each other by hydrogen bonds

A base pair (bp) is a fundamental unit of double-stranded nucleic acids consisting of two nucleobases bound to each other by hydrogen bonds. They form the building blocks of the DNA double helix and contribute to the folded structure of both DNA and RNA. Dictated by specific hydrogen bonding patterns, "Watson–Crick" base pairs allow the DNA helix to maintain a regular helical structure that is subtly dependent on its nucleotide sequence. The complementary nature of this based-paired structure provides a redundant copy of the genetic information encoded within each strand of DNA. The regular structure and data redundancy provided by the DNA double helix make DNA well suited to the storage of genetic information, while base-pairing between DNA and incoming nucleotides provides the mechanism through which DNA polymerase replicates DNA and RNA polymerase transcribes DNA into RNA. Many DNA-binding proteins can recognize specific base-pairing patterns that identify particular regulatory regions of genes.

<span class="mw-page-title-main">DNA</span> Molecule that carries genetic information

Deoxyribonucleic acid is a polymer composed of two polynucleotide chains that coil around each other to form a double helix. The polymer carries genetic instructions for the development, functioning, growth and reproduction of all known organisms and many viruses. DNA and ribonucleic acid (RNA) are nucleic acids. Alongside proteins, lipids and complex carbohydrates (polysaccharides), nucleic acids are one of the four major types of macromolecules that are essential for all known forms of life.

<span class="mw-page-title-main">Nucleic acid</span> Class of large biomolecules essential to all known life

Nucleic acids are biopolymers, macromolecules, essential to all known forms of life. They are composed of nucleotides, which are the monomer components: a 5-carbon sugar, a phosphate group and a nitrogenous base. The two main classes of nucleic acids are deoxyribonucleic acid (DNA) and ribonucleic acid (RNA). If the sugar is ribose, the polymer is RNA; if the sugar is deoxyribose, a variant of ribose, the polymer is DNA.

<span class="mw-page-title-main">Nucleotide</span> Biological molecules that form the building blocks of nucleic acids

Nucleotides are organic molecules composed of a nitrogenous base, a pentose sugar and a phosphate. They serve as monomeric units of the nucleic acid polymers – deoxyribonucleic acid (DNA) and ribonucleic acid (RNA), both of which are essential biomolecules within all life-forms on Earth. Nucleotides are obtained in the diet and are also synthesized from common nutrients by the liver.

<span class="mw-page-title-main">Nucleobase</span> Nitrogen-containing biological compounds that form nucleosides

Nucleobases are nitrogen-containing biological compounds that form nucleosides, which, in turn, are components of nucleotides, with all of these monomers constituting the basic building blocks of nucleic acids. The ability of nucleobases to form base pairs and to stack one upon another leads directly to long-chain helical structures such as ribonucleic acid (RNA) and deoxyribonucleic acid (DNA). Five nucleobases—adenine (A), cytosine (C), guanine (G), thymine (T), and uracil (U)—are called primary or canonical. They function as the fundamental units of the genetic code, with the bases A, G, C, and T being found in DNA while A, G, C, and U are found in RNA. Thymine and uracil are distinguished by merely the presence or absence of a methyl group on the fifth carbon (C5) of these heterocyclic six-membered rings. In addition, some viruses have aminoadenine (Z) instead of adenine. It differs in having an extra amine group, creating a more stable bond to thymine.

<span class="mw-page-title-main">Nucleic acid sequence</span> Succession of nucleotides in a nucleic acid

A nucleic acid sequence is a succession of bases within the nucleotides forming alleles within a DNA or RNA (GACU) molecule. This succession is denoted by a series of a set of five different letters that indicate the order of the nucleotides. By convention, sequences are usually presented from the 5' end to the 3' end. For DNA, with its double helix, there are two possible directions for the notated sequence; of these two, the sense strand is used. Because nucleic acids are normally linear (unbranched) polymers, specifying the sequence is equivalent to defining the covalent structure of the entire molecule. For this reason, the nucleic acid sequence is also termed the primary structure.

<span class="mw-page-title-main">Molecular genetics</span> Scientific study of genes at the molecular level

Molecular genetics is a branch of biology that addresses how differences in the structures or expression of DNA molecules manifests as variation among organisms. Molecular genetics often applies an "investigative approach" to determine the structure and/or function of genes in an organism's genome using genetic screens. 

In biology, a sequence motif is a nucleotide or amino-acid sequence pattern that is widespread and usually assumed to be related to biological function of the macromolecule. For example, an N-glycosylation site motif can be defined as Asn, followed by anything but Pro, followed by either Ser or Thr, followed by anything but Pro residue.

<span class="mw-page-title-main">DNA synthesis</span>

DNA synthesis is the natural or artificial creation of deoxyribonucleic acid (DNA) molecules. DNA is a macromolecule made up of nucleotide units, which are linked by covalent bonds and hydrogen bonds, in a repeating structure. DNA synthesis occurs when these nucleotide units are joined to form DNA; this can occur artificially or naturally. Nucleotide units are made up of a nitrogenous base, pentose sugar (deoxyribose) and phosphate group. Each unit is joined when a covalent bond forms between its phosphate group and the pentose sugar of the next nucleotide, forming a sugar-phosphate backbone. DNA is a complementary, double stranded structure as specific base pairing occurs naturally when hydrogen bonds form between the nucleotide bases.

<span class="mw-page-title-main">Chargaff's rules</span> Two rules about the percentage of A, C, G, and T in DNA strands

Chargaff's rules [given by Erwin Chargaff] states that in the DNA of any species and any organism, the amount of guanine should be equal to the amount of cytosine and the amount of adenine should be equal to the amount of thymine. Further a 1:1 stoichiometric ratio of purine and pyrimidine bases should exist. This pattern is found in both strands of the DNA. They were discovered by Austrian-born chemist Erwin Chargaff, in the late 1940s.

<span class="mw-page-title-main">Wobble base pair</span> RNA base pair that does not follow Watson-Crick base pair rules

A wobble base pair is a pairing between two nucleotides in RNA molecules that does not follow Watson-Crick base pair rules. The four main wobble base pairs are guanine-uracil (G-U), hypoxanthine-uracil (I-U), hypoxanthine-adenine (I-A), and hypoxanthine-cytosine (I-C). In order to maintain consistency of nucleic acid nomenclature, "I" is used for hypoxanthine because hypoxanthine is the nucleobase of inosine; nomenclature otherwise follows the names of nucleobases and their corresponding nucleosides. The thermodynamic stability of a wobble base pair is comparable to that of a Watson-Crick base pair. Wobble base pairs are fundamental in RNA secondary structure and are critical for the proper translation of the genetic code.

<span class="mw-page-title-main">Nirenberg and Leder experiment</span>

The Nirenberg and Leder experiment was a scientific experiment performed in 1964 by Marshall W. Nirenberg and Philip Leder. The experiment elucidated the triplet nature of the genetic code and allowed the remaining ambiguous codons in the genetic code to be deciphered.

<span class="mw-page-title-main">Palindromic sequence</span> DNA or RNA sequence that matches its complement when read backwards

A palindromic sequence is a nucleic acid sequence in a double-stranded DNA or RNA molecule whereby reading in a certain direction on one strand is identical to the sequence in the same direction on the complementary strand. This definition of palindrome thus depends on complementary strands being palindromic of each other.

<span class="mw-page-title-main">Nucleic acid analogue</span> Compound analogous to naturally occurring RNA and DNA

Nucleic acid analogues are compounds which are analogous to naturally occurring RNA and DNA, used in medicine and in molecular biology research. Nucleic acids are chains of nucleotides, which are composed of three parts: a phosphate backbone, a pentose sugar, either ribose or deoxyribose, and one of four nucleobases. An analogue may have any of these altered. Typically the analogue nucleobases confer, among other things, different base pairing and base stacking properties. Examples include universal bases, which can pair with all four canonical bases, and phosphate-sugar backbone analogues such as PNA, which affect the properties of the chain . Nucleic acid analogues are also called Xeno Nucleic Acid and represent one of the main pillars of xenobiology, the design of new-to-nature forms of life based on alternative biochemistries.

<span class="mw-page-title-main">Nucleic acid structure</span> Biomolecular structure of nucleic acids such as DNA and RNA

Nucleic acid structure refers to the structure of nucleic acids such as DNA and RNA. Chemically speaking, DNA and RNA are very similar. Nucleic acid structure is often divided into four different levels: primary, secondary, tertiary, and quaternary.

<span class="mw-page-title-main">Complementarity (molecular biology)</span> Lock-and-key pairing between two structures

In molecular biology, complementarity describes a relationship between two structures each following the lock-and-key principle. In nature complementarity is the base principle of DNA replication and transcription as it is a property shared between two DNA or RNA sequences, such that when they are aligned antiparallel to each other, the nucleotide bases at each position in the sequences will be complementary, much like looking in the mirror and seeing the reverse of things. This complementary base pairing allows cells to copy information from one generation to another and even find and repair damage to the information stored in the sequences.

<span class="mw-page-title-main">Xeno nucleic acid</span> Synthetic nucleic acid analogues

Xeno nucleic acids (XNA) are synthetic nucleic acid analogues that have a different backbone than the ribose and deoxyribose found in the nucleic acids of naturally occurring RNA and DNA.

The ascidian mitochondrial code is a genetic code found in the mitochondria of Ascidia.

<span class="mw-page-title-main">Hachimoji DNA</span> Synthetic DNA

Hachimoji DNA is a synthetic nucleic acid analog that uses four synthetic nucleotides in addition to the four present in the natural nucleic acids, DNA and RNA. This leads to four allowed base pairs: two unnatural base pairs formed by the synthetic nucleobases in addition to the two normal pairs. Hachimoji bases have been demonstrated in both DNA and RNA analogs, using deoxyribose and ribose respectively as the backbone sugar.

This glossary of cell and molecular biology is a list of definitions of terms and concepts commonly used in the study of cell biology, molecular biology, and related disciplines, including genetics, microbiology, and biochemistry. It is split across two articles:

References

  1. 1 2 IUPAC-IUB Commission on Biochemical Nomenclature (1970). "Abbreviations and symbols for nucleic acids, polynucleotides, and their constituents". Biochemistry . 9 (20): 4022–4027. doi:10.1021/bi00822a023.
  2. 1 2 Nomenclature Committee of the International Union of Biochemistry (NC-IUB) (1984). "Nomenclature for Incompletely Specified Bases in Nucleic Acid Sequences". Nucleic Acids Research. 13 (9): 3021–3030. doi:10.1093/nar/13.9.3021. PMC   341218 . PMID   2582368.
  3. 1 2 Cornish-Bowden A (May 1985). "Nomenclature for incompletely specified bases in nucleic acid sequences: recommendations 1984". Nucleic Acids Research. 13 (9): 3021–30. doi:10.1093/nar/13.9.3021. PMC   341218 . PMID   2582368.
  4. 1 2 Nomenclature Committee of the International Union of Biochemistry (NC-IUB) (1986). "Nomenclature for incompletely specified bases in nucleic acid sequences. Recommendations 1984". Proc. Natl. Acad. Sci. USA . 83 (1): 4–8. Bibcode:1986PNAS...83....4O. doi: 10.1073/pnas.83.1.4 . PMC   322779 . PMID   2417239.
  5. Tinker, M. A. 1963. Legibility of Print. Iowa State University Press, Ames IA.
  6. Cowin, J. E.; Jellis, C. H.; Rickwood, D. (1986). "A new method of representing DNA sequences which combines ease of visual analysis with machine readability". Nucleic Acids Research . 14 (1): 509–15. doi:10.1093/nar/14.1.509. PMC   339435 . PMID   3003680.
  7. Zimmerman, P. A.; Spell, M. L.; Rawls, J.; Unnasch, T. R. (1991). "Transformation of DNA sequence data into geometric symbols". BioTechniques . 11 (1): 50–52. PMID   1954017.
  8. Jarvius, J.; Landegren, U. (2006). "DNA Skyline: fonts to facilitate visual inspection of nucleic acid sequences". BioTechniques . 40 (6): 740. doi: 10.2144/000112180 . PMID   16774117.
  9. Hofstadter, Douglas R. (1985). Metamagical Themas: Questioning the Essence of Mind and Pattern . New York: Basic Books. ISBN   978-0465045662.
  10. Rozak, D. A. (2006). "The practical and pedagogical advantages of an ambigraphic nucleic acid notation". Nucleosides, Nucleotides & Nucleic Acids . 25 (7): 807–813. doi:10.1080/15257770600726109. PMID   16898419. S2CID   23600737.
  11. Rozak, David A.; Rozak, Anthony J. (2008). "Simplicity, function, and legibility in an enhanced ambigraphic nucleic acid notation". BioTechniques . 44 (6): 811–813. doi: 10.2144/000112727 . PMID   18476835.