Pseudo K-tuple nucleotide composition

Last updated January 15, 2025

The Pseudo K-tuple nucleotide composition or PseKNC, is a method for converting a nucleotide sequence (DNA or RNA) into a numerical vector so as to be used in pattern recognition techniques. Generally, the K-tuple can refer to a dinucleotide (when K=2) or a trinucleotide (when K=3). Depending on the instance, the technique can also be called PseDNC or PseTNC.^[1]

Background

PseAAC

PseKNC was derived from an analogous method in proteomics known as PseAAC (Pseudo Amino Acid Composition).^[2] Previously, investigations either relied on sequential models for making predictions of certain protein properties (which, in its simplest case, just refers to the amino acid composition of the protein), or a discrete model which represents a vector of twenty elements, each of which represent the frequency of each amino acid in the protein sample. The discrete model, however, fails to account for sequence-order information. The PseACC model extends the 20-length vector in the discrete model with λ components, each of which in some way captures sequence-order information, and this vector becomes the basis for making predictions.^[3]

Analogous problem in genomics

Analogously, a discrete model of a nucleotide sequence based on its dinucleotide composition would lay involve a vector of 16 elements, the value of which one representing the frequency of each dinucleotide in the sequence:^[1]

$\mathbf {D} ={\begin{bmatrix}f(AA)f(AC)\cdots f(TT)\end{bmatrix}}^{\mathbf {T} }$

Where D is the DNA sequence, T is the transpose operator, and f(AA) is the normalized occurrence frequency of AA in the DNA sequence. A trinucleotide representation can be denoted as:^[1]

$\mathbf {D} ={\begin{bmatrix}f(AAA)f(AAC)\cdots f(TTT)\end{bmatrix}}^{\mathbf {T} }$

As can be seen, these discrete models fail to consider any global or long-range sequence-order information. To address this for both DNA and RNA sequences, the pseudo K-tuple nucleotide composition or PseKNC was proposed.^[4]^[5]^[6]

PseKNC

PseKNC extends the discrete model by adding λ components to represent sequence-order and physico-chemical properties of the nucleotide sequence. The original KNC model will involve 4^K components. In a dinucleotide situation where K = 2, 4² = 16 components will be included. The extension by PseKNC results in (4^K + λ) components.^[1]

Applications

A wide diversity of applications have been developed with respect to the PseKNC method.^[7] For example, it has become an integral component of many algorithms designed to predict the locations of recombination hotspots and coldspots from sequence information.^[8]^[9]

Web servers

For the convenience scientific community, a freely available web server called PseKNC^[4] and an open source package called PseKNC-General^[5] were developed in 2013 and 2014, respectively, that could convert large-scale sequence datasets to pseudo nucleotide compositions with numerous choices of physicochemical property combinations. PseKNC-General can generate several modes of pseudo nucleotide compositions, including conventional k-tuple nucleotide compositions, Moreau–Broto autocorrelation coefficient, Moran autocorrelation coefficient, Geary autocorrelation coefficient, Type I PseKNC and Type II PseKNC.

Another web server, Pse-in-One, allows users to hand-select all pre-existing PseAAC and PseKNC methods for protein, RNA, and DNA sequences, along with any selection of the existing availability of physicochemical property combinations for these options.^[10]

Related Research Articles

A base pair (bp) is a fundamental unit of double-stranded nucleic acids consisting of two nucleobases bound to each other by hydrogen bonds. They form the building blocks of the DNA double helix and contribute to the folded structure of both DNA and RNA. Dictated by specific hydrogen bonding patterns, "Watson–Crick" base pairs allow the DNA helix to maintain a regular helical structure that is subtly dependent on its nucleotide sequence. The complementary nature of this based-paired structure provides a redundant copy of the genetic information encoded within each strand of DNA. The regular structure and data redundancy provided by the DNA double helix make DNA well suited to the storage of genetic information, while base-pairing between DNA and incoming nucleotides provides the mechanism through which DNA polymerase replicates DNA and RNA polymerase transcribes DNA into RNA. Many DNA-binding proteins can recognize specific base-pairing patterns that identify particular regulatory regions of genes.

Protein quaternary structure is the fourth classification level of protein structure. Protein quaternary structure refers to the structure of proteins which are themselves composed of two or more smaller protein chains. Protein quaternary structure describes the number and arrangement of multiple folded protein subunits in a multi-subunit complex. It includes organizations from simple dimers to large homooligomers and complexes with defined or variable numbers of subunits. In contrast to the first three levels of protein structure, not all proteins will have a quaternary structure since some proteins function as single units. Protein quaternary structure can also refer to biomolecular complexes of proteins with nucleic acids and other cofactors.

In biology, translation is the process in living cells in which proteins are produced using RNA molecules as templates. The generated protein is a sequence of amino acids. This sequence is determined by the sequence of nucleotides in the RNA. The nucleotides are considered three at a time. Each such triple results in addition of one specific amino acid to the protein being generated. The matching from nucleotide triple to amino acid is called the genetic code. The translation is performed by a large complex of functional RNA and proteins called ribosomes. The entire process is called gene expression.

In bioinformatics and evolutionary biology, a substitution matrix describes the frequency at which a character in a nucleotide sequence or a protein sequence changes to other character states over evolutionary time. The information is often in the form of log odds of finding two specific character states aligned and depends on the assumed number of evolutionary changes or sequence dissimilarity between compared sequences. It is an application of a stochastic matrix. Substitution matrices are usually seen in the context of amino acid or DNA sequence alignments, where they are used to calculate similarity scores between the aligned sequences.

<span class="mw-page-title-main">Nicotinamide adenine dinucleotide</span> Chemical compound which is reduced and oxidized

Nicotinamide adenine dinucleotide (NAD) is a coenzyme central to metabolism. Found in all living cells, NAD is called a dinucleotide because it consists of two nucleotides joined through their phosphate groups. One nucleotide contains an adenine nucleobase and the other, nicotinamide. NAD exists in two forms: an oxidized and reduced form, abbreviated as NAD⁺ and NADH (H for hydrogen), respectively.

<span class="mw-page-title-main">Biomolecule</span> Molecule produced by a living organism

A biomolecule or biological molecule is loosely defined as a molecule produced by a living organism and essential to one or more typically biological processes. Biomolecules include large macromolecules such as proteins, carbohydrates, lipids, and nucleic acids, as well as small molecules such as vitamins and hormones. A general name for this class of material is biological materials. Biomolecules are an important element of living organisms. They are often endogenous, i.e. produced within the organism, but organisms usually also need exogenous biomolecules, for example certain nutrients, to survive.

In biology, a sequence motif is a nucleotide or amino-acid sequence pattern that is widespread and usually assumed to be related to biological function of the macromolecule. For example, an N-glycosylation site motif can be defined as Asn, followed by anything but Pro, followed by either Ser or Thr, followed by anything but Pro residue.

A frameshift mutation is a genetic mutation caused by indels of a number of nucleotides in a DNA sequence that is not divisible by three. Due to the triplet nature of gene expression by codons, the insertion or deletion can change the reading frame, resulting in a completely different translation from the original. The earlier in the sequence the deletion or insertion occurs, the more altered the protein. A frameshift mutation is not the same as a single-nucleotide polymorphism in which a nucleotide is replaced, rather than inserted or deleted. A frameshift mutation will in general cause the reading of the codons after the mutation to code for different amino acids. The frameshift mutation will also alter the first stop codon encountered in the sequence. The polypeptide being created could be abnormally short or abnormally long, and will most likely not be functional.

A point mutation is a genetic mutation where a single nucleotide base is changed, inserted or deleted from a DNA or RNA sequence of an organism's genome. Point mutations have a variety of effects on the downstream protein product—consequences that are moderately predictable based upon the specifics of the mutation. These consequences can range from no effect to deleterious effects, with regard to protein production, composition, and function.

In genetics, a missense mutation is a point mutation in which a single nucleotide change results in a codon that codes for a different amino acid. It is a type of nonsynonymous substitution.

In biology, a substitution model, also called models of sequence evolution, are Markov models that describe changes over evolutionary time. These models describe evolutionary changes in macromolecules, such as DNA sequences or protein sequences, that can be represented as sequence of symbols. Substitution models are used to calculate the likelihood of phylogenetic trees using multiple sequence alignment data. Thus, substitution models are central to maximum likelihood estimation of phylogeny as well as Bayesian inference in phylogeny. Estimates of evolutionary distances are typically calculated using substitution models. Substitution models are also central to phylogenetic invariants because they are necessary to predict site pattern frequencies given a tree topology. Substitution models are also necessary to simulate sequence data for a group of organisms related by a specific tree.

A position weight matrix (PWM), also known as a position-specific weight matrix (PSWM) or position-specific scoring matrix (PSSM), is a commonly used representation of motifs (patterns) in biological sequences.

In bioinformatics, k-mers are substrings of length $contained within a biological sequence. Primarily used within the context of computational genomics and sequence analysis, in which k -mers are composed of nucleotides, k -mers are capitalized upon to assemble DNA sequences, improve heterologous gene expression, identify species in metagenomic samples, and create attenuated vaccines. Usually, the term k -mer refers to all of a sequence's subsequences of length, such that the sequence AGAT would have four monomers, three 2-mers, two 3-mers and one 4-mer (AGAT). More generally, a sequence of length will have k -mers and total possible k -mers, where is number of possible monomers.$

The adaptor hypothesis is a theoretical scheme in molecular biology to explain how information encoded in the nucleic acid sequences of messenger RNA (mRNA) is used to specify the amino acids that make up proteins during the process of translation. It was formulated by Francis Crick in 1955 in an informal publication of the RNA Tie Club, and later elaborated in 1957 along with the central dogma of molecular biology and the sequence hypothesis. It was formally published as an article "On protein synthesis" in 1958. The name "adaptor hypothesis" was given by Sydney Brenner.

Glycine cleavage system H protein, mitochondrial is a protein that in humans is encoded by the GCSH gene. Degradation of glycine is brought about by the glycine cleavage system (GCS), which is composed of 4 protein components: P protein, H protein, T protein, and L protein. The H protein shuttles the methylamine group of glycine from the P protein to the T protein. The protein encoded by GCSH gene is the H protein, which transfers the methylamine group of glycine from the P protein to the T protein. Defects in this gene are a cause of nonketotic hyperglycinemia (NKH). Two transcript variants, one protein-coding and the other probably not protein-coding, have been found for this gene. Also, several transcribed and non-transcribed pseudogenes of this gene exist throughout the genome.

In molecular biology, pseudo amino acid composition (PseACC) is a method introduced by Kuo-Chen Chou to convert the protein sequence into a numerical vector for enhancing pattern recognition techniques, such as during discrimination between classes of proteins based on their sequences (e.g. between membrane proteins, transmembrane proteins, cytosolic proteins, and other types). This method represented an advance beyond using the immediate amino acid composition (AAC). Instead, the protein is characterized into a matrix of amino-acid frequencies. This matrix incorporates not only amino acid composition, but can also incorporate information from local features of the protein sequence.

Nucleic acid structure refers to the structure of nucleic acids such as DNA and RNA. Chemically speaking, DNA and RNA are very similar. Nucleic acid structure is often divided into four different levels: primary, secondary, tertiary, and quaternary.

Piscivorin is a component of snake venom secreted by the Eastern Cottonmouth. It is a member of the cysteine-rich secretory protein (CRISP) family, which blocks voltage-dependent calcium channels.

<span class="mw-page-title-main">Kuo-Chen Chou</span> Chinese-American biophysicist

Kuo-Chen Chou was a Chinese-American biophysicist and bioinformatician who founded the Gordon Life Science Institute, a non-profit research organization in Boston, Massachusetts. Among other contributions, he developed pseudo amino acid composition (PseAAC), used in computational biology for proteomics analysis and pseudo K-tuple nucleotide composition (PseKNC) for genome analysis. He is the father of James Chou.

FAM237A is a protein coding gene which encodes a protein of the same name. Within Homo sapiens, FAM237A is believed to be primarily expressed within the brain, with moderate heart and lesser testes expression^,. FAM237A is hypothesized to act as a specific activator of receptor GPR83.

References

1 2 3 4 Chen, Wei; Lei, Tian-Yu; Jin, Dian-Chuan; Lin, Hao; Chou, Kuo-Chen (2014). "PseKNC: A flexible web server for generating pseudo K-tuple nucleotide composition". Analytical Biochemistry. 456: 53–60. doi:10.1016/j.ab.2014.04.001.
1 2 Chou, Kuo-Chen (2001). "Prediction of protein cellular attributes using pseudo-amino acid composition". Proteins: Structure, Function, and Genetics. 43 (3): 246–55. doi:10.1002/prot.1035. PMID 11288174. S2CID 28406797.
↑ Chou, Kuo-Chen (2011-03-21). "Some remarks on protein attribute prediction and pseudo amino acid composition". Journal of Theoretical Biology. 273 (1): 236–247. doi:10.1016/j.jtbi.2010.12.024. ISSN 0022-5193. PMC 7125570 .
1 2 Chen, Wei; Lei, Tian-Yu; Jin, Dian-Chuan; Lin, Hao; Chou, Kuo-Chen (2014). "PseKNC: A flexible web server for generating pseudo K-tuple nucleotide composition". Analytical Biochemistry. 456: 53–60. doi:10.1016/j.ab.2014.04.001. PMID 24732113.
1 2 Chen, Wei; Zhang, Xitong; Brooker, Jordan; Lin, Hao; Zhang, Liqing; Chou, Kuo-Chen (2015). "PseKNC-General: A cross-platform package for generating various modes of pseudo nucleotide compositions". Bioinformatics. 31 (1): 119–20. doi: 10.1093/bioinformatics/btu602 . PMID 25231908.
↑ Chen, Wei; Lin, Hao; Chou, Kuo-Chen (2015). "Pseudo nucleotide composition or PseKNC: An effective formulation for analyzing genomic sequences". Molecular BioSystems. 11 (10): 2620–34. doi:10.1039/c5mb00155b. PMID 26099739.
↑ Chen, Wei; Lin, Hao; Chou, Kuo-Chen (2015). "Pseudo nucleotide composition or PseKNC: an effective formulation for analyzing genomic sequences". Molecular BioSystems. 11 (10): 2620–2634. doi:10.1039/C5MB00155B. ISSN 1742-206X.
↑ Liu, Bin; Wang, Shanyi; Long, Ren; Chou, Kuo-Chen (2017-01-01). "iRSpot-EL: identify recombination spots with an ensemble learning approach". Bioinformatics. 33 (1): 35–41. doi:10.1093/bioinformatics/btw539. ISSN 1367-4803.
↑ Ye, Dong-Xin; Yu, Jun-Wen; Li, Rui; Hao, Yu-Duo; Wang, Tian-Yu; Yang, Hui; Ding, Hui (2024-06-12). "The Prediction of Recombination Hotspot Based on Automated Machine Learning". Journal of Molecular Biology: 168653. doi:10.1016/j.jmb.2024.168653. ISSN 0022-2836.
↑ Liu, Bin; Liu, Fule; Wang, Xiaolong; Chen, Junjie; Fang, Longyun; Chou, Kuo-Chen (2015-07-01). "Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences". Nucleic Acids Research. 43 (W1): W65 –W71. doi:10.1093/nar/gkv458. ISSN 0305-1048. PMC 4489303 . PMID 25958395.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[:1-1] 1 2 3 4 Chen, Wei; Lei, Tian-Yu; Jin, Dian-Chuan; Lin, Hao; Chou, Kuo-Chen (2014). "PseKNC: A flexible web server for generating pseudo K-tuple nucleotide composition". Analytical Biochemistry. 456: 53–60. doi:10.1016/j.ab.2014.04.001.

[Chou01-2] 1 2 Chou, Kuo-Chen (2001). "Prediction of protein cellular attributes using pseudo-amino acid composition". Proteins: Structure, Function, and Genetics. 43 (3): 246–55. doi:10.1002/prot.1035. PMID 11288174. S2CID 28406797.

[:0-3] Chou, Kuo-Chen (2011-03-21). "Some remarks on protein attribute prediction and pseudo amino acid composition". Journal of Theoretical Biology. 273 (1): 236–247. doi:10.1016/j.jtbi.2010.12.024. ISSN 0022-5193. PMC 7125570 .

[Chen01-4] 1 2 Chen, Wei; Lei, Tian-Yu; Jin, Dian-Chuan; Lin, Hao; Chou, Kuo-Chen (2014). "PseKNC: A flexible web server for generating pseudo K-tuple nucleotide composition". Analytical Biochemistry. 456: 53–60. doi:10.1016/j.ab.2014.04.001. PMID 24732113.

[Chen02-5] 1 2 Chen, Wei; Zhang, Xitong; Brooker, Jordan; Lin, Hao; Zhang, Liqing; Chou, Kuo-Chen (2015). "PseKNC-General: A cross-platform package for generating various modes of pseudo nucleotide compositions". Bioinformatics. 31 (1): 119–20. doi: 10.1093/bioinformatics/btu602 . PMID 25231908.

[Chen03-6] Chen, Wei; Lin, Hao; Chou, Kuo-Chen (2015). "Pseudo nucleotide composition or PseKNC: An effective formulation for analyzing genomic sequences". Molecular BioSystems. 11 (10): 2620–34. doi:10.1039/c5mb00155b. PMID 26099739.

[7] Chen, Wei; Lin, Hao; Chou, Kuo-Chen (2015). "Pseudo nucleotide composition or PseKNC: an effective formulation for analyzing genomic sequences". Molecular BioSystems. 11 (10): 2620–2634. doi:10.1039/C5MB00155B. ISSN 1742-206X.

[8] Liu, Bin; Wang, Shanyi; Long, Ren; Chou, Kuo-Chen (2017-01-01). "iRSpot-EL: identify recombination spots with an ensemble learning approach". Bioinformatics. 33 (1): 35–41. doi:10.1093/bioinformatics/btw539. ISSN 1367-4803.

[9] Ye, Dong-Xin; Yu, Jun-Wen; Li, Rui; Hao, Yu-Duo; Wang, Tian-Yu; Yang, Hui; Ding, Hui (2024-06-12). "The Prediction of Recombination Hotspot Based on Automated Machine Learning". Journal of Molecular Biology: 168653. doi:10.1016/j.jmb.2024.168653. ISSN 0022-2836.

[10] Liu, Bin; Liu, Fule; Wang, Xiaolong; Chen, Junjie; Fang, Longyun; Chou, Kuo-Chen (2015-07-01). "Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences". Nucleic Acids Research. 43 (W1): W65 –W71. doi:10.1093/nar/gkv458. ISSN 0305-1048. PMC 4489303 . PMID 25958395.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]