Pseudo amino acid composition

Last updated

Pseudo amino acid composition, or PseAAC, in molecular biology, was originally introduced by Kuo-Chen Chou in 2001 to represent protein samples for improving protein subcellular localization prediction and membrane protein type prediction. [1] Like the vanilla amino acid composition (AAC) method, it characterizes the protein mainly using a matrix of amino-acid frequencies, which helps with dealing with proteins without significant sequential homology to other proteins. Compared to AAC, additional information are also included in the matrix to represent some local features, such as correlation between residues of a certain distance. [2] When dealing the cases of PseAAC, the Chou's invariance theorem has been often used.

Contents

Background

To predict the subcellular localization of proteins and other attributes based on their sequence, two kinds of models are generally used to represent protein samples: (1) the sequential model, and (2) the non-sequential model or discrete model.

The most typical sequential representation for a protein sample is its entire amino acid (AA) sequence, which can contain its most complete information. This is an obvious advantage of the sequential model. To get the desired results, the sequence-similarity-search-based tools are usually utilized to conduct the prediction.

Given a protein sequence P with amino acid residues, i.e.,

where R1 represents the 1st residue of the protein P, R2 the 2nd residue, and so forth. This is the representation of the protein under the sequential model.

However, this kind of approach fails when a query protein does not have significant homology to the known protein(s). Thus, various discrete models were proposed that do not rely on sequence-order. The simplest discrete model is using the amino acid composition (AAC) to represent protein samples. Under the AAC model, the protein P of Eq.1 can also be expressed by

where are the normalized occurrence frequencies of the 20 native amino acids in P, and T the transposing operator. The AAC of a protein is trivially derived with the protein primary structure known like given in Eq.1; it is also possible by hydrolysis without knowing the exact sequence, and such a step in fact is often a prerequisite for protein sequencing. [3]

Owing to its simplicity, the amino acid composition (AAC) model was widely used in many earlier statistical methods for predicting protein attributes. However, all the sequence-order information is lost. This is its main shortcoming.

Concept

To avoid completely losing the sequence-order information, the concept of PseAAC (pseudo amino acid composition) was proposed. [1] In contrast with the conventional amino acid composition (AAC) that contains 20 components with each reflecting the occurrence frequency for one of the 20 native amino acids in a protein, the PseAAC contains a set of greater than 20 discrete factors, where the first 20 represent the components of its conventional amino acid composition while the additional factors incorporate some sequence-order information via various pseudo components.

The additional factors are a series of rank-different correlation factors along a protein chain, but they can also be any combinations of other factors so long as they can reflect some sorts of sequence-order effects one way or the other. Therefore, the essence of PseAAC is that on one hand it covers the AA composition, but on the other hand it contains the information beyond the AA composition and hence can better reflect the feature of a protein sequence through a discrete model.

Meanwhile, various modes to formulate the PseAAC vector have also been developed, as summarized in a 2009 review article. [2]

Algorithm

Figure 1. A schematic drawing to show (a) the 1st-tier, (b) the 2nd-tier, and (c) the 3rd-tier sequence-order-correlation mode along a protein sequence, where R1 represents the amino acid residue at the sequence position 1, R2 at position 2, and so forth (cf. Eq.1), and the coupling factors
J
i
,
j
{\displaystyle J_{i,j}}
are given by Eq.6. Panel (a) reflects the correlation mode between all the most contiguous residues, panel (b) that between all the 2nd most contiguous residues, and panel (c) that between all the 3rd most contiguous residues. Chou's PseAAC illustration.jpg
Figure 1. A schematic drawing to show (a) the 1st-tier, (b) the 2nd-tier, and (c) the 3rd-tier sequence-order-correlation mode along a protein sequence, where R1 represents the amino acid residue at the sequence position 1, R2 at position 2, and so forth (cf. Eq.1), and the coupling factors are given by Eq.6. Panel (a) reflects the correlation mode between all the most contiguous residues, panel (b) that between all the 2nd most contiguous residues, and panel (c) that between all the 3rd most contiguous residues.

According to the PseAAC model, the protein P of Eq.1 can be formulated as

where the () components are given by

where is the weight factor, and the -th tier correlation factor that reflects the sequence order correlation between all the -th most contiguous residues as formulated by

with

where is the -th function of the amino acid , and the total number of the functions considered. For example, in the original paper by Chou, [1] , and are respectively the hydrophobicity value, hydrophilicity value, and side chain mass of amino acid ; while , and the corresponding values for the amino acid . Therefore, the total number of functions considered there is . It can be seen from Eq.3 that the first 20 components, i.e. are associated with the conventional AA composition of protein, while the remaining components are the correlation factors that reflect the 1st tier, 2nd tier, ..., and the -th tier sequence order correlation patterns (Figure 1). It is through these additional factors that some important sequence-order effects are incorporated.

in Eq.3 is a parameter of integer and that choosing a different integer for will lead to a dimension-different PseAA composition. [4]

Using Eq.6 is just one of the many modes for deriving the correlation factors in PseAAC or its components. The others, such as the physicochemical distance mode [5] and amphiphilic pattern mode, [6] can also be used to derive different types of PseAAC, as summarized in a 2009 review article. [2] In 2011, the formulation of PseAAC (Eq.3) was extended to a form of the general PseAAC as given by: [7]

where the subscript is an integer, and its value and the components will depend on how to extract the desired information from the amino acid sequence of P in Eq.1.

The general PseAAC can be used to reflect any desired features according to the targets of research, including those core features such as functional domain, sequential evolution, and gene ontology to improve the prediction quality for the subcellular localization of proteins. [8] [9] as well as their many other important attributes.

Related Research Articles

<span class="mw-page-title-main">Ellipsoid</span> Quadric surface that looks like a deformed sphere

An ellipsoid is a surface that may be obtained from a sphere by deforming it by means of directional scalings, or more generally, of an affine transformation.

Ray transfer matrix analysis is a mathematical form for performing ray tracing calculations in sufficiently simple problems which can be solved considering only paraxial rays. Each optical element is described by a 2×2 ray transfer matrix which operates on a vector describing an incoming light ray to calculate the outgoing ray. Multiplication of the successive matrices thus yields a concise ray transfer matrix describing the entire optical system. The same mathematics is also used in accelerator physics to track particles through the magnet installations of a particle accelerator, see electron optics.

In linear algebra, a square matrix  is called diagonalizable or non-defective if it is similar to a diagonal matrix, i.e., if there exists an invertible matrix  and a diagonal matrix such that , or equivalently . For a finite-dimensional vector space , a linear map  is called diagonalizable if there exists an ordered basis of  consisting of eigenvectors of . These definitions are equivalent: if  has a matrix representation as above, then the column vectors of  form a basis consisting of eigenvectors of , and the diagonal entries of  are the corresponding eigenvalues of ; with respect to this eigenvector basis,  is represented by .Diagonalization is the process of finding the above  and .

In quantum field theory, the Dirac spinor is the spinor that describes all known fundamental particles that are fermions, with the possible exception of neutrinos. It appears in the plane-wave solution to the Dirac equation, and is a certain combination of two Weyl spinors, specifically, a bispinor that transforms "spinorially" under the action of the Lorentz group.

Linear elasticity is a mathematical model of how solid objects deform and become internally stressed due to prescribed loading conditions. It is a simplification of the more general nonlinear theory of elasticity and a branch of continuum mechanics.

In the calculus of variations, a field of mathematical analysis, the functional derivative relates a change in a functional to a change in a function on which the functional depends.

In the theory of stochastic processes, the Karhunen–Loève theorem, also known as the Kosambi–Karhunen–Loève theorem states that a stochastic process can be represented as an infinite linear combination of orthogonal functions, analogous to a Fourier series representation of a function on a bounded interval. The transformation is also known as Hotelling transform and eigenvector transform, and is closely related to principal component analysis (PCA) technique widely used in image processing and in data analysis in many fields.

<span class="mw-page-title-main">Euler's rotation theorem</span> Movement with a fixed point is rotation

In geometry, Euler's rotation theorem states that, in three-dimensional space, any displacement of a rigid body such that a point on the rigid body remains fixed, is equivalent to a single rotation about some axis that runs through the fixed point. It also means that the composition of two rotations is also a rotation. Therefore the set of rotations has a group structure, known as a rotation group.

<span class="mw-page-title-main">Change of basis</span> Coordinate change in linear algebra

In mathematics, an ordered basis of a vector space of finite dimension n allows representing uniquely any element of the vector space by a coordinate vector, which is a sequence of n scalars called coordinates. If two different bases are considered, the coordinate vector that represents a vector v on one basis is, in general, different from the coordinate vector that represents v on the other basis. A change of basis consists of converting every assertion expressed in terms of coordinates relative to one basis into an assertion expressed in terms of coordinates relative to the other basis.

In mathematics, the discrete Laplace operator is an analog of the continuous Laplace operator, defined so that it has meaning on a graph or a discrete grid. For the case of a finite-dimensional graph, the discrete Laplace operator is more commonly called the Laplacian matrix.

In numerical analysis, the Crank–Nicolson method is a finite difference method used for numerically solving the heat equation and similar partial differential equations. It is a second-order method in time. It is implicit in time, can be written as an implicit Runge–Kutta method, and it is numerically stable. The method was developed by John Crank and Phyllis Nicolson in the mid 20th century.

<span class="mw-page-title-main">Tissot's indicatrix</span> Characterization of distortion in map protections

In cartography, a Tissot's indicatrix is a mathematical contrivance presented by French mathematician Nicolas Auguste Tissot in 1859 and 1871 in order to characterize local distortions due to map projection. It is the geometry that results from projecting a circle of infinitesimal radius from a curved geometric model, such as a globe, onto a map. Tissot proved that the resulting diagram is an ellipse whose axes indicate the two principal directions along which scale is maximal and minimal at that point on the map.

In numerical linear algebra, the method of successive over-relaxation (SOR) is a variant of the Gauss–Seidel method for solving a linear system of equations, resulting in faster convergence. A similar method can be used for any slowly converging iterative process.

<span class="mw-page-title-main">Prony's method</span>

Prony analysis was developed by Gaspard Riche de Prony in 1795. However, practical use of the method awaited the digital computer. Similar to the Fourier transform, Prony's method extracts valuable information from a uniformly sampled signal and builds a series of damped complex exponentials or damped sinusoids. This allows for the estimation of frequency, amplitude, phase and damping components of a signal.

<span class="mw-page-title-main">Local tangent plane coordinates</span>

Local tangent plane coordinates (LTP), also known as local ellipsoidal system, local geodetic coordinate system, or local vertical, local horizontal coordinates (LVLH), are a spatial reference system based on the tangent plane defined by the local vertical direction and the Earth's axis of rotation. It consists of three coordinates: one represents the position along the northern axis, one along the local eastern axis, and one represents the vertical position. Two right-handed variants exist: east, north, up (ENU) coordinates and north, east, down (NED) coordinates. They serve for representing state vectors that are commonly used in aviation and marine cybernetics.


In mathematics, an eigenvalue perturbation problem is that of finding the eigenvectors and eigenvalues of a system that is perturbed from one with known eigenvectors and eigenvalues . This is useful for studying how sensitive the original system's eigenvectors and eigenvalues are to changes in the system. This type of analysis was popularized by Lord Rayleigh, in his investigation of harmonic vibrations of a string perturbed by small inhomogeneities.

In linear algebra, eigendecomposition is the factorization of a matrix into a canonical form, whereby the matrix is represented in terms of its eigenvalues and eigenvectors. Only diagonalizable matrices can be factorized in this way. When the matrix being factorized is a normal or real symmetric matrix, the decomposition is called "spectral decomposition", derived from the spectral theorem.

Common integrals in quantum field theory are all variations and generalizations of Gaussian integrals to the complex plane and to multiple dimensions. Other integrals can be approximated by versions of the Gaussian integral. Fourier integrals are also considered.

The spin angular momentum of light (SAM) is the component of angular momentum of light that is associated with the quantum spin and the rotation between the polarization degrees of freedom of the photon.

The Wannier equation describes a quantum mechanical eigenvalue problem in solids where an electron in a conduction band and an electronic vacancy within a valence band attract each other via the Coulomb interaction. For one electron and one hole, this problem is analogous to the Schrödinger equation of the hydrogen atom; and the bound-state solutions are called excitons. When an exciton's radius extends over several unit cells, it is referred to as a Wannier exciton in contrast to Frenkel excitons whose size is comparable with the unit cell. An excited solid typically contains many electrons and holes; this modifies the Wannier equation considerably. The resulting generalized Wannier equation can be determined from the homogeneous part of the semiconductor Bloch equations or the semiconductor luminescence equations.

References

  1. 1 2 3 Chou KC (May 2001). "Prediction of protein cellular attributes using pseudo-amino acid composition". Proteins. 43 (3): 246–55. doi:10.1002/prot.1035. PMID   11288174. S2CID   28406797.
  2. 1 2 3 Chou KC (2009). "Pseudo amino acid composition and its applications in bioinformatics, proteomics and system biology". Current Proteomics. 6 (4): 262–274. doi:10.2174/157016409789973707.
  3. Michail A. Alterman; Peter Hunziker (2 December 2011). Amino Acid Analysis: Methods and Protocols. Humana Press. ISBN   978-1-61779-444-5.
  4. Chou KC, Shen HB (November 2007). "Recent progress in protein subcellular location prediction". Anal. Biochem. 370 (1): 1–16. doi:10.1016/j.ab.2007.07.006. PMID   17698024.
  5. Chou KC (November 2000). "Prediction of protein subcellular locations by incorporating quasi-sequence-order effect". Biochem. Biophys. Res. Commun. 278 (2): 477–83. doi:10.1006/bbrc.2000.3815. PMID   11097861.
  6. Chou KC (January 2005). "Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes". Bioinformatics. 21 (1): 10–9. doi: 10.1093/bioinformatics/bth466 . PMID   15308540.
  7. Chou KC (March 2011). "Some remarks on protein attribute prediction and pseudo amino acid composition". Journal of Theoretical Biology. 273 (1): 236–47. Bibcode:2011JThBi.273..236C. doi:10.1016/j.jtbi.2010.12.024. PMC   7125570 . PMID   21168420.
  8. Chou KC, Shen HB (2008). "Cell-PLoc: a package of Web servers for predicting subcellular localization of proteins in various organisms". Nat Protoc. 3 (2): 153–62. doi:10.1038/nprot.2007.494. PMID   18274516. S2CID   226104. Archived from the original on 2007-08-27. Retrieved 2008-03-24.
  9. Shen HB, Chou KC (February 2008). "PseAAC: a flexible web server for generating various kinds of protein pseudo amino acid composition". Anal. Biochem. 373 (2): 386–8. doi:10.1016/j.ab.2007.10.012. PMID   17976365.