Predicted Aligned Error

Last updated
Predicted Aligned Error
Predicted Aligned Error.png
Filename extension
.json
Internet media type
application/json
Developed by DeepMind, EMBL-EBI
Type of format Bioinformatics
Website https://alphafold.ebi.ac.uk/faq

The Predicted Aligned Error (PAE) is a quantitative output produced by AlphaFold, a protein structure prediction system developed by DeepMind. [1] PAE estimates the expected positional error for each residue in a predicted protein structure if it were aligned to a corresponding residue in the true protein structure. This measurement helps scientists assess the confidence in the relative positions and orientations of different parts of the predicted protein model. [2]

Contents

Structure

PAE is presented as a two-dimensional (2D) interactive plot where the color at coordinates (x, y) represents the predicted position error at residue x if the predicted and true structures were aligned on residue y. [3] Lower PAE values for residue pairs from different domains suggest well-defined relative positions and orientations in the prediction, while higher PAE values indicate uncertainty in the relative positions or orientations. Users can download the raw PAE data for all residue pairs in a custom JSON format for further analysis or visualization using a programming language such as Python. The format of the JSON file is as follows:

[     {         "predicted_aligned_error": [[0, 1, 4, 7, 9, ...], ...],         "max_predicted_aligned_error": 31.75     } ] 

In the JSON file, the field predicted_aligned_error provides the PAE value for each residue pair (rounded to the nearest integer), and the field max_predicted_aligned_error gives the maximum possible PAE value, which is capped at 31.75 Å. The PAE is measured in Ångströms.

A separately developed 3D viewer of PAE allows for more intuitive visualization. [4]

Predicted Aligned Error 3D viewer Predicted Aligned Error 3D viewer.png
Predicted Aligned Error 3D viewer

Interpretation

Interpretation of PAE values allows scientists to understand the level of confidence in the predicted structure of a protein: Lower PAE values between residue pairs from different domains indicate that the model predicts well-defined relative positions and orientations for those domains. Higher PAE values for such residue pairs suggest that the relative positions and/or orientations of these domains in the 3D structure are uncertain and should not be interpreted. [5]

Caveats

Although PAE provides valuable information, users should note that it is asymmetric; the PAE value for (x, y) may differ from the value for (y, x), particularly between loop regions with highly uncertain orientations. [6] Moreover, while AlphaFold can make useful inter-domain predictions, intra-domain prediction accuracy is expected to be more reliable based on CASP14 validation.

Related Research Articles

<span class="mw-page-title-main">Alpha helix</span> Type of secondary structure of proteins

The alpha helix (α-helix) is a common motif in the secondary structure of proteins and is a right hand-helix conformation in which every backbone N−H group hydrogen bonds to the backbone C=O group of the amino acid located four residues earlier along the protein sequence.

<span class="mw-page-title-main">Protein secondary structure</span> General three-dimensional form of local segments of proteins

Protein secondary structure is the local spatial conformation of the polypeptide backbone excluding the side chains. The two most common secondary structural elements are alpha helices and beta sheets, though beta turns and omega loops occur as well. Secondary structure elements typically spontaneously form as an intermediate before the protein folds into its three dimensional tertiary structure.

Protein engineering is the process of developing useful or valuable proteins through the design and production of unnatural polypeptides, often by altering amino acid sequences found in nature. It is a young discipline, with much research taking place into the understanding of protein folding and recognition for protein design principles. It has been used to improve the function of many enzymes for industrial catalysis. It is also a product and services market, with an estimated value of $168 billion by 2017.

<span class="mw-page-title-main">Protein structure prediction</span> Type of biological prediction

Protein structure prediction is the inference of the three-dimensional structure of a protein from its amino acid sequence—that is, the prediction of its secondary and tertiary structure from primary structure. Structure prediction is different from the inverse problem of protein design. Protein structure prediction is one of the most important goals pursued by computational biology; and it is important in medicine and biotechnology.

<span class="mw-page-title-main">CASP</span> Protein structure prediction challenge

Critical Assessment of Structure Prediction (CASP), sometimes called Critical Assessment of Protein Structure Prediction, is a community-wide, worldwide experiment for protein structure prediction taking place every two years since 1994. CASP provides research groups with an opportunity to objectively test their structure prediction methods and delivers an independent assessment of the state of the art in protein structure modeling to the research community and software users. Even though the primary goal of CASP is to help advance the methods of identifying protein three-dimensional structure from its amino acid sequence many view the experiment more as a “world championship” in this field of science. More than 100 research groups from all over the world participate in CASP on a regular basis and it is not uncommon for entire groups to suspend their other research for months while they focus on getting their servers ready for the experiment and on performing the detailed predictions.

<span class="mw-page-title-main">Structural alignment</span> Aligning molecular sequences using sequence and structural information

Structural alignment attempts to establish homology between two or more polymer structures based on their shape and three-dimensional conformation. This process is usually applied to protein tertiary structures but can also be used for large RNA molecules. In contrast to simple structural superposition, where at least some equivalent residues of the two structures are known, structural alignment requires no a priori knowledge of equivalent positions. Structural alignment is a valuable tool for the comparison of proteins with low sequence similarity, where evolutionary relationships between proteins cannot be easily detected by standard sequence alignment techniques. Structural alignment can therefore be used to imply evolutionary relationships between proteins that share very little common sequence. However, caution should be used in using the results as evidence for shared evolutionary ancestry because of the possible confounding effects of convergent evolution by which multiple unrelated amino acid sequences converge on a common tertiary structure.

<span class="mw-page-title-main">Structural bioinformatics</span> Bioinformatics subfield

Structural bioinformatics is the branch of bioinformatics that is related to the analysis and prediction of the three-dimensional structure of biological macromolecules such as proteins, RNA, and DNA. It deals with generalizations about macromolecular 3D structures such as comparisons of overall folds and local motifs, principles of molecular folding, evolution, binding interactions, and structure/function relationships, working both from experimentally solved structures and from computational models. The term structural has the same meaning as in structural biology, and structural bioinformatics can be seen as a part of computational structural biology. The main objective of structural bioinformatics is the creation of new methods of analysing and manipulating biological macromolecular data in order to solve problems in biology and generate new knowledge.

A coiled coil is a structural motif in proteins in which 2–7 alpha-helices are coiled together like the strands of a rope. Many coiled coil-type proteins are involved in important biological functions, such as the regulation of gene expression — e.g., transcription factors. Notable examples are the oncoproteins c-Fos and c-Jun, as well as the muscle protein tropomyosin.

In molecular biology, protein threading, also known as fold recognition, is a method of protein modeling which is used to model those proteins which have the same fold as proteins of known structures, but do not have homologous proteins with known structure. It differs from the homology modeling method of structure prediction as it is used for proteins which do not have their homologous protein structures deposited in the Protein Data Bank (PDB), whereas homology modeling is used for those proteins which do. Threading works by using statistical knowledge of the relationship between the structures deposited in the PDB and the sequence of the protein which one wishes to model.

<span class="mw-page-title-main">Homology modeling</span> Method of protein structure prediction using other known proteins

Homology modeling, also known as comparative modeling of protein, refers to constructing an atomic-resolution model of the "target" protein from its amino acid sequence and an experimental three-dimensional structure of a related homologous protein. Homology modeling relies on the identification of one or more known protein structures likely to resemble the structure of the query sequence, and on the production of an alignment that maps residues in the query sequence to residues in the template sequence. It has been seen that protein structures are more conserved than protein sequences amongst homologues, but sequences falling below a 20% sequence identity can have very different structure.

In computational biology, de novo protein structure prediction refers to an algorithmic process by which protein tertiary structure is predicted from its amino acid primary sequence. The problem itself has occupied leading scientists for decades while still remaining unsolved. According to Science, the problem remains one of the top 125 outstanding issues in modern science. At present, some of the most successful methods have a reasonable probability of predicting the folds of small, single-domain proteins within 1.5 angstroms over the entire structure.

In bioinformatics, the template modeling score or TM-score is a measure of similarity between two protein structures. The TM-score is intended as a more accurate measure of the global similarity of full-length protein structures than the often used RMSD measure. The TM-score indicates the similarity between two structures by a score between , where 1 indicates a perfect match between two structures. Generally scores below 0.20 corresponds to randomly chosen unrelated proteins whereas structures with a score higher than 0.5 assume roughly the same fold. A quantitative study shows that proteins of TM-score = 0.5 have a posterior probability of 37% in the same CATH topology family and of 13% in the same SCOP fold family. The probabilities increase rapidly when TM-score > 0.5. The TM-score is designed to be independent of protein lengths.

CS-BLAST (Context-Specific BLAST) is a tool that searches a protein sequence that extends BLAST, using context-specific mutation probabilities. More specifically, CS-BLAST derives context-specific amino-acid similarities on each query sequence from short windows on the query sequences [4]. Using CS-BLAST doubles sensitivity and significantly improves alignment quality without a loss of speed in comparison to BLAST. CSI-BLAST is the context-specific analog of PSI-BLAST, which computes the mutation profile with substitution probabilities and mixes it with the query profile [2]. CSI-BLAST is the context specific analog of PSI-BLAST. Both of these programs are available as web-server and are available for free download.

<span class="mw-page-title-main">PRR29</span> Protein-coding gene in the species Homo sapiens

PRR29 is a protein encoded by the PRR29 gene located in humans on chromosome 17 at 17q23.

Direct coupling analysis or DCA is an umbrella term comprising several methods for analyzing sequence data in computational biology. The common idea of these methods is to use statistical modeling to quantify the strength of the direct relationship between two positions of a biological sequence, excluding effects from other positions. This contrasts usual measures of correlation, which can be large even if there is no direct relationship between the positions. Such a direct relationship can for example be the evolutionary pressure for two positions to maintain mutual compatibility in the biomolecular structure of the sequence, leading to molecular coevolution between the two positions. DCA has been used in the inference of protein residue contacts, RNA structure prediction, the inference of protein-protein interaction networks, the modeling of fitness landscapes, and the identification of functionally relevant residue communities.

<span class="mw-page-title-main">AlphaFold</span> Artificial intelligence program by DeepMind

AlphaFold is an artificial intelligence (AI) program developed by DeepMind, a subsidiary of Alphabet, which performs predictions of protein structure. The program is designed as a deep learning system.

<span class="mw-page-title-main">C4orf51</span> Protein-coding gene in the species Homo sapiens

Chromosome 4 open reading frame 51 (C4orf51) is a protein which in humans is encoded by the C4orf51 gene.

<span class="mw-page-title-main">ZNF821</span> Zinc Finger 821

Zinc Finger Protein 821, also known as ZNF821, is a protein encoded by the ZNF821 gene. This gene is located on the 16th chromosome and is expressed highly in the testes, moderately expressed in the brain and low expression in 23 other tissues. The protein encoded is 412 amino acids long with 2 Zinc Finger motifs and a 23 amino acid long STPR domain.

<span class="mw-page-title-main">C1orf159</span> Protein encoded on a gene

C1orf159 is a protein that in human is encoded by the C1orf159 gene located on chromosome 1. This gene is also found to be an unfavorable prognosis marker for renal and liver cancer, and a favorable prognosis marker for urothelial cancer.

<span class="mw-page-title-main">TMEM271</span> TMEM271 gene and protein

Transmembrane protein 271, or TMEM271 is a protein in Homo sapiens encoded by the TMEM271 gene, located at 4p16.3 on the minus strand. The protein is located on the plasma membrane of cells and highly expressed in several regions of the brain.

References

  1. "AlphaFold Protein Structure Database". alphafold.ebi.ac.uk. 2023-06-12. Archived from the original on 2023-06-13. Retrieved 2023-06-12.
  2. "AlphaFold Error Estimates". www.rbvi.ucsf.edu. Archived from the original on 2023-06-13. Retrieved 2023-06-12.
  3. "Enabling high-accuracy protein structure prediction at the proteome scale". www.deepmind.com. 2023-06-13. Archived from the original on 2023-06-13. Retrieved 2023-06-13.
  4. Elfmann, Christoph; Stülke, Jörg (2023-05-04). "PAE viewer: a webserver for the interactive visualization of the predicted aligned error for multimer structure predictions and crosslinks". Nucleic Acids Research. 51 (W1): W404–W410. doi:10.1093/nar/gkad350. ISSN   0305-1048. PMC   10320053 . PMID   37140053.
  5. Varadi, Mihaly (2023-06-13). "NIH: National Library of Medicine: AlphaFold Database". Nucleic Acids Research. 50 (D1): D439–D444. doi:10.1093/nar/gkab1061. PMC   8728224 . PMID   34791371.
  6. "Why is the Alphafold PAE (predicted aligned error) not symmetric?". Matter Modeling Stack Exchange. 2023-06-12. Archived from the original on 2023-06-13. Retrieved 2023-06-12.