Protein subcellular localization prediction (or just protein localization prediction) involves the prediction of where a protein resides in a cell, its subcellular localization.
In general, prediction tools take as input information about a protein, such as a protein sequence of amino acids, and produce a predicted location within the cell as output, such as the nucleus, Endoplasmic reticulum, Golgi apparatus, extracellular space, or other organelles. The aim is to build tools that can accurately predict the outcome of protein targeting in cells.
Prediction of protein subcellular localization is an important component of bioinformatics based prediction of protein function and genome annotation, and it can aid the identification of drug targets.
Experimentally determining the subcellular localization of a protein can be a laborious and time consuming task. Immunolabeling or tagging (such as with a green fluorescent protein) to view localization using fluorescence microscope are often used. A high throughput alternative is to use prediction.
Through the development of new approaches in computer science, coupled with an increased dataset of proteins of known localization, computational tools can now provide fast and accurate localization predictions for many organisms. This has resulted in subcellular localization prediction becoming one of the challenges being successfully aided by bioinformatics, and machine learning.
Many prediction methods now exceed the accuracy of some high-throughput laboratory methods for the identification of protein subcellular localization. [1] [2] [3] Particularly, some predictors have been developed [4] that can be used to deal with proteins that may simultaneously exist, or move between, two or more different subcellular locations. Experimental validation is typically required to confirm the predicted localizations.
In 1999 PSORT was the first published program to predict subcellular localization. [5] Subsequent tools and websites have been released using techniques such as artificial neural networks, support vector machine and protein motifs. Predictors can be specialized for proteins in different organisms. Some are specialized for eukaryotic proteins, [6] some for human proteins, [7] and some for plant proteins. [8] Methods for the prediction of bacterial localization predictors, and their accuracy, have been reviewed. [9] In 2021, SCLpred-MEM, a membrane protein prediction tool powered by artificial neural networks was published. [10] SCLpred-EMS is another tool powered by Artificial neural networks that classify proteins into endomembrane system and secretory pathway (EMS) versus all others. [11] Similarly, Light-Attention uses machine learning methods to predict ten different common subcellular locations. [12]
The first model to generalize protein subcellular localization to all cell line does so by leveraging images of subcellular landmark stains (i.e., nuclear, plasma membrane, and endoplasmic reticulum markers) across multiple cell stains. Coupling multimodal data of landmark stains along with a pre-trained protein language model, the Prediction of Unseen Proteins' Subcellular Localization (PUPS) model is capable of generative subcellular localization prediction of any protein in any cell line given the protein's amino acid sequence and reference stains of the cell line. [13]
The development of protein subcellular location prediction has been summarized in two comprehensive review articles. [14] [15] Recent tools and an experience report can be found in a recent paper by Meinken and Min (2012).
Knowledge of the subcellular localization of a protein can significantly improve target identification during the drug discovery process. For example, secreted proteins and plasma membrane proteins are easily accessible by drug molecules due to their localization in the extracellular space or on the cell surface.
Bacterial cell surface and secreted proteins are also of interest for their potential as vaccine candidates or as diagnostic targets. Aberrant subcellular localization of proteins has been observed in the cells of several diseases, such as cancer and Alzheimer's disease. Secreted proteins from some archaea that can survive in unusual environments have industrially important applications.
By using prediction a high number of proteins can be assessed in order to find candidates that are trafficked to the desired location.
The results of subcellular localization prediction can be stored in databases. Examples include the multi-species database Compartments, FunSecKB2, a fungal database; [16] PlantSecKB, a plant database; [17] MetazSecKB, an animal and human database; [18] and ProtSecKB, a protist database. [19]
Topology of a transmembrane protein refers to locations of N- and C-termini of membrane-spanning polypeptide chain with respect to the inner or outer sides of the biological membrane occupied by the protein.
PSORT is a bioinformatics tool used for the prediction of protein localisation sites in cells. It receives the information of an amino acid sequence and its taxon of origin as inputs. Then it analyses the input sequence by applying the stored rules for various sequence features of known protein sorting signals. Finally, it reports the possibility for the input protein to be localised at each candidate site with additional information.
The cells of eukaryotic organisms are elaborately subdivided into functionally-distinct membrane-bound compartments. Some major constituents of eukaryotic cells are: extracellular space, plasma membrane, cytoplasm, nucleus, mitochondria, Golgi apparatus, endoplasmic reticulum (ER), peroxisome, vacuoles, cytoskeleton, nucleoplasm, nucleolus, nuclear matrix and ribosomes.
TRAPPC14 also known as MAP11 is a protein that in human is encoded by the gene TRAPPC14. It was previously referred to by the generic name C7orf43. C7orf43 has no other human alias, but in mice can be found as BC037034.
Pseudo amino acid composition, or PseAAC, in molecular biology, was originally introduced by Kuo-Chen Chou in 2001 to represent protein samples for improving protein subcellular localization prediction and membrane protein type prediction. Like the vanilla amino acid composition (AAC) method, it characterizes the protein mainly using a matrix of amino-acid frequencies, which helps with dealing with proteins without significant sequential homology to other proteins. Compared to AAC, additional information are also included in the matrix to represent some local features, such as correlation between residues of a certain distance. When dealing the cases of PseAAC, the Chou's invariance theorem has been often used.
PSORTdb is a database of protein subcellular localization (SCL) for bacteria and archaea. It is a member of the PSORT family of bioinformatics tools. The database consists of two datasets, ePSORTdb and cPSORTdb, which contain information determined through experimental validation and computational prediction, respectively. The ePSORTdb dataset is the largest curated collection of experimentally verified SCL data.
Secretomics is a type of proteomics which involves the analysis of the secretome—all the secreted proteins of a cell, tissue or organism. Secreted proteins are involved in a variety of physiological processes, including cell signaling and matrix remodeling, but are also integral to invasion and metastasis of malignant cells. Secretomics has thus been especially important in the discovery of biomarkers for cancer and understanding molecular basis of pathogenesis. The analysis of the insoluble fraction of the secretome has been termed matrisomics.
Burkhard Rost is a scientist leading the Department for Computational Biology & Bioinformatics at the Faculty of Informatics of the Technical University of Munich (TUM). Rost chairs the Study Section Bioinformatics Munich involving the TUM and the Ludwig Maximilian University of Munich (LMU) in Munich. From 2007-2014 Rost was President of the International Society for Computational Biology (ISCB).
The secretome is the set of proteins expressed by an organism and secreted into the extracellular space. In humans, this subset of the proteome encompasses 13-20% of all proteins, including cytokines, growth factors, extracellular matrix proteins and regulators, and shed receptors. The secretome of a specific tissue can be measured by mass spectrometry and its analysis constitutes a type of proteomics known as secretomics.
Proteome Analyst (PA) is a freely available web server and online toolkit for predicting protein subcellular localization, or where a protein resides in a cell. In the field of proteomics, accurately predicting a protein's subcellular localization, or where a specific protein is located inside a cell, is an important step in the large scale study of proteins. This computational prediction problem is known as Protein subcellular localization prediction. Over the last decade, more than a dozen web servers and computer programs have been developed to attempt to solve this problem. Proteome Analyst is an example of one of the better performing subcellular prediction tools. Proteome Analyst makes predictions for both prokaryotic eukaryotic proteins using a text mining approach. Proteome Analyst was originally developed by the Proteome Analyst Research Group at the University of Alberta, and was initially released in March 2004. It was recently updated in January 2014.
Relative accessible surface area or relative solvent accessibility (RSA) of a protein residue is a measure of residue solvent exposure. It can be calculated by formula:
KIAA0825 is a protein that in humans is encoded by the gene of the same name, located on chromosome 5, 5q15. It is a possible risk factor in Type II Diabetes, and associated with high levels of glucose in the blood. It is a relatively fast mutating gene, compared to other coding genes. There is however one region which is highly conserved across the species that have the gene, known as DUF4495. It is predicted to travel between the nucleus and the cytoplasm.
UPF0575 protein C19orf67 is a protein which in humans is encoded by the C19orf67 gene. Orthologs of C19orf67 are found in many mammals, some reptiles, and most jawed fish. The protein is expressed at low levels throughout the body with the exception of the testis and breast tissue. Where it is expressed, the protein is predicted to be localized in the nucleus to carry out a function. The highly conserved and slowly evolving DUFF3314 region is predicted to form numerous alpha helices and may be vital to the function of the protein.
Transmembrane protein 179 is a protein that in humans is encoded by the TMEM179 gene. The function of transmembrane protein 179 is not yet well understood, but it is believed to have a function in the nervous system.
FAM237A is a protein coding gene which encodes a protein of the same name. Within Homo sapiens, FAM237A is believed to be primarily expressed within the brain, with moderate heart and lesser testes expression,. FAM237A is hypothesized to act as a specific activator of receptor GPR83.
C13orf42 is a protein which, in humans, is encoded by the gene chromosome 13 open reading frame 42 (C13orf42). RNA sequencing data shows low expression of the C13orf42 gene in a variety of tissues. The C13orf42 protein is predicted to be localized in the mitochondria, nucleus, and cytosol. Tertiary structure predictions for C13orf42 indicate multiple alpha helices.
C6orf163 is a human protein encoded by the C6orf163 gene.
CIMAP1C is a gene in humans that encodes the CIMAP1C protein. It is also often referred to as ODF3L1. CIMAP1C is expressed in low levels throughout the body with high expression levels in the testes. It is highly conserved in mammals and reptiles but not present in birds or amphibians, indicating it arose around 300 million years ago.