Protein subcellular localization prediction

Last updated

Protein subcellular localization prediction (or just protein localization prediction) involves the prediction of where a protein resides in a cell, its subcellular localization.

Contents

In general, prediction tools take as input information about a protein, such as a protein sequence of amino acids, and produce a predicted location within the cell as output, such as the nucleus, Endoplasmic reticulum, Golgi apparatus, extracellular space, or other organelles. The aim is to build tools that can accurately predict the outcome of protein targeting in cells.

Prediction of protein subcellular localization is an important component of bioinformatics based prediction of protein function and genome annotation, and it can aid the identification of drug targets.

Background

Experimentally determining the subcellular localization of a protein can be a laborious and time consuming task. Immunolabeling or tagging (such as with a green fluorescent protein) to view localization using fluorescence microscope are often used. A high throughput alternative is to use prediction.

Through the development of new approaches in computer science, coupled with an increased dataset of proteins of known localization, computational tools can now provide fast and accurate localization predictions for many organisms. This has resulted in subcellular localization prediction becoming one of the challenges being successfully aided by bioinformatics, and machine learning.

Many prediction methods now exceed the accuracy of some high-throughput laboratory methods for the identification of protein subcellular localization. [1] [2] [3] Particularly, some predictors have been developed [4] that can be used to deal with proteins that may simultaneously exist, or move between, two or more different subcellular locations. Experimental validation is typically required to confirm the predicted localizations.

Tools

In 1999 PSORT was the first published program to predict subcellular localization. [5] Subsequent tools and websites have been released using techniques such as artificial neural networks, support vector machine and protein motifs. Predictors can be specialized for proteins in different organisms. Some are specialized for eukaryotic proteins, [6] some for human proteins, [7] and some for plant proteins. [8] Methods for the prediction of bacterial localization predictors, and their accuracy, have been reviewed. [9] In 2021, SCLpred-MEM, a membrane protein prediction tool powered by artificial neural networks was published. [10] SCLpred-EMS is another tool powered by Artificial neural networks that classify proteins into endomembrane system and secretory pathway (EMS) versus all others. [11] Similarly, Light-Attention uses machine learning methods to predict ten different common subcellular locations. [12]

The development of protein subcellular location prediction has been summarized in two comprehensive review articles. [13] [14] Recent tools and an experience report can be found in a recent paper by Meinken and Min (2012).

Application

Knowledge of the subcellular localization of a protein can significantly improve target identification during the drug discovery process. For example, secreted proteins and plasma membrane proteins are easily accessible by drug molecules due to their localization in the extracellular space or on the cell surface.

Bacterial cell surface and secreted proteins are also of interest for their potential as vaccine candidates or as diagnostic targets. Aberrant subcellular localization of proteins has been observed in the cells of several diseases, such as cancer and Alzheimer's disease. Secreted proteins from some archaea that can survive in unusual environments have industrially important applications.

By using prediction a high number of proteins can be assessed in order to find candidates that are trafficked to the desired location.

Databases

The results of subcellular localization prediction can be stored in databases. Examples include the multi-species database Compartments, FunSecKB2, a fungal database; [15] PlantSecKB, a plant database; [16] MetazSecKB, an animal and human database; [17] and ProtSecKB, a protist database. [18]

Related Research Articles

<span class="mw-page-title-main">Bioinformatics</span> Computational analysis of large, complex sets of biological data

Bioinformatics is an interdisciplinary field of science that develops methods and software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, chemistry, physics, computer science, computer programming, information engineering, mathematics and statistics to analyze and interpret biological data. The subsequent process of analyzing and interpreting data is referred to as computational biology.

<span class="mw-page-title-main">Membrane topology</span>

Topology of a transmembrane protein refers to locations of N- and C-termini of membrane-spanning polypeptide chain with respect to the inner or outer sides of the biological membrane occupied by the protein.

<span class="mw-page-title-main">PSORT</span>

PSORT is a bioinformatics tool used for the prediction of protein localisation sites in cells. It receives the information of an amino acid sequence and its taxon of origin as inputs. Then it analyses the input sequence by applying the stored rules for various sequence features of known protein sorting signals. Finally, it reports the possibility for the input protein to be localised at each candidate site with additional information.

The cells of eukaryotic organisms are elaborately subdivided into functionally-distinct membrane-bound compartments. Some major constituents of eukaryotic cells are: extracellular space, plasma membrane, cytoplasm, nucleus, mitochondria, Golgi apparatus, endoplasmic reticulum (ER), peroxisome, vacuoles, cytoskeleton, nucleoplasm, nucleolus, nuclear matrix and ribosomes.

Pseudo amino acid composition, or PseAAC, in molecular biology, was originally introduced by Kuo-Chen Chou in 2001 to represent protein samples for improving protein subcellular localization prediction and membrane protein type prediction. Like the vanilla amino acid composition (AAC) method, it characterizes the protein mainly using a matrix of amino-acid frequencies, which helps with dealing with proteins without significant sequential homology to other proteins. Compared to AAC, additional information are also included in the matrix to represent some local features, such as correlation between residues of a certain distance. When dealing the cases of PseAAC, the Chou's invariance theorem has been often used.

<span class="mw-page-title-main">PSORTdb</span>

PSORTdb is a database of protein subcellular localization (SCL) for bacteria and archaea. It is a member of the PSORT family of bioinformatics tools. The database consists of two datasets, ePSORTdb and cPSORTdb, which contain information determined through experimental validation and computational prediction, respectively. The ePSORTdb dataset is the largest curated collection of experimentally verified SCL data.

Secretomics is a type of proteomics which involves the analysis of the secretome—all the secreted proteins of a cell, tissue or organism. Secreted proteins are involved in a variety of physiological processes, including cell signaling and matrix remodeling, but are also integral to invasion and metastasis of malignant cells. Secretomics has thus been especially important in the discovery of biomarkers for cancer and understanding molecular basis of pathogenesis. The analysis of the insoluble fraction of the secretome has been termed matrisomics.

<span class="mw-page-title-main">Burkhard Rost</span> German computational biology researcher

Burkhard Rost is a scientist leading the Department for Computational Biology & Bioinformatics at the Faculty of Informatics of the Technical University of Munich (TUM). Rost chairs the Study Section Bioinformatics Munich involving the TUM and the Ludwig Maximilian University of Munich (LMU) in Munich. From 2007-2014 Rost was President of the International Society for Computational Biology (ISCB).

The secretome is the set of proteins expressed by an organism and secreted into the extracellular space. In humans, this subset of the proteome encompasses 13-20% of all proteins, including cytokines, growth factors, extracellular matrix proteins and regulators, and shed receptors. The secretome of a specific tissue can be measured by mass spectrometry and its analysis constitutes a type of proteomics known as secretomics.

Proteome Analyst (PA) is a freely available web server and online toolkit for predicting protein subcellular localization, or where a protein resides in a cell. In the field of proteomics, accurately predicting a protein's subcellular localization, or where a specific protein is located inside a cell, is an important step in the large scale study of proteins. This computational prediction problem is known as Protein subcellular localization prediction. Over the last decade, more than a dozen web servers and computer programs have been developed to attempt to solve this problem. Proteome Analyst is an example of one of the better performing subcellular prediction tools. Proteome Analyst makes predictions for both prokaryotic eukaryotic proteins using a text mining approach. Proteome Analyst was originally developed by the Proteome Analyst Research Group at the University of Alberta, and was initially released in March 2004. It was recently updated in January 2014.

Relative accessible surface area or relative solvent accessibility (RSA) of a protein residue is a measure of residue solvent exposure. It can be calculated by formula:

<span class="mw-page-title-main">C12orf60</span> Protein-coding gene in humans

Uncharacterized protein C12orf60 is a protein that in humans is encoded by the C12orf60 gene. The gene is also known as LOC144608 or MGC47869. The protein lacks transmembrane domains and helices, but it is rich in alpha-helices. It is predicted to localize in the nucleus.

<span class="mw-page-title-main">KIAA0825</span> Protein-coding gene in the species Homo sapiens

KIAA0825 is a protein that in humans is encoded by the gene of the same name, located on chromosome 5, 5q15. It is a possible risk factor in Type II Diabetes, and associated with high levels of glucose in the blood. It is a relatively fast mutating gene, compared to other coding genes. There is however one region which is highly conserved across the species that have the gene, known as DUF4495. It is predicted to travel between the nucleus and the cytoplasm.

UPF0575 protein C19orf67 is a protein which in humans is encoded by the C19orf67 gene. Orthologs of C19orf67 are found in many mammals, some reptiles, and most jawed fish. The protein is expressed at low levels throughout the body with the exception of the testis and breast tissue. Where it is expressed, the protein is predicted to be localized in the nucleus to carry out a function. The highly conserved and slowly evolving DUFF3314 region is predicted to form numerous alpha helices and may be vital to the function of the protein.

<span class="mw-page-title-main">Transmembrane protein 179</span> Protein-coding gene in the species Homo sapiens

Transmembrane protein 179 is a protein that in humans is encoded by the TMEM179 gene. The function of transmembrane protein 179 is not yet well understood, but it is believed to have a function in the nervous system.

FAM237A is a protein coding gene which encodes a protein of the same name. Within Homo sapiens, FAM237A is believed to be primarily expressed within the brain, with moderate heart and lesser testes expression,. FAM237A is hypothesized to act as a specific activator of receptor GPR83.

<span class="mw-page-title-main">C13orf42</span> C13orf42 gene page

C13orf42 is a protein which, in humans, is encoded by the gene chromosome 13 open reading frame 42 (C13orf42). RNA sequencing data shows low expression of the C13orf42 gene in a variety of tissues. The C13orf42 protein is predicted to be localized in the mitochondria, nucleus, and cytosol. Tertiary structure predictions for C13orf42 indicate multiple alpha helices.

CIMAP1C is a gene in humans that encodes the CIMAP1C protein. It is also often referred to as ODF3L1. CIMAP1C is expressed in low levels throughout the body with high expression levels in the testes. It is highly conserved in mammals and reptiles but not present in birds or amphibians, indicating it arose around 300 million years ago.

References

  1. Kaleel, M; Zheng, Y; Chen, J; Feng, X; Simpson, JC; Pollastri, G; Mooney, C (6 March 2020). "SCLpred-EMS: subcellular localization prediction of endomembrane system and secretory pathway proteins by Deep N-to-1 Convolutional Neural Networks". Bioinformatics. 36 (11): 3343–3349. doi:10.1093/bioinformatics/btaa156. hdl: 10197/12182 . PMID   32142105.
  2. Rey S, Gardy JL, Brinkman FS (2005). "Assessing the precision of high-throughput computational and laboratory approaches for the genome-wide identification of protein subcellular localization in bacteria". BMC Genomics. 6: 162. doi: 10.1186/1471-2164-6-162 . PMC   1314894 . PMID   16288665.
  3. Kaleel, Manaz; Ellinger, Liam; Lalor, Clodagh; Pollastri, Gianluca; Mooney, Catherine (2021). "SCLpred-MEM: Subcellular localization prediction of membrane proteins by deep N-to-1 convolutional neural networks". Proteins: Structure, Function, and Bioinformatics. 89 (10): 1233–1239. doi: 10.1002/prot.26144 . hdl: 2346/90320 . PMID   33983651. S2CID   234484678.
  4. Chou KC, Shen HB (2008). "Cell-PLoc: a package of Web servers for predicting subcellular localization of proteins in various organisms". Nature Protocols. 3 (2): 153–62. doi:10.1038/nprot.2007.494. PMID   18274516. S2CID   226104.
  5. "Protein Subcellular Localization Prediction". www.ncbi.nlm.nih.gov. Retrieved 2016-12-31.
  6. Chou KC, Wu ZC, Xiao X (2011). "iLoc-Euk: a multi-label classifier for predicting the subcellular localization of singleplex and multiplex eukaryotic proteins". PLOS ONE. 6 (3): e18258. Bibcode:2011PLoSO...618258C. doi: 10.1371/journal.pone.0018258 . PMC   3068162 . PMID   21483473.
  7. Shen HB, Chou KC (Nov 2009). "A top-down approach to enhance the power of predicting human protein subcellular localization: Hum-mPLoc 2.0". Analytical Biochemistry. 394 (2): 269–74. doi:10.1016/j.ab.2009.07.046. PMID   19651102.
  8. Chou KC, Shen HB (2010). "Plant-mPLoc: a top-down strategy to augment the power for predicting plant protein subcellular localization". PLOS ONE. 5 (6): e11335. Bibcode:2010PLoSO...511335C. doi: 10.1371/journal.pone.0011335 . PMC   2893129 . PMID   20596258.
  9. Gardy JL, Brinkman FS (Oct 2006). "Methods for predicting bacterial protein subcellular localization". Nature Reviews. Microbiology. 4 (10): 741–51. doi:10.1038/nrmicro1494. PMID   16964270. S2CID   62781755.
  10. Kaleel, Manaz; Ellinger, Liam; Lalor, Clodagh; Pollastri, Gianluca; Mooney, Catherine (2021). "SCLpred-MEM: Subcellular localization prediction of membrane proteins by deep N-to-1 convolutional neural networks". Proteins: Structure, Function, and Bioinformatics. 89 (10): 1233–1239. doi: 10.1002/prot.26144 . hdl: 2346/90320 . PMID   33983651. S2CID   234484678.
  11. Kaleel, Manaz; Zheng, Yandan; Chen, Jialiang; Feng, Xuanming; Simpson, Jeremy C; Pollastri, Gianluca; Mooney, Catherine (1 June 2020). "SCLpred-EMS: subcellular localization prediction of endomembrane system and secretory pathway proteins by Deep N-to-1 Convolutional Neural Networks". Bioinformatics. 36 (11): 3343–3349. doi:10.1093/bioinformatics/btaa156. hdl: 10197/12182 . PMID   32142105.
  12. Rost, Stark; Heinzinger, Dallago (26 April 2021). "Light Attention Predicts Protein Location from the Language of Life". Biorxiv. doi:10.1101/2021.04.25.441334. S2CID   233449747.
  13. Nakai, K. Protein sorting signals and prediction of subcellular localization. Adv. Protein Chem., 2000, 54, 277-344.
  14. Chou, K. C.; Shen, H. B. Review: Recent progresses in protein subcellular location prediction" Anal. Biochem 2007, 370, 1-16.
  15. "FunSecKB2 (The Fungal Secretome and Subcellular Proteome KnowledgeBase 2.1)". bioinformatics.ysu.edu. Archived from the original on 2016-04-10. Retrieved 2017-09-17.
  16. "PlantSecKB (The Plant Secretome and Subcellular Proteome KnowledgeBase)". bioinformatics.ysu.edu. Archived from the original on 2016-04-06. Retrieved 2017-09-17.
  17. "MetazSecKB (The Metazoa (Human & Animal) Protein Subcelluar Location, Secretome and Subcellular Proteome Database)". bioinformatics.ysu.edu. Archived from the original on 2016-04-06. Retrieved 2017-09-17.
  18. "ProtSecKB (The Protist Secretome and Subcellular Proteome KnowledgeBase)". proteomics.ysu.edu. Retrieved 2017-09-17.

Further reading