Trans-Proteomic Pipeline

Last updated
TPP
Developer(s) Institute for Systems Biology
Initial release10 December 2004;14 years ago (2004-12-10)
Stable release
5.0.0 / 11 October 2016;2 years ago (2016-10-11) [1]
Written in C++, Perl, Java
Operating system Linux, Windows, OS X
Type Bioinformatics / Mass spectrometry software
License GPL v. 2.0 and LGPL
Website TPP Wiki

The Trans-Proteomic Pipeline (TPP) is an open-source data analysis software for proteomics developed at the Institute for Systems Biology (ISB) by the Ruedi Aebersold group under the Seattle Proteome Center. The TPP includes PeptideProphet, [2] ProteinProphet, [3] ASAPRatio, XPRESS and Libra.

Open-source software software licensed to ensure source code usage rights

Open-source software (OSS) is a type of computer software in which source code is released under a license in which the copyright holder grants users the rights to study, change, and distribute the software to anyone and for any purpose. Open-source software may be developed in a collaborative public manner. Open-source software is a prominent example of open collaboration.

Proteomics study of proteins

Proteomics is the large-scale study of proteins. Proteins are vital parts of living organisms, with many functions. The term proteomics was coined in 1997, in analogy to genomics, the study of the genome. The word proteome is a portmanteau of protein and genome, and was coined by Marc Wilkins in 1994 while he was a Ph.D. student at Macquarie University. Macquarie University also founded the first dedicated proteomics laboratory in 1995.

Institute for Systems Biology

Institute for Systems Biology (ISB) is a non-profit research institution located in Seattle, Washington, United States. ISB concentrates on systems biology, the study of relationships and interactions between various parts of biological systems, and advocates an interdisciplinary approach to biological research.

Contents

Software Components

Probability Assignment and Validation

PeptideProphet performs statistical validation of peptide-spectra-matches (PSM) using the results of search engines by estimating a false discovery rate (FDR) on PSM level. [4] The initial PeptideProphet used a fit of a Gaussian distribution for the correct identifications and a fit of a gamma distribution for the incorrect identification. A later modification of the program allowed the usage of a target-decoy approach, using either a variable component mixture model or a semi-parametric mixture model. [5] In the PeptideProphet, specifying a decoy tag will use the variable component mixture model while selecting a non-parametric model will use the semi-parametric mixture model.

The false discovery rate (FDR) is a method of conceptualizing the rate of type I errors in null hypothesis testing when conducting multiple comparisons. FDR-controlling procedures are designed to control the expected proportion of "discoveries" that are false. FDR-controlling procedures provide less stringent control of Type I errors compared to familywise error rate (FWER) controlling procedures, which control the probability of at least one Type I error. Thus, FDR-controlling procedures have greater power, at the cost of increased numbers of Type I errors.

Normal distribution probability distribution

In probability theory, the normaldistribution is a very common continuous probability distribution. Normal distributions are important in statistics and are often used in the natural and social sciences to represent real-valued random variables whose distributions are not known. A random variable with a Gaussian distribution is said to be normally distributed and is called a normal deviate.

Gamma distribution probability distribution

In probability theory and statistics, the gamma distribution is a two-parameter family of continuous probability distributions. The exponential distribution, Erlang distribution, and chi-squared distribution are special cases of the gamma distribution. There are three different parametrizations in common use:

  1. With a shape parameter k and a scale parameter θ.
  2. With a shape parameter α = k and an inverse scale parameter β = 1/θ, called a rate parameter.
  3. With a shape parameter k and a mean parameter μ = = α/β.

ProteinProphet identifies proteins based on the results of PeptideProphet. [6]

Mayu performs statistical validation of protein identification by estimating a False Discovery Rate (FDR) on protein level. [7]

Spectral library handling

The SpectraST tool is able to generate spectral libraries and search datasets using these libraries. [8]

See also

Related Research Articles

Peptide mass fingerprinting analytical technique for protein identification

Peptide mass fingerprinting (PMF) is an analytical technique for protein identification in which the unknown protein of interest is first cleaved into smaller peptides, whose absolute masses can be accurately measured with a mass spectrometer such as MALDI-TOF or ESI-TOF. The method was developed in 1993 by several groups independently. The peptide masses are compared to either a database containing known protein sequences or even the genome. This is achieved by using computer programs that translate the known genome of the organism into proteins, then theoretically cut the proteins into peptides, and calculate the absolute masses of the peptides from each protein. They then compare the masses of the peptides of the unknown protein to the theoretical peptide masses of each protein encoded in the genome. The results are statistically analyzed to find the best match.
The advantage of this method is that only the masses of the peptides have to be known. Time-consuming de novo peptide sequencing is then unnecessary. A disadvantage is that the protein sequence has to be present in the database of interest. Additionally most PMF algorithms assume that the peptides come from a single protein. The presence of a mixture can significantly complicate the analysis and potentially compromise the results. Typical for the PMF based protein identification is the requirement for an isolated protein. Mixtures exceeding a number of 2-3 proteins typically require the additional use of MS/MS based protein identification to achieve sufficient specificity of identification (6). Therefore, the typical PMF samples are isolated proteins from two-dimensional gel electrophoresis or isolated SDS-PAGE bands. Additional analyses by MS/MS can either be direct, e.g., MALDI-TOF/TOF analysis or downstream nanoLC-ESI-MS/MS analysis of gel spot eluates.

SEQUEST is a tandem mass spectrometry data analysis program used for protein identification. Sequest identifies collections of tandem mass spectra to peptide sequences that have been generated from databases of protein sequences.

Ruedi Aebersold Swiss biologist

Rudolf Aebersold is a Swiss biologist, regarded as a pioneer in the fields of proteomics and systems biology. He has primarily researched techniques for measuring proteins in complex samples, in many cases via mass spectrometry. Ruedi Aebersold is a professor of Systems biology at the Institute of Molecular Systems Biology (IMSB) in ETH Zurich. He was one of the founders of the Institute for Systems Biology in Seattle, Washington, where he previously had a research group.

Mascot is a software search engine that uses mass spectrometry data to identify proteins from peptide sequence databases. Mascot is widely used by research facilities around the world. Mascot uses a probabilistic scoring algorithm for protein identification that was adapted from the MOWSE algorithm. Mascot is freely available to use on the website of Matrix Science. A License is required for in-house use where more features can be incorporated.

PEAKS is a proteomics software program for tandem mass spectrometry designed for peptide sequencing, protein identification and quantification.

Protein mass spectrometry

Protein mass spectrometry refers to the application of mass spectrometry to the study of proteins. Mass spectrometry is an important method for the accurate mass determination and characterization of proteins, and a variety of methods and instrumentations have been developed for its many uses. Its applications include the identification of proteins and their post-translational modifications, the elucidation of protein complexes, their subunits and functional interactions, as well as the global measurement of proteins in proteomics. It can also be used to localize proteins to the various organelles, and determine the interactions between different proteins as well as with membrane lipids.

Shotgun proteomics refers to the use of bottom-up proteomics techniques in identifying proteins in complex mixtures using a combination of high performance liquid chromatography combined with mass spectrometry. The name is derived from shotgun sequencing of DNA which is itself named after the rapidly expanding, quasi-random firing pattern of a shotgun. The most common method of shotgun proteomics starts with the proteins in the mixture being digested and the resulting peptides are separated by liquid chromatography. Tandem mass spectrometry is then used to identify the peptides.

Bottom-up proteomics

Bottom-up proteomics is a common method to identify proteins and characterize their amino acid sequences and post-translational modifications by proteolytic digestion of proteins prior to analysis by mass spectrometry. The major alternative workflow used in proteomics is called top-down proteomics where intact proteins are purified prior to digestion and/or fragmentation either within the mass spectrometer or by 2D electrophoresis. Essentially, bottom-up proteomics is a relatively simple and reliable means of determining the protein make-up of a given sample of cells, tissues, etc.

Quantitative proteomics

Quantitative proteomics is an analytical chemistry technique for determining the amount of proteins in a sample. The methods for protein identification are identical to those used in general proteomics, but include quantification as an additional dimension. Rather than just providing lists of proteins identified in a certain sample, quantitative proteomics yields information about the physiological differences between two biological samples. For example, this approach can be used to compare samples from healthy and diseased patients. Quantitative proteomics is mainly performed by two-dimensional gel electrophoresis (2-DE) or mass spectrometry (MS). However, a recent developed method of quantitative dot blot (QDB) analysis is able to measure both the absolute and relative quantity of an individual proteins in the sample in high throughput format, thus open a new direction for proteomic research. In contrast to 2-DE, which requires MS for the downstream protein identification, MS technology can identify and quantify the changes.

Isobaric tag for relative and absolute quantitation

Isobaric tags for relative and absolute quantitation (iTRAQ) is an isobaric labeling method used in quantitative proteomics by tandem mass spectrometry to determine the amount of proteins from different sources in a single experiment. It uses stable isotope labeled molecules that can be covalent bonded to the N-terminus and side chain amines of proteins.

An Isotope-coded affinity tag (ICAT) is an isotopic labeling method used for quantitative proteomics by mass spectrometry that uses chemical labeling reagents. These chemical probes consist of three elements: a reactive group for labeling an amino acid side chain, an isotopically coded linker, and a tag for the affinity isolation of labeled proteins/peptides. The samples are combined and then separated through chromatography, then sent though a mass spectrometer to determine the mass-to-charge ratio between the proteins.

OpenMS is an open-source project for data analysis and processing in protein mass spectrometry and is released under the 3-clause BSD licence. It supports most common operating systems including Microsoft Windows, OS X and Linux.

Selected reaction monitoring

Selected reaction monitoring (SRM) is a method used in tandem mass spectrometry in which an ion of a particular mass is selected in the first stage of a tandem mass spectrometer and an ion product of a fragmentation reaction of the precursor ion is selected in the second mass spectrometer stage for detection.

A peptide spectral library is a curated, annotated and non-redundant collection/database of LC-MS/MS peptide spectra. One essential utility of a peptide spectral library is to serve as consensus templates supporting the identification of peptide/proteins based on the correlation between the templates with experimental spectra.

Proteogenomics

Proteogenomics is a field of biological research that utilizes a combination of proteomics, genomics, and transcriptomics to aid in the discovery and identification of peptides. Proteogenomics is used to identify new peptides by comparing MS/MS spectra against a protein database that has been derived from genomic and transcriptomic information. Proteogenomics often refers to studies that use proteomic information, often derived from mass spectrometry, to improve gene annotations. Genomics deals with the genetic code of entire organisms, while transcriptomics deals with the study of RNA sequencing and transcripts. Proteomics utilizes tandem mass spectrometry and liquid chromatography to identify and study the functions of proteins. Proteomics is being utilized to discover all the proteins expressed within an organism, known as its proteome. The issue with proteomics is that it relies on the assumption that current gene models are correct and that the correct protein sequences can be found using a reference protein sequence database; however, this is not always the case as some peptides cannot be located in the database. In addition, novel protein sequences can occur through mutations. these issues can be fixed with the use of proteomic, genomic, and trancriptomic data. The utilization of both proteomics and genomics led to proteogenomics which became its own field in 2004.

Systematic Protein Investigative Research Environment (SPIRE) provides web-based experiment-specific mass spectrometry (MS) proteomics analysis in order to identify proteins and peptides, and label-free expression and relative expression analyses. SPIRE provides a web-interface and generates results in both interactive and simple data formats.

MassMatrix is a mass spectrometry data analysis software that uses a statistical model to achieve increased mass accuracy over other database search algorithms. This search engine is set apart from others dues to its ability to provide extremely efficient judgement between true and false positives for high mass accuracy data that has been obtained from present day mass spectrometer instruments. It is useful for identifying disulphide bonds in tandem mass spectrometry data. This search engine is set apart from others due to its ability to provide extremely efficient judgement between true and false positives for high mass accuracy data that has been obtained from present day mass spectrometer instruments.

In mass spectrometry, data-independent acquisition (DIA) is a method of molecular structure determination in which all ions within a selected m/z range are fragmented and analyzed in a second stage of tandem mass spectrometry. Tandem mass spectra are acquired either by fragmenting all ions that enter the mass spectrometer at a given time or by sequentially isolating and fragmenting ranges of m/z. DIA is an alternative to data-dependent acquisition (DDA) where a fixed number of precursor ions are selected and analyzed by tandem mass spectrometry.

References

  1. TPP 5.0.0 Release is Available
  2. Software:PeptideProphet - SPCTools
  3. Software:ProteinProphet - SPCTools
  4. Keller, A; Nesvizhskii, A; Kolker, E; Aebersold, R. (2002). "Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search". Anal Chem. 74 (20): 5383–5392. doi:10.1021/ac025747h. PMID   12403597.
  5. Choi, Hyungwon; Ghosh, Debashis; Nesvizhskii, Alexey I. (2008). "Statistical Validation of Peptide Identifications in Large-Scale Proteomics Using the Target-Decoy Database Search Strategy and Flexible Mixture Modeling" (PDF). Journal of Proteome Research. 7 (1): 286–292. doi:10.1021/pr7006818. ISSN   1535-3893. PMID   18078310.
  6. Nesvizhskii AI, Keller A, Kolker E, Aebersold R. (2003) "A statistical model for identifying proteins by tandem mass spectrometry." Anal Chem 75:4646-58
  7. Reiter, L.; Claassen, M.; Schrimpf, SP.; Jovanovic, M.; Schmidt, A.; Buhmann, JM.; Hengartner, MO.; Aebersold, R. (Nov 2009). "Protein identification false discovery rates for very large proteomics data sets generated by tandem mass spectrometry". Mol Cell Proteomics. 8 (11): 2405–17. doi:10.1074/mcp.M900317-MCP200. PMC   2773710 . PMID   19608599.
  8. Software:SpectraST - SPCTools