Original author(s) | David Perkins and Darryl Pappin |
---|---|
Initial release | 1999 |
Stable release | 2.6.00 / December 2016 |
Operating system | Linux or Windows |
Available in | C |
Type | Protein identification Bioinformatics |
License | proprietary, free for online use |
Website | http://www.matrixscience.com/ |
Mascot is a software search engine that uses mass spectrometry data to identify proteins from peptide sequence databases. [1] [2] Mascot is widely used by research facilities around the world. Mascot uses a probabilistic scoring algorithm for protein identification that was adapted from the MOWSE algorithm. Mascot is freely available to use on the website of Matrix Science. [3] A license is required for in-house use where more features can be incorporated.
MOWSE was one of the first algorithms developed for protein identification using peptide mass fingerprinting. [4] It was originally developed in 1993 as a collaboration between Darryl Pappin of the Imperial Cancer Research Fund (ICRF) and Alan Bleasby of the Science and Engineering Research Council (SERC). MOWSE stood apart from other protein identification algorithms in that it produced a probability-based score for identification. It was also the first to take into account the non-uniform distribution of peptide sizes, caused by the enzymatic digestion of a protein that is needed for mass spectrometry analysis. However, MOWSE was only applicable to peptide mass fingerprint searches and was dependent on pre-compiled databases which were inflexible with regard to post-translational modifications and enzymes other than trypsin. To overcome these limitations, to take advantage of multi-processor systems and to add non-enzymatic search functionality, development was begun again from scratch by David Perkins at the Imperial Cancer Research Fund. The first versions were developed for Silicon Graphics Irix and Digital Unix systems. Eventually this software was named Mascot and to reach a wider audience, an external bioinformatics company named Matrix Science was created by David Creasy and John Cottrell to develop and distribute Mascot. Legacy software versions exist for Tru64, Irix, AIX, Solaris, Microsoft Windows NT4 and Microsoft Windows 2000. Mascot has been available as a free service on the Matrix Science website since 1999 and has been cited in scientific literature over 5,000 times. Matrix Science still continues to work on improving Mascot’s functionality.
Mascot identifies proteins by interpreting mass spectrometry data. The prevailing experimental method for protein identification is a bottom-up approach, where a protein sample is typically digested with Trypsin to form smaller peptides. While most proteins are too big, peptides usually fall within the limited mass range that a typical mass spectrometer can measure. Mass spectrometers measure the molecular weights of peptides in a sample. Mascot then compares these molecular weights against a database of known peptides. The program cleaves every protein in the specified search database in silico according to specific rules depending on the cleavage enzyme used for digestion and calculates the theoretical mass for each peptide. Mascot then computes a score based on the probability that the peptides from a sample match those in the selected protein database. The more peptides Mascot identifies from a particular protein, the higher the Mascot score for that protein.
The software processes data from mass spectrometers of the following companies:
Mascot’s fundamental approach to identifying peptides is to calculate the probability whether an observed match between experimental data and peptide sequences found in a reference database has occurred by chance. The match with the lowest probability of occurring by chance is returned as the most significant match. The significance of the match depends on the size of the database that is being queried. Mascot employs the widely used significance level of 0.05, meaning that in a single test the probability of observing an event at random is less than or equal to 1 in 20. In this light, a score of 10−5 might seem very promising. However, if the database being searched contains 106 sequences several scores of this magnitude would be expected by chance alone because the algorithm carried out 106 individual comparisons. For a database of that size, by applying a Bonferroni correction to account for multiple comparisons, the significance threshold drops to 5*10−8. [1]
In addition to the calculated peptide scores, Mascot also estimates the False Discovery Rate (FDR) by searching against a decoy database. When performing a decoy search, Mascot generates a randomized sequence of the same length for every sequence in the target database. The decoy sequence is generated such that it has the same average amino acid composition as the target database. The FDR is estimated as the ratio of decoy database matches to target database matches. This relates to the standard formula FDR = FP / (FP + TP), where FP are false positives and TP are true positives. The decoy matches are certain to be spurious identifications, but we can't discriminate between true and false positives identified in the target database. FDR estimation was added in response to journals' guidelines on protein identification reports like the ones from Molecular and Cellular Proteomics. [5] Mascot's FDR calculation incorporates ideas from different publications. [6] [7]
The most common alternative database search programs are listed in the Mass spectrometry software article. The performance of a variety of mass spectrometry software, including Mascot, can be observed in the 2011 iPRG study. Genome-based peptide fingerprint scanning is another method that compares the peptide fingerprints to the entire genome instead of only annotated genes.
Tandem mass spectrometry, also known as MS/MS or MS2, is a technique in instrumental analysis where two or more mass analyzers are coupled together using an additional reaction step to increase their abilities to analyse chemical samples. A common use of tandem MS is the analysis of biomolecules, such as proteins and peptides.
Protein sequencing is the practical process of determining the amino acid sequence of all or part of a protein or peptide. This may serve to identify the protein or characterize its post-translational modifications. Typically, partial sequencing of a protein provides sufficient information to identify it with reference to databases of protein sequences derived from the conceptual translation of genes.
Peptide mass fingerprinting (PMF) is an analytical technique for protein identification in which the unknown protein of interest is first cleaved into smaller peptides, whose absolute masses can be accurately measured with a mass spectrometer such as MALDI-TOF or ESI-TOF. The method was developed in 1993 by several groups independently. The peptide masses are compared to either a database containing known protein sequences or even the genome. This is achieved by using computer programs that translate the known genome of the organism into proteins, then theoretically cut the proteins into peptides, and calculate the absolute masses of the peptides from each protein. They then compare the masses of the peptides of the unknown protein to the theoretical peptide masses of each protein encoded in the genome. The results are statistically analyzed to find the best match.
SEQUEST is a tandem mass spectrometry data analysis program used for protein identification. Sequest identifies collections of tandem mass spectra to peptide sequences that have been generated from databases of protein sequences.
Genome-based peptide fingerprint scanning (GFS) is a system in bioinformatics analysis that attempts to identify the genomic origin of sample proteins by scanning their peptide-mass fingerprint against the theoretical translation and proteolytic digest of an entire genome. This method is an improvement from previous methods because it compares the peptide fingerprints to an entire genome instead of comparing it to an already annotated genome. This improvement has the potential to improve genome annotation and identify proteins with incorrect or missing annotations.
The Trans-Proteomic Pipeline (TPP) is an open-source data analysis software for proteomics developed at the Institute for Systems Biology (ISB) by the Ruedi Aebersold group under the Seattle Proteome Center. The TPP includes PeptideProphet, ProteinProphet, ASAPRatio, XPRESS and Libra.
A peptide sequence tag is a piece of information about a peptide obtained by tandem mass spectrometry that can be used to identify this peptide in a protein database.
PEAKS is a proteomics software program for tandem mass spectrometry designed for peptide sequencing, protein identification and quantification.
Protein mass spectrometry refers to the application of mass spectrometry to the study of proteins. Mass spectrometry is an important method for the accurate mass determination and characterization of proteins, and a variety of methods and instrumentations have been developed for its many uses. Its applications include the identification of proteins and their post-translational modifications, the elucidation of protein complexes, their subunits and functional interactions, as well as the global measurement of proteins in proteomics. It can also be used to localize proteins to the various organelles, and determine the interactions between different proteins as well as with membrane lipids.
Shotgun proteomics refers to the use of bottom-up proteomics techniques in identifying proteins in complex mixtures using a combination of high performance liquid chromatography combined with mass spectrometry. The name is derived from shotgun sequencing of DNA which is itself named after the rapidly expanding, quasi-random firing pattern of a shotgun. The most common method of shotgun proteomics starts with the proteins in the mixture being digested and the resulting peptides are separated by liquid chromatography. Tandem mass spectrometry is then used to identify the peptides.
Bottom-up proteomics is a common method to identify proteins and characterize their amino acid sequences and post-translational modifications by proteolytic digestion of proteins prior to analysis by mass spectrometry. The major alternative workflow used in proteomics is called top-down proteomics where intact proteins are purified prior to digestion and/or fragmentation either within the mass spectrometer or by 2D electrophoresis. Essentially, bottom-up proteomics is a relatively simple and reliable means of determining the protein make-up of a given sample of cells, tissues, etc.
Isobaric tags for relative and absolute quantitation (iTRAQ) is an isobaric labeling method used in quantitative proteomics by tandem mass spectrometry to determine the amount of proteins from different sources in a single experiment. It uses stable isotope labeled molecules that can be covalent bonded to the N-terminus and side chain amines of proteins.
MOWSE is a method for identification of proteins from the molecular weight of peptides created by proteolytic digestion and measured with mass spectrometry.
A peptide spectral library is a curated, annotated and non-redundant collection/database of LC-MS/MS peptide spectra. One essential utility of a peptide spectral library is to serve as consensus templates supporting the identification of peptide/proteins based on the correlation between the templates with experimental spectra.
Proteogenomics is a field of biological research that utilizes a combination of proteomics, genomics, and transcriptomics to aid in the discovery and identification of peptides. Proteogenomics is used to identify new peptides by comparing MS/MS spectra against a protein database that has been derived from genomic and transcriptomic information. Proteogenomics often refers to studies that use proteomic information, often derived from mass spectrometry, to improve gene annotations. Genomics deals with the genetic code of entire organisms, while transcriptomics deals with the study of RNA sequencing and transcripts. Proteomics utilizes tandem mass spectrometry and liquid chromatography to identify and study the functions of proteins. Proteomics is being utilized to discover all the proteins expressed within an organism, known as its proteome. The issue with proteomics is that it relies on the assumption that current gene models are correct and that the correct protein sequences can be found using a reference protein sequence database; however, this is not always the case as some peptides cannot be located in the database. In addition, novel protein sequences can occur through mutations. these issues can be fixed with the use of proteomic, genomic, and trancriptomic data. The utilization of both proteomics and genomics led to proteogenomics which became its own field in 2004.
In bio-informatics, a peptide-mass fingerprint or peptide-mass map is a mass spectrum of a mixture of peptides that comes from a digested protein being analyzed. The mass spectrum serves as a fingerprint in the sense that it is a pattern that can serve to identify the protein. The method for forming a peptide-mass fingerprint, developed in 1993, consists of isolating a protein, breaking it down into individual peptides, and determining the masses of the peptides through some form of mass spectrometry. Once formed, a peptide-mass fingerprint can be used to search in databases for related protein or even genomic sequences, making it a powerful tool for annotation of protein-coding genes.
MassMatrix is a mass spectrometry data analysis software that uses a statistical model to achieve increased mass accuracy over other database search algorithms. This search engine is set apart from others dues to its ability to provide extremely efficient judgement between true and false positives for high mass accuracy data that has been obtained from present day mass spectrometer instruments. It is useful for identifying disulphide bonds in tandem mass spectrometry data. This search engine is set apart from others due to its ability to provide extremely efficient judgement between true and false positives for high mass accuracy data that has been obtained from present day mass spectrometer instruments.
In mass spectrometry, de novo peptide sequencing is the method in which a peptide amino acid sequence is determined from tandem mass spectrometry.
Paleoproteomics is a relatively young and rapidly growing field of molecular science in which proteomics-based sequencing technology is used to resolve species identification and evolutionary relationships of extinct taxa. While complementary to paleogenomics in application, the study of ancient proteins has the potential to reveal older, more complete phylogenies due to the relative stability of amino acids in proteins as compared to the nucleic acids of DNA. Ancient protein studies can further reveal types and sources of recovered tissues, as well as the developmental stages of fossilized specimens. Paleoproteomics can also be extended to archaeological materials such as textiles, animal skins, food remains, and pottery.