Genome-based peptide fingerprint scanning

Last updated

Genome-based peptide fingerprint scanning (GFS) is a system in bioinformatics analysis that attempts to identify the genomic origin (that is, what species they come from) of sample proteins by scanning their peptide-mass fingerprint against the theoretical translation and proteolytic digest of an entire genome. [1] This method is an improvement from previous methods because it compares the peptide fingerprints to an entire genome instead of comparing it to an already annotated genome. [2] This improvement has the potential to improve genome annotation and identify proteins with incorrect or missing annotations.

Contents

History and background

GFS was designed by Michael C. Giddings (University of North Carolina, Chapel Hill) et al., and released in 2003. Giddings expanded the algorithms for GFS from earlier ideas. Two papers were published in 1993 explaining the techniques used to identify proteins in sequence databases. These methods determined the mass of peptides using mass spectrometry, and then used the mass to search protein databases to identify the proteins [3] [4] In 1999 a more complex program was released called Mascot that integrated three types of protein/database searches: peptide molecular weights, tandem mass spectrometry from one or more peptide, and combination mass data with amino acid sequence. [5] The fallback with this widely used program is that it is unable to detect alternative splice sites that are not currently annotated, and it not usually able to find proteins that have not been annotated. Giddings built upon these sources to create GFS which would compare peptide mass data to entire genomes to identify the proteins. Giddings system is able to find new annotations of genes that have not been found, such as undocumented genes and undocumented alternative splice sites.

Research examples

In 2012 research was published where genes and proteins were found in a model organism that could not have been found without GFS because they had not been previously annotated. The planarian Schmidtea mediterranea has been used in research for over 100 years. This planarian is capable of regenerating missing body parts and is therefore emerging as potential model organism for stem cell research. Planarians are covered in mucus which aids in locomotion, in protecting them from predation, and in helping their immune system. The genome of Schmidtea mediterranea is sequenced but mostly un-annotated making it a prime candidate for genome-based peptide fingerprint scanning. When the proteins were analyzed with GFS 1,604 proteins were identified. These proteins had mostly not been annotated before they were found with GFS They were also able to find the mucous subproteome (all the genes associated with mucus production). They found that this proteome was conserved in the sister species Schmidtea mansoni. The mucous subproteome is so conserved that 119 orthologs of planarians are found in humans. Due to the similarity in these genes the planarian can now be used as a model to study mucous protein function in humans. This is relevant for infections and diseases related to mucous aberrancies such as cystic fibrosis, asthma, and other lung diseases. These genes could not have been found without GFS because they had not been previously annotated. [6]

In February 2013, proteogenomic mapping research was done with ENCODE to identify translational regions in the human genome. They applied peptide fingerprint scanning and MASCOT to the protein data to find regions that may not have been previously annotated as translated in the human genome. This search against the whole genome revealed that approximately 4% of unique peptide that they found were outside of previously annotated regions. Also the comparison of the whole genome revealed 15% more hits than from a protein database search (such as MASCOT) alone. GFS can be used as a complementary method for annotation due to the fact that you can find new genes or splice sites that have not been annotated before. However it is important to remember that the whole genome approach used by GFS can be less sensitive than programs that look only at annotated regions. [7]

Related Research Articles

<span class="mw-page-title-main">Proteome</span> Set of proteins that can be expressed by a genome, cell, tissue, or organism

The proteome is the entire set of proteins that is, or can be, expressed by a genome, cell, tissue, or organism at a certain time. It is the set of expressed proteins in a given type of cell or organism, at a given time, under defined conditions. Proteomics is the study of the proteome.

In computational biology, gene prediction or gene finding refers to the process of identifying the regions of genomic DNA that encode genes. This includes protein-coding genes as well as RNA genes, but may also include prediction of other functional elements such as regulatory regions. Gene finding is one of the first and most important steps in understanding the genome of a species once it has been sequenced.

<span class="mw-page-title-main">Functional genomics</span> Field of molecular biology

Functional genomics is a field of molecular biology that attempts to describe gene functions and interactions. Functional genomics make use of the vast data generated by genomic and transcriptomic projects. Functional genomics focuses on the dynamic aspects such as gene transcription, translation, regulation of gene expression and protein–protein interactions, as opposed to the static aspects of the genomic information such as DNA sequence or structures. A key characteristic of functional genomics studies is their genome-wide approach to these questions, generally involving high-throughput methods rather than a more traditional "candidate-gene" approach.

<span class="mw-page-title-main">Peptide mass fingerprinting</span> Analytical technique for protein identification

Peptide mass fingerprinting (PMF) is an analytical technique for protein identification in which the unknown protein of interest is first cleaved into smaller peptides, whose absolute masses can be accurately measured with a mass spectrometer such as MALDI-TOF or ESI-TOF. The method was developed in 1993 by several groups independently. The peptide masses are compared to either a database containing known protein sequences or even the genome. This is achieved by using computer programs that translate the known genome of the organism into proteins, then theoretically cut the proteins into peptides, and calculate the absolute masses of the peptides from each protein. They then compare the masses of the peptides of the unknown protein to the theoretical peptide masses of each protein encoded in the genome. The results are statistically analyzed to find the best match.

InterPro is a database of protein families, protein domains and functional sites in which identifiable features found in known proteins can be applied to new protein sequences in order to functionally characterise them.

Mascot is a software search engine that uses mass spectrometry data to identify proteins from peptide sequence databases. Mascot is widely used by research facilities around the world. Mascot uses a probabilistic scoring algorithm for protein identification that was adapted from the MOWSE algorithm. Mascot is freely available to use on the website of Matrix Science. A license is required for in-house use where more features can be incorporated.

<span class="mw-page-title-main">Protein mass spectrometry</span> Application of mass spectrometry

Protein mass spectrometry refers to the application of mass spectrometry to the study of proteins. Mass spectrometry is an important method for the accurate mass determination and characterization of proteins, and a variety of methods and instrumentations have been developed for its many uses. Its applications include the identification of proteins and their post-translational modifications, the elucidation of protein complexes, their subunits and functional interactions, as well as the global measurement of proteins in proteomics. It can also be used to localize proteins to the various organelles, and determine the interactions between different proteins as well as with membrane lipids.

Shotgun proteomics refers to the use of bottom-up proteomics techniques in identifying proteins in complex mixtures using a combination of high performance liquid chromatography combined with mass spectrometry. The name is derived from shotgun sequencing of DNA which is itself named after the rapidly expanding, quasi-random firing pattern of a shotgun. The most common method of shotgun proteomics starts with the proteins in the mixture being digested and the resulting peptides are separated by liquid chromatography. Tandem mass spectrometry is then used to identify the peptides.

<span class="mw-page-title-main">Bottom-up proteomics</span>

Bottom-up proteomics is a common method to identify proteins and characterize their amino acid sequences and post-translational modifications by proteolytic digestion of proteins prior to analysis by mass spectrometry. The major alternative workflow used in proteomics is called top-down proteomics where intact proteins are purified prior to digestion and/or fragmentation either within the mass spectrometer or by 2D electrophoresis. Essentially, bottom-up proteomics is a relatively simple and reliable means of determining the protein make-up of a given sample of cells, tissues, etc.

The Viral Bioinformatics Resource Center (VBRC) is an online resource providing access to a database of curated viral genomes and a variety of tools for bioinformatic genome analysis. This resource was one of eight BRCs funded by NIAID with the goal of promoting research against emerging and re-emerging pathogens, particularly those seen as potential bioterrorism threats. The VBRC is now supported by Dr. Chris Upton at the University of Victoria.

MOWSE is a method to identify proteins from the molecular weight of peptides created by proteolytic digestion and measured with mass spectrometry.

GENCODE is a scientific project in genome research and part of the ENCODE scale-up project.

<span class="mw-page-title-main">Proteogenomics</span>

Proteogenomics is a field of biological research that utilizes a combination of proteomics, genomics, and transcriptomics to aid in the discovery and identification of peptides. Proteogenomics is used to identify new peptides by comparing MS/MS spectra against a protein database that has been derived from genomic and transcriptomic information. Proteogenomics often refers to studies that use proteomic information, often derived from mass spectrometry, to improve gene annotations. The utilization of both proteomics and genomics data alongside advances in the availability and power of spectrographic and chromatographic technology led to the emergence of proteogenomics as its own field in 2004.

<span class="mw-page-title-main">DNA annotation</span> The process of describing the structure and function of a genome

In molecular biology and genetics, DNA annotation or genome annotation is the process of describing the structure and function of the components of a genome, by analyzing and interpreting them in order to extract their biological significance and understand the biological processes in which they participate. Among other things, it identifies the locations of genes and all the coding regions in a genome and determines what those genes do.

The Consensus Coding Sequence (CCDS) Project is a collaborative effort to maintain a dataset of protein-coding regions that are identically annotated on the human and mouse reference genome assemblies. The CCDS project tracks identical protein annotations on the reference mouse and human genomes with a stable identifier, and ensures that they are consistently represented by the National Center for Biotechnology Information (NCBI), Ensembl, and UCSC Genome Browser. The integrity of the CCDS dataset is maintained through stringent quality assurance testing and on-going manual curation.

In bio-informatics, a peptide-mass fingerprint or peptide-mass map is a mass spectrum of a mixture of peptides that comes from a digested protein being analyzed. The mass spectrum serves as a fingerprint in the sense that it is a pattern that can serve to identify the protein. The method for forming a peptide-mass fingerprint, developed in 1993, consists of isolating a protein, breaking it down into individual peptides, and determining the masses of the peptides through some form of mass spectrometry. Once formed, a peptide-mass fingerprint can be used to search in databases for related protein or even genomic sequences, making it a powerful tool for annotation of protein-coding genes.

BASys is a freely available web server that can be used to perform automated, comprehensive annotation of bacterial genomes. With the advent of next generation DNA sequencing it is now possible to sequence the complete genome of a bacterium within a single day. This has led to an explosion in the number of fully sequenced microbes. In fact, as of 2013, there were more than 2700 fully sequenced bacterial genomes deposited with GenBank. However, a continuing challenge with microbial genomics is finding the resources or tools for annotating the large number of newly sequenced genomes. BASys was developed in 2005 in anticipation of these needs. In fact, BASys was the world’s first publicly accessible microbial genome annotation web server. Because of its widespread popularity, the BASys server was updated in 2011 through the addition of multiple server nodes to handle the large number of queries it was receiving.

Deinococcus deserti is a Gram-negative, rod-shaped bacterium that belongs to the Deinococcaceae, a group of extremely radiotolerant bacteria. D. deserti and other Deinococcaceae exhibit an extraordinary ability to withstand ionizing radiation.

<span class="mw-page-title-main">Neoblast</span> Planarian regeneration proliferative cells

Neoblasts (ˈniːəʊˌblæst) are adult stem cells found in planarian flatworms. They are the only dividing planarian cells, and they produce all cell types, including the germline. Neoblasts are abundant in the planarian parenchyma, and make up to 30 percent of all cells. Following injury, neoblasts rapidly divide and generate new cells, which allow planarians to regenerate any missing tissue.

Planarian secretory cell nidovirus (PSCNV) is a virus of the species Planidovirus 1, a nidovirus notable for its extremely large genome. At 41.1 kilobases, it is the largest known genome of an RNA virus. It was discovered by inspecting the transcriptomes of the planarian flatworm Schmidtea mediterranea and is the first known RNA virus infecting planarians. It was first described in 2018.

References

  1. Giddings, M. C.; Shah, A. A.; Gesteland, R.; Moore, B. (2003). "Abstract of Genome-based peptide fingerprint scanning". PNAS. 100 (1): 20–25. doi: 10.1073/pnas.0136893100 . PMC   140871 . PMID   12518051.
  2. Shinoda, Kosaku; Nozomu Yachie; Takeshi Masuda; Naoyuki Sugiyama; Masahiro Sugimoto; Tomoyoshi Soga; Masaru Tomita (29 October 2006). "HybGFS: a hybrid method for genome-fingerprink scanning". BMC Bioinformatics. 7: 479. doi: 10.1186/1471-2105-7-479 . PMC   1643838 . PMID   17069662.
  3. Henzel, W J; T M Billeci; J T Stults; S C Wong; C Grimley; C Watanabe (1 June 1993). "Identifying proteins from two-dimensional gels by molecular mass searching of peptide fragments in protein sequence databases". PNAS. 90 (11): 5011–5015. Bibcode:1993PNAS...90.5011H. doi: 10.1073/pnas.90.11.5011 . PMC   46643 . PMID   8506346.
  4. Mann, Matthais; Peter Højrup; Peter Roepstorff (June 1993). "Use of mass spectrometric molecular weight information to identify proteins in sequence databases". Biological Mass Spectrometry. 22 (6): 338–345. doi:10.1002/bms.1200220605. PMID   8329463.
  5. Perkins, David N.; Darryl J. C. Pappin; David M. Creasy; John S. Cottrell (1 December 1999). "Probability-based protein identification by searching sequence databases using mass spectrometry data". Electrophoresis. 20 (18): 3551–3567. doi:10.1002/(sici)1522-2683(19991201)20:18<3551::aid-elps3551>3.0.co;2-2. PMID   10612281.
  6. Bocchinfuso, Donald G. (September 2012). "Proteomic Profiling of the Planarian Schmidtea mediterranea and Its Mucous Reveals Similarities with Human Secretions and Those Predicted for Parasitic Flatworms". Molecular & Cellular Proteomics. 11 (9): 681–91. doi:10.1074/mcp.M112.019026. PMC   3434776 . PMID   22653920.
  7. Khatun, Jainab (February 2013). "Whole human genome proteogenomic mapping for ENCODE cell line data: identifying protein-coding regions". BMC Genomics. 14: 141. doi: 10.1186/1471-2164-14-141 . PMC   3607840 . PMID   23448259.