Proteogenomics is a field of biological research that utilizes a combination of proteomics, genomics, and transcriptomics to aid in the discovery and identification of peptides. Proteogenomics is used to identify new peptides by comparing MS/MS spectra against a protein database that has been derived from genomic and transcriptomic information. Proteogenomics often refers to studies that use proteomic information, often derived from mass spectrometry, to improve gene annotations. The utilization of both proteomics and genomics data alongside advances in the availability and power of spectrographic and chromatographic technology led to the emergence of proteogenomics as its own field in 2004.
Proteomics deals with proteins in the same way that Genomics studies the genetic code of entire organisms, while Transcriptomics deals with the study of RNA sequencing and transcripts. While all three fields might use forms of mass spectrometry and chromatography to identify and study the functions of DNA, RNA, and proteins, proteomics relies on the assumption that current gene models are correct and that all relevant protein sequences can be found in a reference database such as the Proteomics Identifications Database. Proteogenomics helps eliminate this reliance on existing, limited genetic models by combining datasets from multiple fields in order to produce a database of proteins or genetic markers. In addition, the emergence of novel protein sequences due to mutations often cannot be accounted for in traditional proteomic databases, but can be predicted and studied using a synthesis of genomic and transcriptomic data.
The resulting research has applications in improving gene annotations, studying mutations, and understanding the effects of genetic manipulation.
More recently, the joint profiling of surface proteins and mRNA transcripts from single cells by methods such as CITE-Seq and ESCAPE [1] has been referred to as single-cell proteogenomics, [2] [3] [4] although the goals of these studies are not related to peptide identification. Since 2019 these methods are more commonly referred to as multimodal omics or multi-omics. [5]
Proteogenomics emerged as an independent field in 2004, based on the integration of technological advancements in next-generation sequencing genomics, and mass spectrometry proteomics. [6] The term itself came into use that year, with the publication of a paper by George Church’s research group describing their discovery of a proteogenomic mapping technique that utilized proteomics data to better annotate the genome of the bacteria M. pneumoniae. By using a modern protein database, the lab mapped peptides detected in a whole cell onto a genetic scaffold using tandem mass spectrometry, then used the generated "hits" in order to create a "proteogenomic map" based on traditional genetic signals. The resulting map proved extremely accurate, with over 81% of predicted genomic reading frames being detected in the bacterial cells studied. In addition, the lab discovered several new frames not predicted via purely genetic methods, as well as some evidence supporting the idea that several predictions based genetic models could be false, proving the accuracy and cost-effectiveness of the hybrid technique. [7] [8]
The field expanded over the next two decades, initially using proteomics data to aid in refining genetic models via protein databases. [6] In 2020s, one of the most common technique for identifying peptides involves using tandem mass spectrometry. This technique originated with Eng and Yates in 1994 which involves comparing a theoretical peptide fragment spectrum to compare an experimentally derived peptide spectrum to and outputting the most likely matches found. [7] However, in the absence of an established peptide database, Proteogenomics instead compares the experimental spectrum to a genomic database instead which can then be used for genome annotation - as described in George Church's work.[3] The latter technique has become more widely used over the last decade in large part due to the increasing affordability and speed of genomic sequencing techniques coupled with the increasing sensitivity of mass spectrometry-based proteomics. [6]
The main idea behind the proteogenomic approach is to identify peptides by comparing MS/MS data to protein databases that contain predicted protein sequences. [9] The protein database is generated in a variety of ways through the utilization of genomic and transcriptomic data. Below are some of the ways in which protein databases are generated:
Six-frame translations can be utilized to generate a database that predicts protein sequences. The limitation of this method is that databases will be very large due to the number of sequences that are generated, some of which do not exist in nature. [10]
In this method, a protein base is generated by gene predicting algorithms that enable the identification of protein coding regions. The database is similar to one generated through six-frame translation in regards to the fact that the databases can be very large. [10]
Six-frame translations can utilize an expressed sequence tag (EST) to generate protein databases. EST data provide transcription information that can aid in the creation of the database. The database can be very large and has the disadvantage of having multiple copies of a given sequence present; however, this problem can be circumvented by compressing the protein sequence generated through computational strategies. [10]
Protein databases can also be created by using RNA sequencing data, annotated RNA transcripts, and variant protein sequences. Also, there are other more specialized protein databases that can be made to appropriately identify the peptide of interest. [10]
Another method in the identification of proteins through proteogenomics is comparative proteogenomics. Comparative proteogenomics compares proteomic data from multiple related species concurrently and exploits the homology between their proteins to improve annotations with higher statistical confidence. [11] [12]
Proteogenomics can be applied in different ways. One application is the improvement of gene annotations in various organisms. Gene annotation involves discovering genes and their functions. [13] Proteogenomics has become especially useful in the discovery and improvement of gene annotations in prokaryotic organisms. For example, various microorganisms have had their genomic annotation studied through the proteogenomic approach including, Escherichia coli , Mycobacterium , and multiple species of Shewanella bacteria. [14]
Besides improving gene annotations, proteogenomic studies can also provide valuable information about the presence of programmed frameshifts, N-terminal methionine excision, signal peptides, proteolysis and other post-translational modifications. [15] [11] Proteogenomics has potential applications in medicine, especially to oncology research. Cancer occurs through genetic mutations such as methylation, translocation, and somatic mutations. Research has shown that both genomic and proteomic information are needed to understand the molecular variations that lead to cancer. [16] [17] Proteogenomics has aided in this through the identification of protein sequences that may have functional roles in cancer. [18] A specific example of this occurred in a study involving colon cancer that resulted in the discovery of potential targets for cancer treatment. [16] Proteogenomics has also led to personalized cancer targeting immunotherapies, where antibody epitopes for cancer antigens are predicted using proteogenomics to create medicines that act on the patient's specific tumor. [19] In addition to treatment, proteogenonomics may provide insight into cancer diagnosis. In studies involving colon and rectal cancer, proteogenomics was utilized to identify somatic mutations. The identification of somatic mutations in patients could be used to diagnose cancer in patients. In addition to direct applications in cancer treatment and diagnosis, a proteogenomic approach can be used to study proteins that result in resistance to chemotherapy. [17]
Proteogenomics may offer methods of peptide identification without having the disadvantage of incomplete or inaccurate protein databases faced by proteomics; however, there are incurring challenges with the proteogenomic approach. [10] One of the biggest challenges of proteogenomics is the sheer size of protein databases generated. statistically, a large protein database is more likely to result in the incorrect matching of the data from the protein database to the MS/MS data, this issue can hinder the identification of new peptides. False positives are also an issue through proteogenomic approaches. false positives can occur as a result of extremely large protein data bases where miss-matched data leads to incorrect identification. Another issue is the incorrect matching of MS/MS spectra to protein sequence data that corresponds to a similar peptide instead of the actual peptide. There are cases of receiving data of a peptide located at multiple gene sites, this can lead to data that can be interpreted in different ways. Despite these challenges, there are ways to reduce many of the errors that occur. For example, when dealing with a very large protein database, one could compare the identified novel peptide sequences to all of the sequences within the database and then compare the post translational modifications. Next it can be determined if the two sequences represent the same peptide or if they are two different peptides. [10]
Bioinformatics is an interdisciplinary field of science that develops methods and software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, chemistry, physics, computer science, computer programming, information engineering, mathematics and statistics to analyze and interpret biological data. The subsequent process of analyzing and interpreting data is referred to as computational biology.
Genomics is an interdisciplinary field of biology focusing on the structure, function, evolution, mapping, and editing of genomes. A genome is an organism's complete set of DNA, including all of its genes as well as its hierarchical, three-dimensional structural configuration. In contrast to genetics, which refers to the study of individual genes and their roles in inheritance, genomics aims at the collective characterization and quantification of all of an organism's genes, their interrelations and influence on the organism. Genes may direct the production of proteins with the assistance of enzymes and messenger molecules. In turn, proteins make up body structures such as organs and tissues as well as control chemical reactions and carry signals between cells. Genomics also involves the sequencing and analysis of genomes through uses of high throughput DNA sequencing and bioinformatics to assemble and analyze the function and structure of entire genomes. Advances in genomics have triggered a revolution in discovery-based research and systems biology to facilitate understanding of even the most complex biological systems such as the brain.
Proteomics is the large-scale study of proteins. Proteins are vital parts of living organisms, with many functions such as the formation of structural fibers of muscle tissue, enzymatic digestion of food, or synthesis and replication of DNA. In addition, other kinds of proteins include antibodies that protect an organism from infection, and hormones that send important signals throughout the body.
The branches of science known informally as omics are various disciplines in biology whose names end in the suffix -omics, such as genomics, proteomics, metabolomics, metagenomics, phenomics and transcriptomics. Omics aims at the collective characterization and quantification of pools of biological molecules that translate into the structure, function, and dynamics of an organism or organisms.
In computational biology, gene prediction or gene finding refers to the process of identifying the regions of genomic DNA that encode genes. This includes protein-coding genes as well as RNA genes, but may also include prediction of other functional elements such as regulatory regions. Gene finding is one of the first and most important steps in understanding the genome of a species once it has been sequenced.
Functional genomics is a field of molecular biology that attempts to describe gene functions and interactions. Functional genomics make use of the vast data generated by genomic and transcriptomic projects. Functional genomics focuses on the dynamic aspects such as gene transcription, translation, regulation of gene expression and protein–protein interactions, as opposed to the static aspects of the genomic information such as DNA sequence or structures. A key characteristic of functional genomics studies is their genome-wide approach to these questions, generally involving high-throughput methods rather than a more traditional "candidate-gene" approach.
Protein sequencing is the practical process of determining the amino acid sequence of all or part of a protein or peptide. This may serve to identify the protein or characterize its post-translational modifications. Typically, partial sequencing of a protein provides sufficient information to identify it with reference to databases of protein sequences derived from the conceptual translation of genes.
The completion of the human genome sequencing in the early 2000s was a turning point in genomics research. Scientists have conducted series of research into the activities of genes and the genome as a whole. The human genome contains around 3 billion base pairs nucleotide, and the huge quantity of data created necessitates the development of an accessible tool to explore and interpret this information in order to investigate the genetic basis of disease, evolution, and biological processes. The field of genomics has continued to grow, with new sequencing technologies and computational tool making it easier to study the genome.
Genome-based peptide fingerprint scanning (GFS) is a system in bioinformatics analysis that attempts to identify the genomic origin of sample proteins by scanning their peptide-mass fingerprint against the theoretical translation and proteolytic digest of an entire genome. This method is an improvement from previous methods because it compares the peptide fingerprints to an entire genome instead of comparing it to an already annotated genome. This improvement has the potential to improve genome annotation and identify proteins with incorrect or missing annotations.
Mascot is a software search engine that uses mass spectrometry data to identify proteins from peptide sequence databases. Mascot is widely used by research facilities around the world. Mascot uses a probabilistic scoring algorithm for protein identification that was adapted from the MOWSE algorithm. Mascot is freely available to use on the website of Matrix Science. A license is required for in-house use where more features can be incorporated.
Protein mass spectrometry refers to the application of mass spectrometry to the study of proteins. Mass spectrometry is an important method for the accurate mass determination and characterization of proteins, and a variety of methods and instrumentations have been developed for its many uses. Its applications include the identification of proteins and their post-translational modifications, the elucidation of protein complexes, their subunits and functional interactions, as well as the global measurement of proteins in proteomics. It can also be used to localize proteins to the various organelles, and determine the interactions between different proteins as well as with membrane lipids.
Shotgun proteomics refers to the use of bottom-up proteomics techniques in identifying proteins in complex mixtures using a combination of high performance liquid chromatography combined with mass spectrometry. The name is derived from shotgun sequencing of DNA which is itself named after the rapidly expanding, quasi-random firing pattern of a shotgun. The most common method of shotgun proteomics starts with the proteins in the mixture being digested and the resulting peptides are separated by liquid chromatography. Tandem mass spectrometry is then used to identify the peptides.
GeneCards is a database of human genes that provides genomic, proteomic, transcriptomic, genetic and functional information on all known and predicted human genes. It is being developed and maintained by the Crown Human Genome Center at the Weizmann Institute of Science, in collaboration with LifeMap Sciences.
In molecular biology and genetics, DNA annotation or genome annotation is the process of describing the structure and function of the components of a genome, by analyzing and interpreting them in order to extract their biological significance and understand the biological processes in which they participate. Among other things, it identifies the locations of genes and all the coding regions in a genome and determines what those genes do.
De novo transcriptome assembly is the de novo sequence assembly method of creating a transcriptome without the aid of a reference genome.
In the field of cellular biology, single-cell analysis and subcellular analysis is the study of genomics, transcriptomics, proteomics, metabolomics and cell–cell interactions at the single cell level. The concept of single-cell analysis originated in the 1970s. Before the discovery of heterogeneity, single-cell analysis mainly referred to the analysis or manipulation of an individual cell in a bulk population of cells at a particular condition using optical or electronic microscope. To date, due to the heterogeneity seen in both eukaryotic and prokaryotic cell populations, analyzing a single cell makes it possible to discover mechanisms not seen when studying a bulk population of cells. Technologies such as fluorescence-activated cell sorting (FACS) allow the precise isolation of selected single cells from complex samples, while high throughput single cell partitioning technologies, enable the simultaneous molecular analysis of hundreds or thousands of single unsorted cells; this is particularly useful for the analysis of transcriptome variation in genotypically identical cells, allowing the definition of otherwise undetectable cell subtypes. The development of new technologies is increasing our ability to analyze the genome and transcriptome of single cells, as well as to quantify their proteome and metabolome. Mass spectrometry techniques have become important analytical tools for proteomic and metabolomic analysis of single cells. Recent advances have enabled quantifying thousands of protein across hundreds of single cells, and thus make possible new types of analysis. In situ sequencing and fluorescence in situ hybridization (FISH) do not require that cells be isolated and are increasingly being used for analysis of tissues.
Single nucleotide polymorphism annotation is the process of predicting the effect or function of an individual SNP using SNP annotation tools. In SNP annotation the biological information is extracted, collected and displayed in a clear form amenable to query. SNP functional annotation is typically performed based on the available information on nucleic acid and protein sequences.
ANNOVAR is a bioinformatics software tool for the interpretation and prioritization of single nucleotide variants (SNVs), insertions, deletions, and copy number variants (CNVs) of a given genome.
Translatomics is the study of all open reading frames (ORFs) that are being actively translated in a cell or organism. This collection of ORFs is called the translatome. Characterizing a cell's translatome can give insight into the array of biological pathways that are active in the cell. According to the central dogma of molecular biology, the DNA in a cell is transcribed to produce RNA, which is then translated to produce a protein. Thousands of proteins are encoded in an organism's genome, and the proteins present in a cell cooperatively carry out many functions to support the life of the cell. Under various conditions, such as during stress or specific timepoints in development, the cell may require different biological pathways to be active, and therefore require a different collection of proteins. Depending on intrinsic and environmental conditions, the collection of proteins being made at one time varies. Translatomic techniques can be used to take a "snapshot" of this collection of actively translating ORFs, which can give information about which biological pathways the cell is activating under the present conditions.
Precision diagnostics is a branch of precision medicine that involves precisely managing a patient's healthcare model and diagnosing specific diseases based on customized omics data analytics.