Protein sequencing

Last updated
Using a Beckman-Spinco Protein-Peptide Sequencer, 1970 Man using a Beckman-Spinco Protein-Peptide Sequencer bg257f292.tiff
Using a Beckman-Spinco Protein-Peptide Sequencer, 1970

Protein sequencing is the practical process of determining the amino acid sequence of all or part of a protein or peptide. This may serve to identify the protein or characterize its post-translational modifications. Typically, partial sequencing of a protein provides sufficient information (one or more sequence tags) to identify it with reference to databases of protein sequences derived from the conceptual translation of genes.

Contents

The two major direct methods of protein sequencing are mass spectrometry and Edman degradation using a protein sequenator (sequencer). Mass spectrometry methods are now the most widely used for protein sequencing and identification but Edman degradation remains a valuable tool for characterizing a protein's N-terminus.

Determining amino acid composition

Protein sequence interpretation: a scheme new protein to be engineered in a yeast Schermata 2022-06-24 alle 23.06.28.png
Protein sequence interpretation: a scheme new protein to be engineered in a yeast

It is often desirable to know the unordered amino acid composition of a protein prior to attempting to find the ordered sequence, as this knowledge can be used to facilitate the discovery of errors in the sequencing process or to distinguish between ambiguous results. Knowledge of the frequency of certain amino acids may also be used to choose which protease to use for digestion of the protein. The misincorporation of low levels of non-standard amino acids (e.g. norleucine) into proteins may also be determined. [1] A generalized method often referred to as amino acid analysis [2] for determining amino acid frequency is as follows:

  1. Hydrolyse a known quantity of protein into its constituent amino acids.
  2. Separate and quantify the amino acids in some way.

Hydrolysis

Hydrolysis is done by heating a sample of the protein in 6 M hydrochloric acid to 100–110 °C for 24 hours or longer. Proteins with many bulky hydrophobic groups may require longer heating periods. However, these conditions are so vigorous that some amino acids (serine, threonine, tyrosine, tryptophan, glutamine, and cysteine) are degraded. To circumvent this problem, Biochemistry Online suggests heating separate samples for different times, analysing each resulting solution, and extrapolating back to zero hydrolysis time. Rastall suggests a variety of reagents to prevent or reduce degradation, such as thiol reagents or phenol to protect tryptophan and tyrosine from attack by chlorine, and pre-oxidising cysteine. He also suggests measuring the quantity of ammonia evolved to determine the extent of amide hydrolysis.

Separation and quantitation

The amino acids can be separated by ion-exchange chromatography then derivatized to facilitate their detection. More commonly, the amino acids are derivatized then resolved by reversed phase HPLC.

An example of the ion-exchange chromatography is given by the NTRC using sulfonated polystyrene as a matrix, adding the amino acids in acid solution and passing a buffer of steadily increasing pH through the column. Amino acids are eluted when the pH reaches their respective isoelectric points. Once the amino acids have been separated, their respective quantities are determined by adding a reagent that will form a coloured derivative. If the amounts of amino acids are in excess of 10 nmol, ninhydrin can be used for this; it gives a yellow colour when reacted with proline, and a vivid purple with other amino acids. The concentration of amino acid is proportional to the absorbance of the resulting solution. With very small quantities, down to 10 pmol, fluorescent derivatives can be formed using reagents such as ortho-phthaldehyde (OPA) or fluorescamine.

Pre-column derivatization may use the Edman reagent to produce a derivative that is detected by UV light. Greater sensitivity is achieved using a reagent that generates a fluorescent derivative. The derivatized amino acids are subjected to reversed phase chromatography, typically using a C8 or C18 silica column and an optimised elution gradient. The eluting amino acids are detected using a UV or fluorescence detector and the peak areas compared with those for derivatised standards in order to quantify each amino acid in the sample.

N-terminal amino acid analysis

Sanger's method of peptide end-group analysis: A derivatization of N-terminal end with Sanger's reagent (DNFB), B total acid hydrolysis of the dinitrophenyl peptide Sanger peptide end-group analysis.svg
Sanger's method of peptide end-group analysis: A derivatization of N-terminal end with Sanger's reagent (DNFB), B total acid hydrolysis of the dinitrophenyl peptide

Determining which amino acid forms the N-terminus of a peptide chain is useful for two reasons: to aid the ordering of individual peptide fragments' sequences into a whole chain, and because the first round of Edman degradation is often contaminated by impurities and therefore does not give an accurate determination of the N-terminal amino acid. A generalised method for N-terminal amino acid analysis follows:

  1. React the peptide with a reagent that will selectively label the terminal amino acid.
  2. Hydrolyse the protein.
  3. Determine the amino acid by chromatography and comparison with standards.

There are many different reagents which can be used to label terminal amino acids. They all react with amine groups and will therefore also bind to amine groups in the side chains of amino acids such as lysine - for this reason it is necessary to be careful in interpreting chromatograms to ensure that the right spot is chosen. Two of the more common reagents are Sanger's reagent (1-fluoro-2,4-dinitrobenzene) and dansyl derivatives such as dansyl chloride. Phenylisothiocyanate, the reagent for the Edman degradation, can also be used. The same questions apply here as in the determination of amino acid composition, with the exception that no stain is needed, as the reagents produce coloured derivatives and only qualitative analysis is required. So the amino acid does not have to be eluted from the chromatography column, just compared with a standard. Another consideration to take into account is that, since any amine groups will have reacted with the labelling reagent, ion exchange chromatography cannot be used, and thin-layer chromatography or high-pressure liquid chromatography should be used instead.

C-terminal amino acid analysis

The number of methods available for C-terminal amino acid analysis is much smaller than the number of available methods of N-terminal analysis. The most common method is to add carboxypeptidases to a solution of the protein, take samples at regular intervals, and determine the terminal amino acid by analysing a plot of amino acid concentrations against time. This method will be very useful in the case of polypeptides and protein-blocked N termini. C-terminal sequencing would greatly help in verifying the primary structures of proteins predicted from DNA sequences and to detect any posttranslational processing of gene products from known codon sequences.

Edman degradation

The Edman degradation is a very important reaction for protein sequencing, because it allows the ordered amino acid composition of a protein to be discovered. Automated Edman sequencers are now in widespread use, and are able to sequence peptides up to approximately 50 amino acids long. A reaction scheme for sequencing a protein by the Edman degradation follows; some of the steps are elaborated on subsequently.

  1. Break any disulfide bridges in the protein with a reducing agent like 2-mercaptoethanol. A protecting group such as iodoacetic acid may be necessary to prevent the bonds from re-forming.
  2. Separate and purify the individual chains of the protein complex, if there are more than one.
  3. Determine the amino acid composition of each chain.
  4. Determine the terminal amino acids of each chain.
  5. Break each chain into fragments under 50 amino acids long.
  6. Separate and purify the fragments.
  7. Determine the sequence of each fragment.
  8. Repeat with a different pattern of cleavage.
  9. Construct the sequence of the overall protein.

Digestion into peptide fragments

Peptides longer than about 50–70 amino acids long cannot be sequenced reliably by the Edman degradation. Because of this, long protein chains need to be broken up into small fragments that can then be sequenced individually. Digestion is done either by endopeptidases such as trypsin or pepsin or by chemical reagents such as cyanogen bromide. Different enzymes give different cleavage patterns, and the overlap between fragments can be used to construct an overall sequence.

Reaction

The peptide to be sequenced is adsorbed onto a solid surface. One common substrate is glass fibre coated with polybrene, a cationic polymer. The Edman reagent, phenylisothiocyanate (PITC), is added to the adsorbed peptide, together with a mildly basic buffer solution of 12% trimethylamine. This reacts with the amine group of the N-terminal amino acid.

The terminal amino acid can then be selectively detached by the addition of anhydrous acid. The derivative then isomerises to give a substituted phenylthiohydantoin, which can be washed off and identified by chromatography, and the cycle can be repeated. The efficiency of each step is about 98%, which allows about 50 amino acids to be reliably determined.

A Beckman-Coulter Porton LF3000G protein sequencing machine Protein sequencer.jpg
A Beckman-Coulter Porton LF3000G protein sequencing machine

Protein sequencer

A protein sequenator [3] is a machine that performs Edman degradation in an automated manner. A sample of the protein or peptide is immobilized in the reaction vessel of the protein sequenator and the Edman degradation is performed. Each cycle releases and derivatises one amino acid from the protein or peptide's N-terminus and the released amino-acid derivative is then identified by HPLC. The sequencing process is done repetitively for the whole polypeptide until the entire measurable sequence is established or for a pre-determined number of cycles.

Identification by mass spectrometry

Protein identification is the process of assigning a name to a protein of interest (POI), based on its amino-acid sequence. Typically, only part of the protein’s sequence needs to be determined experimentally in order to identify the protein with reference to databases of protein sequences deduced from the DNA sequences of their genes. Further protein characterization may include confirmation of the actual N- and C-termini of the POI, determination of sequence variants and identification of any post-translational modifications present.

Proteolytic digests

A general scheme for protein identification is described. [4] [5]

  1. The POI is isolated, typically by SDS-PAGE or chromatography.
  2. The isolated POI may be chemically modified to stabilise Cysteine residues (e.g. S-amidomethylation or S-carboxymethylation).
  3. The POI is digested with a specific protease to generate peptides. Trypsin, which cleaves selectively on the C-terminal side of Lysine or Arginine residues, is the most commonly used protease. Its advantages include i) the frequency of Lys and Arg residues in proteins, ii) the high specificity of the enzyme, iii) the stability of the enzyme and iv) the suitability of tryptic peptides for mass spectrometry.
  4. The peptides may be desalted to remove ionizable contaminants and subjected to MALDI-TOF mass spectrometry. Direct measurement of the masses of the peptides may provide sufficient information to identify the protein (see Peptide mass fingerprinting) but further fragmentation of the peptides inside the mass spectrometer is often used to gain information about the peptides’ sequences. Alternatively, peptides may be desalted and separated by reversed phase HPLC and introduced into a mass spectrometer via an ESI source. LC-ESI-MS may provide more information than MALDI-MS for protein identification but uses more instrument time.
  5. Depending on the type of mass spectrometer, fragmentation of peptide ions may occur via a variety of mechanisms such as collision-induced dissociation (CID) or post-source decay (PSD). In each case, the pattern of fragment ions of a peptide provides information about its sequence.
  6. Information including the measured mass of the putative peptide ions and those of their fragment ions is then matched against calculated mass values from the conceptual (in-silico) proteolysis and fragmentation of databases of protein sequences. A successful match will be found if its score exceeds a threshold based on the analysis parameters. Even if the actual protein is not represented in the database, error-tolerant matching allows for the putative identification of a protein based on similarity to homologous proteins. A variety of software packages are available to perform this analysis.
  7. Software packages usually generate a report showing the identity (accession code) of each identified protein, its matching score, and provide a measure of the relative strength of the matching where multiple proteins are identified.
  8. A diagram of the matched peptides on the sequence of the identified protein is often used to show the sequence coverage (% of the protein detected as peptides). Where the POI is thought to be significantly smaller than the matched protein, the diagram may suggest whether the POI is an N- or C-terminal fragment of the identified protein.

De novo sequencing

The pattern of fragmentation of a peptide allows for direct determination of its sequence by de novo sequencing. This sequence may be used to match databases of protein sequences or to investigate post-translational or chemical modifications. It may provide additional evidence for protein identifications performed as above.

N- and C-termini

The peptides matched during protein identification do not necessarily include the N- or C-termini predicted for the matched protein. This may result from the N- or C-terminal peptides being difficult to identify by MS (e.g. being either too short or too long), being post-translationally modified (e.g. N-terminal acetylation) or genuinely differing from the prediction. Post-translational modifications or truncated termini may be identified by closer examination of the data (i.e. de novo sequencing). A repeat digest using a protease of different specificity may also be useful.

Post-translational modifications

Whilst detailed comparison of the MS data with predictions based on the known protein sequence may be used to define post-translational modifications, targeted approaches to data acquisition may also be used. For instance, specific enrichment of phosphopeptides may assist in identifying phosphorylation sites in a protein. Alternative methods of peptide fragmentation in the mass spectrometer, such as ETD or ECD, may give complementary sequence information.

Whole-mass determination

The protein’s whole mass is the sum of the masses of its amino-acid residues plus the mass of a water molecule and adjusted for any post-translational modifications. Although proteins ionize less well than the peptides derived from them, a protein in solution may be able to be subjected to ESI-MS and its mass measured to an accuracy of 1 part in 20,000 or better. This is often sufficient to confirm the termini (thus that the protein’s measured mass matches that predicted from its sequence) and infer the presence or absence of many post-translational modifications.

Limitations

Proteolysis does not always yield a set of readily analyzable peptides covering the entire sequence of POI. The fragmentation of peptides in the mass spectrometer often does not yield ions corresponding to cleavage at each peptide bond. Thus, the deduced sequence for each peptide is not necessarily complete. The standard methods of fragmentation do not distinguish between leucine and isoleucine residues since they are isomeric.

Because the Edman degradation proceeds from the N-terminus of the protein, it will not work if the N-terminus has been chemically modified (e.g. by acetylation or formation of Pyroglutamic acid). Edman degradation is generally not useful to determine the positions of disulfide bridges. It also requires peptide amounts of 1 picomole or above for discernible results, making it less sensitive than mass spectrometry.

Predicting from DNA/RNA sequences

In biology, proteins are produced by translation of messenger RNA (mRNA) with the protein sequence deriving from the sequence of codons in the mRNA. The mRNA is itself formed by the transcription of genes and may be further modified. These processes are sufficiently understood to use computer algorithms to automate predictions of protein sequences from DNA sequences, such as from whole-genome DNA-sequencing projects, and have led to the generation of large databases of protein sequences such as UniProt. Predicted protein sequences are an important resource for protein identification by mass spectrometry.

Historically, short protein sequences (10 to 15 residues) determined by Edman degradation were back-translated into DNA sequences that could be used as probes or primers to isolate molecular clones of the corresponding gene or complementary DNA. The sequence of the cloned DNA was then determined and used to deduce the full amino-acid sequence of the protein.

Bioinformatics tools

Bioinformatics tools exist to assist with interpretation of mass spectra (see de novo peptide sequencing), to compare or analyze protein sequences (see sequence analysis), or search databases using peptide or protein sequences (see BLAST).

Applications to cryptography

The difficulty of protein sequencing was recently proposed as a basis for creating k-time programs, programs that run exactly k times before self-destructing. Such a thing is impossible to build purely in software because all software is inherently clonable an unlimited number of times.

See also

Related Research Articles

A signal peptide is a short peptide present at the N-terminus of most newly synthesized proteins that are destined toward the secretory pathway. These proteins include those that reside either inside certain organelles, secreted from the cell, or inserted into most cellular membranes. Although most type I membrane-bound proteins have signal peptides, most type II and multi-spanning membrane-bound proteins are targeted to the secretory pathway by their first transmembrane domain, which biochemically resembles a signal sequence except that it is not cleaved. They are a kind of target peptide.

<span class="mw-page-title-main">Tandem mass spectrometry</span> Type of mass spectrometry

Tandem mass spectrometry, also known as MS/MS or MS2, is a technique in instrumental analysis where two or more stages of analysis using one or more mass analyzer are performed with an additional reaction step in between these analyses to increase their abilities to analyse chemical samples. A common use of tandem MS is the analysis of biomolecules, such as proteins and peptides.

Edman degradation, developed by Pehr Edman, is a method of sequencing amino acids in a peptide. In this method, the amino-terminal residue is labeled and cleaved from the peptide without disrupting the peptide bonds between other amino acid residues.

<span class="mw-page-title-main">Peptide mass fingerprinting</span> Analytical technique for protein identification

Peptide mass fingerprinting (PMF), also known as protein fingerprinting, is an analytical technique for protein identification in which the unknown protein of interest is first cleaved into smaller peptides, whose absolute masses can be accurately measured with a mass spectrometer such as MALDI-TOF or ESI-TOF. The method was developed in 1993 by several groups independently. The peptide masses are compared to either a database containing known protein sequences or even the genome. This is achieved by using computer programs that translate the known genome of the organism into proteins, then theoretically cut the proteins into peptides, and calculate the absolute masses of the peptides from each protein. They then compare the masses of the peptides of the unknown protein to the theoretical peptide masses of each protein encoded in the genome. The results are statistically analyzed to find the best match.

<span class="mw-page-title-main">Matrix-assisted laser desorption/ionization</span> Ionization technique

In mass spectrometry, matrix-assisted laser desorption/ionization (MALDI) is an ionization technique that uses a laser energy-absorbing matrix to create ions from large molecules with minimal fragmentation. It has been applied to the analysis of biomolecules and various organic molecules, which tend to be fragile and fragment when ionized by more conventional ionization methods. It is similar in character to electrospray ionization (ESI) in that both techniques are relatively soft ways of obtaining ions of large molecules in the gas phase, though MALDI typically produces far fewer multi-charged ions.

Protein methods are the techniques used to study proteins. There are experimental methods for studying proteins. Computational methods typically use computer programs to analyze proteins. However, many experimental methods require computational analysis of the raw data.

<span class="mw-page-title-main">Pehr Victor Edman</span> Swedish biochemist

Pehr Victor Edman was a Swedish biochemist. He developed a method for sequencing proteins; the Edman degradation.

A peptide sequence tag is a piece of information about a peptide obtained by tandem mass spectrometry that can be used to identify this peptide in a protein database.

Mascot is a software search engine that uses mass spectrometry data to identify proteins from peptide sequence databases. Mascot is widely used by research facilities around the world. Mascot uses a probabilistic scoring algorithm for protein identification that was adapted from the MOWSE algorithm. Mascot is freely available to use on the website of Matrix Science. A license is required for in-house use where more features can be incorporated.

<span class="mw-page-title-main">Electron-transfer dissociation</span>

Electron-transfer dissociation (ETD) is a method of fragmenting multiply-charged gaseous macromolecules in a mass spectrometer between the stages of tandem mass spectrometry (MS/MS). Similar to electron-capture dissociation, ETD induces fragmentation of large, multiply-charged cations by transferring electrons to them. ETD is used extensively with polymers and biological molecules such as proteins and peptides for sequence analysis. Transferring an electron causes peptide backbone cleavage into c- and z-ions while leaving labile post translational modifications (PTM) intact. The technique only works well for higher charge state peptide or polymer ions (z>2). However, relative to collision-induced dissociation (CID), ETD is advantageous for the fragmentation of longer peptides or even entire proteins. This makes the technique important for top-down proteomics. The method was developed by Hunt and coworkers at the University of Virginia.

The Bergmann degradation is a series of chemical reactions designed to remove a single amino acid from the carboxylic acid (C-terminal) end of a peptide. First demonstrated by Max Bergmann in 1934, it is a rarely used method for sequencing peptides. The later developed Edman degradation is an improvement upon the Bergmann degradation, instead cleaving the N-terminal amino acid of peptides to produce a hydantoin containing the desired amino acid.

<span class="mw-page-title-main">Protein mass spectrometry</span> Application of mass spectrometry

Protein mass spectrometry refers to the application of mass spectrometry to the study of proteins. Mass spectrometry is an important method for the accurate mass determination and characterization of proteins, and a variety of methods and instrumentations have been developed for its many uses. Its applications include the identification of proteins and their post-translational modifications, the elucidation of protein complexes, their subunits and functional interactions, as well as the global measurement of proteins in proteomics. It can also be used to localize proteins to the various organelles, and determine the interactions between different proteins as well as with membrane lipids.

<span class="mw-page-title-main">1-Fluoro-2,4-dinitrobenzene</span> Chemical compound

1-Fluoro-2,4-dinitrobenzene is a chemical that reacts with the N-terminal amino acid of polypeptides. This can be helpful for sequencing proteins.

Shotgun proteomics refers to the use of bottom-up proteomics techniques in identifying proteins in complex mixtures using a combination of high performance liquid chromatography combined with mass spectrometry. The name is derived from shotgun sequencing of DNA which is itself named after the rapidly expanding, quasi-random firing pattern of a shotgun. The most common method of shotgun proteomics starts with the proteins in the mixture being digested and the resulting peptides are separated by liquid chromatography. Tandem mass spectrometry is then used to identify the peptides.

<span class="mw-page-title-main">Bottom-up proteomics</span>

Bottom-up proteomics is a common method to identify proteins and characterize their amino acid sequences and post-translational modifications by proteolytic digestion of proteins prior to analysis by mass spectrometry. The major alternative workflow used in proteomics is called top-down proteomics where intact proteins are purified prior to digestion and/or fragmentation either within the mass spectrometer or by 2D electrophoresis. Essentially, bottom-up proteomics is a relatively simple and reliable means of determining the protein make-up of a given sample of cells, tissues, etc.

<span class="mw-page-title-main">Isobaric tag for relative and absolute quantitation</span>

Isobaric tags for relative and absolute quantitation (iTRAQ) is an isobaric labeling method used in quantitative proteomics by tandem mass spectrometry to determine the amount of proteins from different sources in a single experiment. It uses stable isotope labeled molecules that can be covalent bonded to the N-terminus and side chain amines of proteins.

An isotope-coded affinity tag (ICAT) is an in-vitro isotopic labeling method used for quantitative proteomics by mass spectrometry that uses chemical labeling reagents. These chemical probes consist of three elements: a reactive group for labeling an amino acid side chain, an isotopically coded linker, and a tag for the affinity isolation of labeled proteins/peptides. The samples are combined and then separated through chromatography, then sent through a mass spectrometer to determine the mass-to-charge ratio between the proteins. Only cysteine containing peptides can be analysed. Since only cysteine containing peptides are analysed, often the post translational modification is lost.

Terminal amine isotopic labeling of substrates (TAILS) is a method in quantitative proteomics that identifies the protein content of samples based on N-terminal fragments of each protein and detects differences in protein abundance among samples.

In mass spectrometry, de novo peptide sequencing is the method in which a peptide amino acid sequence is determined from tandem mass spectrometry.

<span class="mw-page-title-main">Ancient protein</span>

Ancient proteins are complex mixtures and the term palaeoproteomics is used to characterise the study of proteomes in the past. Ancients proteins have been recovered from a wide range of archaeological materials, including bones, teeth, eggshells, leathers, parchments, ceramics, painting binders and well-preserved soft tissues like gut intestines. These preserved proteins have provided valuable information about taxonomic identification, evolution history (phylogeny), diet, health, disease, technology and social dynamics in the past.

References

  1. Bogosian G, Violand BN, Dorward-King EJ, Workman WE, Jung PE, Kane JF (January 1989). "Biosynthesis and incorporation into protein of norleucine by Escherichia coli". The Journal of Biological Chemistry. 264 (1): 531–9. doi: 10.1016/S0021-9258(17)31291-7 . PMID   2642478.
  2. Michail A. Alterman; Peter Hunziker (2 December 2011). Amino Acid Analysis: Methods and Protocols. Humana Press. ISBN   978-1-61779-444-5.
  3. Edman P, Begg G (March 1967). "A protein sequenator". European Journal of Biochemistry. 1 (1): 80–91. doi: 10.1111/j.1432-1033.1967.tb00047.x . PMID   6059350.
  4. Shevchenko A, Tomas H, Havlis J, Olsen JV, Mann M (2006). "In-gel digestion for mass spectrometric characterization of proteins and proteomes". Nature Protocols. 1 (6): 2856–60. doi:10.1038/nprot.2006.468. PMID   17406544. S2CID   8248224.
  5. Gundry RL, White MY, Murray CI, Kane LA, Fu Q, Stanley BA, Van Eyk JE (October 2009). "Preparation of proteins and peptides for mass spectrometry analysis in a bottom-up proteomics workflow". Current Protocols in Molecular Biology. Chapter 10: Unit10.25. doi:10.1002/0471142727.mb1025s88. ISBN   978-0471142720. PMC   2905857 . PMID   19816929.

Further reading