Protein primary structure

Last updated
The image above contains clickable links
This diagram (which is interactive) of protein structure uses PCNA as an example. (PDB: 1AXC ) Protein structure (1)-en.svg
The image above contains clickable links Interactive icon.svg
The image above contains clickable links
This diagram (which is interactive) of protein structure uses PCNA as an example. ( PDB: 1AXC )

Protein primary structure is the linear sequence of amino acids in a peptide or protein. [1] By convention, the primary structure of a protein is reported starting from the amino-terminal (N) end to the carboxyl-terminal (C) end. Protein biosynthesis is most commonly performed by ribosomes in cells. Peptides can also be synthesized in the laboratory. Protein primary structures can be directly sequenced, or inferred from DNA sequences.

Contents

Formation

Biological

Amino acids are polymerised via peptide bonds to form a long backbone, with the different amino acid side chains protruding along it. In biological systems, proteins are produced during translation by a cell's ribosomes. Some organisms can also make short peptides by non-ribosomal peptide synthesis, which often use amino acids other than the standard 20, and may be cyclised, modified and cross-linked.

Chemical

Peptides can be synthesised chemically via a range of laboratory methods. Chemical methods typically synthesise peptides in the opposite order (starting at the C-terminus) to biological protein synthesis (starting at the N-terminus).

Notation

Protein sequence is typically notated as a string of letters, listing the amino acids starting at the amino-terminal end through to the carboxyl-terminal end. Either a three letter code or single letter code can be used to represent the 20 naturally occurring amino acids, as well as mixtures or ambiguous amino acids (similar to nucleic acid notation). [1] [2] [3]

Peptides can be directly sequenced, or inferred from DNA sequences. Large sequence databases now exist that collate known protein sequences.

20 natural amino acid notation
Amino Acid3-Letter [4] 1-Letter [4]
Alanine AlaA
Arginine ArgR
Asparagine AsnN
Aspartic acid AspD
Cysteine CysC
Glutamic acid GluE
Glutamine GlnQ
Glycine GlyG
Histidine HisH
Isoleucine IleI
Leucine LeuL
Lysine LysK
Methionine MetM
Phenylalanine PheF
Proline ProP
Serine SerS
Threonine ThrT
Tryptophan TrpW
Tyrosine TyrY
Valine ValV
Ambiguous amino acid notation
SymbolDescriptionResidues represented
XAny amino acid, or unknownAll
BAspartate or AsparagineD, N
ZGlutamate or GlutamineE, Q
JLeucine or IsoleucineI, L
Φ Hydrophobic V, I, L, F, W, M
Ω Aromatic F, W, Y, H
Ψ Aliphatic V, I, L, M
πSmallP, G, A, S
ζ Hydrophilic S, T, H, N, Q, E, D, K, R, Y
+ Positively charged K, R, H
- Negatively charged D, E

Modification

In general, polypeptides are unbranched polymers, so their primary structure can often be specified by the sequence of amino acids along their backbone. However, proteins can become cross-linked, most commonly by disulfide bonds, and the primary structure also requires specifying the cross-linking atoms, e.g., specifying the cysteines involved in the protein's disulfide bonds. Other crosslinks include desmosine.

Isomerisation

The chiral centers of a polypeptide chain can undergo racemization. Although it does not change the sequence, it does affect the chemical properties of the sequence. In particular, the L-amino acids normally found in proteins can spontaneously isomerize at the atom to form D-amino acids, which cannot be cleaved by most proteases. Additionally, proline can form stable trans-isomers at the peptide bond.

Post-translational modification

Additionally, the protein can undergo a variety of post-translational modifications, which are briefly summarized here.

The N-terminal amino group of a polypeptide can be modified covalently, e.g.,

Fig. 1 N-terminal acetylation N-terminal acetylation.svg
Fig. 1 N-terminal acetylation
The positive charge on the N-terminal amino group may be eliminated by changing it to an acetyl group (N-terminal blocking).
The N-terminal methionine usually found after translation has an N-terminus blocked with a formyl group. This formyl group (and sometimes the methionine residue itself, if followed by Gly or Ser) is removed by the enzyme deformylase.
Fig. 2 Formation of pyroglutamate from an N-terminal glutamine Formation of pyroglutamate.svg
Fig. 2 Formation of pyroglutamate from an N-terminal glutamine
An N-terminal glutamine can attack itself, forming a cyclic pyroglutamate group.
Similar to acetylation. Instead of a simple methyl group, the myristoyl group has a tail of 14 hydrophobic carbons, which make it ideal for anchoring proteins to cellular membranes.

The C-terminal carboxylate group of a polypeptide can also be modified, e.g.,

Fig. 3 C-terminal amidation C-terminal amidation.svg
Fig. 3 C-terminal amidation
The C-terminus can also be blocked (thus, neutralizing its negative charge) by amination.
Glycosyl phosphatidylinositol(GPI) is a large, hydrophobic phospholipid prosthetic group that anchors proteins to cellular membranes. It is attached to the polypeptide C-terminus through an amide linkage that then connects to ethanolamine, thence to sundry sugars and finally to the phosphatidylinositol lipid moiety.

Finally, the peptide side chains can also be modified covalently, e.g.,

Aside from cleavage, phosphorylation is perhaps the most important chemical modification of proteins. A phosphate group can be attached to the sidechain hydroxyl group of serine, threonine and tyrosine residues, adding a negative charge at that site and producing an unnatural amino acid. Such reactions are catalyzed by kinases and the reverse reaction is catalyzed by phosphatases. The phosphorylated tyrosines are often used as "handles" by which proteins can bind to one another, whereas phosphorylation of Ser/Thr often induces conformational changes, presumably because of the introduced negative charge. The effects of phosphorylating Ser/Thr can sometimes be simulated by mutating the Ser/Thr residue to glutamate.
A catch-all name for a set of very common and very heterogeneous chemical modifications. Sugar moieties can be attached to the sidechain hydroxyl groups of Ser/Thr or to the sidechain amide groups of Asn. Such attachments can serve many functions, ranging from increasing solubility to complex recognition. All glycosylation can be blocked with certain inhibitors, such as tunicamycin.
In this modification, an asparagine or aspartate side chain attacks the following peptide bond, forming a symmetrical succinimide intermediate. Hydrolysis of the intermediate produces either aspartate or the β-amino acid, iso(Asp). For asparagine, either product results in the loss of the amide group, hence "deamidation".
Proline residues may be hydroxylated at either of two atoms, as can lysine (at one atom). Hydroxyproline is a critical component of collagen, which becomes unstable upon its loss. The hydroxylation reaction is catalyzed by an enzyme that requires ascorbic acid (vitamin C), deficiencies in which lead to many connective-tissue diseases such as scurvy.
Several protein residues can be methylated, most notably the positive groups of lysine and arginine. Arginine residues interact with the nucleic acid phosphate backbone and commonly form hydrogen bonds with the base residues, particularly guanine, in protein–DNA complexes. Lysine residues can be singly, doubly and even triply methylated. Methylation does not alter the positive charge on the side chain, however.
Acetylation of the lysine amino groups is chemically analogous to the acetylation of the N-terminus. Functionally, however, the acetylation of lysine residues is used to regulate the binding of proteins to nucleic acids. The cancellation of the positive charge on the lysine weakens the electrostatic attraction for the (negatively charged) nucleic acids.
Tyrosines may become sulfated on their atom. Somewhat unusually, this modification occurs in the Golgi apparatus, not in the endoplasmic reticulum. Similar to phosphorylated tyrosines, sulfated tyrosines are used for specific recognition, e.g., in chemokine receptors on the cell surface. As with phosphorylation, sulfation adds a negative charge to a previously neutral site.
The hydrophobic isoprene (e.g., farnesyl, geranyl, and geranylgeranyl groups) and palmitoyl groups may be added to the atom of cysteine residues to anchor proteins to cellular membranes. Unlike the GPI and myritoyl anchors, these groups are not necessarily added at the termini.
A relatively rare modification that adds an extra carboxylate group (and, hence, a double negative charge) to a glutamate side chain, producing a Gla residue. This is used to strengthen the binding to "hard" metal ions such as calcium.
The large ADP-ribosyl group can be transferred to several types of side chains within proteins, with heterogeneous effects. This modification is a target for the powerful toxins of disparate bacteria, e.g., Vibrio cholerae, Corynebacterium diphtheriae and Bordetella pertussis.
Various full-length, folded proteins can be attached at their C-termini to the sidechain ammonium groups of lysines of other proteins. Ubiquitin is the most common of these, and usually signals that the ubiquitin-tagged protein should be degraded.

Most of the polypeptide modifications listed above occur post-translationally, i.e., after the protein has been synthesized on the ribosome, typically occurring in the endoplasmic reticulum, a subcellular organelle of the eukaryotic cell.

Many other chemical reactions (e.g., cyanylation) have been applied to proteins by chemists, although they are not found in biological systems.

Cleavage and ligation

In addition to those listed above, the most important modification of primary structure is peptide cleavage (by chemical hydrolysis or by proteases). Proteins are often synthesized in an inactive precursor form; typically, an N-terminal or C-terminal segment blocks the active site of the protein, inhibiting its function. The protein is activated by cleaving off the inhibitory peptide.

Some proteins even have the power to cleave themselves. Typically, the hydroxyl group of a serine (rarely, threonine) or the thiol group of a cysteine residue will attack the carbonyl carbon of the preceding peptide bond, forming a tetrahedrally bonded intermediate [classified as a hydroxyoxazolidine (Ser/Thr) or hydroxythiazolidine (Cys) intermediate]. This intermediate tends to revert to the amide form, expelling the attacking group, since the amide form is usually favored by free energy, (presumably due to the strong resonance stabilization of the peptide group). However, additional molecular interactions may render the amide form less stable; the amino group is expelled instead, resulting in an ester (Ser/Thr) or thioester (Cys) bond in place of the peptide bond. This chemical reaction is called an N-O acyl shift.

The ester/thioester bond can be resolved in several ways:

Sequence compression

The compression of amino acid sequences is a comparatively challenging task. The existing specialized amino acid sequence compressors are low compared with that of DNA sequence compressors, mainly because of the characteristics of the data. For example, modeling inversions is harder because of the reverse information loss (from amino acids to DNA sequence). The current lossless data compressor that provides higher compression is AC2. [5] AC2 mixes various context models using Neural Networks and encodes the data using arithmetic encoding.

History

The proposal that proteins were linear chains of α-amino acids was made nearly simultaneously by two scientists at the same conference in 1902, the 74th meeting of the Society of German Scientists and Physicians, held in Karlsbad. Franz Hofmeister made the proposal in the morning, based on his observations of the biuret reaction in proteins. Hofmeister was followed a few hours later by Emil Fischer, who had amassed a wealth of chemical details supporting the peptide-bond model. For completeness, the proposal that proteins contained amide linkages was made as early as 1882 by the French chemist E. Grimaux. [6]

Despite these data and later evidence that proteolytically digested proteins yielded only oligopeptides, the idea that proteins were linear, unbranched polymers of amino acids was not accepted immediately. Some well-respected scientists such as William Astbury doubted that covalent bonds were strong enough to hold such long molecules together; they feared that thermal agitations would shake such long molecules asunder. Hermann Staudinger faced similar prejudices in the 1920s when he argued that rubber was composed of macromolecules. [6]

Thus, several alternative hypotheses arose. The colloidal protein hypothesis stated that proteins were colloidal assemblies of smaller molecules. This hypothesis was disproved in the 1920s by ultracentrifugation measurements by Theodor Svedberg that showed that proteins had a well-defined, reproducible molecular weight and by electrophoretic measurements by Arne Tiselius that indicated that proteins were single molecules. A second hypothesis, the cyclol hypothesis advanced by Dorothy Wrinch, proposed that the linear polypeptide underwent a chemical cyclol rearrangement C=O + HN C(OH)-N that crosslinked its backbone amide groups, forming a two-dimensional fabric. Other primary structures of proteins were proposed by various researchers, such as the diketopiperazine model of Emil Abderhalden and the pyrrol/piperidine model of Troensegaard in 1942. Although never given much credence, these alternative models were finally disproved when Frederick Sanger successfully sequenced insulin [ when? ] and by the crystallographic determination of myoglobin and hemoglobin by Max Perutz and John Kendrew [ when? ].

Primary structure in other molecules

Any linear-chain heteropolymer can be said to have a "primary structure" by analogy to the usage of the term for proteins, but this usage is rare compared to the extremely common usage in reference to proteins. In RNA, which also has extensive secondary structure, the linear chain of bases is generally just referred to as the "sequence" as it is in DNA (which usually forms a linear double helix with little secondary structure). Other biological polymers such as polysaccharides can also be considered to have a primary structure, although the usage is not standard.

Relation to secondary and tertiary structure

The primary structure of a biological polymer to a large extent determines the three-dimensional shape (tertiary structure). Protein sequence can be used to predict local features, such as segments of secondary structure, or trans-membrane regions. However, the complexity of protein folding currently prohibits predicting the tertiary structure of a protein from its sequence alone. Knowing the structure of a similar homologous sequence (for example a member of the same protein family) allows highly accurate prediction of the tertiary structure by homology modeling. If the full-length protein sequence is available, it is possible to estimate its general biophysical properties, such as its isoelectric point.

Sequence families are often determined by sequence clustering, and structural genomics projects aim to produce a set of representative structures to cover the sequence space of possible non-redundant sequences.

See also

Notes and references

  1. 1 2 SANGER F (1952). "The arrangement of amino acids in proteins". In M.L. Anson; Kenneth Bailey; John T. Edsall (eds.). Advances in Protein Chemistry. Vol. 7. pp. 1–67. doi:10.1016/S0065-3233(08)60017-0. ISBN   9780120342075. PMID   14933251.
  2. Aasland, Rein; Abrams, Charles; Ampe, Christophe; Ball, Linda J.; Bedford, Mark T.; Cesareni, Gianni; Gimona, Mario; Hurley, James H.; Jarchau, Thomas (2002-02-20). "Normalization of nomenclature for peptide motifs as ligands of modular protein domains". FEBS Letters. 513 (1): 141–144. doi: 10.1016/S0014-5793(01)03295-1 . ISSN   1873-3468. PMID   11911894.
  3. Aasland R, Abrams C, Ampe C, Ball LJ, Bedford MT, Cesareni G, Gimona M, Hurley JH, Jarchau T, Lehto VP, Lemmon MA, Linding R, Mayer BJ, Nagai M, Sudol M, Walter U, Winder SJ (1968-07-01). "A One-Letter Notation for Amino Acid Sequences*". European Journal of Biochemistry. 5 (2): 151–153. doi:10.1111/j.1432-1033.1968.tb00350.x. ISSN   1432-1033. PMID   11911894.
  4. 1 2 Hausman, Robert E.; Cooper, Geoffrey M. (2004). The cell: a molecular approach. Washington, D.C: ASM Press. p. 51. ISBN   978-0-87893-214-6.
  5. Silva M, Pratas D, Pinho AJ (April 2021). "AC2: An Efficient Protein Sequence Compression Tool Using Artificial Neural Networks and Cache-Hash Models". Entropy. 23 (5): 530. Bibcode:2021Entrp..23..530S. doi: 10.3390/e23050530 . PMC   8146440 . PMID   33925812.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  6. 1 2 Fruton JS (May 1979). "Early theories of protein structure". Ann. N. Y. Acad. Sci. 325 (1): xiv, 1–18. Bibcode:1979NYASA.325....1F. doi:10.1111/j.1749-6632.1979.tb14125.x. PMID   378063. S2CID   39125170.

Related Research Articles

<span class="mw-page-title-main">Amino acid</span> Organic compounds containing amine and carboxylic groups

Amino acids are organic compounds that contain both amino and carboxylic acid functional groups. Although over 500 amino acids exist in nature, by far the most important are the 22 α-amino acids incorporated into proteins. Only these 22 appear in the genetic code of all life.

<span class="mw-page-title-main">Protein biosynthesis</span> Assembly of proteins inside biological cells

Protein biosynthesis is a core biological process, occurring inside cells, balancing the loss of cellular proteins through the production of new proteins. Proteins perform a number of critical functions as enzymes, structural proteins or hormones. Protein synthesis is a very similar process for both prokaryotes and eukaryotes but there are some distinct differences.

<span class="mw-page-title-main">Post-translational modification</span> Biological processes

Post-translational modification (PTM) is the covalent process of changing proteins following protein biosynthesis. PTMs may involve enzymes or occur spontaneously. Proteins are created by ribosomes translating mRNA into polypeptide chains, which may then change to form the mature protein product. PTMs are important components in cell signalling, as for example when prohormones are converted to hormones.

In polymer science, the polymer chain or simply backbone of a polymer is the main chain of a polymer. Polymers are often classified according to the elements in the main chains. The character of the backbone, i.e. its flexibility, determines the properties of the polymer. For example, in polysiloxanes (silicone), the backbone chain is very flexible, which results in a very low glass transition temperature of −123 °C. The polymers with rigid backbones are prone to crystallization in thin films and in solution. Crystallization in its turn affects the optical properties of the polymers, its optical band gap and electronic levels.

The C-terminus is the end of an amino acid chain, terminated by a free carboxyl group (-COOH). When the protein is translated from messenger RNA, it is created from N-terminus to C-terminus. The convention for writing peptide sequences is to put the C-terminal end on the right and write the sequence from N- to C-terminus.

The N-terminus (also known as the amino-terminus, NH2-terminus, N-terminal end or amine-terminus) is the start of a protein or polypeptide, referring to the free amine group (-NH2) located at the end of a polypeptide. Within a peptide, the amine group is bonded to the carboxylic group of another amino acid, making it a chain. That leaves a free carboxylic group at one end of the peptide, called the C-terminus, and a free amine group on the other end called the N-terminus. By convention, peptide sequences are written N-terminus to C-terminus, left to right (in LTR writing systems). This correlates the translation direction to the text direction, because when a protein is translated from messenger RNA, it is created from the N-terminus to the C-terminus, as amino acids are added to the carboxyl end of the protein.

Native Chemical Ligation (NCL) is an important extension of the chemical ligation concept for constructing a larger polypeptide chain by the covalent condensation of two or more unprotected peptides segments. Native chemical ligation is the most effective method for synthesizing native or modified proteins of typical size.

<span class="mw-page-title-main">Protein sequencing</span> Sequencing of amino acid arrangement in a protein

Protein sequencing is the practical process of determining the amino acid sequence of all or part of a protein or peptide. This may serve to identify the protein or characterize its post-translational modifications. Typically, partial sequencing of a protein provides sufficient information to identify it with reference to databases of protein sequences derived from the conceptual translation of genes.

<span class="mw-page-title-main">Peptide synthesis</span> Production of peptides

In organic chemistry, peptide synthesis is the production of peptides, compounds where multiple amino acids are linked via amide bonds, also known as peptide bonds. Peptides are chemically synthesized by the condensation reaction of the carboxyl group of one amino acid to the amino group of another. Protecting group strategies are usually necessary to prevent undesirable side reactions with the various amino acid side chains. Chemical peptide synthesis most commonly starts at the carboxyl end of the peptide (C-terminus), and proceeds toward the amino-terminus (N-terminus). Protein biosynthesis in living organisms occurs in the opposite direction.

<span class="mw-page-title-main">Dendrotoxin</span> Chemical compound

Dendrotoxins are a class of presynaptic neurotoxins produced by mamba snakes (Dendroaspis) that block particular subtypes of voltage-gated potassium channels in neurons, thereby enhancing the release of acetylcholine at neuromuscular junctions. Because of their high potency and selectivity for potassium channels, dendrotoxins have proven to be extremely useful as pharmacological tools for studying the structure and function of these ion channel proteins.

<span class="mw-page-title-main">Catalytic triad</span> Set of three coordinated amino acids

A catalytic triad is a set of three coordinated amino acids that can be found in the active site of some enzymes. Catalytic triads are most commonly found in hydrolase and transferase enzymes. An acid-base-nucleophile triad is a common motif for generating a nucleophilic residue for covalent catalysis. The residues form a charge-relay network to polarise and activate the nucleophile, which attacks the substrate, forming a covalent intermediate which is then hydrolysed to release the product and regenerate free enzyme. The nucleophile is most commonly a serine or cysteine amino acid, but occasionally threonine or even selenocysteine. The 3D structure of the enzyme brings together the triad residues in a precise orientation, even though they may be far apart in the sequence.

<span class="mw-page-title-main">Carboxypeptidase</span>

A carboxypeptidase is a protease enzyme that hydrolyzes (cleaves) a peptide bond at the carboxy-terminal (C-terminal) end of a protein or peptide. This is in contrast to an aminopeptidases, which cleave peptide bonds at the N-terminus of proteins. Humans, animals, bacteria and plants contain several types of carboxypeptidases that have diverse functions ranging from catabolism to protein maturation. At least two mechanisms have been discussed.

<span class="mw-page-title-main">Cyclic peptide</span> Peptide chains which contain a circular sequence of bonds

Cyclic peptides are polypeptide chains which contain a circular sequence of bonds. This can be through a connection between the amino and carboxyl ends of the peptide, for example in cyclosporin; a connection between the amino end and a side chain, for example in bacitracin; the carboxyl end and a side chain, for example in colistin; or two side chains or more complicated arrangements, for example in amanitin. Many cyclic peptides have been discovered in nature and many others have been synthesized in the laboratory. Their length ranges from just two amino acid residues to hundreds. In nature they are frequently antimicrobial or toxic; in medicine they have various applications, for example as antibiotics and immunosuppressive agents. Thin-Layer Chromatography (TLC) is a convenient method to detect cyclic peptides in crude extract from bio-mass.

Chemical ligation is the chemoselective condensation of unprotected peptide segments enabled by the formation of a non-native bond at the ligation site.

A peptide library is a tool for studying proteins. Peptide libraries typically contain a large number of peptides that have a systematic combination of amino acids. Usually, the peptide library is synthesized on a solid phase, mostly on resin, which can be made as a flat surface or beads. The peptide library is a popular tool for drug design, protein–protein interactions, and other biochemical and pharmaceutical applications.

<span class="mw-page-title-main">Isopeptide bond</span>

An isopeptide bond is a type of amide bond formed between a carboxyl group of one amino acid and an amino group of another. An isopeptide bond is the linkage between the side chain amino or carboxyl group of one amino acid to the α-carboxyl, α-amino group, or the side chain of another amino acid. In a typical peptide bond, also known as eupeptide bond, the amide bond always forms between the α-carboxyl group of one amino acid and the α-amino group of the second amino acid. Isopeptide bonds are rarer than regular peptide bonds. Isopeptide bonds lead to branching in the primary sequence of a protein. Proteins formed from normal peptide bonds typically have a linear primary sequence.

<span class="mw-page-title-main">Pseudoproline</span>

Pseudoproline derivatives are artificially created dipeptides to minimize aggregation during Fmoc solid-phase synthesis of peptides.

Glycopeptides are peptides that contain carbohydrate moieties (glycans) covalently attached to the side chains of the amino acid residues that constitute the peptide.

Lysine carboxypeptidase is an enzyme. This enzyme catalyses the following chemical reaction:

Ribosomally synthesized and post-translationally modified peptides (RiPPs), also known as ribosomal natural products, are a diverse class of natural products of ribosomal origin. Consisting of more than 20 sub-classes, RiPPs are produced by a variety of organisms, including prokaryotes, eukaryotes, and archaea, and they possess a wide range of biological functions.