Micropeptides (also referred to as microproteins) are polypeptides with a length of less than 100-150 amino acids that are encoded by short open reading frames (sORFs). [1] [2] [3] In this respect, they differ from many other active small polypeptides, which are produced through the posttranslational cleavage of larger polypeptides. [1] [4] In terms of size, micropeptides are considerably shorter than "canonical" proteins, which have an average length of 330 and 449 amino acids in prokaryotes and eukaryotes, respectively. [5] Micropeptides are sometimes named according to their genomic location. For example, the translated product of an upstream open reading frame (uORF) might be called a uORF-encoded peptide (uPEP). [6] Micropeptides lack an N-terminal signaling sequences, suggesting that they are likely to be localized to the cytoplasm. [1] However, some micropeptides have been found in other cell compartments, as indicated by the existence of transmembrane micropeptides. [7] [8] They are found in both prokaryotes and eukaryotes. [1] [9] [10] The sORFs from which micropeptides are translated can be encoded in 5' UTRs, small genes, or polycistronic mRNAs. Some micropeptide-coding genes were originally mis-annotated as long non-coding RNAs (lncRNAs). [11]
Given their small size, sORFs were originally overlooked. However, hundreds of thousands of putative micropeptides have been identified through various techniques in a multitude of organisms. Only a small fraction of these with coding potential have had their expression and function confirmed. Those that have been functionally characterized, in general, have roles in cell signaling, organogenesis, and cellular physiology. As more micropeptides are discovered so are more of their functions. One regulatory function is that of peptoswitches, which inhibit expression of downstream coding sequences by stalling ribosomes, through their direct or indirect activation by small molecules. [11]
Various experimental techniques exist for identifying potential sORFs and their translational products. These techniques are only useful for identification of sORF that may produce micropeptides and not for direct functional characterization.
One method for finding potential sORFs, and therefore micropeptides, is through RNA sequencing (RNA-Seq). RNA-Seq uses next-generation sequencing (NGS) to determine which RNAs are expressed in a given cell, tissue, or organism at a specific point in time. This collection of data, known as a transcriptome, can then be used as a resource for finding potential sORFs. [1] Because of the strong likelihood of sORFs less than 100 aa occurring by chance, further study is necessary to determine the validity of data obtained using this method. [11]
Ribosome profiling has been used to identify potential micropeptides in a growing number of organisms, including fruit flies, zebrafish, mice and humans. [11] One method uses compounds such as harringtonine, puromycin or lactimidomycin to stop ribosomes at translation initiation sites. [12] This indicates where active translation is taking place. Translation elongation inhibitors, such as emetine or cycloheximide, may also be used to obtain ribosome footprints which are more likely to result in a translated ORF. [13] If a ribosome is bound at or near a sORF, it putatively encodes a micropeptide. [1] [2] [14]
Mass spectrometry (MS) is the gold standard for identifying and sequencing proteins. Using this technique, investigators are able to determine if polypeptides are, in fact, translated from a sORF.
Proteogenomics combines proteomics, genomics, and transciptomics. This is important when looking for potential micropeptides. One method of using proteogenomics entails using RNA-Seq data to create a custom database of all possible polypeptides. Liquid chromatography followed by tandem MS (LC-MS/MS) is performed to provide sequence information for translation products. Comparison of the transcriptomic and proteomics data can be used to confirm the presence of micropeptides. [1] [2]
Phylogenetic conservation can be a useful tool, particularly when sifting through a large database of sORFs. The likelihood of a sORF resulting in a functional micropeptide is more likely if it is conserved across numerous species. [11] [12] However, this will not work for all sORFs. For example, those that are encoded by lncRNAs are less likely to be conserved given lncRNAs themselves do not have high sequence conservation. [2] Further experimentation will be necessary to determine if a functional micropeptide is in fact produced.
Custom antibodies targeted to the micropeptide of interest can be useful for quantifying expression or determining intracellular localization. As is the case with most proteins, low expression may make detection difficult. The small size of the micropeptide can also lead to difficulties in designing an epitope from which to target the antibody. [2]
Genome editing can be used to add FLAG/MYC or other small peptide tags to an endogenous sORF, thus creating fusion proteins. In most cases, this method is beneficial in that it can be performed more quickly than developing a custom antibody. It is also useful for micropeptides for which no epitope can be targeted. [2]
This process entails cloning the full-length micropeptide cDNA into a plasmid containing a T7 or SP6 promoter. This method utilizes a cell-free protein-synthesizing system in the presence of 35S-methionine to produce the peptide of interest. The products can then be analyzed by gel electrophoresis and the 35S-labeled peptide is visualized using autoradiography. [2]
There are several repositories and databases that have been created for both sORFs and micropeptides. A repository for of small ORFs discovered by ribosome profiling can be found at sORFs.org. [15] [16] A repository of putative sORF-encoded peptides in Arabidopsis thaliana can be found at ARA-PEPs. [17] [18] A database of small proteins, especially encoded by non-coding RNAs can be found at SmProt. [19] [20]
To date, most micropeptides have been identified in prokaryotic organisms. While most have yet to be fully characterized, of those that have been studied, many appear to be critical to the survival of these organisms. Because of their small size, prokaryotes are particularly susceptible to changes in their environment, and as such have developed methods to ensure their existence.
Micropeptides expressed in E. coli exemplify bacterial environmental adaptations. Most of these have been classified into three groups: leader peptides, ribosomal proteins, and toxic proteins. Leader proteins regulate transcription and/or translation of proteins involved in amino acid metabolism when amino acids are scarce. Ribosomal proteins include L36 (rpmJ) and L34 (rpmH), two components of the 50S ribosomal subunit. Toxic proteins, such as ldrD, are toxic at high levels and can kill cells or inhibit growth, which functions to reduce the host cell's viability. [21]
In S. enterica, the MgtC virulence factor is involved in adaptation to low magnesium environments. The hydrophobic peptide MgrR, binds to MgtC, causing its degradation by the FtsH protease. [9]
The 46 aa Sda micropeptide, expressed by B. subtilis, represses sporulation when replication initiation is impaired. By inhibiting the histidine Kinase KinA, Sda prevents the activation of the transcription factor Spo0A, which is required for sporulation. [10]
In S. aureus, there are a group of micropeptides, 20-22 aa, that are excreted during host infection to disrupt neutrophil membranes, causing cell lysis. These micropeptides allow the bacterium to avoid degradation by the human immune systems' main defenses. [22] [23]
Micropeptides have been discovered in eukaryotic organisms from Arabidopsis thaliana to humans. They play diverse roles in tissue and organ development, as well as maintenance and function once fully developed. While many are yet to be functionally characterized, and likely more remain to be discovered, below is a summary of recently identified eukaryotic micropeptide functions.
The POLARIS (PLS) gene encodes a 36 aa micropeptide. It is necessary for proper vascular leaf patterning and cell expansion in the root. This micropeptide interacts with developmental PIN proteins to form a critical network for hormonal crosstalk between auxin, ethylene, and cytokinin. [24] [25] [26]
ROTUNDIFOLIA (ROT4) in A. thaliana encodes a 53 aa peptide, which localizes to the plasma membrane of leaf cells. The mechanism of ROT4 function is not well understood, but mutants have short rounded leaves, indicating that this peptide may be important in leaf morphogenesis. [27]
Brick1 (Brk1) encodes a 76 aa micropeptide, which is highly conserved in both plants and animals. In Z. mays, it was found to be involved in morphogenesis of leaf epithelia, by promoting multiple actin-dependent cell polarization events in the developing leaf epidermis. [28] Zm401p10 is an 89 aa micropeptide, which plays a role in normal pollen development in the tapetum. After mitosis it also is essential in the degradation of the tapetum. [29] Zm908p11 is a micropeptide 97 aa in length, encoded by the Zm908 gene that is expressed in mature pollen grains. It localizes to the cytoplasm of pollen tubes, where it aids in their growth and development. [30]
The evolutionarily conserved polished rice (pri) gene, known as tarsal-less (tal) inD.melanogaster, is involved in epidermal differentiation. This polycistronic transcript encodes four similar peptides, which range between 11-32 aa in length. They function to truncate the transcription factor Shavenbaby (Svb). This converts Svb into an activator that directly regulates the expression of target effectors, including miniature (m) and shavenoid (sha), which are together responsible for trichome formation. [31]
The Elabela gene (Ela) (a.k.a. Apela, Toddler) is important for embryogenesis. [32] It is specifically expressed during late blastula and gastrula stages. During gastrulation, it is critical in promoting the internalization and animal-pole directed movement of mesendodermal cells. After gastrulation, Ela is expressed in the lateral mesoderm, endoderm, as well as the anterior, and posterior, notochord. Although it was annotated as a lncRNA in zebrafish, mouse, and human, the 58-aa ORF was found to be highly conserved among vertebrate species. Ela is processed by removal of its N-terminus signal peptide and then secreted in the extracellular space. Its 34-aa mature peptide serves as the first endogenous ligand to a GPCR known as the Apelin Receptor. [33] [32] The genetic inactivation of Ela or Aplnr in zebrafish results in heartless phenotypes. [34] [35]
Myoregulin (Mln) is encoded by a gene originally annotated as a lncRNA. Mln is expressed in all 3 types of skeletal muscle, and works similarly to the micropeptides phospholamban (Pln) in the cardiac muscle and sarcolipin (Sln) in slow (Type I) skeletal muscle. These micropeptides interact with sarcoplasmic reticulum Ca2+-ATPase (SERCA), a membrane pump responsible for regulating Ca2+ uptake into the sarcoplasmic reticulum (SR). By inhibiting Ca2+ uptake into the SR, they cause muscle relaxation. Similarly, the endoregulin (ELN) and another-regulin (ALN) genes code for transmembrane micropeptides that contain the SERCA binding motif, and are conserved in mammals. [7]
Myomixer (Mymx) is encoded by the gene Gm7325, a muscle-specific peptide, 84 aa in length, which plays a role during embryogenesis in fusion and skeletal muscle formation. It localizes to the plasma membrane, associating with a fusogenic membrane protein, Myomaker (Mymk). In humans, the gene encoding Mymx is annotated as uncharacterized LOC101929726. Orthologs are found in the turtle, frog and fish genomes as well. [8]
In humans, NoBody (non-annotated P-body dissociating polypeptide), a 68 aa micropeptide, was discovered in the long intervening noncoding RNA (lincRNA) LINC01420. It has high sequence conservation among mammals, and localizes to P-bodies. It enriches proteins associated with 5’ mRNA decapping. It is thought to interact directly with Enhancer of mRNA Decapping 4 (EDC4). [36]
ELABELA (ELA) (a.k.a. APELA) is an endogenous hormone that is secreted as a 32 amino acid micropeptide by human embryonic stem cells. [32] It is essential to maintain the self-renewal and pluripotency of human embryonic stem cells. Its signals in an autocrine fashion through the PI3/AKT pathway via an as yet unidentified cell surface receptor. [37] In differentiating mesoendermal cells ELA binds to, and signals via, APLNR, a GPCR which can also respond to the hormonal peptide APLN.
The CYREN gene, conserved in mammals, when alternatively spliced is predicted to produce three micropeptides. MRI-1 was previously found to be a modulator of retrovirus infection. The second predicted micropeptide, MRI-2, may be important in non-homologous end joining (NHEJ) of DNA double strand breaks. In Co-Immunoprecipitation experiments, MRI-2 bound to Ku70 and Ku80, two subunits of Ku, which play a major role in the NHEJ pathway. [38]
The 24 amino acid micropeptide, Humanin (HN), interacts with the apoptosis-inducing protein Bcl2-associated X protein (Bax). In its active state, Bax undergoes a conformational change which exposes membrane-targeting domains. This causes it to move from the cytosol to the mitochondrial membrane, where it inserts and releases apoptogenic proteins such as cytochrome c. By interacting with Bax, HN prevents Bax targeting of the mitochondria, thereby blocking apoptosis. [39]
A micropeptide of 90aa, ‘Small Regulatory Polypeptide of Amino Acid Response’ or SPAAR, was found to be encoded in the lncRNA LINC00961. It is conserved between human and mouse, and localizes to the late endosome/lysosome. SPAAR interacts with four subunits of the v-ATPase complex, inhibiting mTORC1 translocation to the lysosomal surface where it is activated. Down-regulation of this micropeptide enables mTORC1 activation by amino acid stimulation, promoting muscle regeneration. [40]
Protein biosynthesis is a core biological process, occurring inside cells, balancing the loss of cellular proteins through the production of new proteins. Proteins perform a number of critical functions as enzymes, structural proteins or hormones. Protein synthesis is a very similar process for both prokaryotes and eukaryotes but there are some distinct differences.
Ribosomes are macromolecular machines, found within all cells, that perform biological protein synthesis. Ribosomes link amino acids together in the order specified by the codons of messenger RNA molecules to form polypeptide chains. Ribosomes consist of two major components: the small and large ribosomal subunits. Each subunit consists of one or more ribosomal RNA molecules and many ribosomal proteins. The ribosomes and associated molecules are also known as the translational apparatus.
In biology, translation is the process in living cells in which proteins are produced using RNA molecules as templates. The generated protein is a sequence of amino acids. This sequence is determined by the sequence of nucleotides in the RNA. The nucleotides are considered three at a time. Each such triple results in addition of one specific amino acid to the protein being generated. The matching from nucleotide triple to amino acid is called the genetic code. The translation is performed by a large complex of functional RNA and proteins called ribosomes. The entire process is called gene expression.
A signal peptide is a short peptide present at the N-terminus of most newly synthesized proteins that are destined toward the secretory pathway. These proteins include those that reside either inside certain organelles, secreted from the cell, or inserted into most cellular membranes. Although most type I membrane-bound proteins have signal peptides, most type II and multi-spanning membrane-bound proteins are targeted to the secretory pathway by their first transmembrane domain, which biochemically resembles a signal sequence except that it is not cleaved. They are a kind of target peptide.
In molecular biology, a reading frame is a way of dividing the sequence of nucleotides in a nucleic acid molecule into a set of consecutive, non-overlapping triplets. Where these triplets equate to amino acids or stop signals during translation, they are called codons.
Bacterial translation is the process by which messenger RNA is translated into proteins in bacteria.
Eukaryotic translation is the biological process by which messenger RNA is translated into proteins in eukaryotes. It consists of four phases: initiation, elongation, termination, and recapping.
Systemin is a plant peptide hormone involved in the wound response in the family Solanaceae. It was the first plant hormone that was proven to be a peptide having been isolated from tomato leaves in 1991 by a group led by Clarence A. Ryan. Since then, other peptides with similar functions have been identified in tomato and outside of the Solanaceae. Hydroxyproline-rich glycopeptides were found in tobacco in 2001 and AtPeps were found in Arabidopsis thaliana in 2006. Their precursors are found both in the cytoplasm and cell walls of plant cells, upon insect damage, the precursors are processed to produce one or more mature peptides. The receptor for systemin was first thought to be the same as the brassinolide receptor but this is now uncertain. The signal transduction processes that occur after the peptides bind are similar to the cytokine-mediated inflammatory immune response in animals. Early experiments showed that systemin travelled around the plant after insects had damaged the plant, activating systemic acquired resistance, now it is thought that it increases the production of jasmonic acid causing the same result. The main function of systemins is to coordinate defensive responses against insect herbivores but they also affect plant development. Systemin induces the production of protease inhibitors which protect against insect herbivores, other peptides activate defensins and modify root growth. They have also been shown to affect plants' responses to salt stress and UV radiation. AtPEPs have been shown to affect resistance against oomycetes and may allow A. thaliana to distinguish between different pathogens. In Nicotiana attenuata, some of the peptides have stopped being involved in defensive roles and instead affect flower morphology.
EF-Tu is a prokaryotic elongation factor responsible for catalyzing the binding of an aminoacyl-tRNA (aa-tRNA) to the ribosome. It is a G-protein, and facilitates the selection and binding of an aa-tRNA to the A-site of the ribosome. As a reflection of its crucial role in translation, EF-Tu is one of the most abundant and highly conserved proteins in prokaryotes. It is found in eukaryotic mitochondria as TUFM.
Eukaryotic translation termination factor1 (eRF1), also referred to as TB3-1 or SUP45L1, is a protein that is encoded by the ERF1 gene. In Eukaryotes, eRF1 is an essential protein involved in stop codon recognition in translation, termination of translation, and nonsense mediated mRNA decay via the SURF complex.
Ribophorins are dome shaped transmembrane glycoproteins which are located in the membrane of the rough endoplasmic reticulum, but are absent in the membrane of the smooth endoplasmic reticulum. There are two types of ribophorines: ribophorin I and II. These act in the protein complex oligosaccharyltransferase (OST) as two different subunits of the named complex. Ribophorin I and II are only present in eukaryote cells.
EF-G is a prokaryotic elongation factor involved in mRNA translation. As a GTPase, EF-G catalyzes the movement (translocation) of transfer RNA (tRNA) and messenger RNA (mRNA) through the ribosome.
Long non-coding RNAs are a type of RNA, generally defined as transcripts more than 200 nucleotides that are not translated into protein. This arbitrary limit distinguishes long ncRNAs from small non-coding RNAs, such as microRNAs (miRNAs), small interfering RNAs (siRNAs), Piwi-interacting RNAs (piRNAs), small nucleolar RNAs (snoRNAs), and other short RNAs. Given that some lncRNAs have been reported to have the potential to encode small proteins or micro-peptides, the latest definition of lncRNA is a class of transcripts of over 200 nucleotides that have no or limited coding capacity. However, John S. Mattick and colleagues suggested to change definition of long non-coding RNAs to transcripts more than 500 nt, which are mostly generated by Pol II. That means that question of lncRNA exact definition is still under discussion in the field. Long intervening/intergenic noncoding RNAs (lincRNAs) are sequences of transcripts that do not overlap protein-coding genes.
Peptide signaling plays a significant role in various aspects of plant growth and development and specific receptors for various peptides have been identified as being membrane-localized receptor kinases, the largest family of receptor-like molecules in plants. Signaling peptides include members of the following protein families.
Ribosomes are a large and complex molecular machine that catalyzes the synthesis of proteins, referred to as translation. The ribosome selects aminoacylated transfer RNAs (tRNAs) based on the sequence of a protein-encoding messenger RNA (mRNA) and covalently links the amino acids into a polypeptide chain. Ribosomes from all organisms share a highly conserved catalytic center. However, the ribosomes of eukaryotes are much larger than prokaryotic ribosomes and subject to more complex regulation and biogenesis pathways. Eukaryotic ribosomes are also known as 80S ribosomes, referring to their sedimentation coefficients in Svedberg units, because they sediment faster than the prokaryotic (70S) ribosomes. Eukaryotic ribosomes have two unequal subunits, designated small subunit (40S) and large subunit (60S) according to their sedimentation coefficients. Both subunits contain dozens of ribosomal proteins arranged on a scaffold composed of ribosomal RNA (rRNA). The small subunit monitors the complementarity between tRNA anticodon and mRNA, while the large subunit catalyzes peptide bond formation.
Chloroplast DNA (cpDNA), also known as plastid DNA (ptDNA) is the DNA located in chloroplasts, which are photosynthetic organelles located within the cells of some eukaryotic organisms. Chloroplasts, like other types of plastid, contain a genome separate from that in the cell nucleus. The existence of chloroplast DNA was identified biochemically in 1959, and confirmed by electron microscopy in 1962. The discoveries that the chloroplast contains ribosomes and performs protein synthesis revealed that the chloroplast is genetically semi-autonomous. The first complete chloroplast genome sequences were published in 1986, Nicotiana tabacum (tobacco) by Sugiura and colleagues and Marchantia polymorpha (liverwort) by Ozeki et al. Since then, tens of thousands of chloroplast genomes from various species have been sequenced.
CLE peptides are a group of peptides found in plants that are involved with cell signaling. Production is controlled by the CLE genes. Upon binding to a CLE peptide receptor in another cell, a chain reaction of events occurs, which can lead to various physiological and developmental processes. This signaling pathway is conserved in diverse land plants.
EF-Tu receptor, abbreviated as EFR, is a pattern-recognition receptor (PRR) that binds to the prokaryotic protein EF-Tu in Arabidopsis thaliana. This receptor is an important part of the plant immune system as it allows the plant cells to recognize and bind to EF-Tu, preventing genetic transformation by and protein synthesis in pathogens such as Agrobacterium.
Translatomics is the study of all open reading frames (ORFs) that are being actively translated in a cell or organism. This collection of ORFs is called the translatome. Characterizing a cell's translatome can give insight into the array of biological pathways that are active in the cell. According to the central dogma of molecular biology, the DNA in a cell is transcribed to produce RNA, which is then translated to produce a protein. Thousands of proteins are encoded in an organism's genome, and the proteins present in a cell cooperatively carry out many functions to support the life of the cell. Under various conditions, such as during stress or specific timepoints in development, the cell may require different biological pathways to be active, and therefore require a different collection of proteins. Depending on intrinsic and environmental conditions, the collection of proteins being made at one time varies. Translatomic techniques can be used to take a "snapshot" of this collection of actively translating ORFs, which can give information about which biological pathways the cell is activating under the present conditions.
Arabidopsis SUMO-conjugation enzyme (AtSCE1) is an enzyme that is a member of the small ubiquitin-like modifier (SUMO) post-translational modification pathway. This process, and the SCE1 enzyme with it, is highly conserved across eukaryotes yet absent in prokaryotes. In short, this pathway results in the attachment of a small polypeptide through an isopeptide bond between modifying enzyme and the ε-amino group of a lysine residue in the substrate. In plants, the 160 amino acid SCE1 enzyme was first characterized in 2003. One functional gene copy, SCE1a, was found on chromosomes 3.
This article was adapted from the following source under a CC BY 4.0 license (2018) (reviewer reports): Maria E. Sousa; Michael H. Farkas (13 December 2018). "Micropeptide". PLOS Genetics . 14 (12): e1007764. doi: 10.1371/JOURNAL.PGEN.1007764 . ISSN 1553-7390. PMC 6292567 . PMID 30543625. Wikidata Q60017699.