Protein structural phylogenetics

Last updated

Protein structural phylogenetics (or Structural phylogenetics) is the branch of molecular evolution that incorporates three dimensional information from protein structure to understand phylogenetic relationships, and translates those evolutionary insights into understanding protein structure and function. [1] Protein structures are robust over long evolutionary time scales compared with amino acid sequence. [2] [3] [4] The number of protein sequences that can fold into a given structure is astronomical, with one study estimating even a small protein structure with less than 100 amino acids can be attained by a number of sequences that exceeds the Avagadro constant. [5] These properties make structures useful for understanding deep evolutionary relationships, where sequences have become saturated with mutations and share very low levels of similarity. [6]

Contents


History

Protein structures have been used to explore evolutionary relationships since the 1970s. [7] The approach became popularized in the 1990s and 2000s as techniques in structural biology took off, namely X-ray crystallography, nuclear magnetic resonance, and electron microscopy. Throughout this period, several studies investigated deep evolutionary relationships through the analysis of aligned protein structures, for example, the immunoglobulins [8] aminoacyl-tRNA synthetases, [9] and metallo-β-lactamases. [10] However, the field was still constrained by the limited availability of entries in the Protein Data Bank.

In the early 2020s, with the arrival of protein structure prediction methods like AlphaFold2, [11] high quality data became readily available. Although structural predictions are still less accurate than solved structures, [12] this has nevertheless led to three dimensional protein structure becoming increasingly important within the field of phylogenetics. This led to recent insights into the evolution of Flavivirus glycoproteins, [13] fungal virulence factors, [14] and gamete fusion proteins. [15] Despite the abundance of protein structural data, the methodologies to analyze these structures have not kept pace with those used to estimate phylogenies from sequence.

Structural data and alignment

Inferring phylogenetic trees from protein structure usually relies on a structural alignment. There are numerous software packages available to perform this task, each with their own strengths and limitations. [16]

Methods for estimating phylogenies from protein structure

Atomic coordinates

The simplest methods for building phylogenies from protein structures are based on atomic-level comparisons using measures like RMSD and TM-Score, among others. [16] [1] This distance-based approach is often performed using the neighbor joining algorithm. A key limitation in this approach comes from the inability to quantify statistical uncertainty, such as through bootstrap or posterior clade support. Some have used molecular dynamics simulations to estimate bootstrap support, although this approach is computationally demanding. [17]

The more advanced methods are model-based, meaning they describe probabilistic generative processes and can provide a more reliable means of quantifying uncertainty in a maximum likelihood or Bayesian phylogenetic framework. The Challis-Schmidler model describes protein structural drift, over long evolutionary time frames, as an Ornstein–Uhlenbeck process. [18] [19] This Bayesian total-evidence model estimates the sequence and structural alignment all within a single analysis. A key limitation in this method comes from the energetically-unrealistic assumption of independent drift across all positions in the protein. This restriction was later addressed by the Larson-Thorne-Schmidler model. [20]

Structural alphabets

Protein structures can also be represented as sequences of characters from a structural alphabet. Typically, there is one character assigned to each amino acid residue in the sequence. This enables structural phylogenies to be built using the same methodologies that are used in sequence phylogenetics, including maximum likelihood and Bayesian inference, as a continuous time Markov process. The earliest efforts involved simple alphabets that describe protein secondary structure and surface accessibility. [21] [22] The 3Di alphabet employed by Foldseek [23] is widely used today. [24] [25] [26] This alphabet consists of twenty characters informed by the protein tertiary structure. While 3Di phylogenetics has become widely applied in recent years, its key limitation comes from the standard phylogenetic assumption of independence between sites, a requirement violated by the concept of the 3Di characters, which are defined by tertiary structure interactions.

See also

References

  1. 1 2 Puente-Lelievre, C; Malik, A & Douglas, J (2025). "Protein Structural Phylogenetics". Genome Biology and Evolution. 17 (8) evaf139. doi:10.1093/gbe/evaf139. PMC   12369579 . PMID   40839422.
  2. Chothia, C & Lesk, AM (1986). "The relation between the divergence of sequence and structure in proteins". EMBO J. 5 (4): 823–826. doi:10.1002/j.1460-2075.1986.tb04288.x. PMC   1166865 . PMID   3709526.
  3. Flores, TP; Orengo, CA; Moss, DS & Thornton, JM (1993). "Comparison of conformational characteristics in structurally similar protein pairs". Protein Sci. 2 (11): 1811–1826. doi:10.1002/pro.5560021104. PMC   2142289 . PMID   8268794.
  4. Illergård, K; Ardell, DH & Elofsson, A (2009). "Structure is three to ten times more conserved than sequence—a study of structural response in protein cores". Proteins. 77 (3): 499–508. doi:10.1002/prot.22458. PMID   19507241.
  5. Tian, P; Best, RB (17 October 2017). "How Many Protein Sequences Fold to a Given Structure? A Coevolutionary Analysis". Biophysical Journal. 113 (8): 1719–1730. doi:10.1016/j.bpj.2017.08.039. PMC   5647607 . PMID   29045866.
  6. Rost, B (1999). "Twilight zone of protein sequence alignments". Protein Eng. 12 (2): 85–94. doi:10.1093/protein/12.2.85. PMID   10195279.
  7. Eventoff, W; Rossmann, MG & Brändén, C-I (1975). "The evolution of dehydrogenases and kinase". CRC Crit Rev Biochem. 3 (2): 111–140. doi:10.3109/10409237509102554. PMID   1100315.
  8. Johnson, MS; Sutcliffe, MJ & Blundell, TL (1990). "Molecular anatomy: phyletic relationships derived from three-dimensional structures of proteins". J Mol Evol. 30 (1): 43–59. Bibcode:1990JMolE..30...43J. doi:10.1007/BF02102452. PMID   2107323.
  9. O’Donoghue, P & Luthey-Schulten, Z (2003). "On the evolution of structure in aminoacyl-tRNA synthetases". Microbiol Mol Biol Rev. 67 (4): 550–573. doi:10.1128/MMBR.67.4.550-573.2003. PMC   309052 . PMID   14665676.
  10. Garau, G; Di Guilmi, AM & Hall, BG (2005). "Structure-based phylogeny of the metallo-β-lactamases". Antimicrob Agents Chemother. 49 (7): 2778–2784. doi:10.1128/AAC.49.7.2778-2784.2005. PMC   1168685 . PMID   15980349.
  11. Jumper, J; Evans, R; Pritzel, A; Green, T; Figurnov, M; Ronneberger, O & Hassabis, D (2021). "Highly accurate protein structure prediction with AlphaFold". Nature. 596 (7873): 583–589. Bibcode:2021Natur.596..583J. doi:10.1038/s41586-021-03819-2. PMC   8371605 . PMID   34265844.
  12. Terwilliger, TC; Liebschner, D; Croll, TI; Williams, CJ; McCoy, AJ; Poon, BK & Adams, PD (2024). "AlphaFold predictions are valuable hypotheses and accelerate but do not replace experimental structure determination". Nat Methods. 21 (1): 110–116. doi:10.1038/s41592-023-02087-4. PMC   10776388 . PMID   38036854.
  13. Mifsud, JC; Lytras, S; Oliver, MR; Toon, K; Costa, VA; Holmes, EC & Grove, J (2024). "Mapping glycoprotein structure reveals Flaviviridae evolutionary history". Nature. 633 (8030): 695–703. Bibcode:2024Natur.633..695M. doi:10.1038/s41586-024-07899-8. PMC   11410658 . PMID   39232167.
  14. Lahfa, M; Barthe, P; De Guillen, K; Cesari, S; Raji, M; Kroj, T & Padilla, A (2024). "The structural landscape and diversity of Pyricularia oryzae MAX effectors revisited". PLOS Pathog. 20 (5): e1012176. doi: 10.1371/journal.ppat.1012176 . PMC   11132498 . PMID   38709846.{{cite journal}}: CS1 maint: article number as page number (link)
  15. Moi, D; Nishio, S; Li, X; Valansi, C; Langleib, M; Brukman, NG & Podbilewicz, B (2022). "Discovery of archaeal fusexins homologous to eukaryotic HAP2/GCS1 gamete fusion proteins". Nat Commun. 13 (1): 3880. Bibcode:2022NatCo..13.3880M. doi:10.1038/s41467-022-31564-1. PMC   9259645 . PMID   35794124.
  16. 1 2 Hasegawa, H & Holm, L (2009). "Advances and pitfalls of protein structural alignment". Curr Opin Struct Biol. 19 (3): 341–348. doi:10.1016/j.sbi.2009.04.003. PMID   19481444.
  17. Malik, AJ; Poole, AM & Allison, JR (2020). "Structural phylogenetics with confidence". Mol Biol Evol. 37 (9): 2711–2726. doi:10.1093/molbev/msaa100. PMC   7475046 . PMID   32302382.
  18. Challis, CJ & Schmidler, SC (2012). "A stochastic evolutionary model for protein structure alignment and phylogeny". Mol Biol Evol. 29 (11): 3575–3587. doi:10.1093/molbev/mss167. PMC   3697813 . PMID   22723302.
  19. Herman, JL; Challis, CJ; Novák, A; Hein, J & Schmidler, SC (2014). "Simultaneous Bayesian estimation of alignment and phylogeny under a joint model of protein sequence and structure". Mol Biol Evol. 31 (9): 2251–2266. doi:10.1093/molbev/msu184. PMC   4137710 . PMID   24899668.
  20. Larson, G; Thorne, JL & Schmidler, S (2020). "Incorporating nearest-neighbor site dependence into protein evolution models". J Comput Biol. 27 (3): 361–375. doi:10.1089/cmb.2019.0500. PMC   7081252 . PMID   32053390.
  21. Le, SQ & Gascuel, O (2010). "Accounting for solvent accessibility and secondary structure in protein phylogenetics is clearly beneficial". Syst Biol. 59 (3): 277–287. doi:10.1093/sysbio/syq002. PMID   20525635.
  22. Lai, J-S; Rost, B; Kobe, B & Bodén, M (2020). "Evolutionary model of protein secondary structure capable of revealing new biological relationships". Proteins. 88 (9): 1251–1259. doi:10.1002/prot.v88.9. PMID   32394426.
  23. Van Kempen, M; Kim, SS; Tumescheit, C; Mirdita, M; Lee, J; Gilchrist, CL & Steinegger, M (2024). "Fast and accurate protein structure search with Foldseek". Nat Biotechnol. 42 (2): 243–246. doi:10.1038/s41587-023-01773-0. PMC   10869269 . PMID   37156916.
  24. Moi, D; Bernard, C; Steinegger, M; Nevers, Y; Langleib, M & Dessimoz, C (2023). "Structural phylogenetics unravels the evolutionary diversification of communication systems in gram-positive bacteria and their viruses". bioRxiv. doi:10.1101/2023.09.19.558401.
  25. Puente-Lelievre, C; Malik, AJ; Douglas, J; Ascher, D; Baker, M; Allison, J & Matzke, N (2023). "Tertiary-interaction characters enable fast, model-based structural phylogenetics beyond the twilight zone". bioRxiv. doi:10.1101/2023.12.12.571181.
  26. Garg, SG & Hochberg, GKA (2025). "A general substitution matrix for structural phylogenetics". Mol Biol Evol. 42 (6) msaf124. doi:10.1093/molbev/msaf124. PMC   12198762 . PMID   40476610.