Protein structural phylogenetics

Last updated January 12, 2026

Protein structural phylogenetics (or Structural phylogenetics) is the branch of molecular evolution that incorporates three dimensional information from protein structure to understand phylogenetic relationships, and translates those evolutionary insights into understanding protein structure and function.^[1] Protein structures are robust over long evolutionary time scales compared with amino acid sequence.^[2]^[3]^[4] The number of protein sequences that can fold into a given structure is astronomical, with one study estimating even a small protein structure with fewer than 100 amino acids can be attained by a number of sequences that exceeds the Avogadro constant.^[5] These properties make structures useful for understanding deep evolutionary relationships, where sequences have become saturated with mutations and share very low levels of similarity.^[6]

History

Protein structures have been used to explore evolutionary relationships since the 1970s.^[7] The approach became popularized in the 1990s and 2000s as techniques in structural biology took off, namely X-ray crystallography, nuclear magnetic resonance, and electron microscopy. Throughout this period, several studies investigated deep evolutionary relationships through the analysis of aligned protein structures, for example, the immunoglobulins ^[8] aminoacyl-tRNA synthetases,^[9] and metallo-β-lactamases.^[10] However, the field was still constrained by the limited availability of entries in the Protein Data Bank.

In the early 2020s, with the arrival of protein structure prediction methods like AlphaFold2,^[11] high quality data became readily available. Although structural predictions are still less accurate than solved structures,^[12] this has nevertheless led to three dimensional protein structure becoming increasingly important within the field of phylogenetics. This led to recent insights into the evolution of Flavivirus glycoproteins,^[13] fungal virulence factors,^[14] and gamete fusion proteins.^[15] Despite the abundance of protein structural data, the methodologies to analyze these structures have not kept pace with those used to estimate phylogenies from sequence.

Structural data and alignment

Inferring phylogenetic trees from protein structure usually relies on a structural alignment. There are numerous software packages available to perform this task, each with their own strengths and limitations.^[16]

Methods for estimating phylogenies from protein structure

Atomic coordinates

The simplest methods for building phylogenies from protein structures are based on atomic-level comparisons using measures like RMSD and TM-Score, among others.^[16]^[1] This distance-based approach is often performed using the neighbor joining algorithm. A key limitation in this approach comes from the inability to quantify statistical uncertainty, such as through bootstrap or posterior clade support. Some have used molecular dynamics simulations to estimate bootstrap support, although this approach is computationally demanding.^[17]

The more advanced methods are model-based, meaning they describe probabilistic generative processes and can provide a more reliable means of quantifying uncertainty in a maximum likelihood or Bayesian phylogenetic framework. The Challis-Schmidler model describes protein structural drift, over long evolutionary time frames, as an Ornstein–Uhlenbeck process.^[18]^[19] This Bayesian total-evidence model estimates the sequence and structural alignment all within a single analysis. A key limitation in this method comes from the energetically-unrealistic assumption of independent drift across all positions in the protein. This restriction was later addressed by the Larson-Thorne-Schmidler model.^[20]

Structural alphabets

Protein structures can also be represented as sequences of characters from a structural alphabet. Typically, there is one character assigned to each amino acid residue in the sequence. This enables structural phylogenies to be built using the same methodologies that are used in sequence phylogenetics, including maximum likelihood and Bayesian inference, as a continuous time Markov process. The earliest efforts involved simple alphabets that describe protein secondary structure and surface accessibility.^[21]^[22] The 3Di alphabet employed by Foldseek^[23] is widely used today.^[24]^[25]^[26] This alphabet consists of twenty characters informed by the protein tertiary structure. While 3Di phylogenetics has become widely applied in recent years, its key limitation comes from the standard phylogenetic assumption of independence between sites, a requirement violated by the concept of the 3Di characters, which are defined by tertiary structure interactions.

References

1 2 Puente-Lelievre, C; Malik, A & Douglas, J (2025). "Protein Structural Phylogenetics". Genome Biology and Evolution. 17 (8) evaf139. doi:10.1093/gbe/evaf139. PMC 12369579 . PMID 40839422.
↑ Chothia, C & Lesk, AM (1986). "The relation between the divergence of sequence and structure in proteins". EMBO J. 5 (4): 823–826. doi:10.1002/j.1460-2075.1986.tb04288.x. PMC 1166865 . PMID 3709526.
↑ Flores, TP; Orengo, CA; Moss, DS & Thornton, JM (1993). "Comparison of conformational characteristics in structurally similar protein pairs". Protein Sci. 2 (11): 1811–1826. doi:10.1002/pro.5560021104. PMC 2142289 . PMID 8268794.
↑ Illergård, K; Ardell, DH & Elofsson, A (2009). "Structure is three to ten times more conserved than sequence—a study of structural response in protein cores". Proteins. 77 (3): 499–508. doi:10.1002/prot.22458. PMID 19507241.
↑ Tian, P; Best, RB (17 October 2017). "How Many Protein Sequences Fold to a Given Structure? A Coevolutionary Analysis". Biophysical Journal. 113 (8): 1719–1730. Bibcode:2017BpJ...113.1719T. doi:10.1016/j.bpj.2017.08.039. PMC 5647607 . PMID 29045866.
↑ Rost, B (1999). "Twilight zone of protein sequence alignments". Protein Eng. 12 (2): 85–94. doi:10.1093/protein/12.2.85. PMID 10195279.
↑ Eventoff, W; Rossmann, MG & Brändén, C-I (1975). "The evolution of dehydrogenases and kinase". CRC Crit Rev Biochem. 3 (2): 111–140. doi:10.3109/10409237509102554. PMID 1100315.
↑ Johnson, MS; Sutcliffe, MJ & Blundell, TL (1990). "Molecular anatomy: phyletic relationships derived from three-dimensional structures of proteins". J Mol Evol. 30 (1): 43–59. Bibcode:1990JMolE..30...43J. doi:10.1007/BF02102452. PMID 2107323.
↑ O'Donoghue, P & Luthey-Schulten, Z (2003). "On the evolution of structure in aminoacyl-tRNA synthetases". Microbiol Mol Biol Rev. 67 (4): 550–573. Bibcode:2003MMBR...67..550O. doi:10.1128/MMBR.67.4.550-573.2003. PMC 309052 . PMID 14665676.
↑ Garau, G; Di Guilmi, AM & Hall, BG (2005). "Structure-based phylogeny of the metallo-β-lactamases". Antimicrob Agents Chemother. 49 (7): 2778–2784. doi:10.1128/AAC.49.7.2778-2784.2005. PMC 1168685 . PMID 15980349.
↑ Jumper, J; Evans, R; Pritzel, A; Green, T; Figurnov, M; Ronneberger, O & Hassabis, D (2021). "Highly accurate protein structure prediction with AlphaFold". Nature. 596 (7873): 583–589. Bibcode:2021Natur.596..583J. doi:10.1038/s41586-021-03819-2. PMC 8371605 . PMID 34265844.
↑ Terwilliger, TC; Liebschner, D; Croll, TI; Williams, CJ; McCoy, AJ; Poon, BK & Adams, PD (2024). "AlphaFold predictions are valuable hypotheses and accelerate but do not replace experimental structure determination". Nat Methods. 21 (1): 110–116. doi:10.1038/s41592-023-02087-4. PMC 10776388 . PMID 38036854.
↑ Mifsud, JC; Lytras, S; Oliver, MR; Toon, K; Costa, VA; Holmes, EC & Grove, J (2024). "Mapping glycoprotein structure reveals Flaviviridae evolutionary history". Nature. 633 (8030): 695–703. Bibcode:2024Natur.633..695M. doi:10.1038/s41586-024-07899-8. PMC 11410658 . PMID 39232167.
↑ Lahfa, M; Barthe, P; De Guillen, K; Cesari, S; Raji, M; Kroj, T & Padilla, A (2024). "The structural landscape and diversity of Pyricularia oryzae MAX effectors revisited". PLOS Pathog. 20 (5) e1012176. doi: 10.1371/journal.ppat.1012176 . PMC 11132498 . PMID 38709846.
↑ Moi, D; Nishio, S; Li, X; Valansi, C; Langleib, M; Brukman, NG & Podbilewicz, B (2022). "Discovery of archaeal fusexins homologous to eukaryotic HAP2/GCS1 gamete fusion proteins". Nat Commun. 13 (1): 3880. Bibcode:2022NatCo..13.3880M. doi:10.1038/s41467-022-31564-1. PMC 9259645 . PMID 35794124.
1 2 Hasegawa, H & Holm, L (2009). "Advances and pitfalls of protein structural alignment". Curr Opin Struct Biol. 19 (3): 341–348. doi:10.1016/j.sbi.2009.04.003. PMID 19481444.
↑ Malik, AJ; Poole, AM & Allison, JR (2020). "Structural phylogenetics with confidence". Mol Biol Evol. 37 (9): 2711–2726. doi:10.1093/molbev/msaa100. PMC 7475046 . PMID 32302382.
↑ Challis, CJ & Schmidler, SC (2012). "A stochastic evolutionary model for protein structure alignment and phylogeny". Mol Biol Evol. 29 (11): 3575–3587. doi:10.1093/molbev/mss167. PMC 3697813 . PMID 22723302.
↑ Herman, JL; Challis, CJ; Novák, A; Hein, J & Schmidler, SC (2014). "Simultaneous Bayesian estimation of alignment and phylogeny under a joint model of protein sequence and structure". Mol Biol Evol. 31 (9): 2251–2266. doi:10.1093/molbev/msu184. PMC 4137710 . PMID 24899668.
↑ Larson, G; Thorne, JL & Schmidler, S (2020). "Incorporating nearest-neighbor site dependence into protein evolution models". J Comput Biol. 27 (3): 361–375. doi:10.1089/cmb.2019.0500. PMC 7081252 . PMID 32053390.
↑ Le, SQ & Gascuel, O (2010). "Accounting for solvent accessibility and secondary structure in protein phylogenetics is clearly beneficial". Syst Biol. 59 (3): 277–287. doi:10.1093/sysbio/syq002. PMID 20525635.
↑ Lai, J-S; Rost, B; Kobe, B & Bodén, M (2020). "Evolutionary model of protein secondary structure capable of revealing new biological relationships". Proteins. 88 (9): 1251–1259. doi:10.1002/prot.v88.9. PMID 32394426.
↑ Van Kempen, M; Kim, SS; Tumescheit, C; Mirdita, M; Lee, J; Gilchrist, CL & Steinegger, M (2024). "Fast and accurate protein structure search with Foldseek". Nat Biotechnol. 42 (2): 243–246. Bibcode:2024NatBi..42..243V. doi:10.1038/s41587-023-01773-0. PMC 10869269 . PMID 37156916.
↑ Moi, D; Bernard, C; Steinegger, M; Nevers, Y; Langleib, M & Dessimoz, C (2023). "Structural phylogenetics unravels the evolutionary diversification of communication systems in gram-positive bacteria and their viruses". bioRxiv 10.1101/2023.09.19.558401 .
↑ Puente-Lelievre, C; Malik, AJ; Douglas, J; Ascher, D; Baker, M; Allison, J & Matzke, N (2023). "Tertiary-interaction characters enable fast, model-based structural phylogenetics beyond the twilight zone". bioRxiv 10.1101/2023.12.12.571181 .
↑ Garg, SG & Hochberg, GKA (2025). "A general substitution matrix for structural phylogenetics". Mol Biol Evol. 42 (6) msaf124. doi:10.1093/molbev/msaf124. PMC 12198762 . PMID 40476610.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[Puente2025-1] 1 2 Puente-Lelievre, C; Malik, A & Douglas, J (2025). "Protein Structural Phylogenetics". Genome Biology and Evolution. 17 (8) evaf139. doi:10.1093/gbe/evaf139. PMC 12369579 . PMID 40839422.

[2] Chothia, C & Lesk, AM (1986). "The relation between the divergence of sequence and structure in proteins". EMBO J. 5 (4): 823–826. doi:10.1002/j.1460-2075.1986.tb04288.x. PMC 1166865 . PMID 3709526.

[3] Flores, TP; Orengo, CA; Moss, DS & Thornton, JM (1993). "Comparison of conformational characteristics in structurally similar protein pairs". Protein Sci. 2 (11): 1811–1826. doi:10.1002/pro.5560021104. PMC 2142289 . PMID 8268794.

[4] Illergård, K; Ardell, DH & Elofsson, A (2009). "Structure is three to ten times more conserved than sequence—a study of structural response in protein cores". Proteins. 77 (3): 499–508. doi:10.1002/prot.22458. PMID 19507241.

[5] Tian, P; Best, RB (17 October 2017). "How Many Protein Sequences Fold to a Given Structure? A Coevolutionary Analysis". Biophysical Journal. 113 (8): 1719–1730. Bibcode:2017BpJ...113.1719T. doi:10.1016/j.bpj.2017.08.039. PMC 5647607 . PMID 29045866.

[6] Rost, B (1999). "Twilight zone of protein sequence alignments". Protein Eng. 12 (2): 85–94. doi:10.1093/protein/12.2.85. PMID 10195279.

[7] Eventoff, W; Rossmann, MG & Brändén, C-I (1975). "The evolution of dehydrogenases and kinase". CRC Crit Rev Biochem. 3 (2): 111–140. doi:10.3109/10409237509102554. PMID 1100315.

[8] Johnson, MS; Sutcliffe, MJ & Blundell, TL (1990). "Molecular anatomy: phyletic relationships derived from three-dimensional structures of proteins". J Mol Evol. 30 (1): 43–59. Bibcode:1990JMolE..30...43J. doi:10.1007/BF02102452. PMID 2107323.

[9] O'Donoghue, P & Luthey-Schulten, Z (2003). "On the evolution of structure in aminoacyl-tRNA synthetases". Microbiol Mol Biol Rev. 67 (4): 550–573. Bibcode:2003MMBR...67..550O. doi:10.1128/MMBR.67.4.550-573.2003. PMC 309052 . PMID 14665676.

[10] Garau, G; Di Guilmi, AM & Hall, BG (2005). "Structure-based phylogeny of the metallo-β-lactamases". Antimicrob Agents Chemother. 49 (7): 2778–2784. doi:10.1128/AAC.49.7.2778-2784.2005. PMC 1168685 . PMID 15980349.

[11] Jumper, J; Evans, R; Pritzel, A; Green, T; Figurnov, M; Ronneberger, O & Hassabis, D (2021). "Highly accurate protein structure prediction with AlphaFold". Nature. 596 (7873): 583–589. Bibcode:2021Natur.596..583J. doi:10.1038/s41586-021-03819-2. PMC 8371605 . PMID 34265844.

[12] Terwilliger, TC; Liebschner, D; Croll, TI; Williams, CJ; McCoy, AJ; Poon, BK & Adams, PD (2024). "AlphaFold predictions are valuable hypotheses and accelerate but do not replace experimental structure determination". Nat Methods. 21 (1): 110–116. doi:10.1038/s41592-023-02087-4. PMC 10776388 . PMID 38036854.

[13] Mifsud, JC; Lytras, S; Oliver, MR; Toon, K; Costa, VA; Holmes, EC & Grove, J (2024). "Mapping glycoprotein structure reveals Flaviviridae evolutionary history". Nature. 633 (8030): 695–703. Bibcode:2024Natur.633..695M. doi:10.1038/s41586-024-07899-8. PMC 11410658 . PMID 39232167.

[14] Lahfa, M; Barthe, P; De Guillen, K; Cesari, S; Raji, M; Kroj, T & Padilla, A (2024). "The structural landscape and diversity of Pyricularia oryzae MAX effectors revisited". PLOS Pathog. 20 (5) e1012176. doi: 10.1371/journal.ppat.1012176 . PMC 11132498 . PMID 38709846.

[15] Moi, D; Nishio, S; Li, X; Valansi, C; Langleib, M; Brukman, NG & Podbilewicz, B (2022). "Discovery of archaeal fusexins homologous to eukaryotic HAP2/GCS1 gamete fusion proteins". Nat Commun. 13 (1): 3880. Bibcode:2022NatCo..13.3880M. doi:10.1038/s41467-022-31564-1. PMC 9259645 . PMID 35794124.

[Hasegawa2009-16] 1 2 Hasegawa, H & Holm, L (2009). "Advances and pitfalls of protein structural alignment". Curr Opin Struct Biol. 19 (3): 341–348. doi:10.1016/j.sbi.2009.04.003. PMID 19481444.

[17] Malik, AJ; Poole, AM & Allison, JR (2020). "Structural phylogenetics with confidence". Mol Biol Evol. 37 (9): 2711–2726. doi:10.1093/molbev/msaa100. PMC 7475046 . PMID 32302382.

[18] Challis, CJ & Schmidler, SC (2012). "A stochastic evolutionary model for protein structure alignment and phylogeny". Mol Biol Evol. 29 (11): 3575–3587. doi:10.1093/molbev/mss167. PMC 3697813 . PMID 22723302.

[19] Herman, JL; Challis, CJ; Novák, A; Hein, J & Schmidler, SC (2014). "Simultaneous Bayesian estimation of alignment and phylogeny under a joint model of protein sequence and structure". Mol Biol Evol. 31 (9): 2251–2266. doi:10.1093/molbev/msu184. PMC 4137710 . PMID 24899668.

[20] Larson, G; Thorne, JL & Schmidler, S (2020). "Incorporating nearest-neighbor site dependence into protein evolution models". J Comput Biol. 27 (3): 361–375. doi:10.1089/cmb.2019.0500. PMC 7081252 . PMID 32053390.

[21] Le, SQ & Gascuel, O (2010). "Accounting for solvent accessibility and secondary structure in protein phylogenetics is clearly beneficial". Syst Biol. 59 (3): 277–287. doi:10.1093/sysbio/syq002. PMID 20525635.

[22] Lai, J-S; Rost, B; Kobe, B & Bodén, M (2020). "Evolutionary model of protein secondary structure capable of revealing new biological relationships". Proteins. 88 (9): 1251–1259. doi:10.1002/prot.v88.9. PMID 32394426.

[23] Van Kempen, M; Kim, SS; Tumescheit, C; Mirdita, M; Lee, J; Gilchrist, CL & Steinegger, M (2024). "Fast and accurate protein structure search with Foldseek". Nat Biotechnol. 42 (2): 243–246. Bibcode:2024NatBi..42..243V. doi:10.1038/s41587-023-01773-0. PMC 10869269 . PMID 37156916.

[24] Moi, D; Bernard, C; Steinegger, M; Nevers, Y; Langleib, M & Dessimoz, C (2023). "Structural phylogenetics unravels the evolutionary diversification of communication systems in gram-positive bacteria and their viruses". bioRxiv 10.1101/2023.09.19.558401 .

[25] Puente-Lelievre, C; Malik, AJ; Douglas, J; Ascher, D; Baker, M; Allison, J & Matzke, N (2023). "Tertiary-interaction characters enable fast, model-based structural phylogenetics beyond the twilight zone". bioRxiv 10.1101/2023.12.12.571181 .

[26] Garg, SG & Hochberg, GKA (2025). "A general substitution matrix for structural phylogenetics". Mol Biol Evol. 42 (6) msaf124. doi:10.1093/molbev/msaf124. PMC 12198762 . PMID 40476610.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]