Protein structural phylogenetics (or Structural phylogenetics) is the branch of molecular evolution that incorporates three dimensional information from protein structure to understand phylogenetic relationships, and translates those evolutionary insights into understanding protein structure and function. [1] Protein structures are robust over long evolutionary time scales compared with amino acid sequence. [2] [3] [4] The number of protein sequences that can fold into a given structure is astronomical, with one study estimating even a small protein structure with less than 100 amino acids can be attained by a number of sequences that exceeds the Avagadro constant. [5] These properties make structures useful for understanding deep evolutionary relationships, where sequences have become saturated with mutations and share very low levels of similarity. [6]
Protein structures have been used to explore evolutionary relationships since the 1970s. [7] The approach became popularized in the 1990s and 2000s as techniques in structural biology took off, namely X-ray crystallography, nuclear magnetic resonance, and electron microscopy. Throughout this period, several studies investigated deep evolutionary relationships through the analysis of aligned protein structures, for example, the immunoglobulins [8] aminoacyl-tRNA synthetases, [9] and metallo-β-lactamases. [10] However, the field was still constrained by the limited availability of entries in the Protein Data Bank.
In the early 2020s, with the arrival of protein structure prediction methods like AlphaFold2, [11] high quality data became readily available. Although structural predictions are still less accurate than solved structures, [12] this has nevertheless led to three dimensional protein structure becoming increasingly important within the field of phylogenetics. This led to recent insights into the evolution of Flavivirus glycoproteins, [13] fungal virulence factors, [14] and gamete fusion proteins. [15] Despite the abundance of protein structural data, the methodologies to analyze these structures have not kept pace with those used to estimate phylogenies from sequence.
Inferring phylogenetic trees from protein structure usually relies on a structural alignment. There are numerous software packages available to perform this task, each with their own strengths and limitations. [16]
The simplest methods for building phylogenies from protein structures are based on atomic-level comparisons using measures like RMSD and TM-Score, among others. [16] [1] This distance-based approach is often performed using the neighbor joining algorithm. A key limitation in this approach comes from the inability to quantify statistical uncertainty, such as through bootstrap or posterior clade support. Some have used molecular dynamics simulations to estimate bootstrap support, although this approach is computationally demanding. [17]
The more advanced methods are model-based, meaning they describe probabilistic generative processes and can provide a more reliable means of quantifying uncertainty in a maximum likelihood or Bayesian phylogenetic framework. The Challis-Schmidler model describes protein structural drift, over long evolutionary time frames, as an Ornstein–Uhlenbeck process. [18] [19] This Bayesian total-evidence model estimates the sequence and structural alignment all within a single analysis. A key limitation in this method comes from the energetically-unrealistic assumption of independent drift across all positions in the protein. This restriction was later addressed by the Larson-Thorne-Schmidler model. [20]
Protein structures can also be represented as sequences of characters from a structural alphabet. Typically, there is one character assigned to each amino acid residue in the sequence. This enables structural phylogenies to be built using the same methodologies that are used in sequence phylogenetics, including maximum likelihood and Bayesian inference, as a continuous time Markov process. The earliest efforts involved simple alphabets that describe protein secondary structure and surface accessibility. [21] [22] The 3Di alphabet employed by Foldseek [23] is widely used today. [24] [25] [26] This alphabet consists of twenty characters informed by the protein tertiary structure. While 3Di phylogenetics has become widely applied in recent years, its key limitation comes from the standard phylogenetic assumption of independence between sites, a requirement violated by the concept of the 3Di characters, which are defined by tertiary structure interactions.
{{cite journal}}
: CS1 maint: article number as page number (link)