Bacterial phylodynamics is the study of immunology, epidemiology, and phylogenetics of bacterial pathogens to better understand the evolutionary role of these pathogens. [1] [2] [3] Phylodynamic analysis includes analyzing genetic diversity, natural selection, and population dynamics of infectious disease pathogen phylogenies during pandemics and studying intra-host evolution of viruses. [4] Phylodynamics combines the study of phylogenetic analysis, ecological, and evolutionary processes to better understand of the mechanisms that drive spatiotemporal incidence and phylogenetic patterns of bacterial pathogens. [2] [4] Bacterial phylodynamics uses genome-wide single-nucleotide polymorphisms (SNP) in order to better understand the evolutionary mechanism of bacterial pathogens. [5] Many phylodynamic studies have been performed on viruses, specifically RNA viruses (see Viral phylodynamics) which have high mutation rates. The field of bacterial phylodynamics has increased substantially due to the advancement of next-generation sequencing and the amount of data available.
Studies can be designed to observe intra-host or inter-host interactions. Bacterial phylodynamic studies usually focus on inter-host interactions with samples from many different hosts in a specific geographical location or several different geographical locations. [4] The most important part of a study design is how to organize the sampling strategy. [4] For example, the number of sampled time points, the sampling interval, and the number of sequences per time point are crucial to phylodynamic analysis. [4] Sampling bias causes problems when looking at a diverse taxological samples. [3] For example, sampling from a limited geographical location may impact effective population size. [6]
Sequencing of the genome or genomic regions and what sequencing technique to use is an important experimental setting to phylodynamic analysis. Whole genome sequencing is often performed on bacterial genomes, although depending on the design of the study, many different methods can be utilized for phylodynamic analysis. Bacterial genomes are much larger and have a slower evolutionary rate than RNA viruses, limiting studies on the bacterial phylodynamics. The advancement of sequencing technology has made bacterial phylodynamics possible but proper preparation of the whole bacterial genomes is mandatory.
When a new dataset with samples for phylodynamic analysis are obtained, the sequences in the new data set are aligned. [4] A BLAST search is frequently executed to find similar strains of the pathogen of interest. Sequences collected from BLAST for an alignment will need the proper information to be added to a data set, such as sample collection date and geographical location of the sample. Multiple sequence alignment algorithms (e.g., MUSCLE, [7] MAFFT, [8] and CLUSAL W [9] ) will align the data set with all selected sequences. After the running a multiple sequence alignment algorithm, manual editing the alignment is highly recommended. [4] Multiple sequence alignment algorithms can leave a large amount of indels in the sequence alignment when the indels do not exist. [4] Manually editing the indels in the data set will allow a more accurate phylogenetic tree. [4]
In order to have an accurate phylodynamic analysis, quality control methods must be performed. This includes checking the samples in the data set for possible contamination, measuring phylogenetic signal of the sequences, and checking the sequences for possible signs of recombinant strains. [4] Contamination of samples in the data set can be excluded with by various laboratory methods and by proper DNA/RNA extraction methods. There are several way to check for phylogenetic signal in an alignment, such as likelihood mapping, transition/transversions versus divergence plots, and the Xia test for saturation. [4] If phylogenetic signal of an alignment is too low then a longer alignment or an alignment of another gene in the organism may be necessary to perform phylogenetic analysis. [4] Typically substitution saturation is only in issue in data sets with viral sequences. Most algorithms used for phylogenetic analysis do not take into recombination into account, which can alter the molecular clock and coalescent estimates of a multiple sequence alignment. [4] Strains that show signs of recombination should either be excluded from the data set or analyzed on their own. [4]
The best fitting nucleotide or amino acid substitution model for a multiple sequence alignment is the first step in phylodynamic analysis. This can be accomplished with several different algorithms (e.g., IQTREE, [10] MEGA [11] ).
There are several different methods to infer phylogenies. These include methods include tree building algorithms such as UPGMA, neighbor joining, maximum parsimony, maximum likelihood, and Bayesian analysis. [4]
Testing the reliability of the tree after inferring its phylogeny, is a crucial step in the phylodynamic pipeline. [4] Methods to test the reliability of a tree include bootstrapping, maximum likelihood estimation, and posterior probabilities in Bayesian analysis. [4]
Several methods are used to assess phylodynamic reliability of a data set. These methods include estimating the data set's molecular clock, demographic history, population structure, gene flow, and selection analysis. [4] Phylodynamic results of a data set can also influence better study designs in future experiments.
Cholera is a diarrheal disease that is caused by the bacterium Vibrio cholerae. V. cholerae has been a popular bacterium for phylodynamic analysis after the 2010 cholera outbreak in Haiti. The cholera outbreak happened right after the 2010 earthquake in Haiti, which caused critical infrastructure damage, leading to the conclusion that the outbreak was most likely due to the V. cholerae bacterium being introduced naturally to the waters in Haiti from the earthquake. Soon after the earthquake, the UN sent MINUSTAH troops from Nepal to Haiti. Rumors started circulating about terrible conditions of the MINUSTAH camp, as well as people claiming that the MINUSTAH troops were deposing of their waste in the Artibonite River, which is the major water source in the surrounding area. Soon after the MINUSTAH troops arrival, the first cholera case was reported near the location of the MINUSTAH camp. [12] Phylodynamic analysis was used to look into the source of the Haiti cholera outbreak. Whole genome sequencing of V. cholerae revealed that there was one single point source of the cholera outbreak in Haiti and it was similar to O1 strains circulating in South Asia. [12] [13] Before the MINUSTAH troops from Nepal were sent to Haiti, a cholera outbreak had just occurred in Nepal. In the original research to trace the origin of the outbreak, the Nepal strains were not available. [12] Phylodynamic analyses were performed on the Haitian strain and the Nepalese strain when it became available and affirmed that the Haitian cholera strain was the most similar to the Nepalese cholera strain. [14] This outbreak strain of cholera in Haiti showed signs of an altered or hybrid strain of V. cholerae associated with high virulence. [5] Typically high quality single-nucleotide polymorphisms (hqSNP) from whole genome V. cholerae sequences are used for phylodynamic analysis. [5] Using phylodynamic analysis to study cholera helps prediction and understanding of V. cholerae evolution during bacterial epidemics. [5]
Bioinformatics is an interdisciplinary field of science that develops methods and software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, chemistry, physics, computer science, computer programming, information engineering, mathematics and statistics to analyze and interpret biological data. The subsequent process of analyzing and interpreting data is often referred to as computational biology, though the distinction between the two terms is often disputed.
In biology, phylogenetics is the study of the evolutionary history of life using genetics, which is known as phylogenetic inference. It establishes the relationship between organisms with the empirical data and observed heritable traits of DNA sequences, protein amino acid sequences, and morphology. The results are a phylogenetic tree—a diagram setting the hypothetical relationships between organisms and their evolutionary history.
Vibrio cholerae is a species of Gram-negative, facultative anaerobe and comma-shaped bacteria. The bacteria naturally live in brackish or saltwater where they attach themselves easily to the chitin-containing shells of crabs, shrimp, and other shellfish. Some strains of V. cholerae are pathogenic to humans and cause a deadly disease called cholera, which can be derived from the consumption of undercooked or raw marine life species or drinking contaminated water.
Vibrio is a genus of Gram-negative bacteria, possessing a curved-rod (comma) shape, several species of which can cause foodborne infection or soft-tissue infection called Vibriosis. Infection is commonly associated with eating undercooked seafood. Being highly salt tolerant and unable to survive in freshwater, Vibrio spp. are commonly found in various salt water environments. Vibrio spp. are facultative anaerobes that test positive for oxidase and do not form spores. All members of the genus are motile. They are able to have polar or lateral flagellum with or without sheaths. Vibrio species typically possess two chromosomes, which is unusual for bacteria. Each chromosome has a distinct and independent origin of replication, and are conserved together over time in the genus. Recent phylogenies have been constructed based on a suite of genes.
In bioinformatics, sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of a wide range of analytical methods to understand its features, function, structure, or evolution. It can be performed on the entire genome, transcriptome or proteome of an organism, and can also involve only selected segments or regions, like tandem repeats and transposable elements. Methodologies used include sequence alignment, searches against biological databases, and others.
Comparative genomics is a branch of biological research that examines genome sequences across a spectrum of species, spanning from humans and mice to a diverse array of organisms from bacteria to chimpanzees. This large-scale holistic approach compares two or more genomes to discover the similarities and differences between the genomes and to study the biology of the individual genomes. Comparison of whole genome sequences provides a highly detailed view of how organisms are related to each other at the gene level. By comparing whole genome sequences, researchers gain insights into genetic relationships between organisms and study evolutionary changes. The major principle of comparative genomics is that common features of two organisms will often be encoded within the DNA that is evolutionarily conserved between them. Therefore, Comparative genomics provides a powerful tool for studying evolutionary changes among organisms, helping to identify genes that are conserved or common among species, as well as genes that give unique characteristics of each organism. Moreover, these studies can be performed at different levels of the genomes to obtain multiple perspectives about the organisms.
Metagenomics is the study of genetic material recovered directly from environmental or clinical samples by a method called sequencing. The broad field may also be referred to as environmental genomics, ecogenomics, community genomics or microbiomics.
Multilocus sequence typing (MLST) is a technique in molecular biology for the typing of multiple loci, using DNA sequences of internal fragments of multiple housekeeping genes to characterize isolates of microbial species.
In bioinformatics, MAFFT is a program used to create multiple sequence alignments of amino acid or nucleotide sequences. Published in 2002, the first version used an algorithm based on progressive alignment, in which the sequences were clustered with the help of the fast Fourier transform. Subsequent versions of MAFFT have added other algorithms and modes of operation, including options for faster alignment of large numbers of sequences, higher accuracy alignments, alignment of non-coding RNA sequences, and the addition of new sequences to existing alignments.
16S ribosomal RNA is the RNA component of the 30S subunit of a prokaryotic ribosome. It binds to the Shine-Dalgarno sequence and provides most of the SSU structure.
Pathogenomics is a field which uses high-throughput screening technology and bioinformatics to study encoded microbe resistance, as well as virulence factors (VFs), which enable a microorganism to infect a host and possibly cause disease. This includes studying genomes of pathogens which cannot be cultured outside of a host. In the past, researchers and medical professionals found it difficult to study and understand pathogenic traits of infectious organisms. With newer technology, pathogen genomes can be identified and sequenced in a much shorter time and at a lower cost, thus improving the ability to diagnose, treat, and even predict and prevent pathogenic infections and disease. It has also allowed researchers to better understand genome evolution events - gene loss, gain, duplication, rearrangement - and how those events impact pathogen resistance and ability to cause disease. This influx of information has created a need for bioinformatics tools and databases to analyze and make the vast amounts of data accessible to researchers, and it has raised ethical questions about the wisdom of reconstructing previously extinct and deadly pathogens in order to better understand virulence.
The 2010s Haiti cholera outbreak was the first modern large-scale outbreak of cholera—a disease once considered beaten back largely due to the invention of modern sanitation. The disease was reintroduced to Haiti in October 2010, not long after the disastrous earthquake earlier that year, and since then cholera has spread across the country and become endemic, causing high levels of both morbidity and mortality. Nearly 800,000 Haitians have been infected by cholera, and more than 9,000 have died, according to the United Nations (UN). Cholera transmission in Haiti today is largely a function of eradication efforts including WASH, education, oral vaccination, and climate variability. Early efforts were made to cover up the source of the epidemic, but thanks largely to the investigations of journalist Jonathan M. Katz and epidemiologist Renaud Piarroux, it is widely believed to be the result of contamination by infected United Nations peacekeepers deployed from Nepal. In terms of total infections, the outbreak has since been surpassed by the war-fueled 2016–2021 Yemen cholera outbreak, although the Haiti outbreak is still one of the most deadly modern outbreaks. After a three-year hiatus, new cholera cases reappeared in October 2022.
In metagenomics, binning is the process of grouping reads or contigs and assigning them to individual genome. Binning methods can be based on either compositional features or alignment (similarity), or both.
Viral phylodynamics is the study of how epidemiological, immunological, and evolutionary processes act and potentially interact to shape viral phylogenies. Since the term was coined in 2004, research on viral phylodynamics has focused on transmission dynamics in an effort to shed light on how these dynamics impact viral genetic variation. Transmission dynamics can be considered at the level of cells within an infected host, individual hosts within a population, or entire populations of hosts.
In bioinformatics, alignment-free sequence analysis approaches to molecular sequence and structure data provide alternatives over alignment-based approaches.
Metatranscriptomics is the set of techniques used to study gene expression of microbes within natural environments, i.e., the metatranscriptome.
Ancientpathogengenomics is a scientific field related to the study of pathogen genomes recovered from ancient human, plant or animal remains. Ancient pathogens are microorganisms, now extinct, that in the past centuries caused several epidemics and deaths worldwide. Their genome, referred to as ancient DNA (aDNA), is isolated from the burial's remains of victims of the pandemics caused by these pathogens.
Genome skimming is a sequencing approach that uses low-pass, shallow sequencing of a genome, to generate fragments of DNA, known as genome skims. These genome skims contain information about the high-copy fraction of the genome. The high-copy fraction of the genome consists of the ribosomal DNA, plastid genome (plastome), mitochondrial genome (mitogenome), and nuclear repeats such as microsatellites and transposable elements. It employs high-throughput, next generation sequencing technology to generate these skims. Although these skims are merely 'the tip of the genomic iceberg', phylogenomic analysis of them can still provide insights on evolutionary history and biodiversity at a lower cost and larger scale than traditional methods. Due to the small amount of DNA required for genome skimming, its methodology can be applied in other fields other than genomics. Tasks like this include determining the traceability of products in the food industry, enforcing international regulations regarding biodiversity and biological resources, and forensics.
In the field of epidemiology, source attribution refers to a category of methods with the objective of reconstructing the transmission of an infectious disease from a specific source, such as a population, individual, or location. For example, source attribution methods may be used to trace the origin of a new pathogen that recently crossed from another host species into humans, or from one geographic region to another. It may be used to determine the common source of an outbreak of a foodborne infectious disease, such as a contaminated water supply. Finally, source attribution may be used to estimate the probability that an infection was transmitted from one specific individual to another, i.e., "who infected whom".