Bacterial phylodynamics

Last updated

Bacterial phylodynamics is the study of immunology, epidemiology, and phylogenetics of bacterial pathogens to better understand the evolutionary role of these pathogens. [1] [2] [3] Phylodynamic analysis includes analyzing genetic diversity, natural selection, and population dynamics of infectious disease pathogen phylogenies during pandemics and studying intra-host evolution of viruses. [4] Phylodynamics combines the study of phylogenetic analysis, ecological, and evolutionary processes to better understand of the mechanisms that drive spatiotemporal incidence and phylogenetic patterns of bacterial pathogens. [2] [4] Bacterial phylodynamics uses genome-wide single-nucleotide polymorphisms (SNP) in order to better understand the evolutionary mechanism of bacterial pathogens. [5] Many phylodynamic studies have been performed on viruses, specifically RNA viruses (see Viral phylodynamics) which have high mutation rates. The field of bacterial phylodynamics has increased substantially due to the advancement of next-generation sequencing and the amount of data available.

Contents

Methods

Novel hypothesis (study design)

Studies can be designed to observe intra-host or inter-host interactions. Bacterial phylodynamic studies usually focus on inter-host interactions with samples from many different hosts in a specific geographical location or several different geographical locations. [4] The most important part of a study design is how to organize the sampling strategy. [4] For example, the number of sampled time points, the sampling interval, and the number of sequences per time point are crucial to phylodynamic analysis. [4] Sampling bias causes problems when looking at a diverse taxological samples. [3] For example, sampling from a limited geographical location may impact effective population size. [6]

Generating data

Experimental settings

Sequencing of the genome or genomic regions and what sequencing technique to use is an important experimental setting to phylodynamic analysis. Whole genome sequencing is often performed on bacterial genomes, although depending on the design of the study, many different methods can be utilized for phylodynamic analysis. Bacterial genomes are much larger and have a slower evolutionary rate than RNA viruses, limiting studies on the bacterial phylodynamics. The advancement of sequencing technology has made bacterial phylodynamics possible but proper preparation of the whole bacterial genomes is mandatory.

Alignment

When a new dataset with samples for phylodynamic analysis are obtained, the sequences in the new data set are aligned. [4] A BLAST search is frequently executed to find similar strains of the pathogen of interest. Sequences collected from BLAST for an alignment will need the proper information to be added to a data set, such as sample collection date and geographical location of the sample. Multiple sequence alignment algorithms (e.g., MUSCLE, [7] MAFFT, [8] and CLUSAL W [9] ) will align the data set with all selected sequences. After the running a multiple sequence alignment algorithm, manual editing the alignment is highly recommended. [4] Multiple sequence alignment algorithms can leave a large amount of indels in the sequence alignment when the indels do not exist. [4] Manually editing the indels in the data set will allow a more accurate phylogenetic tree. [4]

Quality control

In order to have an accurate phylodynamic analysis, quality control methods must be performed. This includes checking the samples in the data set for possible contamination, measuring phylogenetic signal of the sequences, and checking the sequences for possible signs of recombinant strains. [4] Contamination of samples in the data set can be excluded with by various laboratory methods and by proper DNA/RNA extraction methods. There are several way to check for phylogenetic signal in an alignment, such as likelihood mapping, transition/transversions versus divergence plots, and the Xia test for saturation. [4] If phylogenetic signal of an alignment is too low then a longer alignment or an alignment of another gene in the organism may be necessary to perform phylogenetic analysis. [4] Typically substitution saturation is only in issue in data sets with viral sequences. Most algorithms used for phylogenetic analysis do not take into recombination into account, which can alter the molecular clock and coalescent estimates of a multiple sequence alignment. [4] Strains that show signs of recombination should either be excluded from the data set or analyzed on their own. [4]

Data analysis

Evolutionary model

The best fitting nucleotide or amino acid substitution model for a multiple sequence alignment is the first step in phylodynamic analysis. This can be accomplished with several different algorithms (e.g., IQTREE, [10] MEGA [11] ).

Phylogeny inference

There are several different methods to infer phylogenies. These include methods include tree building algorithms such as UPGMA, neighbor joining, maximum parsimony, maximum likelihood, and Bayesian analysis. [4]

Hypothesis testing

Assessing phylogenetic support

Testing the reliability of the tree after inferring its phylogeny, is a crucial step in the phylodynamic pipeline. [4] Methods to test the reliability of a tree include bootstrapping, maximum likelihood estimation, and posterior probabilities in Bayesian analysis. [4]

Phylodynamics inference

Several methods are used to assess phylodynamic reliability of a data set. These methods include estimating the data set's molecular clock, demographic history, population structure, gene flow, and selection analysis. [4] Phylodynamic results of a data set can also influence better study designs in future experiments.

Examples

Phylodynamics of cholera

Cholera is a diarrheal disease that is caused by the bacterium Vibrio cholerae. V. cholerae has been a popular bacterium for phylodynamic analysis after the 2010 cholera outbreak in Haiti. The cholera outbreak happened right after the 2010 earthquake in Haiti, which caused critical infrastructure damage, leading to the conclusion that the outbreak was most likely due to the V. cholerae bacterium being introduced naturally to the waters in Haiti from the earthquake. Soon after the earthquake, the UN sent MINUSTAH troops from Nepal to Haiti. Rumors started circulating about terrible conditions of the MINUSTAH camp, as well as people claiming that the MINUSTAH troops were deposing of their waste in the Artibonite River, which is the major water source in the surrounding area. Soon after the MINUSTAH troops arrival, the first cholera case was reported near the location of the MINUSTAH camp. [12] Phylodynamic analysis was used to look into the source of the Haiti cholera outbreak. Whole genome sequencing of V. cholerae revealed that there was one single point source of the cholera outbreak in Haiti and it was similar to O1 strains circulating in South Asia. [12] [13] Before the MINUSTAH troops from Nepal were sent to Haiti, a cholera outbreak had just occurred in Nepal. In the original research to trace the origin of the outbreak, the Nepal strains were not available. [12] Phylodynamic analyses were performed on the Haitian strain and the Nepalese strain when it became available and affirmed that the Haitian cholera strain was the most similar to the Nepalese cholera strain. [14] This outbreak strain of cholera in Haiti showed signs of an altered or hybrid strain of V. cholerae associated with high virulence. [5] Typically high quality single-nucleotide polymorphisms (hqSNP) from whole genome V. cholerae sequences are used for phylodynamic analysis. [5] Using phylodynamic analysis to study cholera helps prediction and understanding of V. cholerae evolution during bacterial epidemics. [5]

See also

Related Research Articles

<span class="mw-page-title-main">Bioinformatics</span> Computational analysis of large, complex sets of biological data

Bioinformatics is an interdisciplinary field of science that develops methods and software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, chemistry, physics, computer science, computer programming, information engineering, mathematics and statistics to analyze and interpret biological data. The subsequent process of analyzing and interpreting data is referred to as computational biology.

In biology, phylogenetics is the study of the evolutionary history and relationships among or within groups of organisms. These relationships are determined by phylogenetic inference methods that focus on observed heritable traits, such as DNA sequences, protein amino acid sequences, or morphology. The result of such an analysis is a phylogenetic tree—a diagram containing a hypothesis of relationships that reflects the evolutionary history of a group of organisms.

<i>Vibrio cholerae</i> Species of bacterium

Vibrio cholerae is a species of Gram-negative, facultative anaerobe and comma-shaped bacteria. The bacteria naturally live in brackish or saltwater where they attach themselves easily to the chitin-containing shells of crabs, shrimp, and other shellfish. Some strains of V. cholerae are pathogenic to humans and cause a deadly disease called cholera, which can be derived from the consumption of undercooked or raw marine life species.

<i>Vibrio</i> Genus of bacteria and the disease it can cause

Vibrio is a genus of Gram-negative bacteria, possessing a curved-rod (comma) shape, several species of which can cause foodborne infection, usually associated with eating undercooked seafood. Being highly salt tolerant and unable to survive in fresh water, Vibrio spp. are commonly found in various salt water environments. Vibrio spp. are facultative anaerobes that test positive for oxidase and do not form spores. All members of the genus are motile. They are able to have polar or lateral flagellum with or without sheaths. Vibrio species typically possess two chromosomes, which is unusual for bacteria. Each chromosome has a distinct and independent origin of replication, and are conserved together over time in the genus. Recent phylogenies have been constructed based on a suite of genes.

In bioinformatics, sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of a wide range of analytical methods to understand its features, function, structure, or evolution. Methodologies used include sequence alignment, searches against biological databases, and others.

<span class="mw-page-title-main">Comparative genomics</span>

Comparative genomics is a field of biological research in which the genomic features of different organisms are compared. The genomic features may include the DNA sequence, genes, gene order, regulatory sequences, and other genomic structural landmarks. In this branch of genomics, whole or large parts of genomes resulting from genome projects are compared to study basic biological similarities and differences as well as evolutionary relationships between organisms. The major principle of comparative genomics is that common features of two organisms will often be encoded within the DNA that is evolutionarily conserved between them. Therefore, comparative genomic approaches start with making some form of alignment of genome sequences and looking for orthologous sequences in the aligned genomes and checking to what extent those sequences are conserved. Based on these, genome and molecular evolution are inferred and this may in turn be put in the context of, for example, phenotypic evolution or population genetics.

<span class="mw-page-title-main">Metagenomics</span> Study of genes found in the environment

Metagenomics is the study of genetic material recovered directly from environmental or clinical samples by a method called sequencing. The broad field may also be referred to as environmental genomics, ecogenomics, community genomics or microbiomics.

Multilocus sequence typing (MLST) is a technique in molecular biology for the typing of multiple loci, using DNA sequences of internal fragments of multiple housekeeping genes to characterize isolates of microbial species.

Computational phylogenetics, phylogeny inference, or phylogenetic inference focuses on computational and optimization algorithms, heuristics, and approaches involved in phylogenetic analyses. The goal is to find a phylogenetic tree representing optimal evolutionary ancestry between a set of genes, species, or taxa. Maximum likelihood, parsimony, Bayesian, and minimum evolution are typical optimality criteria used to assess how well a phylogenetic tree topology describes the sequence data. Nearest Neighbour Interchange (NNI), Subtree Prune and Regraft (SPR), and Tree Bisection and Reconnection (TBR), known as tree rearrangements, are deterministic algorithms to search for optimal or the best phylogenetic tree. The space and the landscape of searching for the optimal phylogenetic tree is known as phylogeny search space.

<span class="mw-page-title-main">16S ribosomal RNA</span> RNA component

16S ribosomal RNA is the RNA component of the 30S subunit of a prokaryotic ribosome. It binds to the Shine-Dalgarno sequence and provides most of the SSU structure.

Population genomics is the large-scale comparison of DNA sequences of populations. Population genomics is a neologism that is associated with population genetics. Population genomics studies genome-wide effects to improve our understanding of microevolution so that we may learn the phylogenetic history and demography of a population.

Pathogenomics is a field which uses high-throughput screening technology and bioinformatics to study encoded microbe resistance, as well as virulence factors (VFs), which enable a microorganism to infect a host and possibly cause disease. This includes studying genomes of pathogens which cannot be cultured outside of a host. In the past, researchers and medical professionals found it difficult to study and understand pathogenic traits of infectious organisms. With newer technology, pathogen genomes can be identified and sequenced in a much shorter time and at a lower cost, thus improving the ability to diagnose, treat, and even predict and prevent pathogenic infections and disease. It has also allowed researchers to better understand genome evolution events - gene loss, gain, duplication, rearrangement - and how those events impact pathogen resistance and ability to cause disease. This influx of information has created a need for bioinformatics tools and databases to analyze and make the vast amounts of data accessible to researchers, and it has raised ethical questions about the wisdom of reconstructing previously extinct and deadly pathogens in order to better understand virulence.

<span class="mw-page-title-main">2010s Haiti cholera outbreak</span> 2010-2019 cholera outbreak in Haiti

The 2010s Haiti cholera outbreak was the first modern large-scale outbreak of cholera—a disease once considered beaten back largely due to the invention of modern sanitation. The disease was reintroduced to Haiti in October 2010, not long after the disastrous earthquake earlier that year, and since then cholera has spread across the country and become endemic, causing high levels of both morbidity and mortality. Nearly 800,000 Haitians have been infected by cholera, and more than 9,000 have died, according to the United Nations (UN). Cholera transmission in Haiti today is largely a function of eradication efforts including WASH, education, oral vaccination, and climate variability. Early efforts were made to cover up the source of the epidemic, but thanks largely to the investigations of journalist Jonathan M. Katz and epidemiologist Renaud Piarroux, it is widely believed to be the result of contamination by infected United Nations peacekeepers deployed from Nepal. In terms of total infections, the outbreak has since been surpassed by the war-fueled 2016–2021 Yemen cholera outbreak, although the Haiti outbreak is still one of the most deadly modern outbreaks. After a three-year hiatus, new cholera cases reappeared in October 2022.

In metagenomics, binning is the process of grouping reads or contigs and assigning them to individual genome. Binning methods can be based on either compositional features or alignment (similarity), or both.

Viral phylodynamics is defined as the study of how epidemiological, immunological, and evolutionary processes act and potentially interact to shape viral phylogenies. Since the coining of the term in 2004, research on viral phylodynamics has focused on transmission dynamics in an effort to shed light on how these dynamics impact viral genetic variation. Transmission dynamics can be considered at the level of cells within an infected host, individual hosts within a population, or entire populations of hosts.

In bioinformatics, alignment-free sequence analysis approaches to molecular sequence and structure data provide alternatives over alignment-based approaches.

Ancientpathogengenomics is a scientific field related to the study of pathogen genomes recovered from ancient human, plant or animal remains. Ancient pathogens are microorganisms, now extinct, that in the past centuries caused several epidemics and deaths worldwide. Their genome, referred to as ancient DNA (aDNA), is isolated from the burial's remains of victims of the pandemics caused by these pathogens.

Clinical metagenomic next-generation sequencing (mNGS) is the comprehensive analysis of microbial and host genetic material in clinical samples from patients by next-generation sequencing. It uses the techniques of metagenomics to identify and characterize the genome of bacteria, fungi, parasites, and viruses without the need for a prior knowledge of a specific pathogen directly from clinical specimens. The capacity to detect all the potential pathogens in a sample makes metagenomic next generation sequencing a potent tool in the diagnosis of infectious disease especially when other more directed assays, such as PCR, fail. Its limitations include clinical utility, laboratory validity, sense and sensitivity, cost and regulatory considerations.

<span class="mw-page-title-main">Genome skimming</span> Method of genome sequencing

Genome skimming is a sequencing approach that uses low-pass, shallow sequencing of a genome, to generate fragments of DNA, known as genome skims. These genome skims contain information about the high-copy fraction of the genome. The high-copy fraction of the genome consists of the ribosomal DNA, plastid genome (plastome), mitochondrial genome (mitogenome), and nuclear repeats such as microsatellites and transposable elements. It employs high-throughput, next generation sequencing technology to generate these skims. Although these skims are merely 'the tip of the genomic iceberg', phylogenomic analysis of them can still provide insights on evolutionary history and biodiversity at a lower cost and larger scale than traditional methods. Due to the small amount of DNA required for genome skimming, its methodology can be applied in other fields other than genomics. Tasks like this include determining the traceability of products in the food industry, enforcing international regulations regarding biodiversity and biological resources, and forensics.

In the field of epidemiology, source attribution refers to a category of methods with the objective of reconstructing the transmission of an infectious disease from a specific source, such as a population, individual, or location. For example, source attribution methods may be used to trace the origin of a new pathogen that recently crossed from another host species into humans, or from one geographic region to another. It may be used to determine the common source of an outbreak of a foodborne infectious disease, such as a contaminated water supply. Finally, source attribution may be used to estimate the probability that an infection was transmitted from one specific individual to another, i.e., "who infected whom".

References

  1. Volz, Erik M.; Koelle, Katia; Bedford, Trevor (2013-03-21). "Viral Phylodynamics". PLOS Computational Biology. 9 (3): e1002947. Bibcode:2013PLSCB...9E2947V. doi: 10.1371/journal.pcbi.1002947 . ISSN   1553-7358. PMC   3605911 . PMID   23555203.
  2. 1 2 Grenfell, Bryan T.; Pybus, Oliver G.; Gog, Julia R.; Wood, James L. N.; Daly, Janet M.; Mumford, Jenny A.; Holmes, Edward C. (2004-01-16). "Unifying the epidemiological and evolutionary dynamics of pathogens". Science. 303 (5656): 327–332. Bibcode:2004Sci...303..327G. doi:10.1126/science.1090727. ISSN   1095-9203. PMID   14726583. S2CID   4017704.
  3. 1 2 Frost, Simon D.W.; Pybus, Oliver G.; Gog, Julia R.; Viboud, Cecile; Bonhoeffer, Sebastian; Bedford, Trevor (2015). "Eight challenges in phylodynamic inference". Epidemics. 10: 88–92. doi:10.1016/j.epidem.2014.09.001. PMC   4383806 . PMID   25843391.
  4. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Norström, Melissa M.; Karlsson, Annika C.; Salemi, Marco (2012-04-01). "Towards a new paradigm linking virus molecular evolution and pathogenesis: experimental design and phylodynamic inference". The New Microbiologica. 35 (2): 101–111. ISSN   1121-7138. PMID   22707126.
  5. 1 2 3 4 Azarian, Taj; Ali, Afsar; Johnson, Judith A.; Mohr, David; Prosperi, Mattia; Veras, Nazle M.; Jubair, Mohammed; Strickland, Samantha L.; Rashid, Mohammad H. (2014-12-31). "Phylodynamic Analysis of Clinical and Environmental Vibrio cholerae Isolates from Haiti Reveals Diversification Driven by Positive Selection". mBio. 5 (6): e01824–14. doi:10.1128/mBio.01824-14. ISSN   2150-7511. PMC   4278535 . PMID   25538191.
  6. Biek, Roman; Pybus, Oliver G.; Lloyd-Smith, James O.; Didelot, Xavier (2015). "Measurably evolving pathogens in the genomic era". Trends in Ecology & Evolution. 30 (6): 306–313. doi:10.1016/j.tree.2015.03.009. PMC   4457702 . PMID   25887947.
  7. Edgar, Robert C. (2004-01-01). "MUSCLE: multiple sequence alignment with high accuracy and high throughput". Nucleic Acids Research. 32 (5): 1792–1797. doi:10.1093/nar/gkh340. ISSN   1362-4962. PMC   390337 . PMID   15034147.
  8. Katoh, Kazutaka; Misawa, Kazuharu; Kuma, Kei-ichi; Miyata, Takashi (2002-07-15). "MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform". Nucleic Acids Research. 30 (14): 3059–3066. doi:10.1093/nar/gkf436. ISSN   0305-1048. PMC   135756 . PMID   12136088.
  9. Larkin, M. A.; Blackshields, G.; Brown, N. P.; Chenna, R.; McGettigan, P. A.; McWilliam, H.; Valentin, F.; Wallace, I. M.; Wilm, A. (2007-11-01). "Clustal W and Clustal X version 2.0". Bioinformatics. 23 (21): 2947–2948. doi: 10.1093/bioinformatics/btm404 . ISSN   1367-4811. PMID   17846036.
  10. Nguyen, Lam-Tung; Schmidt, Heiko A.; von Haeseler, Arndt; Minh, Bui Quang (2015-01-01). "IQ-TREE: A Fast and Effective Stochastic Algorithm for Estimating Maximum-Likelihood Phylogenies". Molecular Biology and Evolution. 32 (1): 268–274. doi:10.1093/molbev/msu300. ISSN   0737-4038. PMC   4271533 . PMID   25371430.
  11. Kumar, Sudhir; Stecher, Glen; Tamura, Koichiro (2016-07-01). "MEGA7: Molecular Evolutionary Genetics Analysis Version 7.0 for Bigger Datasets". Molecular Biology and Evolution. 33 (7): 1870–1874. doi: 10.1093/molbev/msw054 . ISSN   1537-1719. PMC   8210823 . PMID   27004904.
  12. 1 2 3 Piarroux, Renaud (2011). "Understanding the Cholera Epidemic, Haiti". Emerging Infectious Diseases. 17 (7): 1161–1168. doi:10.3201/eid1707.110059. PMC   3381400 . PMID   21762567.
  13. Orata, Fabini D.; Keim, Paul S.; Boucher, Yan (2014-04-03). "The 2010 Cholera Outbreak in Haiti: How Science Solved a Controversy". PLOS Pathogens. 10 (4): e1003967. doi: 10.1371/journal.ppat.1003967 . ISSN   1553-7374. PMC   3974815 . PMID   24699938.
  14. Katz, Lee S.; Petkau, Aaron; Beaulaurier, John; Tyler, Shaun; Antonova, Elena S.; Turnsek, Maryann A.; Guo, Yan; Wang, Susana; Paxinos, Ellen E. (2013-08-30). "Evolutionary Dynamics of Vibrio cholerae O1 following a Single-Source Introduction to Haiti". mBio. 4 (4): e00398–13. doi:10.1128/mBio.00398-13. ISSN   2150-7511. PMC   3705451 . PMID   23820394.