Paleogenomics is a field of science based on the reconstruction and analysis of genomic information in extinct species. Improved methods for the extraction of ancient DNA (aDNA) from museum artifacts, ice cores, archeological or paleontological sites, and next-generation sequencing technologies have spurred this field. It is now possible to detect genetic drift, ancient population migration and interrelationships, the evolutionary history of extinct plant, animal and Homo species, and identification of phenotypic features across geographic regions. Scientists can also use paleogenomics to compare ancient ancestors against modern-day humans. [1] The rising importance of paleogenomics is evident from the fact that the 2022 Nobel Prize in physiology or medicine was awarded to a Swedish geneticist Svante Pääbo [1955-], who worked on paleogenomics.
Initially, aDNA sequencing involved cloning small fragments into bacteria, which proceeded with low efficiency due to the oxidative damage the aDNA suffered over millennia. [2] aDNA is difficult to analyze due to facile degradation by nucleases; specific environments and postmortem conditions improved isolation and analysis. Extraction and contamination protocols were necessary for reliable analyses. [3] With the development of the Polymerase Chain Reaction (PCR) in 1983, scientists could study DNA samples up to approximately 100,000 years old, a limitation of the relatively short isolated fragments. Through advances in isolation, amplification, sequencing, and data reconstruction, older and older samples have become analyzable. Over the past 30 years, high copy number mitochondrial DNA was able to answer many questions; the advent of NGS techniques prompted far more. Moreover, this technological revolution allowed the transition from paleogenetics to paleogenomics. [1]
PCR, NGS second generation, and various library methods are available for sequencing aDNA, besides many bioinformatics tools. When dealing with each of these methods it is important to consider that aDNA can be altered post-mortem. [2] Specific alterations arise from:
Specific patterns and onset of these alterations help scientists to estimate the sample's age.
Formerly, scientists diagnosed post-mortem damages using enzymatic reactions or gas chromatography associated with mass spectroscopy; in more recent years scientists began to detect them by exploiting mutational sequence data. This strategy allows to identify excess of C->T mutations following treatment with uracil DNA glycosylase. Nowadays, one uses high-throughput sequencing (HTS) to identify depurination (a process that drives post-mortem DNA fragmentation, younger samples present more adenine than guanine), single strand breaks in double helix of DNA and abasic site (created by C->T mutation).
A single fragment of aDNA can be sequenced in its full length with HTS. With these data we can create a distribution representing a size decay curve that enables a direct quantitative comparison of fragmentation across specimens through space and environmental conditions. Throughout the decay curve it is possible to obtain the median length of the given fragment of aDNA. This length reflects the fragmentation levels after death, which generally increases with depositional temperature. [4]
Two different libraries can be performed for aDNA sequencing using PCR for genome amplification:
The first one is created using the blunt-end approach. This technique uses two different adaptors: these adaptors bind randomly the fragment and it can then be amplified. The fragment that does not contain both adaptors cannot be amplified causing an error source. To reduce this error, Illumina T/A ligation was introduced: this method consists in inserting the A tailing in DNA sample to facilitate the ligation of T tailed adaptors. In this methods we optimize the amplification of the aDNA.
To obtain ssDNA libraries, DNA is first denatured with heat. The obtained ssDNA is then ligated to two adaptors in order to generate the complementary strand and finally PCR is applied. [4]
As aDNA may contain bacterial DNA or other microorganisms, the process requires enrichment. In order to separate endogenous and exogenous fractions, various methods are employed:
By now many studies in different fields have led to the conclusion that present-day non-African population is the result of the diversification in several different lineages of an ancestral, well-structured, metapopulation which was the protagonist of an out-of-Africa expansion, in which it carried a subset of African genetic heritage. In this context, the analysis of ancient DNA was fundamental to test already formulated hypothesis and to provide new insights. First, it has allowed to narrow the timing and the structure of this diversification phenomenon by providing the calibration of the autosomal and mitochondrial mutation rate. [7] Admixture analysis has demonstrated that at least two independent gene flow events have occurred between ancestors of modern humans and archaic humans, such as Neanderthal and Denisovan populations, testifying the “leaky replacement” model of Eurasian human population history. According to all these data, the human divergence of the non-African lineages occurred around 45,000 – 55,000 BP. [7] Besides that, in many cases ancient DNA has allowed to track historical processes which have led, in time, to the actual population genetic structure, which would have been difficult to do counting only on the analysis of present-day genomes. Among these still unresolved questions, some of the most studied are the identity of the first inhabitants of the Americas, the peopling of Europe and the origin of agriculture in Europe. [1]
Analysis of ancient DNA allows to study mutations of phenotypic traits following changes in environment and human behavior. Migration to new habitats, new dietary shifts (following the transition to agriculture) and building of large communities led to the exposure of humans to new conditions that ultimately resulted in biological adaptation.
Migration of humans out of Africa to higher latitudes involved less exposure to sunlight. Since UVA and UVB rays are crucial for the synthesis of vitamin D, which regulates calcium absorption and thus is essential for bone health, living at higher latitudes would mean a substantial reduction in vitamin D synthesis. This put a new selective pressure on skin colour trait, favouring lighter skin colour at higher latitudes. The two most important genes involved in skin pigmentation are SLC24A5 and SLC45A2. Nowadays the “light skin” alleles of these genes are fixed in Europe but they reached a relatively high frequency only fairly recently (about 5000 years ago). [7] Such slow depigmentation process suggests that ancient Europeans could have faced the downsides of low vitamin D production, such as musculoskeletal and cardiovascular conditions. Another hypothesis is that pre-agricultural Europeans could have met their vitamin D requirements through their diet (since meat and fish contain some vitamin D) [8]
One of the major examples of adaptation following the switch to agricultural diet is the persistence of production of the lactase enzyme in adulthood. This enzyme is essential to digest lactose present in milk and dietary products and its absence leads to diarrhea following the consumption of these products. Lactase persistence is determined predominantly by a single-base mutation in the MCM6 gene and ancient DNA data show that this mutation became common only within the past 5000 years, thousands of years after the beginning of dairying practices. [7] Thus, even in the case of lactase-persistence there is a huge time delay between the onset of a new habit and the spread of the adaptive allele and so milk consumption may have been restricted to children or to lactose-reduced products.
Another example of mutation positively selected by the switch to agriculture is the number of AMY1 gene copies. AMY1 encodes for the starch-digesting enzyme amylase present in saliva and modern humans have a higher number of gene copies compared to chimpanzees. [8]
The human immune system has undergone intense selection through the millennia, adapting to different pathogen landscapes. Several environmental and cultural changes have imposed a selective pressure on different immune-associated genes. Migrations, for example, exposed humans to new habitats carrying new pathogens or pathogen vectors (e.g. mosquitos). Also the switch to agriculture involved exposition to different pathogens and health conditions, both due to the increased population density and to living close to livestock. However, it is difficult to directly correlate particular ancient genome changes to improved resistance to particular pathogens, giving the vastness and complexity of the human immune system. Besides studying directly changes in the human immune system, it is also possible to study the ancient genomes of pathogens, such as those causing tuberculosis, leprosy, plague, smallpox or malaria. For example, researchers have discovered that all strains of Yersinia pestis before 3600 years ago were lacking the ymt gene, which is essential for the pathogen to survive in the intestine of fleas. [8] This suggests that in the ancient past plague may had been less virulent compared to more recent Y. pestis outbreaks.
A study of ancient DNA supported or confirmed [9] that recent human evolution to resist infection of pathogens also increased inflammatory disease risk in post-Neolithic Europeans over the last 10,000 years, estimating nature, strength, and time of onset of selections due to pathogens. [10]
Many non-hominin vertebrates - ancient mammoth, polar bear, dog and horse - have been reconstructed through aDNA recovery from fossils and samples preserved at low temperature or high altitude. Mammoth studies are most frequent due to the high presence of soft tissue and hair from permafrost and are used to identify the relationship and demographic changes with more recent elephants. Polar bear studies are performed to identify the impact of climate change in evolution and biodiversity. Dog and horse studies give insights into domestication. In plants, aDNA has been isolated from seeds, pollen and wood. A correlation has been identified between ancient and extant barley. Another application was the detection of domestication and adaptation process of maize which include genes for drought tolerance and sugar content. [1]
The analysis of ancient genomes of anatomically modern humans has, in recent years, completely revolutionized our way of studying population migrations, transformation and evolution. Nevertheless, much still remains unknown. The first and obvious problem related to this kind of approach, which is going to be partially overcome by the continuous improvement of the ancient DNA extraction techniques, is the difficulty of recovering well preserved ancient genomes, a challenge that is particularly observed in Africa and in Asia, where the temperatures are higher than in other colder regions of the world. Further, Africa is, among all the continents, the one that harbors the most genetic diversity. [7] Besides DNA degradation, also exogenous contamination limits paleogenomic sequencing and assembly processes. [1] As we do not possess ancient DNA coming from the time and the region inhabited by the original ancestors of present-day non-African population, we still know little about their structure and location. The second and more important challenge that this matter has to face is the recovery of DNA from early modern humans (100,000 – 200,000 BP). These data, together with a major number of archaic genomes to analyze and with the knowledge of the timing and of the distribution of archaic genetic admixture, will allow scientists to more easily reconstruct the history of our species. In fact, collecting more data about or genetic history will allow us to track human evolution not only in terms of migrations and natural selection, but also in terms of culture. In the next decade paleogenomics research field is going to focus its attention mainly on three topics: the definition, at a fine-scale detail, of past human interactions by denser sampling, the comprehension of how these interactions have contributed to agricultural transition by analysis of DNA of understudied regions and, finally, the quantification of the natural selection contribution to present-day phenotypes. To interpret all these data geneticists will be required to cooperate, as they have already done with anthropologists and archaeologists, with historians. [7]
Bioethics in paleogenomics concerns ethical questions that arise in the study of ancient human remains, due to the complex relationships among scientists, governments and indigenous populations. In addition, paleogenomic studies have the potential to harm community or individual histories and identities, as well as to reveal denouncing information about their descendants. For these reasons, these kind of studies are still a touchy subject. Paleogenomics studies can have negative consequences mainly because of the discrepancies between articulations of ethical principles and practices. In fact, ancestors’ remains are usually considered legally and scientifically as “artifacts”, rather than “human subjects”, which justifies questionable behaviors and lack of engagement from communities. Testing of ancestral remains are therefore used in disputes, claims in treaty, repatriation, or other legal cases. The acknowledgement of the importance and susceptibility of this subject is heading towards ethical commitment and guidance applicable to different contexts, in order to preserve ancestral remains’ dignity and avoid ethical issues. [11] Finally, another pioneering area of interest is the so-called “de-extinction” project, which aims to the resurrection of extinct species, such as the mammoth. This project, which appears to be possible thanks to the CRISPR/Cas9 technology, is, however, strongly connected to many ethical issues. [1]
The polymerase chain reaction (PCR) is a method widely used to make millions to billions of copies of a specific DNA sample rapidly, allowing scientists to amplify a very small sample of DNA sufficiently to enable detailed study. PCR was invented in 1983 by American biochemist Kary Mullis at Cetus Corporation. Mullis and biochemist Michael Smith, who had developed other essential ways of manipulating DNA, were jointly awarded the Nobel Prize in Chemistry in 1993.
In molecular biology, restriction fragment length polymorphism (RFLP) is a technique that exploits variations in homologous DNA sequences, known as polymorphisms, populations, or species or to pinpoint the locations of genes within a sequence. The term may refer to a polymorphism itself, as detected through the differing locations of restriction enzyme sites, or to a related laboratory technique by which such differences can be illustrated. In RFLP analysis, a DNA sample is digested into fragments by one or more restriction enzymes, and the resulting restriction fragments are then separated by gel electrophoresis according to their size.
A microsatellite is a tract of repetitive DNA in which certain DNA motifs are repeated, typically 5–50 times. Microsatellites occur at thousands of locations within an organism's genome. They have a higher mutation rate than other areas of DNA leading to high genetic diversity. Microsatellites are often referred to as short tandem repeats (STRs) by forensic geneticists and in genetic genealogy, or as simple sequence repeats (SSRs) by plant geneticists.
Archaeogenetics is the study of ancient DNA using various molecular genetic methods and DNA resources. This form of genetic analysis can be applied to human, animal, and plant specimens. Ancient DNA can be extracted from various fossilized specimens including bones, eggshells, and artificially preserved tissues in human and animal specimens. In plants, ancient DNA can be extracted from seeds and tissue. Archaeogenetics provides us with genetic evidence of ancient population group migrations, domestication events, and plant and animal evolution. The ancient DNA cross referenced with the DNA of relative modern genetic populations allows researchers to run comparison studies that provide a more complete analysis when ancient DNA is compromised.
AFLP-PCR or just AFLP is a PCR-based tool used in genetics research, DNA fingerprinting, and in the practice of genetic engineering. Developed in the early 1990s by KeyGene, AFLP uses restriction enzymes to digest genomic DNA, followed by ligation of adaptors to the sticky ends of the restriction fragments. A subset of the restriction fragments is then selected to be amplified. This selection is achieved by using primers complementary to the adaptor sequence, the restriction site sequence and a few nucleotides inside the restriction site fragments. The amplified fragments are separated and visualized on denaturing on agarose gel electrophoresis, either through autoradiography or fluorescence methodologies, or via automated capillary sequencing instruments.
Genotyping is the process of determining differences in the genetic make-up (genotype) of an individual by examining the individual's DNA sequence using biological assays and comparing it to another individual's sequence or a reference sequence. It reveals the alleles an individual has inherited from their parents. Traditionally genotyping is the use of DNA sequences to define biological populations by use of molecular tools. It does not usually involve defining the genes of an individual.
Exome sequencing, also known as whole exome sequencing (WES), is a genomic technique for sequencing all of the protein-coding regions of genes in a genome. It consists of two steps: the first step is to select only the subset of DNA that encodes proteins. These regions are known as exons—humans have about 180,000 exons, constituting about 1% of the human genome, or approximately 30 million base pairs. The second step is to sequence the exonic DNA using any high-throughput DNA sequencing technology.
Combined Bisulfite Restriction Analysis is a molecular biology technique that allows for the sensitive quantification of DNA methylation levels at a specific genomic locus on a DNA sequence in a small sample of genomic DNA. The technique is a variation of bisulfite sequencing, and combines bisulfite conversion based polymerase chain reaction with restriction digestion. Originally developed to reliably handle minute amounts of genomic DNA from microdissected paraffin-embedded tissue samples, the technique has since seen widespread usage in cancer research and epigenetics studies.
Microfluidic whole genome haplotyping is a technique for the physical separation of individual chromosomes from a metaphase cell followed by direct resolution of the haplotype for each allele.
DNA nanoball sequencing is a high throughput sequencing technology that is used to determine the entire genomic sequence of an organism. The method uses rolling circle replication to amplify small fragments of genomic DNA into DNA nanoballs. Fluorescent nucleotides bind to complementary nucleotides and are then polymerized to anchor sequences bound to known sequences on the DNA template. The base order is determined via the fluorescence of the bound nucleotides This DNA sequencing method allows large numbers of DNA nanoballs to be sequenced per run at lower reagent costs compared to other next generation sequencing platforms. However, a limitation of this method is that it generates only short sequences of DNA, which presents challenges to mapping its reads to a reference genome. After purchasing Complete Genomics, the Beijing Genomics Institute (BGI) refined DNA nanoball sequencing to sequence nucleotide samples on their own platform.
Multiple Annealing and Looping Based Amplification Cycles (MALBAC) is a quasilinear whole genome amplification method. Unlike conventional DNA amplification methods that are non-linear or exponential, MALBAC utilizes special primers that allow amplicons to have complementary ends and therefore to loop, preventing DNA from being copied exponentially. This results in amplification of only the original genomic DNA and therefore reduces amplification bias. MALBAC is “used to create overlapped shotgun amplicons covering most of the genome”. For next generation sequencing, MALBAC is followed by regular PCR which is used to further amplify amplicons.
STARR-seq is a method to assay enhancer activity for millions of candidates from arbitrary sources of DNA. It is used to identify the sequences that act as transcriptional enhancers in a direct, quantitative, and genome-wide manner.
Viral metagenomics uses metagenomic technologies to detect viral genomic material from diverse environmental and clinical samples. Viruses are the most abundant biological entity and are extremely diverse; however, only a small fraction of viruses have been sequenced and only an even smaller fraction have been isolated and cultured. Sequencing viruses can be challenging because viruses lack a universally conserved marker gene so gene-based approaches are limited. Metagenomics can be used to study and analyze unculturable viruses and has been an important tool in understanding viral diversity and abundance and in the discovery of novel viruses. For example, metagenomics methods have been used to describe viruses associated with cancerous tumors and in terrestrial ecosystems.
Single-cell sequencing examines the nucleic acid sequence information from individual cells with optimized next-generation sequencing technologies, providing a higher resolution of cellular differences and a better understanding of the function of an individual cell in the context of its microenvironment. For example, in cancer, sequencing the DNA of individual cells can give information about mutations carried by small populations of cells. In development, sequencing the RNAs expressed by individual cells can give insight into the existence and behavior of different cell types. In microbial systems, a population of the same species can appear genetically clonal. Still, single-cell sequencing of RNA or epigenetic modifications can reveal cell-to-cell variability that may help populations rapidly adapt to survive in changing environments.
Transposon insertion sequencing (Tn-seq) combines transposon insertional mutagenesis with massively parallel sequencing (MPS) of the transposon insertion sites to identify genes contributing to a function of interest in bacteria. The method was originally established by concurrent work in four laboratories under the acronyms HITS, INSeq, TraDIS, and Tn-Seq. Numerous variations have been subsequently developed and applied to diverse biological systems. Collectively, the methods are often termed Tn-Seq as they all involve monitoring the fitness of transposon insertion mutants via DNA sequencing approaches.
Circulating tumor DNA (ctDNA) is tumor-derived fragmented DNA in the bloodstream that is not associated with cells. ctDNA should not be confused with cell-free DNA (cfDNA), a broader term which describes DNA that is freely circulating in the bloodstream, but is not necessarily of tumor origin. Because ctDNA may reflect the entire tumor genome, it has gained traction for its potential clinical utility; "liquid biopsies" in the form of blood draws may be taken at various time points to monitor tumor progression throughout the treatment regimen.
Duplex sequencing is a library preparation and analysis method for next-generation sequencing (NGS) platforms that employs random tagging of double-stranded DNA to detect mutations with higher accuracy and lower error rates.
Ancientpathogengenomics is a scientific field related to the study of pathogen genomes recovered from ancient human, plant or animal remains. Ancient pathogens are microorganisms, now extinct, that in the past centuries caused several epidemics and deaths worldwide. Their genome, referred to as ancient DNA (aDNA), is isolated from the burial's remains of victims of the pandemics caused by these pathogens.
Clinical metagenomic next-generation sequencing (mNGS) is the comprehensive analysis of microbial and host genetic material in clinical samples from patients by next-generation sequencing. It uses the techniques of metagenomics to identify and characterize the genome of bacteria, fungi, parasites, and viruses without the need for a prior knowledge of a specific pathogen directly from clinical specimens. The capacity to detect all the potential pathogens in a sample makes metagenomic next generation sequencing a potent tool in the diagnosis of infectious disease especially when other more directed assays, such as PCR, fail. Its limitations include clinical utility, laboratory validity, sense and sensitivity, cost and regulatory considerations.
Linked-read sequencing, a type of DNA sequencing technology, uses specialized technique that tags DNA molecules with unique barcodes before fragmenting them. Unlike traditional sequencing technology, where DNA is broken into small fragments and then sequenced individually, resulting in short read lengths that has difficulties in accurately reconstructing the original DNA sequence, the unique barcodes of linked-read sequencing allows scientists to link together DNA fragments that come from the same DNA molecule. A pivotal benefit of this technology lies in the small quantities of DNA required for large genome information output, effectively combining the advantages of long-read and short-read technologies.