Human Pangenome Reference

Last updated

Shared sequences and structural variants between genomes in Human Pangenome Reference Pangenome-graph (1).png
Shared sequences and structural variants between genomes in Human Pangenome Reference

The Human Pangenome Reference is a collection of genomes from a diverse cohort of individuals compiled by the Human Pangenome Reference Consortium (HPRC). This first draft pangenome comprises 47 phased, diploid assemblies from a diverse cohort of individuals and was intended to capture the genetic diversity of the human population. The development of this pangenome seeks to address perceived shortcomings in the current human reference genome by offering a more comprehensive and inclusive resource for genomic research and analysis. [1]

Contents

The pangenome concept, originating from the study of prokaryotes, has been extended to multicellular eukaryotic organisms, including humans. The human pangenome has significant implications for population genetics, phylogenetics, and public health policy, as it can inform the genetic basis of diseases and personalized treatments by providing insights into the genetic diversity of human populations. [2]

The new human pangenome reference integrates the missing 8% of the human genome sequence, adding over 100 million new bases. It aims to capture more population diversity than the previous reference sequence and is based on 94 high-quality haploid assemblies from individuals with broad genetic diversity. The generation of this reference genome focuses on eliminating gaps, incorporating complex genomic sequence features, and encompassing a broader spectrum of human genome diversity. [3]

History

The human reference genome, initially drafted over 20 years ago, is a composite of merged haplotypes from more than 20 individuals, with a single individual contributing to approximately 70% of the sequence. However, it has limitations, including biases and errors, and, as would be the case for any linear human genome reference sequence, can not fully represent the global human genomic variation. The majority of genomic research has focused on individuals of European descent which leads to a bias in available datasets for analysis. Consequently, precision medicine primarily relies on genomic variations found within populations of European ancestry. This limited scope overlooks a significant portion of global genetic diversity crucial for understanding clinical phenotypes. [4] To overcome this, the Human Pangenome Reference Consortium (HPRC) has been working on creating a more complete human reference genome with a graph-based, telomere-to-telomere representation of global genomic diversity integrating genome sequences from a diverse array of individuals. Its primary objectives include enhancing gene-disease association studies across populations and serving as an extensive genetic resource for future biomedical research and precision medicine endeavors. [1] [4]

Properties of Human Pangenome Reference

The Pangenome Reference Consortium has developed a draft human pangenome reference, which includes 47 phased, diploid assemblies from a genetically diverse cohort of individuals. The HPRC samples were sequenced using Pacific Biosciences (PacBio) high-fidelity (HiFi) and Oxford Nanopore Technologies (ONT) long-read sequencing, Bionano optical maps and high-coverage Hi-C Illumina short-read sequencing. [1]

Capturing variants

These assemblies are reported to cover more than 99% of the expected sequence in each genome and exhibit an accuracy of over 99% at both the structural and base pair levels. The pangenome captures known variants and haplotypes, reveals new alleles at structurally complex loci, and adds 119 million base pairs of euchromatic polymorphic sequence and 1,115 gene duplications relative to the existing reference GRCh38, with roughly 90 million of the additional base pairs derived from structural variation. Using this draft pangenome for analyzing short-read data has shown a 34% reduction in small variant discovery errors and a 104% increase in the detection of structural variants per haplotype compared to GRCh38-based workflows. [1]

Representation of diversity

The PRC's efforts are part of a broader initiative to sequence and assemble genomes from individuals across diverse populations, with the goal of better representing the genomic landscape of human diversity. The consortium aims to increase the number of genome sequences to 350 by mid-2024, providing a more complete and inclusive resource for genomic research and analysis. [1] The development of the human pangenome reference marks a notable advancement in genomics, as it offers a more accurate and diverse depiction of global genomic variation. This development is expected to enhance gene-disease association studies across populations, broaden the scope of genomics research to encompass the most repetitive and polymorphic regions of the genome, and serve as a valuable genetic resource for future studies. [1]

HPRC sample subpopulations includes ACB, African Caribbean in Barbados; ASW, African Ancestry in Southwest US; CHS, Han Chinese South; CLM, Colombian in Medellin, Colombia; ESN, Esan in Nigeria; GWD, Gambian in Western Division; KHV, Kinh in Ho Chi Minh City, Vietnam; MKK, Maasai in Kinyawa, Kenya; MSL, Mende in Sierra Leone; PEL, Peruvian in Lima, Peru; PJL, Punjabi in Lahore, Pakistan; PUR, Puerto Rican in Puerto Rico; YRI, Yoruba in Ibadan, Nigeria. [1] The human pangenome reference is more comprehensive than previous reference sequences. It incorporates over 100 million new bases from 47 people with diverse ancestries, capturing more population diversity than previous references. [5] [3]

Human Pangenome generation

A brief overview of different steps in genome de novo assembly De novo assembly.png
A brief overview of different steps in genome de novo assembly

Sample selection and sequencing

The pangenome reference includes 47 fully phased diploid genomes. Among these, 29 genomes were entirely generated by HPRC, while the remaining 18 were produced by other efforts. [1]

These sequencing technologies were used to collect information: Pacific Biosciences (PacBio) high-fidelity (HiFi) with 39.7× HiFi sequence depth of coverage, Oxford Nanopore Technologies (ONT) long-read sequencing, and Bionano optical maps and high-coverage Hi-C Illumina short-read sequencing. To analyze the 18 additional samples, they employed the nanopore unsheared long-read sequencing protocol, resulting in approximately 60× coverage of unsheared sequencing data. [1]

Assembling genomes

The Trio-Hifiasm. [6] [7] tool was selected as the primary assembler following thorough benchmarking of multiple alternatives. Trio-Hifiasm leverages PacBio HiFi long-read sequences and parental Illumina short-read sequences to generate highly phased contig assemblies. [1]

Constructing the pangenome graph

Three different tools were used to construct the pangenome graph:

Applications

Small variants

An application of note is pangenome-based short variant discovery, involving the alignment of short reads to a pangenome graph to enhance the accuracy of calling small variants like SNPs and indels. This method should exhibit improved performance compared to traditional approaches, particularly in regions of complexity and genes of medical relevance. Furthermore, the pangenome purportedly aids in variant calling in parent-child trios, potentially enhancing accuracy in this context. [1]

Structural Variants

Another key application lies in SV genotyping, where the sequence-resolved structural variants (SVs) within the pangenome enable the identification and genotyping of diverse SV alleles. [1]

Variable Number Tandem Repeat

Improvements in VNTR (Variable Number Tandem Repeat) regions mapping, RNA sequencing mapping, chromatin immunoprecipitation and sequencing analysis were also reported. [ citation needed ]In summary, the pangenome is regarded as a resource with potential for enhancing variant discovery, population genetics analyses, and the detection of complex genetic events that may not be identified by conventional reference genomes. [1]

Limitations

Currently available application and tools for Human Pangenome Reference Fgene-13-1042550-g003.jpg
Currently available application and tools for Human Pangenome Reference

Lack of established tools

Most of the current tools developed are compatible with GRCh38, the human reference genome. It is known that variant discovery using the human reference genome fails to capture all the variations because it lacks diversity and is not complete and accurate. Using graph-based references for alignment can increase the accuracy of the analysis as it is more diverse and complete. [12]

Scale-up problems

The estimates show that by 2025, the number of genomes that are sequenced will be 100 million to 2 billion which according to price trends, the storage for storing these data would be expensive and problematic. [12] With the increasing availability of personal genome data, the initial dataset size -currently in the thousands of gigabase-scale genomes- is poised to expand exponentially. This growth will necessitate the development of more efficient analysis algorithms and data representation formats that can handle the escalating demands on time, memory, and storage space. [12]

Privacy problems for expanding the dataset

Expanding the human pangenome reference to proposed 700 haplotypes (350 individuals) poses challenges in ensuring inclusivity due to linguistic, literacy, socioeconomic barriers, and distrust among racial-ethnic minorities and aborigines. Obtaining informed consent becomes complex as participants need to understand project implications. Balancing the release of post-analysis genomic data with ethical considerations presents dilemmas concerning complete information disclosure. [12]

Related Research Articles

<span class="mw-page-title-main">Human genome</span> Complete set of nucleic acid sequences for humans

The human genome is a complete set of nucleic acid sequences for humans, encoded as the DNA within each of the 24 distinct chromosomes in the cell nucleus. A small DNA molecule is found within individual mitochondria. These are usually treated separately as the nuclear genome and the mitochondrial genome. Human genomes include both protein-coding DNA sequences and various types of DNA that does not encode proteins. The latter is a diverse category that includes DNA coding for non-translated RNA, such as that for ribosomal RNA, transfer RNA, ribozymes, small nuclear RNAs, and several types of regulatory RNAs. It also includes promoters and their associated gene-regulatory elements, DNA playing structural and replicatory roles, such as scaffolding regions, telomeres, centromeres, and origins of replication, plus large numbers of transposable elements, inserted viral DNA, non-functional pseudogenes and simple, highly repetitive sequences. Introns make up a large percentage of non-coding DNA. Some of this non-coding DNA is non-functional junk DNA, such as pseudogenes, but there is no firm consensus on the total amount of junk DNA.

<span class="mw-page-title-main">Genomics</span> Discipline in genetics

Genomics is an interdisciplinary field of molecular biology focusing on the structure, function, evolution, mapping, and editing of genomes. A genome is an organism's complete set of DNA, including all of its genes as well as its hierarchical, three-dimensional structural configuration. In contrast to genetics, which refers to the study of individual genes and their roles in inheritance, genomics aims at the collective characterization and quantification of all of an organism's genes, their interrelations and influence on the organism. Genes may direct the production of proteins with the assistance of enzymes and messenger molecules. In turn, proteins make up body structures such as organs and tissues as well as control chemical reactions and carry signals between cells. Genomics also involves the sequencing and analysis of genomes through uses of high throughput DNA sequencing and bioinformatics to assemble and analyze the function and structure of entire genomes. Advances in genomics have triggered a revolution in discovery-based research and systems biology to facilitate understanding of even the most complex biological systems such as the brain.

In bioinformatics, sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of a wide range of analytical methods to understand its features, function, structure, or evolution. It can be performed on the entire genome, transcriptome or proteome of an organism, and can also involve only selected segments or regions, like tandem repeats and transposable elements. Methodologies used include sequence alignment, searches against biological databases, and others.

<span class="mw-page-title-main">Single-nucleotide polymorphism</span> Single nucleotide in genomic DNA at which different sequence alternatives exist

In genetics and bioinformatics, a single-nucleotide polymorphism is a germline substitution of a single nucleotide at a specific position in the genome. Although certain definitions require the substitution to be present in a sufficiently large fraction of the population, many publications do not apply such a frequency threshold.

<span class="mw-page-title-main">Comparative genomics</span> Field of biological research

Comparative genomics is a branch of biological research that examines genome sequences across a spectrum of species, spanning from humans and mice to a diverse array of organisms from bacteria to chimpanzees. This large-scale holistic approach compares two or more genomes to discover the similarities and differences between the genomes and to study the biology of the individual genomes. Comparison of whole genome sequences provides a highly detailed view of how organisms are related to each other at the gene level. By comparing whole genome sequences, researchers gain insights into genetic relationships between organisms and study evolutionary changes. The major principle of comparative genomics is that common features of two organisms will often be encoded within the DNA that is evolutionarily conserved between them. Therefore, Comparative genomics provides a powerful tool for studying evolutionary changes among organisms, helping to identify genes that are conserved or common among species, as well as genes that give unique characteristics of each organism. Moreover, these studies can be performed at different levels of the genomes to obtain multiple perspectives about the organisms.

<span class="mw-page-title-main">Human genetic variation</span> Genetic diversity in human populations

Human genetic variation is the genetic differences in and among populations. There may be multiple variants of any given gene in the human population (alleles), a situation called polymorphism.

Personal genomics or consumer genetics is the branch of genomics concerned with the sequencing, analysis and interpretation of the genome of an individual. The genotyping stage employs different techniques, including single-nucleotide polymorphism (SNP) analysis chips, or partial or full genome sequencing. Once the genotypes are known, the individual's variations can be compared with the published literature to determine likelihood of trait expression, ancestry inference and disease risk.

<span class="mw-page-title-main">Pan-genome</span> All genes of all strains in a clade

In the fields of molecular biology and genetics, a pan-genome is the entire set of genes from all strains within a clade. More generally, it is the union of all the genomes of a clade. The pan-genome can be broken down into a "core pangenome" that contains genes present in all individuals, a "shell pangenome" that contains genes present in two or more strains, and a "cloud pangenome" that contains genes only found in a single strain. Some authors also refer to the cloud genome as "accessory genome" containing 'dispensable' genes present in a subset of the strains and strain-specific genes. Note that the use of the term 'dispensable' has been questioned, at least in plant genomes, as accessory genes play "an important role in genome evolution and in the complex interplay between the genome and the environment". The field of study of pangenomes is called pangenomics.

Complete Genomics is a life sciences company that has developed and commercialized a DNA sequencing platform for human genome sequencing and analysis. The company is a wholly-owned subsidiary of MGI.

<span class="mw-page-title-main">Reference genome</span> Digital nucleic acid sequence database

A reference genome is a digital nucleic acid sequence database, assembled by scientists as a representative example of the set of genes in one idealized individual organism of a species. As they are assembled from the sequencing of DNA from a number of individual donors, reference genomes do not accurately represent the set of genes of any single individual organism. Instead, a reference provides a haploid mosaic of different DNA sequences from each donor. For example, one of the most recent human reference genomes, assembly GRCh38/hg38, is derived from >60 genomic clone libraries. There are reference genomes for multiple species of viruses, bacteria, fungus, plants, and animals. Reference genomes are typically used as a guide on which new genomes are built, enabling them to be assembled much more quickly and cheaply than the initial Human Genome Project. Reference genomes can be accessed online at several locations, using dedicated browsers such as Ensembl or UCSC Genome Browser.

<span class="mw-page-title-main">Transmission electron microscopy DNA sequencing</span> Single-molecule sequencing technology

Transmission electron microscopy DNA sequencing is a single-molecule sequencing technology that uses transmission electron microscopy techniques. The method was conceived and developed in the 1960s and 70s, but lost favor when the extent of damage to the sample was recognized.

De novo transcriptome assembly is the de novo sequence assembly method of creating a transcriptome without the aid of a reference genome.

In genetics, imputation is the statistical inference of unobserved genotypes. It is achieved by using known haplotypes in a population, for instance from the HapMap or the 1000 Genomes Project in humans, thereby allowing to test for association between a trait of interest and experimentally untyped genetic variants, but whose genotypes have been statistically inferred ("imputed"). Genotype imputation is usually performed on SNPs, the most common kind of genetic variation.

Single-cell DNA template strand sequencing, or Strand-seq, is a technique for the selective sequencing of a daughter cell's parental template strands. This technique offers a wide variety of applications, including the identification of sister chromatid exchanges in the parental cell prior to segregation, the assessment of non-random segregation of sister chromatids, the identification of misoriented contigs in genome assemblies, de novo genome assembly of both haplotypes in diploid organisms including humans, whole-chromosome haplotyping, and the identification of germline and somatic genomic structural variation, the latter of which can be detected robustly even in single cells.

In genetics, coverage is one of several measures of the depth or completeness of DNA sequencing, and is more specifically expressed in any of the following terms:

Amanda M. Hulse-Kemp is a computational biologist with the United States Department of Agriculture – Agricultural Research Service. She works in the Genomics and Bioinformatics Research Unit and is stationed on the North Carolina State University campus in Raleigh, North Carolina.

The Vertebrate Genomes Project (VGP) is a project which aims to generate high-quality, complete reference genomes of all 66,000 vertebrate species. It is an international cooperation project with members from more than 50 separate institutions and was launched in February 2017.

<span class="mw-page-title-main">Linked-read sequencing</span>

Linked-read sequencing, a type of DNA sequencing technology, uses specialized technique that tags DNA molecules with unique barcodes before fragmenting them. Unlike traditional sequencing technology, where DNA is broken into small fragments and then sequenced individually, resulting in short read lengths that has difficulties in accurately reconstructing the original DNA sequence, the unique barcodes of linked-read sequencing allows scientists to link together DNA fragments that come from the same DNA molecule. A pivotal benefit of this technology lies in the small quantities of DNA required for large genome information output, effectively combining the advantages of long-read and short-read technologies.

Circular consensus sequencing (CCS) is a DNA sequencing method that is used in conjunction with single-molecule real-time sequencing to yield highly accurate long-read sequencing datasets with read lengths averaging 15–25 kb with median accuracy greater than 99.9%. These long reads, which are created via the formation of consensus sequencing obtained from multiple passes on a single DNA molecule, can be used to improve results for complex applications such as single nucleotide and structural variant detection, genome assembly, assembly of difficult polyploid or highly repetitive genomes, and assembly of metagenomes.

Nvidia Parabricks is a suite of free software for genome analysis developed by Nvidia, designed to deliver high throughput by resorting to graphics processing unit (GPU) acceleration.

References

  1. 1 2 3 4 5 6 7 8 9 10 11 12 13 Liao, Wen-Wei; Asri, Mobin; Ebler, Jana; Doerr, Daniel; Haukness, Marina; Hickey, Glenn; Lu, Shuangjia; Lucas, Julian K.; Monlong, Jean; Abel, Haley J.; Buonaiuto, Silvia; Chang, Xian H.; Cheng, Haoyu; Chu, Justin; Colonna, Vincenza (May 2023). "A draft human pangenome reference". Nature. 617 (7960): 312–324. Bibcode:2023Natur.617..312L. doi:10.1038/s41586-023-05896-x. ISSN   1476-4687. PMC   10172123 . PMID   37165242.
  2. Abondio, Paolo; Cilli, Elisabetta; Luiselli, Donata (June 2023). "Human Pangenomics: Promises and Challenges of a Distributed Genomic Reference". Life. 13 (6): 1360. Bibcode:2023Life...13.1360A. doi: 10.3390/life13061360 . ISSN   2075-1729. PMC   10304804 . PMID   37374141.
  3. 1 2 Lee, HoJoon; Greer, Stephanie U.; Pavlichin, Dmitri S.; Zhou, Bo; Urban, Alexander E.; Weissman, Tsachy; Liao, Wen-Wei; Asri, Mobin; Ebler, Jana; Doerr, Daniel; Haukness, Marina; Hickey, Glenn; Lu, Shuangjia; Lucas, Julian K.; Monlong, Jean (2023-08-28). "Pan-conserved segment tags identify ultra-conserved sequences across assemblies in the human pangenome". Cell Reports Methods. 3 (8): 100543. doi:10.1016/j.crmeth.2023.100543. ISSN   2667-2375. PMC   10475782 . PMID   37671027.
  4. 1 2 Wang, T; Antonacci-Fulton, L; Howe, K; Lawson, HA; Lucas, JK; Phillippy, AM; Popejoy, AB; Asri, M; Carson, C; Chaisson, MJP; Chang, X; Cook-Deegan, R; Felsenfeld, AL; Fulton, RS; Garrison, EP; Garrison, NA; Graves-Lindsay, TA; Ji, H; Kenny, EE; Koenig, BA; Li, D; Marschall, T; McMichael, JF; Novak, AM; Purushotham, D; Schneider, VA; Schultz, BI; Smith, MW; Sofia, HJ; Weissman, T; Flicek, P; Li, H; Miga, KH; Paten, B; Jarvis, ED; Hall, IM; Eichler, EE; Haussler, D; Human Pangenome Reference, Consortium (April 2022). "The Human Pangenome Project: a global resource to map genomic diversity". Nature. 604 (7906): 437–446. Bibcode:2022Natur.604..437W. doi:10.1038/s41586-022-04601-8. PMC   9402379 . PMID   35444317.
  5. "A new human "pangenome" reference". www.genome.gov. Retrieved 2024-02-23.
  6. Jarvis, Erich D.; Formenti, Giulio; Rhie, Arang; Guarracino, Andrea; Yang, Chentao; Wood, Jonathan; Tracey, Alan; Thibaud-Nissen, Francoise; Vollger, Mitchell R.; Porubsky, David; Cheng, Haoyu; Asri, Mobin; Logsdon, Glennis A.; Carnevali, Paolo; Chaisson, Mark J. P. (November 2022). "Semi-automated assembly of high-quality diploid human reference genomes". Nature. 611 (7936): 519–531. Bibcode:2022Natur.611..519J. doi:10.1038/s41586-022-05325-5. ISSN   1476-4687. PMC   9668749 . PMID   36261518.
  7. Li, Heng; Bloom, Jonathan M.; Farjoun, Yossi; Fleharty, Mark; Gauthier, Laura; Neale, Benjamin; MacArthur, Daniel (August 2018). "A synthetic-diploid benchmark for accurate variant-calling evaluation". Nature Methods. 15 (8): 595–597. doi:10.1038/s41592-018-0054-7. ISSN   1548-7105. PMC   6341484 . PMID   30013044.
  8. Li, Heng; Feng, Xiaowen; Chu, Chong (2020-10-16). "The design and construction of reference pangenome graphs with minigraph". Genome Biology. 21 (1): 265. doi: 10.1186/s13059-020-02168-z . ISSN   1474-760X. PMC   7568353 . PMID   33066802.
  9. Li, Heng (2018-05-10). "Minimap2: pairwise alignment for nucleotide sequences". Bioinformatics. 34 (18): 3094–3100. doi:10.1093/bioinformatics/bty191. ISSN   1367-4803. PMC   6137996 . PMID   29750242.
  10. Hickey, Glenn; Monlong, Jean; Ebler, Jana; Novak, Adam M.; Eizenga, Jordan M.; Gao, Yan; Marschall, Tobias; Li, Heng; Paten, Benedict (2023-05-10). "Pangenome graph construction from genome alignments with Minigraph-Cactus". Nature Biotechnology. 42 (4): 663–673. doi:10.1038/s41587-023-01793-w. ISSN   1546-1696. PMC   10638906 . PMID   37165083.
  11. Armstrong, Joel; Hickey, Glenn; Diekhans, Mark; Fiddes, Ian T.; Novak, Adam M.; Deran, Alden; Fang, Qi; Xie, Duo; Feng, Shaohong; Stiller, Josefin; Genereux, Diane; Johnson, Jeremy; Marinescu, Voichita Dana; Alföldi, Jessica; Harris, Robert S. (November 2020). "Progressive Cactus is a multiple-genome aligner for the thousand-genome era". Nature. 587 (7833): 246–251. Bibcode:2020Natur.587..246A. doi:10.1038/s41586-020-2871-y. ISSN   1476-4687. PMC   7673649 . PMID   33177663.
  12. 1 2 3 4 5 Singh, Vipin; Pandey, Shweta; Bhardwaj, Anshu (2022). "From the reference human genome to human pangenome: Premise, promise and challenge". Frontiers in Genetics. 13. doi: 10.3389/fgene.2022.1042550 . ISSN   1664-8021. PMC   9684177 . PMID   36437921.