Heng Li

Last updated
Heng Li
Known for Bioinformatics
Burrows–Wheeler transform
Samtools
TreeFam
Awards Benjamin Franklin Award (Bioinformatics) (2012) [1]
Scientific career
Institutions Wellcome Trust Sanger Institute
Broad Institute
Beijing Genomics Institute
Thesis Constructing the TreeFam database  (2006)
Doctoral advisor Wei-Mou Zheng [2]
Website hlilab.github.io

Heng Li is a Chinese bioinformatics scientist. He is an associate professor at the department of Biomedical Informatics of Harvard Medical School and the department of Data Science of Dana-Farber Cancer Institute. [3] [4] [5] He was previously a research scientist working at the Broad Institute in Cambridge, Massachusetts with David Reich and David Altshuler. [6] Li's work has made several important contributions in the field of next generation sequencing.

Contents

Education

Li majored in physics at Nanjing University from 1997 to 2001. [7] He received his PhD from the Institute of Theoretical Physics at the Chinese Academy of Sciences in 2006. His thesis, titled "Constructing the TreeFam database", was supervised by Wei-Mou Zheng. [2]

Research

Li was involved in a number of projects while working at the Beijing Genomics Institute from 2002 to 2006. These included studying rice finishing, [8] silkworm sequencing, [9] and genetic variation in chickens. [10]

From 2006 to 2009, Li worked on a postdoctoral research fellowship with Richard M. Durbin at the Wellcome Trust Sanger Institute. [11] During this time, Li made several important contributions to the field of next generation sequencing (NGS) through the development of software such as the SAMtools NGS utilities, [12] the Burrows–Wheeler aligner (BWA), [13] MAQ, [14] TreeSoft and TreeFam. [15]

Li joined the Broad Institute in 2009, working in the core faculty lab of David Altshuler, [11] [16] which investigates the discovery and understanding of the genetic causes of disease.

As of December 2018, Li's papers on SAMtools [12] and BWA [13] (sequence alignment using the Burrows–Wheeler transform) have both been cited over 16,000 times. [17]

Awards

In 2012, Li won the Benjamin Franklin award [1] in bioinformatics. Li became the fourth former member of Richard Durbin's lab to win the award, following Sean Eddy, Ewan Birney and Alex Bateman. [18]

Personal

Li lives in Boston with his wife, daughter, and fish. [6]

Related Research Articles

The Burrows–Wheeler transform rearranges a character string into runs of similar characters. This is useful for compression, since it tends to be easy to compress a string that has runs of repeated characters by techniques such as move-to-front transform and run-length encoding. More importantly, the transformation is reversible, without needing to store any additional data except the position of the first original character. The BWT is thus a "free" method of improving the efficiency of text compression algorithms, costing only some extra computation. The Burrows–Wheeler transform is an algorithm used to prepare data for use with data compression techniques such as bzip2. It was invented by Michael Burrows and David Wheeler in 1994 while Burrows was working at DEC Systems Research Center in Palo Alto, California. It is based on a previously unpublished transformation discovered by Wheeler in 1983. The algorithm can be implemented efficiently using a suffix array thus reaching linear time complexity.

In bioinformatics, sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of a wide range of analytical methods to understand its features, function, structure, or evolution. It can be performed on the entire genome, transcriptome or proteome of an organism, and can also involve only selected segments or regions, like tandem repeats and transposable elements. Methodologies used include sequence alignment, searches against biological databases, and others.

<span class="mw-page-title-main">Genome project</span>

Genome projects are scientific endeavours that ultimately aim to determine the complete genome sequence of an organism and to annotate protein-coding genes and other important genome-encoded features. The genome sequence of an organism includes the collective DNA sequences of each chromosome in the organism. For a bacterium containing a single chromosome, a genome project will aim to map the sequence of that chromosome. For the human species, whose genome includes 22 pairs of autosomes and 2 sex chromosomes, a complete genome sequence will involve 46 separate chromosome sequences.

TreeFam is a database of phylogenetic trees of animal genes. It aims at developing a curated resource that gives reliable information about ortholog and paralog assignments, and evolutionary history of various gene families.

<span class="mw-page-title-main">Steven Salzberg</span> American biologist and computer scientist

Steven Lloyd Salzberg is an American computational biologist and computer scientist who is a Bloomberg Distinguished Professor of Biomedical Engineering, Computer Science, and Biostatistics at Johns Hopkins University, where he is also Director of the Center for Computational Biology.

<span class="mw-page-title-main">UGENE</span> Computer software for bioinformatics

UGENE is computer software for bioinformatics. It works on personal computer operating systems such as Windows, macOS, or Linux. It is released as free and open-source software, under a GNU General Public License (GPL) version 2.

<span class="mw-page-title-main">Richard M. Durbin</span> British computational biologist

Richard Michael Durbin is a British computational biologist and Al-Kindi Professor of Genetics at the University of Cambridge. He also serves as an associate faculty member at the Wellcome Sanger Institute where he was previously a senior group leader.

SOAP is a suite of bioinformatics software tools from the BGI Bioinformatics department enabling the assembly, alignment, and analysis of next generation DNA sequencing data. It is particularly suited to short read sequencing data.

<span class="mw-page-title-main">Reference genome</span> Digital nucleic acid sequence database

A reference genome is a digital nucleic acid sequence database, assembled by scientists as a representative example of the set of genes in one idealized individual organism of a species. As they are assembled from the sequencing of DNA from a number of individual donors, reference genomes do not accurately represent the set of genes of any single individual organism. Instead, a reference provides a haploid mosaic of different DNA sequences from each donor. For example, one of the most recent human reference genomes, assembly GRCh38/hg38, is derived from >60 genomic clone libraries. There are reference genomes for multiple species of viruses, bacteria, fungus, plants, and animals. Reference genomes are typically used as a guide on which new genomes are built, enabling them to be assembled much more quickly and cheaply than the initial Human Genome Project. Reference genomes can be accessed online at several locations, using dedicated browsers such as Ensembl or UCSC Genome Browser.

Pileup format is a text-based format for summarizing the base calls of aligned reads to a reference sequence. This format facilitates visual display of SNP/indel calling and alignment. It was first used by Tony Cox and Zemin Ning at the Wellcome Trust Sanger Institute, and became widely known through its implementation within the SAMtools software suite.

SAMtools is a set of utilities for interacting with and post-processing short DNA sequence read alignments in the SAM, BAM and CRAM formats, written by Heng Li. These files are generated as output by short read aligners like BWA. Both simple and advanced tools are provided, supporting complex tasks like variant calling and alignment viewing as well as sorting, indexing, data extraction and format conversion. SAM files can be very large, so compression is used to save space. SAM files are human-readable text files, and BAM files are simply their binary equivalent, whilst CRAM files are a restructured column-oriented binary container format. BAM files are typically compressed and more efficient for software to work with than SAM. SAMtools makes it possible to work directly with a compressed BAM file, without having to uncompress the whole file. Additionally, since the format for a SAM/BAM file is somewhat complex - containing reads, references, alignments, quality information, and user-specified annotations - SAMtools reduces the effort needed to use SAM/BAM files by hiding low-level details.

In DNA sequencing, a read is an inferred sequence of base pairs corresponding to all or part of a single DNA fragment. A typical sequencing experiment involves fragmentation of the genome into millions of molecules, which are size-selected and ligated to adapters. The set of fragments is referred to as a sequencing library, which is sequenced to produce a set of reads.

SNV calling from NGS data is any of a range of methods for identifying the existence of single nucleotide variants (SNVs) from the results of next generation sequencing (NGS) experiments. These are computational techniques, and are in contrast to special experimental methods based on known population-wide single nucleotide polymorphisms. Due to the increasing abundance of NGS data, these techniques are becoming increasingly popular for performing SNP genotyping, with a wide variety of algorithms designed for specific experimental designs and applications. In addition to the usual application domain of SNP genotyping, these techniques have been successfully adapted to identify rare SNPs within a population, as well as detecting somatic SNVs within an individual using multiple tissue samples.

Bowtie is a software package commonly used for sequence alignment and sequence analysis in bioinformatics. The source code for the package is distributed freely and compiled binaries are available for Linux, macOS and Windows platforms. As of 2017, the Genome Biology paper describing the original Bowtie method has been cited more than 11,000 times. Bowtie is open-source software and is currently maintained by Johns Hopkins University.

<span class="mw-page-title-main">Gonçalo Abecasis</span> Portuguese researcher

Gonçalo Rocha Abecasis is a Portuguese American biomedical researcher at the University of Michigan and was chair of the Department of Biostatistics in the School of Public Health. He leads a group at the Center for Statistical Genetics in the Department of Biostatistics, where he is also the Felix E. Moore Collegiate Professor of Biostatistics and director of the Michigan Genomic Initiative. His group develops statistical tools to analyze the genetics of human disease.

Sequence Alignment Map (SAM) is a text-based format originally for storing biological sequences aligned to a reference sequence developed by Heng Li and Bob Handsaker et al. It was developed when the 1000 Genomes Project wanted to move away from the MAQ mapper format and decided to design a new format. The overall TAB-delimited flavour of the format came from an earlier format inspired by BLAT’s PSL. The name of SAM came from Gabor Marth from University of Utah, who originally had a format under the same name but with a different syntax more similar to a BLAST output. It is widely used for storing data, such as nucleotide sequences, generated by next generation sequencing technologies, and the standard has been broadened to include unmapped sequences. The format supports short and long reads (up to 128 Mbp) produced by different sequencing platforms and is used to hold mapped data within the Genome Analysis Toolkit (GATK) and across the Broad Institute, the Wellcome Sanger Institute, and throughout the 1000 Genomes Project.

<span class="mw-page-title-main">Binary Alignment Map</span>

Binary Alignment Map (BAM) is the comprehensive raw data of genome sequencing; it consists of the lossless, compressed binary representation of the Sequence Alignment Map-files.

A plant genome assembly represents the complete genomic sequence of a plant species, which is assembled into chromosomes and other organelles by using DNA fragments that are obtained from different types of sequencing technology.

References

  1. 1 2 "Broad's Heng Li Wins 2012 Benjamin Franklin Award - Bio-IT World". Archived from the original on 2012-04-01.
  2. 1 2 Li, Heng (2006). Constructing the TreeFam database (PDF) (PhD thesis). Chinese Academy of Sciences.
  3. "Heng Li | Department of Biomedical Informatics". dbmi.hms.harvard.edu. Retrieved 2018-10-30.
  4. "Noted computational biologist Heng Li joins faculty". harvard.edu. Retrieved 2018-10-30.
  5. "HLi Lab - Home". hlilab.github.io. Retrieved 2018-10-30.
  6. 1 2 "Heng Li's Homepage". sourceforge.net. Archived from the original on 2012-04-19.
  7. https://www.linkedin.com/in/lh3lh3 [ self-published source ]
  8. Yu, Jun; et al. (2005). "The Genomes of Oryza sativa: A History of Duplications". PLOS Biology. 3 (2): e38. doi: 10.1371/journal.pbio.0030038 . PMC   546038 . PMID   15685292.
  9. Xia, Q; et al. (Dec 10, 2004). "A draft sequence for the genome of the domesticated silkworm (Bombyx mori)". Science. 306 (5703): 1937–40. Bibcode:2004Sci...306.1937X. doi:10.1126/science.1102210. PMID   15591204. S2CID   7227719.
  10. Ka-Shu Wong, Gane; et al. (9 December 2004). "A genetic variation map for chicken with 2.8 million single-nucleotide polymorphisms". Nature. 432 (7018): 717–722. Bibcode:2004Natur.432..717B. doi:10.1038/nature03156. PMC   2263125 . PMID   15592405.
  11. 1 2 "ResearcherID: Heng Li". researcherid.com. Retrieved 11 September 2014.
  12. 1 2 Li, H.; Handsaker, B.; Wysoker, A.; Fennell, T.; Ruan, J.; Homer, N.; Marth, G.; Abecasis, G.; Durbin, R.; 1000 Genome Project Data Processing Subgroup (2009). "The Sequence Alignment/Map format and SAMtools". Bioinformatics. 25 (16): 2078–2079. doi:10.1093/bioinformatics/btp352. PMC   2723002 . PMID   19505943.{{cite journal}}: CS1 maint: numeric names: authors list (link)
  13. 1 2 Li, H.; Durbin, R. (2009). "Fast and accurate short read alignment with Burrows–Wheeler transform". Bioinformatics. 25 (14): 1754–1760. doi:10.1093/bioinformatics/btp324. PMC   2705234 . PMID   19451168.
  14. Li, H.; Ruan, J.; Durbin, R. (2008). "Mapping short DNA sequencing reads and calling variants using mapping quality scores". Genome Research. 18 (11): 1851–1858. doi:10.1101/gr.078212.108. PMC   2577856 . PMID   18714091.
  15. Li, H.; Coghlan, A.; Ruan, J.; Coin, L. J.; Hériché, J. K.; Osmotherly, L.; Li, R.; Liu, T.; Zhang, Z.; Bolund, L.; Wong, G. K.; Zheng, W.; Dehal, P.; Wang, J.; Durbin, R. (2006). "TreeFam: A curated database of phylogenetic trees of animal gene families". Nucleic Acids Research. 34 (90001): D572–D580. doi:10.1093/nar/gkj118. PMC   1347480 . PMID   16381935.
  16. "Current Lab Members - Altshuler Lab". broadinstitute.org. 2010-05-25. Retrieved 11 September 2014.
  17. "Heng Li - Google Scholar Citations". scholar.google.co.uk. Retrieved 16 April 2015.
  18. "Heng Li Credits Durbin Pedigree in Accepting Franklin Award". bio-itworld.com. Retrieved 11 September 2014.