Compression of genomic sequencing data

Last updated

High-throughput sequencing technologies have led to a dramatic decline of genome sequencing costs and to an astonishingly rapid accumulation of genomic data. These technologies are enabling ambitious genome sequencing endeavours, such as the 1000 Genomes Project and 1001 (Arabidopsis thaliana) Genomes Project. The storage and transfer of the tremendous amount of genomic data have become a mainstream problem, motivating the development of high-performance compression tools designed specifically for genomic data. A recent surge of interest in the development of novel algorithms and tools for storing and managing genomic re-sequencing data emphasizes the growing demand for efficient methods for genomic data compression.

Contents

General concepts

While standard data compression tools (e.g., zip and rar) are being used to compress sequence data (e.g., GenBank flat file database), this approach has been criticized to be extravagant because genomic sequences often contain repetitive content (e.g., microsatellite sequences) or many sequences exhibit high levels of similarity (e.g., multiple genome sequences from the same species). Additionally, the statistical and information-theoretic properties of genomic sequences can potentially be exploited for compressing sequencing data. [1] [2] [3]

Figure 1: The principal steps of a workflow for compressing genomic re-sequencing data: (1) processing of the original sequencing data (e.g., reducing the original dataset to only variations relative to a specified reference sequence; (2) Encoding the processed data into binary form; and (3) decoding the data back to text form. Basic Procedure of Genomic Data Compression.png
Figure 1: The principal steps of a workflow for compressing genomic re-sequencing data: (1) processing of the original sequencing data (e.g., reducing the original dataset to only variations relative to a specified reference sequence; (2) Encoding the processed data into binary form; and (3) decoding the data back to text form.

Base variants

With the availability of a reference template, only differences (e.g., single nucleotide substitutions and insertions/deletions) need to be recorded, thereby greatly reducing the amount of information to be stored. The notion of relative compression is obvious especially in genome re-sequencing projects where the aim is to discover variations in individual genomes. The use of a reference single nucleotide polymorphism (SNP) map, such as dbSNP, can be used to further improve the number of variants for storage. [4]

Relative genomic coordinates

Another useful idea is to store relative genomic coordinates in lieu of absolute coordinates. [4] For example, representing sequence variant bases in the format ‘Position1Base1Position2Base2…’, ‘123C125T130G’ can be shortened to ‘0C2T5G’, where the integers represent intervals between the variants. The cost is the modest arithmetic calculation required to recover the absolute coordinates plus the storage of the correction factor (‘123’ in this example).

Prior information about the genomes

Further reduction can be achieved if all possible positions of substitutions in a pool of genome sequences are known in advance. [4] For instance, if all locations of SNPs in a human population are known, then there is no need to record variant coordinate information (e.g., ‘123C125T130G’ can be abridged to ‘CTG’). This approach, however, is rarely appropriate because such information is usually incomplete or unavailable.

Encoding genomic coordinates

Encoding schemes are used to convert coordinate integers into binary form to provide additional compression gains. Encoding designs, such as the Golomb code and the Huffman code, have been incorporated into genomic data compression tools. [5] [6] [7] [8] [9] [10] Of course, encoding schemes entail accompanying decoding algorithms. Choice of the decoding scheme potentially affects the efficiency of sequence information retrieval.

Algorithm design choices

A universal approach to compressing genomic data may not necessarily be optimal, as a particular method may be more suitable for specific purposes and aims. Thus, several design choices that potentially impacts compression performance may be important for consideration.

Reference sequence

Selection of a reference sequence for relative compression can affect compression performance. Choosing a consensus reference sequence over a more specific reference sequence (e.g., the revised Cambridge Reference Sequence) can result in higher compression ratio because the consensus reference may contain less bias in its data. [4] Knowledge about the source of the sequence being compressed, however, may be exploited to achieve greater compression gains. The idea of using multiple reference sequences has been proposed. [4] Brandon et al. (2009) [4] alluded to the potential use of ethnic group-specific reference sequence templates, using the compression of mitochondrial DNA variant data as an example (see Figure 2). The authors found biased haplotype distribution in the mitochondrial DNA sequences of Africans, Asians, and Eurasians relative to the revised Cambridge Reference Sequence. Their result suggests that the revised Cambridge Reference Sequence may not always be optimal because a greater number of variants need to be stored when it is used against data from ethnically distant individuals. Additionally, a reference sequence can be designed based on statistical properties [1] [4] or engineered [11] [12] to improve the compression ratio.

Encoding schemes

The application of different types of encoding schemes have been explored to encode variant bases and genomic coordinates. [4] Fixed codes, such as the Golomb code and the Rice code, are suitable when the variant or coordinate (represented as integer) distribution is well defined. Variable codes, such as the Huffman code, provide a more general entropy encoding scheme when the underlying variant and/or coordinate distribution is not well-defined (this is typically the case in genomic sequence data).

List of genomic re-sequencing data compression tools

The compression ratio of currently available genomic data compression tools ranges between 65-fold and 1,200-fold for human genomes. [4] [5] [6] [7] [8] [9] [10] [13] Very close variants or revisions of the same genome can be compressed very efficiently (for example, 18,133 compression ratio was reported [6] for two revisions of the same A. thaliana genome, which are 99.999% identical). However, such compression is not indicative of the typical compression ratio for different genomes (individuals) of the same organism. The most common encoding scheme amongst these tools is Huffman coding, which is used for lossless data compression.

Genomic Sequencing data compression tools compatible with standard genome sequencing files formats (BAM & FASTQ)
SoftwareDescriptionCompression RatioData Used for EvaluationApproach/Encoding SchemeLinkUse LicenceReference
PetaSuiteLossless compression tool for BAM and FASTQ.gz files; transparent on-the-fly readback through BAM and FASTQ.gz virtual files60% to 90%Human genome sequences from the 1000 Genomes Project https://petagene.com Commercial [14]
GenozipA universal compressor for genomic files – compresses FASTQ, SAM/BAM/CRAM, VCF/BCF, FASTA, GFF/GTF/GVF, PHYLIP, BED and 23andMe files [15] [16] Human genome sequences from the 1000 Genomes ProjectGenozip extensible framework http://genozip.com Commercial, but free for non-commercial use [17]
Genomic Squeeze (G-SQZ)Lossless compression tool designed for storing and analyzing sequencing read data65% to 76%Human genome sequences from the 1000 Genomes ProjectHuffman coding http://public.tgen.org/sqz -Undeclared- [8]
CRAM (part of SAMtools)Highly efficient and tunable reference-based compression of sequence data [18] European Nucleotide Archivedeflate and rANS http://www.ebi.ac.uk/ena/software/cram-toolkit Apache-2.0 [19]
Genome Compressor (GeCo)A tool using a mixture of multiple Markov models for compressing reference and reference-free sequencesHuman nuclear genome sequenceArithmetic coding http://bioinformatics.ua.pt/software/geco/ or https://pratas.github.io/geco/ GPLv3 [13]
GenomSys codecsLossless compression of BAM and FASTQ files into the standard format ISO/IEC 23092 [20] (MPEG-G)60% to 90%Human genome sequences from the 1000 Genomes Project Context-adaptive binary arithmetic coding (CABAC) https://www.genomsys.com Commercial [21]
fastafsCompression of FASTA / UCSC2Bit files into random access compressed archives. Toolkit to mount FASTA files, indices and dictionary files virtually. This allows neat file system (api-like )integration without the need to fully decompress archives for random / partial access.FASTA filesHuffman coding as implemented by Zstd https://github.com/yhoogstrate/fastafs GPL-v2.0 [22]
Genomic Sequencing data compression tools not compatible with standard genome sequencing files formats
SoftwareDescriptionCompression RatioData Used for EvaluationApproach/Encoding SchemeLinkUse LicenseReference
Genome Differential Compressor (GDC)LZ77-style tool for compressing multiple genomes of the same species180 to 250-fold / 70 to 100-foldNuclear genome sequence of human and Saccharomyces cerevisiaeHuffman coding http://sun.aei.polsl.pl/gdc GPLv2 [5]
Genome Re-Sequencing (GRS)Reference sequence-based tool independent of a reference SNP map or sequence variation information159-fold / 18,133-fold / 82-foldNuclear genome sequence of human, Arabidopsis thaliana (different revisions of the same genome), and Oryza sativaHuffman coding https://web.archive.org/web/20121209070434/http://gmdd.shgmo.org/Computational-Biology/GRS/ free of charge for non-commercial use [6]
Genome Re-sequencing Encoding (GReEN)Probabilistic copy model-based tool for compressing re-sequencing data using a reference sequence~100-foldHuman nuclear genome sequenceArithmetic coding http://bioinformatics.ua.pt/software/green/ -Undeclared- [7]
DNAzipA package of compression tools~750-foldHuman nuclear genome sequenceHuffman coding http://www.ics.uci.edu/~dnazip/ -Undeclared- [9]
GenomeZipCompression with respect to a reference genome. Optionally uses external databases of genomic variations (e.g. dbSNP)~1200-foldHuman nuclear genome sequence (Watson) and sequences from the 1000 Genomes ProjectEntropy coding for approximations of empirical distributions https://sourceforge.net/projects/genomezip/ -Undeclared- [10]

Related Research Articles

<span class="mw-page-title-main">Bioinformatics</span> Computational analysis of large, complex sets of biological data

Bioinformatics is an interdisciplinary field of science that develops methods and software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, chemistry, physics, computer science, computer programming, information engineering, mathematics and statistics to analyze and interpret biological data. The subsequent process of analyzing and interpreting data is referred to as computational biology.

The Burrows–Wheeler transform rearranges a character string into runs of similar characters. This is useful for compression, since it tends to be easy to compress a string that has runs of repeated characters by techniques such as move-to-front transform and run-length encoding. More importantly, the transformation is reversible, without needing to store any additional data except the position of the first original character. The BWT is thus a "free" method of improving the efficiency of text compression algorithms, costing only some extra computation. The Burrows–Wheeler transform is an algorithm used to prepare data for use with data compression techniques such as bzip2. It was invented by Michael Burrows and David Wheeler in 1994 while Burrows was working at DEC Systems Research Center in Palo Alto, California. It is based on a previously unpublished transformation discovered by Wheeler in 1983. The algorithm can be implemented efficiently using a suffix array thus reaching linear time complexity.

<span class="mw-page-title-main">Genomics</span> Discipline in genetics

Genomics is an interdisciplinary field of biology focusing on the structure, function, evolution, mapping, and editing of genomes. A genome is an organism's complete set of DNA, including all of its genes as well as its hierarchical, three-dimensional structural configuration. In contrast to genetics, which refers to the study of individual genes and their roles in inheritance, genomics aims at the collective characterization and quantification of all of an organism's genes, their interrelations and influence on the organism. Genes may direct the production of proteins with the assistance of enzymes and messenger molecules. In turn, proteins make up body structures such as organs and tissues as well as control chemical reactions and carry signals between cells. Genomics also involves the sequencing and analysis of genomes through uses of high throughput DNA sequencing and bioinformatics to assemble and analyze the function and structure of entire genomes. Advances in genomics have triggered a revolution in discovery-based research and systems biology to facilitate understanding of even the most complex biological systems such as the brain.

In bioinformatics and biochemistry, the FASTA format is a text-based format for representing either nucleotide sequences or amino acid (protein) sequences, in which nucleotides or amino acids are represented using single-letter codes.

<span class="mw-page-title-main">Single-nucleotide polymorphism</span> Single nucleotide in genomic DNA at which different sequence alternatives exist

In genetics and bioinformatics, a single-nucleotide polymorphism is a germline substitution of a single nucleotide at a specific position in the genome that is present in a sufficiently large fraction of considered population.

<span class="mw-page-title-main">Functional genomics</span> Field of molecular biology

Functional genomics is a field of molecular biology that attempts to describe gene functions and interactions. Functional genomics make use of the vast data generated by genomic and transcriptomic projects. Functional genomics focuses on the dynamic aspects such as gene transcription, translation, regulation of gene expression and protein–protein interactions, as opposed to the static aspects of the genomic information such as DNA sequence or structures. A key characteristic of functional genomics studies is their genome-wide approach to these questions, generally involving high-throughput methods rather than a more traditional "candidate-gene" approach.

Computational genomics refers to the use of computational and statistical analysis to decipher biology from genome sequences and related data, including both DNA and RNA sequence as well as other "post-genomic" data. These, in combination with computational and statistical approaches to understanding the function of the genes and statistical association analysis, this field is also often referred to as Computational and Statistical Genetics/genomics. As such, computational genomics may be regarded as a subset of bioinformatics and computational biology, but with a focus on using whole genomes to understand the principles of how the DNA of a species controls its biology at the molecular level and beyond. With the current abundance of massive biological datasets, computational studies have become one of the most important means to biological discovery.

<span class="mw-page-title-main">Phred quality score</span> Measurement in DNA sequencing

A Phred quality score is a measure of the quality of the identification of the nucleobases generated by automated DNA sequencing. It was originally developed for the computer program Phred to help in the automation of DNA sequencing in the Human Genome Project. Phred quality scores are assigned to each nucleotide base call in automated sequencer traces. The FASTQ format encodes phred scores as ASCII characters alongside the read sequences. Phred quality scores have become widely accepted to characterize the quality of DNA sequences, and can be used to compare the efficacy of different sequencing methods. Perhaps the most important use of Phred quality scores is the automatic determination of accurate, quality-based consensus sequences.

<span class="mw-page-title-main">RNA-Seq</span> Lab technique in cellular biology

RNA-Seq is a technique that uses next-generation sequencing to reveal the presence and quantity of RNA molecules in a biological sample, providing a snapshot of gene expression in the sample, also known as transcriptome.

FASTQ format is a text-based format for storing both a biological sequence and its corresponding quality scores. Both the sequence letter and quality score are each encoded with a single ASCII character for brevity.

<span class="mw-page-title-main">DNA annotation</span> The process of describing the structure and function of a genome

In molecular biology and genetics, DNA annotation or genome annotation is the process of describing the structure and function of the components of a genome, by analyzing and interpreting them in order to extract their biological significance and understand the biological processes in which they participate. Among other things, it identifies the locations of genes and all the coding regions in a genome and determines what those genes do.

SAMtools is a set of utilities for interacting with and post-processing short DNA sequence read alignments in the SAM, BAM and CRAM formats, written by Heng Li. These files are generated as output by short read aligners like BWA. Both simple and advanced tools are provided, supporting complex tasks like variant calling and alignment viewing as well as sorting, indexing, data extraction and format conversion. SAM files can be very large, so compression is used to save space. SAM files are human-readable text files, and BAM files are simply their binary equivalent, whilst CRAM files are a restructured column-oriented binary container format. BAM files are typically compressed and more efficient for software to work with than SAM. SAMtools makes it possible to work directly with a compressed BAM file, without having to uncompress the whole file. Additionally, since the format for a SAM/BAM file is somewhat complex - containing reads, references, alignments, quality information, and user-specified annotations - SAMtools reduces the effort needed to use SAM/BAM files by hiding low-level details.

In bioinformatics, alignment-free sequence analysis approaches to molecular sequence and structure data provide alternatives over alignment-based approaches.

Single nucleotide polymorphism annotation is the process of predicting the effect or function of an individual SNP using SNP annotation tools. In SNP annotation the biological information is extracted, collected and displayed in a clear form amenable to query. SNP functional annotation is typically performed based on the available information on nucleic acid and protein sequences.

<span class="mw-page-title-main">Binary Alignment Map</span>

Binary Alignment Map (BAM) is the comprehensive raw data of genome sequencing; it consists of the lossless, compressed binary representation of the Sequence Alignment Map-files.

Transcriptomics technologies are the techniques used to study an organism's transcriptome, the sum of all of its RNA transcripts. The information content of an organism is recorded in the DNA of its genome and expressed through transcription. Here, mRNA serves as a transient intermediary molecule in the information network, whilst non-coding RNAs perform additional diverse functions. A transcriptome captures a snapshot in time of the total transcripts present in a cell. Transcriptomics technologies provide a broad account of which cellular processes are active and which are dormant. A major challenge in molecular biology is to understand how a single genome gives rise to a variety of cells. Another is how gene expression is regulated.

Compressed Reference-oriented Alignment Map (CRAM) is a compressed columnar file format for storing biological sequences aligned to a reference sequence, initially devised by Markus Hsi-Yang Fritz et al.

ANNOVAR is a bioinformatics software tool for the interpretation and prioritization of single nucleotide variants (SNVs), insertions, deletions, and copy number variants (CNVs) of a given genome.

Bloom filters are space-efficient probabilistic data structures used to test whether an element is a part of a set. Bloom filters require much less space than other data structures for representing sets, however the downside of Bloom filters is that there is a false positive rate when querying the data structure. Since multiple elements may have the same hash values for a number of hash functions, then there is a probability that querying for a non-existent element may return a positive if another element with the same hash values has been added to the Bloom filter. Assuming that the hash function has equal probability of selecting any index of the Bloom filter, the false positive rate of querying a Bloom filter is a function of the number of bits, number of hash functions and number of elements of the Bloom filter. This allows the user to manage the risk of a getting a false positive by compromising on the space benefits of the Bloom filter.

References

  1. 1 2 Giancarlo, R.; Scaturro, D.; Utro, F. (2009). "Textual data compression in computational biology: A synopsis". Bioinformatics. 25 (13): 1575–1586. doi: 10.1093/bioinformatics/btp117 . PMID   19251772.
  2. Nalbantog̃Lu, O. U.; Russell, D. J.; Sayood, K. (2010). "Data Compression Concepts and Algorithms and their Applications to Bioinformatics". Entropy. 12 (1): 34. doi: 10.3390/e12010034 . PMC   2821113 . PMID   20157640.
  3. Hosseini, Morteza; Pratas, Diogo; Pinho, Armando (2016). "A Survey on Data Compression Methods for Biological Sequences". Information. 7 (4): 56. doi: 10.3390/info7040056 .
  4. 1 2 3 4 5 6 7 8 9 Brandon, M. C.; Wallace, D. C.; Baldi, P. (2009). "Data structures and compression algorithms for genomic sequence data". Bioinformatics. 25 (14): 1731–1738. doi:10.1093/bioinformatics/btp319. PMC   2705231 . PMID   19447783.
  5. 1 2 3 Deorowicz, S.; Grabowski, S. (2011). "Robust relative compression of genomes with random access". Bioinformatics. 27 (21): 2979–2986. doi: 10.1093/bioinformatics/btr505 . PMID   21896510.
  6. 1 2 3 4 Wang, C.; Zhang, D. (2011). "A novel compression tool for efficient storage of genome resequencing data". Nucleic Acids Research. 39 (7): e45. doi:10.1093/nar/gkr009. PMC   3074166 . PMID   21266471.
  7. 1 2 3 Pinho, A. J.; Pratas, D.; Garcia, S. P. (2012). "GReEn: A tool for efficient compression of genome resequencing data". Nucleic Acids Research. 40 (4): e27. doi:10.1093/nar/gkr1124. PMC   3287168 . PMID   22139935.
  8. 1 2 3 Tembe, W.; Lowey, J.; Suh, E. (2010). "G-SQZ: Compact encoding of genomic sequence and quality data". Bioinformatics. 26 (17): 2192–2194. doi:10.1093/bioinformatics/btq346. PMID   20605925.
  9. 1 2 3 Christley, S.; Lu, Y.; Li, C.; Xie, X. (2009). "Human genomes as email attachments". Bioinformatics. 25 (2): 274–275. doi: 10.1093/bioinformatics/btn582 . PMID   18996942.
  10. 1 2 3 Pavlichin, D. S.; Weissman, T.; Yona, G. (2013). "The human genome contracts again". Bioinformatics. 29 (17): 2199–2302. doi: 10.1093/bioinformatics/btt362 . PMID   23793748.
  11. Kuruppu, Shanika; Puglisi, Simon J.; Zobel, Justin (2011). "Reference Sequence Construction for Relative Compression of Genomes". String Processing and Information Retrieval. Lecture Notes in Computer Science. Vol. 7024. pp. 420–425. doi:10.1007/978-3-642-24583-1_41. ISBN   978-3-642-24582-4. S2CID   16007637.
  12. Grabowski, Szymon; Deorowicz, Sebastian (2011). "Engineering Relative Compression of Genomes". arXiv: 1103.2351 [cs.CE].
  13. 1 2 Pratas, D., Pinho, A. J., and Ferreira, P. J. S. G. Efficient compression of genomic sequences. Data Compression Conference, Snowbird, Utah, 2016.
  14. "The Importance of Data Compression in the Field of Genomics". IEEE Pulse. 2019-04-26. Retrieved 2024-02-22.
  15. Lan, Divon; Llamas, Bastien (14 September 2022). "Genozip 14 - advances in compression of BAM and CRAM files". bioRxiv. doi:10.1101/2022.09.12.507582. S2CID   252357508.
  16. Lan, Divon; Hughes, Daniel S T; Llamas, Bastien (7 July 2023). "Deep FASTQ and BAM co-compression in Genozip 15". bioRxiv. doi:10.1101/2023.07.07.548069. S2CID   259764998.
  17. Lan, Divon; Tobler, Ray; Souilmi, Yassine; Llamas, Bastien (25 August 2021). "Genozip: a universal extensible genomic data compressor". Bioinformatics. 37 (16): 2225–2230. doi:10.1093/bioinformatics/btab102. PMC   8388020 . PMID   33585897.
  18. CRAM benchmarking
  19. CRAM format specification (version 3.0)
  20. "ISO/IEC 23092-2:2019 Information technology — Genomic information representation — Part 2: Coding of genomic information". iso.org.
  21. Alberti, Claudio; Paridaens, Tom; Voges, Jan; Naro, Daniel; Ahmad, Junaid J.; Ravasi, Massimo; Renzi, Daniele; Zoia, Giorgio; Ochoa, Idoia; Mattavelli, Marco; Delgado, Jaime; Hernaez, Mikel (27 September 2018). "An introduction to MPEG-G, the new ISO standard for genomic information representation". bioRxiv   10.1101/426353 .
  22. Hoogstrate, Youri; Jenster, Guido W.; van de Werken, Harmen J. G. (December 2021). "FASTAFS: file system virtualisation of random access compressed FASTA files". BMC Bioinformatics. 22 (1): 535. doi: 10.1186/s12859-021-04455-3 . PMC   8558547 . PMID   34724897.