CRAM (file format)

Last updated
CRAM
Filename extension
.cram
Developed byMarkus Hsi-Yang Fritz et al; Vadim Zalunin; James Bonfield
Type of format Bioinformatics
Open format?yes
Website www.ga4gh.org/cram/ , www.ebi.ac.uk/ena/software/cram-toolkit

Compressed Reference-oriented Alignment Map (CRAM) is a compressed columnar file format for storing biological sequences aligned to a reference sequence, initially devised by Markus Hsi-Yang Fritz et al. [1]

Contents

CRAM was designed to be an efficient reference-based alternative to the Sequence Alignment Map (SAM) and Binary Alignment Map (BAM) file formats. It optionally uses a genomic reference to describe differences between the aligned sequence fragments and the reference sequence, reducing storage costs. Additionally each column in the SAM format is separated into its own blocks, improving compression ratio. CRAM files typically vary from 30 to 60% smaller than BAM, depending on the data held within them.

Implementations of CRAM exist in htsjdk, [2] htslib, [3] JBrowse, [4] and Scramble. [5]

The file format specification is maintained by the Global Alliance for Genomics and Health (GA4GH) [6] with the specification document available from the EBI cram toolkit page. [7]

File format

The basic structure of a CRAM file is a series of containers, the first of which holds a compressed copy of the SAM header. Subsequent containers consist of a container Compression Header followed by a series of slices which in turn hold the alignment records themselves, formatted as a series of blocks.

CRAM file:

Magic numberContainer
(SAM header)
Container
(Data)
...Container
(Data)
Container
(EOF)

Container:

Container
Header
Compression
Header
Slice...Slice

Slice:

Slice
Header
BlockBlock...Block

CRAM constructs records from a set of data series, describing the components of an alignment. The container Compression Header specifies which data series is encoded in which block, what codec will be used, and any codec specific meta-data (for example a table of Huffman symbol code lengths). While data series can be mixed together within the same block, keeping them separate usually improves compression and provides the opportunity for efficient selective decoding where only some data types are required.

Selective access to a CRAM file is granted via the index (with file-name suffix ".crai"). On chromosome and position sorted data this indicates which region is covered by each slice. On unsorted data the index may be used to simply fetch the Nth container. Selective decoding may also be achieved using the Compression Header to skip specified data series if partial records are required.

History

YearVersion(s)Notes
2010-11pre-CRAMInitial paper describing the reference based format. This did not use the name CRAM, but called it mzip. This software was implemented in Python as a prototype and demonstration of the basic concepts. [1]
2011-120.3–0.86Vadim Zalunin of the European Bioinformatics Institute (EBI) produced the first implementation named CRAM as a package called CRAMtools, [8] written in the Java programming language.
20121.0 [9] Implemented in Java CRAMtools. [10]
2013 C implementation added to the Scramble [11] [5] tool, by James Bonfield of the Wellcome Sanger Institute.
20132.0Changes included support for more than one reference per slice (useful with highly fragmented assemblies), better encoding of SAM auxiliary tags, splitting soft-clip and inserted bases into their own data-series, meta-data to track the number of records and bases per slice, and corrections to the BF (BAM flag) data-series.
2013Added to htslib (0.2.0).
20142.1 [12] Added EOF blocks, to help identify truncated files.
2014Added to htsjdk (1.127).
20143.0 [13] Inclusion of lzma and rANS codecs for block compression, along with multiple checksums for ensuring data integrity
2018JavaScript implementation as part of JBrowse [4] (1.15.0), by Rob Buels.
2021Rust implementation in Noodles [14]
20233.1 [15] Officially adopted. (Draft from 2019)

CRAM version 4.0 exists as a prototype in Scramble, [5] initially demonstrated in 2015, but has yet to be adopted as a standard.

See also

Related Research Articles

Waveform Audio File Format is an audio file format standard for storing an audio bitstream on personal computers. The format was developed and published for the first time in 1991 by IBM and Microsoft. It is the main format used on Microsoft Windows systems for uncompressed audio. The usual bitstream encoding is the linear pulse-code modulation (LPCM) format.

Audio Video Interleave is a proprietary multimedia container format and Windows standard introduced by Microsoft in November 1992 as part of its Video for Windows software. AVI files can contain both audio and video data in a file container that allows synchronous audio-with-video playback. Like the DVD video format, AVI files support multiple streaming audio and video, although these features are seldom used.

A video file format is a type of file format for storing digital video data on a computer system. Video is almost always stored using lossy compression to reduce the file size.

<span class="mw-page-title-main">FLAC</span> Lossless digital audio coding format

FLAC is an audio coding format for lossless compression of digital audio, developed by the Xiph.Org Foundation, and is also the name of the free software project producing the FLAC tools, the reference software package that includes a codec implementation. Digital audio compressed by FLAC's algorithm can typically be reduced to between 50 and 70 percent of its original size and decompresses to an identical copy of the original audio data.

ZIP is an archive file format that supports lossless data compression. A ZIP file may contain one or more files or directories that may have been compressed. The ZIP file format permits a number of compression algorithms, though DEFLATE is the most common. This format was originally created in 1989 and was first implemented in PKWARE, Inc.'s PKZIP utility, as a replacement for the previous ARC compression format by Thom Henderson. The ZIP format was then quickly supported by many software utilities other than PKZIP. Microsoft has included built-in ZIP support in versions of Microsoft Windows since 1998 via the "Plus! 98" addon for Windows 98. Native support was added as of the year 2000 in Windows ME. Apple has included built-in ZIP support in Mac OS X 10.3 and later. Most free operating systems have built in support for ZIP in similar manners to Windows and macOS.

In bioinformatics and biochemistry, the FASTA format is a text-based format for representing either nucleotide sequences or amino acid (protein) sequences, in which nucleotides or amino acids are represented using single-letter codes.

The European Bioinformatics Institute (EMBL-EBI) is an intergovernmental organization (IGO) which, as part of the European Molecular Biology Laboratory (EMBL) family, focuses on research and services in bioinformatics. It is located on the Wellcome Genome Campus in Hinxton near Cambridge, and employs over 600 full-time equivalent (FTE) staff. Institute leaders such as Rolf Apweiler, Alex Bateman, Ewan Birney, and Guy Cochrane, an adviser on the National Genomics Data Center Scientific Advisory Board, serve as part of the international research network of the BIG Data Center at the Beijing Institute of Genomics.

CineForm Intermediate is an open source video codec developed for CineForm Inc by David Taylor, David Newman and Brian Schunck. On March 30, 2011, the company was acquired by GoPro which in particular wanted to use the 3D film capabilities of the CineForm 444 Codec for its 3D HERO System.

ZPAQ is an open source command line archiver for Windows and Linux. It uses a journaling or append-only format which can be rolled back to an earlier state to retrieve older versions of files and directories. It supports fast incremental update by adding only files whose last-modified date has changed since the previous update. It compresses using deduplication and several algorithms depending on the data type and the selected compression level. To preserve forward and backward compatibility between versions as the compression algorithm is improved, it stores the decompression algorithm in the archive. The ZPAQ source code includes a public domain API, libzpaq, which provides compression and decompression services to C++ applications. The format is believed to be unencumbered by patents.

The Variant Call Format or VCF is a standard text file format used in bioinformatics for storing gene sequence variations. The format was developed in 2010 for the 1000 Genomes Project and has since been used by other large-scale genotyping and DNA sequencing projects. VCF is a common output format for variant calling programs due to its relative simplicity and scalability. Many tools have been developed for editing and manipulating VCF files, including VCFtools, which was released in conjunction with the VCF format in 2011, and BCFtools, which was included as part of SAMtools until being split into an independent package in 2014.

SAMtools is a set of utilities for interacting with and post-processing short DNA sequence read alignments in the SAM, BAM and CRAM formats, written by Heng Li. These files are generated as output by short read aligners like BWA. Both simple and advanced tools are provided, supporting complex tasks like variant calling and alignment viewing as well as sorting, indexing, data extraction and format conversion. SAM files can be very large, so compression is used to save space. SAM files are human-readable text files, and BAM files are simply their binary equivalent, whilst CRAM files are a restructured column-oriented binary container format. BAM files are typically compressed and more efficient for software to work with than SAM. SAMtools makes it possible to work directly with a compressed BAM file, without having to uncompress the whole file. Additionally, since the format for a SAM/BAM file is somewhat complex - containing reads, references, alignments, quality information, and user-specified annotations - SAMtools reduces the effort needed to use SAM/BAM files by hiding low-level details.

High-throughput sequencing technologies have led to a dramatic decline of genome sequencing costs and to an astonishingly rapid accumulation of genomic data. These technologies are enabling ambitious genome sequencing endeavours, such as the 1000 Genomes Project and 1001 Genomes Project. The storage and transfer of the tremendous amount of genomic data have become a mainstream problem, motivating the development of high-performance compression tools designed specifically for genomic data. A recent surge of interest in the development of novel algorithms and tools for storing and managing genomic re-sequencing data emphasizes the growing demand for efficient methods for genomic data compression.

<span class="mw-page-title-main">European Nucleotide Archive</span> Online database from the EBI on Nucleotides

The European Nucleotide Archive (ENA) is a repository providing free and unrestricted access to annotated DNA and RNA sequences. It also stores complementary information such as experimental procedures, details of sequence assembly and other metadata related to sequencing projects. The archive is composed of three main databases: the Sequence Read Archive, the Trace Archive and the EMBL Nucleotide Sequence Database. The ENA is produced and maintained by the European Bioinformatics Institute and is a member of the International Nucleotide Sequence Database Collaboration (INSDC) along with the DNA Data Bank of Japan and GenBank.

Sequence Alignment Map (SAM) is a text-based format originally for storing biological sequences aligned to a reference sequence developed by Heng Li and Bob Handsaker et al. It was developed when the 1000 Genomes Project wanted to move away from the MAQ mapper format and decided to design a new format. The overall TAB-delimited flavour of the format came from an earlier format inspired by BLAT’s PSL. The name of SAM came from Gabor Marth from University of Utah, who originally had a format under the same name but with a different syntax more similar to a BLAST output. It is widely used for storing data, such as nucleotide sequences, generated by next generation sequencing technologies, and the standard has been broadened to include unmapped sequences. The format supports short and long reads (up to 128 Mbp) produced by different sequencing platforms and is used to hold mapped data within the Genome Analysis Toolkit (GATK) and across the Broad Institute, the Wellcome Sanger Institute, and throughout the 1000 Genomes Project.

<span class="mw-page-title-main">Binary Alignment Map</span>

Binary Alignment Map (BAM) is the comprehensive raw data of genome sequencing; it consists of the lossless, compressed binary representation of the Sequence Alignment Map-files.

Zstandard is a lossless data compression algorithm developed by Yann Collet at Facebook. Zstd is the corresponding reference implementation in C, released as open-source software on 31 August 2016.

JPEG XL is a royalty-free raster-graphics file format that supports both lossy and lossless compression. It is designed to outperform existing raster formats and thus become their universal replacement.

MPEG-G is an ISO/IEC standard designed for genomic information representation by the collaboration of the ISO/IEC JTC 1/SC 29/WG 9 (MPEG) and ISO TC 276 "Biotechnology" Work Group 5. The goal of the standard is to provide interoperable solutions for data storage, access, and protection across different possible implementations for data information generated by high-throughput sequencing machines and their subsequent processing and analysis. The standard is composed of different parts, each one addressing a specific aspect, such as compression, metadata association, Application Programming Interfaces (APIs), and a reference software for data decoding. Together with the reference decoder software, commercial and open source implementations started to be available in 2019, covering progressively more of the published parts of the standard.

The BED format is a text file format used to store genomic regions as coordinates and associated annotations. The data are presented in the form of columns separated by spaces or tabs. This format was developed during the Human Genome Project and then adopted by other sequencing projects. As a result of this increasingly wide use, this format had already become a de facto standard in bioinformatics before a formal specification was written.

<span class="mw-page-title-main">VC-6</span> A video coding format

SMPTE ST 2117-1, informally known as VC-6, is a video coding format.

References

  1. 1 2 Hsi-Yang Fritz, Markus; Leinonen, Rasko; Cochrane, Guy; Birney, Ewan (May 2011). "Efficient storage of high throughput DNA sequencing data using reference-based compression". Genome Research. 21 (5): 734–740. doi:10.1101/gr.114819.110. ISSN   1549-5469. PMC   3083090 . PMID   21245279.
  2. "Htsjdk by Broad Institute". samtools.github.io. Retrieved 2018-10-14.
  3. "Samtools". www.htslib.org. Retrieved 2018-10-14.
  4. 1 2 "JBrowse · A fast, embeddable genome browser built with HTML5 and JavaScript". jbrowse.org. Retrieved 2018-10-14.
  5. 1 2 3 Bonfield, James K. (2014-06-14). "The Scramble conversion tool". Bioinformatics. 30 (19): 2818–2819. doi:10.1093/bioinformatics/btu390. ISSN   1460-2059. PMC   4173023 . PMID   24930138.
  6. "GA4GH". www.ga4gh.org. Retrieved 2018-10-14.
  7. EMBL-EBI. "CRAM toolkit < Software < European Nucleotide Archive < EMBL-EBI". www.ebi.ac.uk. Retrieved 2018-10-14.
  8. "vadimzalunin/crammer". GitHub. 2017-08-08. Retrieved 2018-10-14.
  9. "CRAM 1.0 Specification" (PDF).
  10. "enasequence/cramtools". GitHub. 2018-10-02. Retrieved 2018-10-14.
  11. "jkbonfield/io_lib". GitHub. 2018-10-16. Retrieved 2018-10-14.
  12. "CRAM 2.1 Specification" (PDF).
  13. "CRAM 3.0 Specification" (PDF).
  14. https://github.com/zaeleus/noodles/ [ bare URL ]
  15. "CRAM 3.1 Specification" (PDF).