This article's use of external links may not follow Wikipedia's policies or guidelines.(December 2021) |
This article may have too many section headers .(December 2021) |
MPEG-G (ISO / IEC 23092) is an ISO/IEC standard designed for genomic information representation by the collaboration of the ISO/IEC JTC 1/SC 29/WG 9 (MPEG) and ISO TC 276 "Biotechnology" Work Group 5. The goal of the standard is to provide interoperable solutions for data storage, access, and protection across different possible implementations for data information generated by high-throughput sequencing machines and their subsequent processing and analysis. [1] [2] The standard is composed of different parts, each one addressing a specific aspect, such as compression, metadata association, Application Programming Interfaces (APIs), and a reference software for data decoding. Together with the reference decoder software, commercial and open source [3] implementations started to be available in 2019, covering progressively more of the published parts of the standard.
The advent of high-throughput sequencing (HTS) technologies has revolutionized the field of quantitative biology. Availability of large collections of genomic information has now entered everyday practice and has become a cornerstone of a number of disciplines, ranging from biological research to personalized medicine in the clinic. At the moment, genomic information is mostly exchanged through a variety of data formats, such as FASTA/FASTQ for unaligned sequencing reads and SAM/BAM/CRAM for aligned reads. The ISO/IEC 23092 (MPEG-G) standard aims to provide a unified format for the efficient representation and compression of such diverse data, both for file storage and data transport. In order to do that, the standard is divided in several parts.
The MPEG-G standard utilizes technology and data representation architectures previously validated in the field of digital media. They allow to compress and transport genome sequencing data even in complex scenarios, for instance when access is needed to large amounts of possibly distributed data, or when part of the data needs to be encrypted for privacy reasons. Conceptually, such requirements lead to the definition of a number of mutually interrelated mechanisms, which are summarized in the following list:
In turn, some of these topic have been collected together, in order to make the standard easier to understand and implement. As a result, the ISO/IEC 23092 standard is physically structured as a series of separate document, as follows:
Part | Number | First public release date (First edition) | Latest public release date (edition) | Latest amend- ment | Title | Description |
---|---|---|---|---|---|---|
Part 1 | ISO/IEC 23092-1 | 2019 | 2019 | Transport and Storage of Genomic Information | Specification of file format, streaming and indexing [4] | |
Part 2 | ISO/IEC 23092-2 | 2019 | 2019 | Coding of Genomic Information | Compression of unmapped (raw) and aligned genome sequencing data [5] | |
Part 3 | ISO/IEC 23092-3 | 2020 | 2020 | Metadata and Application Programming Interfaces (APIs) | Specification of standard interfaces, syntax for metadata and description of content protection mechanisms [6] | |
Part 4 | ISO/IEC 23092-4 | (2020) | Reference Software | It describes the open source implementation of a normative decoder and informative encoder. It also provides compressed bitstreams that can be used for reference purposes. Note that other open source implementations developed by independent groups do exist [8] [9] | ||
Part 5 | ISO/IEC 23092-5 | (2020) | Conformance testing | It details the testing procedure and associated compressed reference bitstreams to be used when one wants to assess the conformance of a decoder implementation with the MPEG-G standard [10] | ||
Part 6 | ISO/IEC 23092-6 | (2021) | Coding of genomic annotations | Compressed representation of genomic annotations — that is, a number of heterogeneous data types associated with intervals of the reference genome that the sequencing data has been aligned to. [7] |
ISO/IEC 23092-1 specifies how the genomic data is organized within MPEG-G structures for transport (i.e., streaming) and storage. Formats of genomic record, reference record, MPEG-G file and transport stream are defined in this part. It introduces Access Unit as the container of the compressed genomic data and provides a reference conversion process among different formats.
ISO/IEC 23092-2 specifies the syntax and methods for MPEG-G lossless compression of sequencing data and lossy compression of associated quality scores. MPEG-G, as is typical for MPEG standards, only specifies the decoding process while the encoding process is left open to algorithmic and implementation-specific innovations. All MPEG-G conformed decoders produce identical outputs from the multiplexed bitstreams included in MPEG-G files and the data streams in streaming scenarios.
The input data of the encoder are genomic records or metadata, with optional reference data, while its output is MPEG-G file or transport streams.
ISO/IEC 23092-3 specifies a metadata format and provides genomic data representation APIs to support interoperability among existing tools and systems. Part 3 specifies how an MPEG-G compliant bitstream can be integrated with metadata as well as mechanisms to implement access control, integrity verification, authentication and authorization mechanisms. This part also contains an informative section devoted to the mapping between SAM and MPEG-G data structures, including backward compatibility with existing SAM content. It defines:
Functions Group | Brief Description |
---|---|
Genomic Information | Functions used to query the structure of, and retrieve, the genomic information coded in a bitstream compliant with ISO/IEC 23092 series. |
Metadata | Functions used to query the structure of, and retrieve, the metadata associated with the coded genomic data. |
Protection | Functions used to retrieve the protection metadata associated with the coded genomic data. |
Reference | Functions used to retrieve the reference associated with a dataset. |
Statistics | Functions used to retrieve statistics associated with a dataset. |
ISO/IEC 23092-4 [9] specifies genomic information representation reference software, referred to as the genomic model (GM). It consists of two components: the reference encoder software and the reference decoder software. While the reference decoder software is provided to assess the conformance to the requirements of ISO/IEC 23092-1, [4] ISO/IEC 23092-2 [5] and ISO/IEC 23092-6, [7] the reference encoder software serves as a guide for the implementation of the aforementioned standards. The reference encoder software called Genie [3] is an open source software developed by a group of individuals from multiple universities and companies around the world. It features the following components:
Part | Number | Component | Description |
---|---|---|---|
Part 1 [4] | ISO/IEC 23092-1 | Encapsulation | |
Indexing | |||
Part 2 [5] | ISO/IEC 23092-2 | Classification | |
Reference engine | |||
Quality value quantization | |||
Descriptor subsequence generation | |||
Transformations | |||
Entropy encoding | |||
Part 6 | ISO/IEC 23092-6 | (To be determined) |
ISO/IEC 23092-5 specifies conformance of the coding of genomic information. Part 5 provides a means to test and validate the correct implementation of the MPEG-G technology in different devices and applications to ensure the interoperability among all systems. It specifies a normative procedure to assess conformity to the standard on an exhaustive set of compressed data.
No MIME type (RFC 6838 based IANA media type) currently defined for MPEG-G file.
No conventional file extensions are defined.
A codec is a device or computer program that encodes or decodes a data stream or signal. Codec is a portmanteau of coder/decoder.
MP3 is a coding format for digital audio developed largely by the Fraunhofer Society in Germany under the lead of Karlheinz Brandenburg, with support from other digital scientists in other countries. Originally defined as the third audio format of the MPEG-1 standard, it was retained and further extended — defining additional bit-rates and support for more audio channels — as the third audio format of the subsequent MPEG-2 standard. A third version, known as MPEG-2.5 — extended to better support lower bit rates — is commonly implemented, but is not a recognized standard.
The Moving Picture Experts Group (MPEG) is an alliance of working groups established jointly by ISO and IEC that sets standards for media coding, including compression coding of audio, video, graphics, and genomic data; and transmission and file formats for various applications. Together with JPEG, MPEG is organized under ISO/IEC JTC 1/SC 29 – Coding of audio, picture, multimedia and hypermedia information.
MPEG-1 is a standard for lossy compression of video and audio. It is designed to compress VHS-quality raw digital video and CD audio down to about 1.5 Mbit/s without excessive quality loss, making video CDs, digital cable/satellite TV and digital audio broadcasting (DAB) practical.
MPEG-2 is a standard for "the generic coding of moving pictures and associated audio information". It describes a combination of lossy video compression and lossy audio data compression methods, which permit storage and transmission of movies using currently available storage media and transmission bandwidth. While MPEG-2 is not as efficient as newer standards such as H.264/AVC and H.265/HEVC, backwards compatibility with existing hardware and software means it is still widely used, for example in over-the-air digital television broadcasting and in the DVD-Video standard.
MPEG-4 is a group of international standards for the compression of digital audio and visual data, multimedia systems, and file storage formats. It was originally introduced in late 1998 as a group of audio and video coding formats and related technology agreed upon by the ISO/IEC Moving Picture Experts Group (MPEG) under the formal standard ISO/IEC 14496 – Coding of audio-visual objects. Uses of MPEG-4 include compression of audiovisual data for Internet video and CD distribution, voice and broadcast television applications. The MPEG-4 standard was developed by a group led by Touradj Ebrahimi and Fernando Pereira.
A video codec is software or hardware that compresses and decompresses digital video. In the context of video compression, codec is a portmanteau of encoder and decoder, while a device that only compresses is typically called an encoder, and one that only decompresses is a decoder.
JPEG 2000 (JP2) is an image compression standard and coding system. It was developed from 1997 to 2000 by a Joint Photographic Experts Group committee chaired by Touradj Ebrahimi, with the intention of superseding their original JPEG standard, which is based on a discrete cosine transform (DCT), with a newly designed, wavelet-based method. The standardized filename extension is .jp2 for ISO/IEC 15444-1 conforming files and .jpx for the extended part-2 specifications, published as ISO/IEC 15444-2. The registered MIME types are defined in RFC 3745. For ISO/IEC 15444-1 it is image/jp2.
MPEG-1 Audio Layer II or MPEG-2 Audio Layer II is a lossy audio compression format defined by ISO/IEC 11172-3 alongside MPEG-1 Audio Layer I and MPEG-1 Audio Layer III (MP3). While MP3 is much more popular for PC and Internet applications, MP2 remains a dominant standard for audio broadcasting.
Advanced Audio Coding (AAC) is an audio coding standard for lossy digital audio compression. Designed to be the successor of the MP3 format, AAC generally achieves higher sound quality than MP3 encoders at the same bit rate.
Advanced Video Coding (AVC), also referred to as H.264 or MPEG-4 Part 10, is a video compression standard based on block-oriented, motion-compensated coding. It is by far the most commonly used format for the recording, compression, and distribution of video content, used by 91% of video industry developers as of September 2019. It supports a maximum resolution of 8K UHD.
H.262 or MPEG-2 Part 2 is a video coding format standardised and jointly maintained by ITU-T Study Group 16 Video Coding Experts Group (VCEG) and ISO/IEC Moving Picture Experts Group (MPEG), and developed with the involvement of many companies. It is the second part of the ISO/IEC MPEG-2 standard. The ITU-T Recommendation H.262 and ISO/IEC 13818-2 documents are identical.
High-Efficiency Advanced Audio Coding (HE-AAC) is an audio coding format for lossy data compression of digital audio defined as an MPEG-4 Audio profile in ISO/IEC 14496–3. It is an extension of Low Complexity AAC (AAC-LC) optimized for low-bitrate applications such as streaming audio. The usage profile HE-AAC v1 uses spectral band replication (SBR) to enhance the modified discrete cosine transform (MDCT) compression efficiency in the frequency domain. The usage profile HE-AAC v2 couples SBR with Parametric Stereo (PS) to further enhance the compression efficiency of stereo signals.
FAAC or Freeware Advanced Audio Coder is a software project which includes the AAC encoder FAAC and decoder FAAD2. It supports MPEG-2 AAC as well as MPEG-4 AAC. It supports several MPEG-4 Audio object types, file formats, multichannel and gapless encoding/decoding and MP4 metadata tags. The encoder and decoder is compatible with standard-compliant audio applications using one or more of these object types and facilities. It also supports Digital Radio Mondiale.
JPEG XR is an image compression standard for continuous tone photographic images, based on the HD Photo specifications that Microsoft originally developed and patented. It supports both lossy and lossless compression, and is the preferred image format for Ecma-388 Open XML Paper Specification documents.
High-throughput sequencing technologies have led to a dramatic decline of genome sequencing costs and to an astonishingly rapid accumulation of genomic data. These technologies are enabling ambitious genome sequencing endeavours, such as the 1000 Genomes Project and 1001 Genomes Project. The storage and transfer of the tremendous amount of genomic data have become a mainstream problem, motivating the development of high-performance compression tools designed specifically for genomic data. A recent surge of interest in the development of novel algorithms and tools for storing and managing genomic re-sequencing data emphasizes the growing demand for efficient methods for genomic data compression.
JPEG XT is an image compression standard which specifies backward-compatible extensions of the base JPEG standard.
JPEG XS is an interoperable, visually lossless, low-latency and lightweight image and video coding system used in professional applications. Applications of the standard include streaming high quality content for virtual reality, drones, autonomous vehicles using cameras, gaming, and broadcasting. In this respect, JPEG XS is unique, being the first ISO codec ever designed for this specific purpose. JPEG XS, built on core technology from both intoPIX and Fraunhofer IIS, is formally standardized as ISO/IEC 21122 by the Joint Photographic Experts Group with the first edition published in 2019. Although not official, the XS acronym was chosen to highlight the eXtra Small and eXtra Speed characteristics of the codec. Today, the JPEG committee is still actively working on further improvements to XS, with the second edition scheduled for publication and initial efforts being launched towards a third edition.
Low Complexity Enhancement Video Coding (LCEVC) is a ISO/IEC video coding standard developed by the Moving Picture Experts Group (MPEG) under the project name MPEG-5 Part 2 LCEVC.