Mass spectrometry is a scientific technique for measuring the mass-to-charge ratio of ions. It is often coupled to chromatographic techniques such as gas- or liquid chromatography and has found widespread adoption in the fields of analytical chemistry and biochemistry where it can be used to identify and characterize small molecules and proteins (proteomics). The large volume of data produced in a typical mass spectrometry experiment requires that computers be used for data storage and processing. Over the years, different manufacturers of mass spectrometers have developed various proprietary data formats for handling such data which makes it difficult for academic scientists to directly manipulate their data. To address this limitation, several open, XML-based data formats have recently been developed by the Trans-Proteomic Pipeline at the Institute for Systems Biology to facilitate data manipulation and innovation in the public sector. [1] These data formats are described here.
This format was one of the earliest attempts to supply a standardized file format for data exchange in mass spectrometry. JCAMP-DX was initially developed for infrared spectrometry. JCAMP-DX is an ASCII based format and therefore not very compact even though it includes standards for file compression. JCAMP was officially released in 1988. [2] Together with the American Society for Mass Spectrometry a JCAMP-DX format for mass spectrometry was developed with aim to preserve legacy data. [3]
The Analytical Data Interchange Format for Mass Spectrometry is a format for exchanging data. Many mass spectrometry software packages can read or write ANDI files. ANDI is specified in the ASTM E1947 Standard. [4] ANDI is based on netCDF which is a software tool library for writing and reading data files. ANDI was initially developed for chromatography-MS data and therefore was not used in the proteomics gold rush where new formats based on XML were developed. [5]
AnIML is a joined effort of IUPAC and ASTM International to create an XML based standard that covers a wide variety of analytical techniques including mass spectrometry. [6]
mzData was the first attempt by the Proteomics Standards Initiative (PSI) from the Human Proteome Organization (HUPO) to create a standardized format for Mass Spectrometry data. [7] This format is now deprecated, and replaced by mzML. [8]
mzXML is a XML (eXtensible Markup Language) based common file format for proteomics mass spectrometric data. [9] [10] This format was developed at the Seattle Proteome Center/Institute for Systems Biology while the HUPO-PSI was trying to specify the standardized mzData format, and is still in use in the proteomics community.
Yet Another Format for Mass Spectrometry (YAFMS) is a suggestion to save data in four table relational server-less database schema with data extraction and appending being exercised using SQL queries. [11]
As two formats (mzData and mzXML) for representing the same information is an undesirable state, a joint effort was set by HUPO-PSI, the SPC/ISB and instrument vendors to create a unified standard borrowing the best aspects of both mzData and mzXML, and intended to replace them. Originally called dataXML, it was officially announced as mzML. [12] The first specification was published in June 2008. [13] This format was officially released at the 2008 American Society for Mass Spectrometry Meeting, and is since then relatively stable with very few updates. On 1 June 2009, mzML 1.1.0 was released. There are no planned further changes as of 2013.
Instead of defining new file formats and writing converters for proprietary vendor formats a group of scientists proposed to define a common application program interface to shift the burden of standards compliance to the instrument manufacturers' existing data access libraries. [14]
The mz5 format addresses the performance problems of the previous XML based formats. It uses the mzML ontology, but saves the data using the HDF5 backend for reduced storage space requirements and improved read/write speed. [15]
The imzML standard was proposed to exchange data from mass spectrometry imaging in a standardized XML file based on the mzML ontology. It splits experimental data into XML and spectral data in a binary file. Both files are linked by a universally unique identifier. [16]
mzDB saves data in an SQLite database to save on storage space and improve access times as the data points can be queried from a relational database. [17]
Toffee is an open lossless file format for data-independent acquisition mass spectrometry. It leverages HDF5 and aims to achieve file sizes similar to those from the proprietary and closed vendor formats. [18]
mzMLb is another take on using a HDF5 backend for performant raw data saving. It, however, preserves the mzML XML data structure and stays compliant to the existing standard. [19]
Below is a table of different file format extensions.
Company | Extension | File type |
---|---|---|
ACD/Labs | *.spectrus | Imports LC/MS and GC/MS data from most major instrument vendors listed here |
Agilent Bruker | .D (folder) | Agilent MassHunter, Agilent ChemStation, or Bruker BAF/YEP/TDF data format |
Agilent/Bruker | .YEP | instrument data format |
Agilent | .AEV, .ASR | ASCII Report format (for Analytical Studio Reviewer) |
Bruker | .BAF | instrument data format |
Bruker | .FID | instrument data format |
Bruker | .TDF | timsTOF instrument data format |
ABI/Sciex | .WIFF, .WIFF2 | instrument data format |
ABI/Sciex | .t2d | 4700 and 4800 file format |
ABI/Sciex | .dat | Voyager-DE series file format |
Waters | .PKL | MassLynx peak list format |
Thermo PerkinElmer | .RAW* | Thermo Xcalibur PerkinElmer TurboMass |
Micromass**/Waters | .RAW* (folder) | Waters MassLynx |
Chromtech Finnigan*** VG | .DAT | Finnigan ITDS file format; MAT95 instrument data format MassLab data format |
Finnigan*** | .MS | ITS40 instrument data format |
Shimadzu | .QGD | GCMSSolution format |
Shimadzu | .qgd | instrument data format |
Shimadzu | .lcd | QQQ/QTOF instrument data format |
Shimadzu | .spc | library data format |
Bruker/Varian | .SMS | instrument data format |
Bruker/Varian | .XMS | instrument data format |
ION-TOF | .itm | raw measurement data |
ION-TOF | .ita | analysis data |
Physical Electronics/ULVAC-PHI | .raw* | raw measurement data |
Physical Electronics/ULVAC-PHI | .tdc | spectrum data |
(*) Note that the RAW formats of each vendor are not interchangeable; software from one cannot handle the RAW files from another.
(**) Micromass was acquired by Waters in 1997
(***) Finnigan is a division of Thermo
There are several viewers for mzXML, mzML and mzData. These viewers are of two types: Free Open Source Software (FOSS) or Proprietary.
In the FOSS viewer category, one can find MZmine, [20] mineXpert2 (mzXML, mzML, native timsTOF, xy, MGF, BafAscii) [21] MS-Spectre, [22] TOPPView (mzXML, mzML and mzData), [23] Spectra Viewer, [24] SeeMS, [25] msInspect, [26] jmzML. [27]
In the proprietary category, one can find PEAKS, [28] Insilicos, [29] Mascot Distiller, [30] Elsci Peaksel. [31]
There is a viewer for ITA images. [32] ITA and ITM images can be parsed with the pySPM python library. [33]
Known converters for mzData to mzXML:
Known converters for mzXML:
Known converters for mzML:
Converters for proprietary formats:
Currently available converters are :
Chemical Markup Language is an approach to managing molecular information using tools such as XML and Java. It was the first domain specific implementation based strictly on XML, first based on a DTD and later on an XML Schema, the most robust and widely used system for precise information management in many areas. It has been developed over more than a decade by Murray-Rust, Rzepa and others and has been tested in many areas and on a variety of machines.
Insilicos is a life science software company founded in 2002 by Erik Nilsson, Brian Pratt and Bryan Prazen. Insilicos develops scientific computing software to provide software for disease diagnoses.
Rudolf Aebersold is a Swiss biologist, regarded as a pioneer in the fields of proteomics and systems biology. He has primarily researched techniques for measuring proteins in complex samples, in many cases via mass spectrometry. Ruedi Aebersold is a professor of Systems biology at the Institute of Molecular Systems Biology (IMSB) in ETH Zurich. He was one of the founders of the Institute for Systems Biology in Seattle, Washington, United States where he previously had a research group.
dcraw is an open-source computer program which is able to read numerous raw image format files, typically produced by mid-range and high-end digital cameras. dcraw converts these images into the standard TIFF and PPM image formats. This conversion is sometimes referred to as developing a raw image since it renders raw image sensor data into a viewable form.
Mass spectrometry imaging (MSI) is a technique used in mass spectrometry to visualize the spatial distribution of molecules, as biomarkers, metabolites, peptides or proteins by their molecular masses. After collecting a mass spectrum at one spot, the sample is moved to reach another region, and so on, until the entire sample is scanned. By choosing a peak in the resulting spectra that corresponds to the compound of interest, the MS data is used to map its distribution across the sample. This results in pictures of the spatially resolved distribution of a compound pixel by pixel. Each data set contains a veritable gallery of pictures because any peak in each spectrum can be spatially mapped. Despite the fact that MSI has been generally considered a qualitative method, the signal generated by this technique is proportional to the relative abundance of the analyte. Therefore, quantification is possible, when its challenges are overcome. Although widely used traditional methodologies like radiochemistry and immunohistochemistry achieve the same goal as MSI, they are limited in their abilities to analyze multiple samples at once, and can prove to be lacking if researchers do not have prior knowledge of the samples being studied. Most common ionization technologies in the field of MSI are DESI imaging, MALDI imaging, secondary ion mass spectrometry imaging and Nanoscale SIMS (NanoSIMS).
Isobaric tags for relative and absolute quantitation (iTRAQ) is an isobaric labeling method used in quantitative proteomics by tandem mass spectrometry to determine the amount of proteins from different sources in a single experiment. It uses stable isotope labeled molecules that can be covalent bonded to the N-terminus and side chain amines of proteins.
The Proteomics Standards Initiative (PSI) is a working group of the Human Proteome Organization. It aims to define data standards for proteomics to facilitate data comparison, exchange and verification.
OpenMS is an open-source project for data analysis and processing in mass spectrometry and is released under the 3-clause BSD licence. It supports most common operating systems including Microsoft Windows, MacOS and Linux.
The OpenMS Proteomics Pipeline (TOPP) is a set of computational tools that can be chained together to tailor problem-specific analysis pipelines for HPLC-MS data. It transforms most of the OpenMS functionality into small command line tools that are the building blocks for more complex analysis pipelines. The functionality of the tools ranges from data preprocessing over quantitation to identification.
OpenChrom is an open source software for the analysis and visualization of mass spectrometric and chromatographic data. Its focus is to handle native data files from several mass spectrometry systems, vendors like Agilent Technologies, Varian, Shimadzu, Thermo Fisher, PerkinElmer and others. But also data formats from other detector types are supported recently.
The PRIDE is a public data repository of mass spectrometry (MS) based proteomics data, and is maintained by the European Bioinformatics Institute as part of the Proteomics Team.
ProteoWizard is a set of open-source, cross-platform tools and libraries for proteomics data analyses. It provides a framework for unified mass spectrometry data file access and performs standard chemistry and LCMS dataset computations. Specifically, it is able to read many of the vendor-specific, proprietary formats and converting the data into an open data format.
The Minimum Information Required About a Glycomics Experiment (MIRAGE) initiative is part of the Minimum Information Standards and specifically applies to guidelines for reporting on a glycomics experiment. The initiative is supported by the Beilstein Institute for the Advancement of Chemical Sciences. The MIRAGE project focuses on the development of publication guidelines for interaction and structural glycomics data as well as the development of data exchange formats. The project was launched in 2011 in Seattle and set off with the description of the aims of the MIRAGE project.
Skyline is an open source software for targeted proteomics and metabolomics data analysis. It runs on Microsoft Windows and supports the raw data formats from multiple mass spectrometric vendors. It contains a graphical user interface to display chromatographic data for individual peptide or small molecule analytes.
Olga Vitek is a biostatistician and computer scientist specializing in bioinformatics, proteomics, mass spectrometry, causal inference of biological function, and the development of open-source software for statistical analysis in these areas. She is a professor in the College of Science and Khoury College of Computer Sciences of Northeastern University.
JCAMP-DX are text-based file formats created by JCAMP for storing spectroscopic data. It started as a file format for Infrared spectroscopy. It was later expanded to cover Nuclear magnetic resonance spectroscopy, mass spectrometry, electron magnetic resonance and circular dichroism spectroscopy. Later extensions for good laboratory practice were added to cover contract laboratories needs. Despite all efforts to create an easy to comprehend standards, most vendor implementations differ slightly. An open source implementation exists in Java.
SCIEX is a manufacturer of mass spectrometry instrumentation used in biomedical and environmental applications. Originally started by scientists from the University of Toronto Institute for Aerospace Studies, it is now part of Danaher Corporation with the SCIExe R&D division still located in Toronto, Canada.
Dmitry Bandura is a Soviet-born Canadian scientist, notable for being one of the co-inventors of the Mass cytometry technology. Bandura co-founded DVS Sciences in 2004 along with Drs Vladimir Baranov, Scott D. Tanner, and Olga Ornatsky.