Zarr (data format)

Last updated
Zarr
Filename extension
.zarr
Latest release
3
Type of formatMultidimensional array
Open format?Yes
Free format?Yes
Website zarr.dev

Zarr is an open standard for storing large multidimensional array data. It specifies a protocol and data format, and is designed to be "cloud ready" including random access, by dividing data into subsets referred to as chunks. [1] [2] Zarr can be used within many programming languages, including Python, Java, JavaScript, C++, Rust and Julia. [3] It has been used by organizations such as Google and Microsoft to publish large datasets. [4] [5] Early versions of Zarr were first released in 2015 by Alistair Miles. [6] [7]

Contents

Zarr is designed to support high-throughput distributed I/O on different storage systems, which is a common requirement in cloud computing. Multiple read operations can efficiently occur to a Zarr array in parallel, or multiple write operations in parallel. [8]

Format description

An illustration of Zarr's chunking data format. Zarr-scipy2019-storage.png
An illustration of Zarr's chunking data format.

The main data format in Zarr is multidimensional arrays. For parallelisable access, these arrays are stored and accessed as a grid of so-called "chunks". The actual data format on disk depends on the compressor and storage plugins selected by the user. [8]

Zarr's design was influenced by that of HDF5, and so it includes similar features for metadata and grouping: arrays can be grouped into named hierarchies, and they can also be annotated with key-value metadata stored alongside the array. [8]

Applications

Representation of microscopy data for high-content screening using OME-Zarr. OME-Zarr format for High Content Screening (HCS).webp
Representation of microscopy data for high-content screening using OME-Zarr.

Due to its efficient handling of tensors, Zarr is being used to publish weather and satellite data [9] and energy data, [10] among others.

For bioimaging such as microscopy, a consortium called the Open Microscopy Environment (OME) created a format called "OME-Zarr", based on Zarr with some discipline-specific extensions. [11] The .zarr specification enables granular representation of outputs of complex experiments, such as high content screening assays. Each plate read in the microscope contains multiple wells, and to scan each well, multiple fields are needed. Each image may have up to 5 dimensions (time points, imaging channels and the three space dimensions). It may also include resolution pyramids, enabling better performance of visualization tools. As Zarr uses multiple directories for organizing data, each of these different fields can be specified and retrieved independently, for example by retrieving a custom URL from object storage databases. [11]

See also

References

  1. "Zarr - chunked, compressed, N-dimensional arrays". zarr.dev. Retrieved 2024-09-12.
  2. "Cloud-Optimized Geospatial Formats Guide: Zarr". guide.cloudnativegeo.org. Retrieved 2024-09-12.
  3. "Zarr Implementations". zarr.dev. Retrieved 2025-01-09.
  4. "Google Cloud: ERA5 data". cloud.google.com. Retrieved 2024-09-12.
  5. "Microsoft Planetary Computer: Reading Zarr Data". planetarycomputer.microsoft.com. Retrieved 2024-09-12.
  6. "zarr - PyPI" . Retrieved 2025-02-10.
  7. Alistair Miles (2016-04-14). "To HDF5 and beyond" . Retrieved 2025-02-10.
  8. 1 2 3 "Zarr - Tutorial". zarr.readthedocs.io. Retrieved 2024-09-12.
  9. "Lazy loading: Making it easier to access vast datasets of weather & satellite data". openclimatefix.org. Archived from the original on 2024-09-12. Retrieved 2024-09-12.
  10. Sansal, Altay; Kainkaryam, Sribharath; Lasscock, Ben; Valenciano, Alejandro (2023). "MDIO: Open-source format for multidimensional energy data". The Leading Edge. 42 (7). Society of Exploration Geophysicists: 465–473. Bibcode:2023LeaEd..42..465S. doi:10.1190/tle42070465.1. ISSN   1938-3789.
  11. 1 2 Moore, Josh (2023). "OME-Zarr: a cloud-optimized bioimaging file format with international community support". Histochemistry and Cell Biology. 160 (3). Springer Science and Business Media LLC: 223–251. doi:10.1007/s00418-023-02209-1. hdl: 1721.1/151126 . ISSN   1432-119X. PMC   10492740 . PMID   37428210.