Zarr (data format)

Zarr
Filename extension	.zarr
Latest release	3
Type of format	Multidimensional array
Open format?	Yes
Free format?	Yes
Website	zarr.dev

Last updated January 09, 2025

Zarr is an open standard for storing large multidimensional array data. It specifies a protocol and data format, and is designed to be "cloud ready" including random access, by dividing data into subsets referred to as chunks.^[1]^[2] Zarr can be used within many programming languages, including Python, Java, JavaScript, C++, Rust and Julia.^[3] It has been used by organisations such as Google and Microsoft to publish large datasets.^[4]^[5]

Format description

The main data format in Zarr is multidimensional arrays. For parallelisable access, these arrays are stored and accessed as a grid of so-called "chunks". The actual data format on disk depends on the compressor and storage plugins selected by the user.^[6]

An illustration of Zarr's chunking data format. Zarr-scipy2019-storage.png — An illustration of Zarr's chunking data format.

Zarr's design was influenced by that of HDF5, and so it includes similar features for metadata and grouping: arrays can be grouped into named hierarchies, and they can also be annotated with key-value metadata stored alongside the array.^[6]

Applications

For bioimaging such as microscopy, a consortium called the Open Microscopy Environment (OME) created a format called "OME-Zarr", based on Zarr with some discipline-specific extensions.^[7] Similarly, Zarr is being used to publish weather and satellite data ^[8] and energy data,^[9] among others.

Related Research Articles

Portable Network Graphics is a raster-graphics file format that supports lossless data compression. PNG was developed as an improved, non-patented replacement for Graphics Interchange Format (GIF)—unofficially, the initials PNG stood for the recursive acronym "PNG's not GIF".

Waveform Audio File Format is an audio file format standard for storing an audio bitstream on personal computers. The format was developed and published for the first time in 1991 by IBM and Microsoft. It is the main format used on Microsoft Windows systems for uncompressed audio. The usual bitstream encoding is the linear pulse-code modulation (LPCM) format.

Hierarchical Data Format (HDF) is a set of file formats designed to store and organize large amounts of data. Originally developed at the U.S. National Center for Supercomputing Applications, it is supported by The HDF Group, a non-profit corporation whose mission is to ensure continued development of HDF5 technologies and the continued accessibility of data stored in HDF.

Google File System is a proprietary distributed file system developed by Google to provide efficient, reliable access to data using large clusters of commodity hardware. Google file system was replaced by Colossus in 2010.

Essbase is a multidimensional database management system (MDBMS) that provides a platform upon which to build analytic applications. Essbase began as a product from Arbor Software, which merged with Hyperion Software in 1998. Oracle Corporation acquired Hyperion Solutions Corporation in 2007. Until late 2005 IBM also marketed an OEM version of Essbase as DB2 OLAP Server.

A container format or metafile is a file format that allows multiple data streams to be embedded into a single file, usually along with metadata for identifying and further detailing those streams. Notable examples of container formats include archive files and formats used for multimedia playback. Among the earliest cross-platform container formats were Distinguished Encoding Rules and the 1985 Interchange File Format.

In computing, an archive file is a computer file that is composed of one or more files along with metadata. Many archive formats also support compression of member files. Archive files are used to collect multiple data files together into a single file for easier portability and storage, or simply to compress files to use less storage space. Archive files often store directory structures, error detection and correction information, comments, and some use built-in encryption.

COM Structured Storage is a technology developed by Microsoft as part of its Windows operating system for storing hierarchical data within a single file. Strictly speaking, the term structured storage refers to a set of COM interfaces that a conforming implementation must provide, and not to a specific implementation, nor to a specific file format. In addition to providing a hierarchical structure for data, structured storage may also provide a limited form of transactional support for data access. Microsoft provides an implementation that supports transactions, as well as one that does not.

Apache Hadoop is a collection of open-source software utilities for reliable, scalable, distributed computing. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Hadoop was originally designed for computer clusters built from commodity hardware, which is still the common use. It has since also found use on clusters of higher-end hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.

In computing, data deduplication is a technique for eliminating duplicate copies of repeating data. Successful implementation of the technique can improve storage utilization, which may in turn lower capital expenditure by reducing the overall amount of storage media required to meet storage capacity needs. It can also be applied to network data transfers to reduce the number of bytes that must be sent.

NetCDF is a set of software libraries and self-describing, machine-independent data formats that support the creation, access, and sharing of array-oriented scientific data. The project homepage is hosted by the Unidata program at the University Corporation for Atmospheric Research (UCAR). They are also the chief source of netCDF software, standards development, updates, etc. The format is an open standard. NetCDF Classic and 64-bit Offset Format are an international standard of the Open Geospatial Consortium.

Web storage, sometimes known as DOM storage, is a standard JavaScript API provided by web browsers. It enables websites to store persistent data on users' devices similar to cookies, but with much larger capacity and no information sent in HTTP headers. There are two main web storage types: local storage and session storage, behaving similarly to persistent cookies and session cookies respectively. Web Storage is standardized by the World Wide Web Consortium (W3C) and WHATWG, and is supported by all major browsers.

Windows Runtime (WinRT) is a platform-agnostic component and application architecture first introduced in Windows 8 and Windows Server 2012 in 2012. It is implemented in C++ and officially supports development in C++, Rust/WinRT, Python/WinRT, JavaScript-TypeScript, and the managed code languages C# and Visual Basic (.NET) (VB.NET).

HTML audio is a subject of the HTML specification, incorporating audio |speech to text]], all in the browser.

rasdaman is an Array DBMS, that is: a Database Management System which adds capabilities for storage and retrieval of massive multi-dimensional arrays, such as sensor, image, simulation, and statistics data. A frequently used synonym to arrays is raster data, such as in 2-D raster graphics; this actually has motivated the name rasdaman. However, rasdaman has no limitation in the number of dimensions - it can serve, for example, 1-D measurement data, 2-D satellite imagery, 3-D x/y/t image time series and x/y/z exploration data, 4-D ocean and climate data, and even beyond spatio-temporal dimensions.

An array database management system or array DBMS provides database services specifically for arrays, that is: homogeneous collections of data items, sitting on a regular grid of one, two, or more dimensions. Often arrays are used to represent sensor, simulation, image, or statistics data. Such arrays tend to be Big Data, with single objects frequently ranging into Terabyte and soon Petabyte sizes; for example, today's earth and space observation archives typically grow by Terabytes a day. Array databases aim at offering flexible, scalable storage and retrieval on this information category.

Apache Drill is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. Built chiefly by contributions from developers from MapR, Drill is inspired by Google's Dremel system. Drill is an Apache top-level project. Tom Shiran is the founder of the Apache Drill Project. It was designated an Apache Software Foundation top-level project in December 2016.

Object storage is a computer data storage approach that manages data as "blobs" or "objects", as opposed to other storage architectures like file systems, which manage data as a file hierarchy, and block storage, which manages data as blocks within sectors and tracks. Each object is typically associated with a variable amount of metadata, and a globally unique identifier. Object storage can be implemented at multiple levels, including the device level, the system level, and the interface level. In each case, object storage seeks to enable capabilities not addressed by other storage architectures, like interfaces that are directly programmable by the application, a namespace that can span multiple instances of physical hardware, and data-management functions like data replication and data distribution at object-level granularity.

A distributed file system for cloud is a file system that allows many clients to have access to data and supports operations on that data. Each data file may be partitioned into several parts called chunks. Each chunk may be stored on different remote machines, facilitating the parallel execution of applications. Typically, data is stored in files in a hierarchical tree, where the nodes represent directories. There are several ways to share files in a distributed architecture: each solution must be suitable for a certain type of application, depending on how complex the application is. Meanwhile, the security of the system must be ensured. Confidentiality, availability and integrity are the main keys for a secure system.

BisQue is a free, open source web-based platform for the exchange and exploration of large, complex datasets. It is being developed at the Vision Research Lab at the University of California, Santa Barbara. BisQue specifically supports large scale, multi-dimensional multimodal-images and image analysis. Metadata is stored as arbitrarily nested and linked tag/value pairs, allowing for domain-specific data organization. Image analysis modules can be added to perform complex analysis tasks on compute clusters. Analysis results are stored within the database for further querying and processing. The data and analysis provenance is maintained for reproducibility of results. BisQue can be easily deployed in cloud computing environments or on computer clusters for scalability. BisQue has been integrated into the NSF Cyberinfrastructure project CyVerse. The user interacts with BisQue via any modern web browser.

References

↑ "Zarr - chunked, compressed, N-dimensional arrays". zarr.dev. Retrieved 2024-09-12.
↑ "Cloud-Optimized Geospatial Formats Guide: Zarr". guide.cloudnativegeo.org. Retrieved 2024-09-12.
↑ "Zarr Implementations". zarr.dev. Retrieved 2025-01-09.
↑ "Google Cloud: ERA5 data". cloud.google.com. Retrieved 2024-09-12.
↑ "Microsoft Planetary Computer: Reading Zarr Data". planetarycomputer.microsoft.com. Retrieved 2024-09-12.
1 2 3 "Zarr - Tutorial". zarr.readthedocs.io. Retrieved 2024-09-12.
↑ Moore, Josh (2023). "OME-Zarr: a cloud-optimized bioimaging file format with international community support". Histochemistry and Cell Biology. 160 (3). Springer Science and Business Media LLC: 223–251. doi:10.1007/s00418-023-02209-1. hdl: 1721.1/151126 . ISSN 1432-119X. PMC 10492740 . PMID 37428210.
↑ "Lazy loading: Making it easier to access vast datasets of weather & satellite data". openclimatefix.org. Retrieved 2024-09-12.
↑ Sansal, Altay; Kainkaryam, Sribharath; Lasscock, Ben; Valenciano, Alejandro (2023). "MDIO: Open-source format for multidimensional energy data". The Leading Edge. 42 (7). Society of Exploration Geophysicists: 465–473. Bibcode:2023LeaEd..42..465S. doi:10.1190/tle42070465.1. ISSN 1938-3789.

External links

Official website

This computing article is a stub. You can help Wikipedia by expanding it.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[zarr-specs-1] "Zarr - chunked, compressed, N-dimensional arrays". zarr.dev. Retrieved 2024-09-12.

[cloudnativegeo-2] "Cloud-Optimized Geospatial Formats Guide: Zarr". guide.cloudnativegeo.org. Retrieved 2024-09-12.

[zarr-github-impl-3] "Zarr Implementations". zarr.dev. Retrieved 2025-01-09.

[4] "Google Cloud: ERA5 data". cloud.google.com. Retrieved 2024-09-12.

[5] "Microsoft Planetary Computer: Reading Zarr Data". planetarycomputer.microsoft.com. Retrieved 2024-09-12.

[zarrtutorial-6] 1 2 3 "Zarr - Tutorial". zarr.readthedocs.io. Retrieved 2024-09-12.

[ome-zarr-7] Moore, Josh (2023). "OME-Zarr: a cloud-optimized bioimaging file format with international community support". Histochemistry and Cell Biology. 160 (3). Springer Science and Business Media LLC: 223–251. doi:10.1007/s00418-023-02209-1. hdl: 1721.1/151126 . ISSN 1432-119X. PMC 10492740 . PMID 37428210.

[ocf-8] "Lazy loading: Making it easier to access vast datasets of weather & satellite data". openclimatefix.org. Retrieved 2024-09-12.

[mdio-9] Sansal, Altay; Kainkaryam, Sribharath; Lasscock, Ben; Valenciano, Alejandro (2023). "MDIO: Open-source format for multidimensional energy data". The Leading Edge. 42 (7). Society of Exploration Geophysicists: 465–473. Bibcode:2023LeaEd..42..465S. doi:10.1190/tle42070465.1. ISSN 1938-3789.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]