Hierarchical Data Format

Last updated
Hierarchical Data Format
HDF logo (2017).svg
Filename extension .hdf, .h4, .hdf4, .he2, .h5, .hdf5, .he5
Internet media type application/x-hdf, application/x-hdf5
Magic number \211HDF\r\n\032\n
Developed byThe HDF Group
Latest release
HDF5 1.14.3 [1]   OOjs UI icon edit-ltr-progressive.svg (30 October 2023;3 months ago (30 October 2023))
Type of format Scientific data format
Open format?Yes
Website www.hdfgroup.org/solutions/hdf5 OOjs UI icon edit-ltr-progressive.svg

Hierarchical Data Format (HDF) is a set of file formats (HDF4, HDF5) designed to store and organize large amounts of data. Originally developed at the U.S. National Center for Supercomputing Applications, it is supported by The HDF Group, a non-profit corporation whose mission is to ensure continued development of HDF5 technologies and the continued accessibility of data stored in HDF.

Contents

In keeping with this goal, the HDF libraries and associated tools are available under a liberal, BSD-like license for general use. HDF is supported by many commercial and non-commercial software platforms and programming languages. The freely available HDF distribution consists of the library, command-line utilities, test suite source, Java interface, and the Java-based HDF Viewer (HDFView). [2]

The current version, HDF5, differs significantly in design and API from the major legacy version HDF4.

Early history

The quest for a portable scientific data format, originally dubbed AEHOO (All Encompassing Hierarchical Object Oriented format) began in 1987 by the Graphics Foundations Task Force (GFTF) at the National Center for Supercomputing Applications (NCSA). NSF grants received in 1990 and 1992 were important to the project. Around this time NASA investigated 15 different file formats for use in the Earth Observing System (EOS) project. After a two-year review process, HDF was selected as the standard data and information system. [3]

HDF4

HDF4 is the older version of the format, although still actively supported by The HDF Group. It supports a proliferation of different data models, including multidimensional arrays, raster images, and tables. Each defines a specific aggregate data type and provides an API for reading, writing, and organizing the data and metadata. New data models can be added by the HDF developers or users.

HDF is self-describing, allowing an application to interpret the structure and contents of a file with no outside information. One HDF file can hold a mix of related objects which can be accessed as a group or as individual objects. Users can create their own grouping structures called "vgroups."

The HDF4 format has many limitations. [4] [5] It lacks a clear object model, which makes continued support and improvement difficult. Supporting many different interface styles (images, tables, arrays) leads to a complex API. Support for metadata depends on which interface is in use; SD (Scientific Dataset) objects support arbitrary named attributes, while other types only support predefined metadata. Perhaps most importantly, the use of 32-bit signed integers for addressing limits HDF4 files to a maximum of 2 GB, which is unacceptable in many modern scientific applications.

HDF5

The HDF5 format is designed to address some of the limitations of the HDF4 library, and to address current and anticipated requirements of modern systems and applications. In 2002 it won an R&D 100 Award. [6]

HDF5 simplifies the file structure to include only two major types of object:

HDF Structure Example HDF-Structure-Example.gif
HDF Structure Example

This results in a truly hierarchical, filesystem-like data format.[ clarification needed ][ citation needed ] In fact, resources in an HDF5 file can be accessed using the POSIX-like syntax /path/to/resource. Metadata is stored in the form of user-defined, named attributes attached to groups and datasets. More complex storage APIs representing images and tables can then be built up using datasets, groups and attributes.

In addition to these advances in the file format, HDF5 includes an improved type system, and dataspace objects which represent selections over dataset regions. The API is also object-oriented with respect to datasets, groups, attributes, types, dataspaces and property lists.

The latest version of NetCDF, version 4, is based on HDF5.

Because it uses B-trees to index table objects, HDF5 works well for time series data such as stock price series, network monitoring data, and 3D meteorological data. The bulk of the data goes into straightforward arrays (the table objects) that can be accessed much more quickly than the rows of an SQL database, but B-tree access is available for non-array data. The HDF5 data storage mechanism can be simpler and faster than an SQL star schema. [ example needed ]

Feedback

Criticism of HDF5 follows from its monolithic design and lengthy specification.

Interfaces

Officially supported APIs

Third-party bindings

Tools

See also

Related Research Articles

In computing, serialization is the process of translating a data structure or object state into a format that can be stored or transmitted and reconstructed later. When the resulting series of bits is reread according to the serialization format, it can be used to create a semantically identical clone of the original object. For many complex objects, such as those that make extensive use of references, this process is not straightforward. Serialization of object-oriented objects does not include any of their associated methods with which they were previously linked.

Flexible Image Transport System (FITS) is an open standard defining a digital file format useful for storage, transmission and processing of data: formatted as multi-dimensional arrays, or tables. FITS is the most commonly used digital file format in astronomy. The FITS standard was designed specifically for astronomical data, and includes provisions such as describing photometric and spatial calibration information, together with image origin metadata.

OPeNDAP is an acronym for "Open-source Project for a Network Data Access Protocol," an endeavor focused on enhancing the retrieval of remote, structured data through a Web-based architecture and a discipline-neutral Data Access Protocol (DAP). Widely used, especially in Earth science, the protocol is layered on HTTP, and its current specification is DAP4, though the previous DAP2 version remains broadly used. Developed and advanced by the non-profit OPeNDAP, Inc., DAP is intended to enable remote, selective data-retrieval as an easily invoked Web service. OPeNDAP, Inc. also develops and maintains zero-cost (reference) implementations of the DAP protocol in both server-side and client-side software.

GRIB is a concise data format commonly used in meteorology to store historical and forecast weather data. It is standardized by the World Meteorological Organization's Commission for Basic Systems, known under number GRIB FM 92-IX, described in WMO Manual on Codes No.306. Currently there are three versions of GRIB. Version 0 was used to a limited extent by projects such as TOGA, and is no longer in operational use. The first edition is used operationally worldwide by most meteorological centers, for Numerical Weather Prediction output (NWP). A newer generation has been introduced, known as GRIB second edition, and data is slowly changing over to this format. Some of the second-generation GRIB is used for derived products distributed in the Eumetcast of Meteosat Second Generation. Another example is the NAM model.

A foreign function interface (FFI) is a mechanism by which a program written in one programming language can call routines or make use of services written or compiled in another one. An FFI is often used in contexts where calls are made into binary dynamic-link library.

HippoDraw is a object-oriented statistical data analysis package written in C++, with user interaction via a Qt-based GUI and a Python-scriptable interface. It was developed by Paul Kunz at SLAC, primarily for the analysis and presentation of particle physics and astrophysics data, but can be equally well used in other fields where data handling is important.

Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Hadoop was originally designed for computer clusters built from commodity hardware, which is still the common use. It has since also found use on clusters of higher-end hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.

Thrift is an interface definition language and binary communication protocol used for defining and creating services for programming languages. It was developed at Facebook. As of 2020, it is an open source project in the Apache Software Foundation.

CGNS stands for CFD General Notation System. It is a general, portable, and extensible standard for the storage and retrieval of CFD analysis data. It consists of a collection of conventions, and free and open software implementing those conventions. It is self-descriptive, cross-platform also termed platform or machine independent, documented, and administered by an international steering committee. It is also an American Institute of Aeronautics and Astronautics (AIAA) recommended practice. The CGNS project originated in 1994 as a joint effort between Boeing and NASA, and has since grown to include many other contributing organizations worldwide. In 1999, control of CGNS was completely transferred to a public forum known as the CGNS Steering CommitteeArchived 2007-06-24 at the Wayback Machine. This Committee is made up of international representatives from government and private industry.

Web2py is an open-source web application framework written in the Python programming language. Web2py allows web developers to program dynamic web content using Python. Web2py is designed to help reduce tedious web development tasks, such as developing web forms from scratch, although a web developer may build a form from scratch if required.

NetCDF is a set of software libraries and self-describing, machine-independent data formats that support the creation, access, and sharing of array-oriented scientific data. The project homepage is hosted by the Unidata program at the University Corporation for Atmospheric Research (UCAR). They are also the chief source of netCDF software, standards development, updates, etc. The format is an open standard. NetCDF Classic and 64-bit Offset Format are an international standard of the Open Geospatial Consortium.

Apache Empire-db is a Java library that provides a high level object-oriented API for accessing relational database management systems (RDBMS) through JDBC. Apache Empire-db is open source and provided under the Apache License 2.0 from the Apache Software Foundation.

<span class="mw-page-title-main">Apache Hive</span> Database engine

Apache Hive is a data warehouse software project, built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. Traditional SQL queries must be implemented in the MapReduce Java API to execute SQL applications and queries over distributed data. Hive provides the necessary SQL abstraction to integrate SQL-like queries into the underlying Java without the need to implement queries in the low-level Java API. Since most data warehousing applications work with SQL-based querying languages, Hive aids the portability of SQL-based applications to Hadoop. While initially developed by Facebook, Apache Hive is used and developed by other companies such as Netflix and the Financial Industry Regulatory Authority (FINRA). Amazon maintains a software fork of Apache Hive included in Amazon Elastic MapReduce on Amazon Web Services.

Windows Runtime (WinRT) is a platform-agnostic component and application architecture first introduced in Windows 8 and Windows Server 2012 in 2012. It is implemented in C++ and officially supports development in C++, Rust/WinRT, Python/WinRT, JavaScript-TypeScript, and the managed code languages C# and Visual Basic .NET (VB.NET).

<span class="mw-page-title-main">Apache Drill</span> Open-source software framework

Apache Drill is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. Built chiefly by contributions from developers from MapR, Drill is inspired by Google's Dremel system. Drill is an Apache top-level project. Tom Shiran is the founder of the Apache Drill Project. It was designated an Apache Software Foundation top-level project in December 2016.

<span class="mw-page-title-main">Oracle NoSQL Database</span> Distributed database

Oracle NoSQL Database is a NoSQL-type distributed key-value database from Oracle Corporation. It provides transactional semantics for data manipulation, horizontal scalability, and simple administration and monitoring.

pandas (software) Python library for data analysis

Pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. It is free software released under the three-clause BSD license. The name is derived from the term "panel data", an econometrics term for data sets that include observations over multiple time periods for the same individuals, as well as a play on the phrase "Python data analysis". Wes McKinney started building what would become Pandas at AQR Capital while he was a researcher there from 2007 to 2010.

Elliptics is a distributed key–value data storage with open source code. By default it is a classic distributed hash table (DHT) with multiple replicas put in different groups. Elliptics was created to meet requirements of multi-datacenter and physically distributed storage locations when storing huge amount of medium and large files.

<span class="mw-page-title-main">Apache Spark</span> Open-source data analytics cluster computing framework

Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since.

References

  1. "HDF5 1.14.3". 30 October 2023.
  2. Java-based HDF Viewer (HDFView)
  3. "History of HDF Group" . Retrieved 15 July 2014.
  4. How is HDF5 different from HDF4? Archived 2009-03-30 at the Wayback Machine
  5. "Are there limitations to HDF4 files?". Archived from the original on 2016-04-19. Retrieved 2009-03-29.
  6. R&D 100 Awards Archives Archived 2011-01-04 at the Wayback Machine
  7. Rossant, Cyrille. "Moving away from HDF5". cyrille.rossant.net. Retrieved 21 April 2016.
  8. JHDF5 library
  9. HDF Import and Export Mathematica documentation
  10. PDL::IO::HDF5