An HDF5 primer

The currently recommended file format for NeXus is HDF5. Hence, understanding the basic concepts behind HDF5 is absolutely mandatory to work with NeXus in a productive way.

HDF5 differs from the commonly used file formats in two fundamental ways

  1. It is a binary format.
  2. The data is organized as a tree rather than as tables.
_images/hdf5_tree.png

A very abstract view on an HDF5 tree. The basic elements are group, datasets, attributes and links.

An HDF5 data tree is represented using the following basic objects

  • groups which are the nodes of the tree
  • datasets which are the data storing leaves
  • links which connect nodes and leaves
  • attributes which are, with some limitations, similar to datasets and can be attached to groups and datasets to store additional meta-data.

Though this sounds rather complex, in practice, an HDF5 tree looks quite similar to a file system tree.

_images/hdf5_tree_with_data.png

HDFview showing data stored in an HDF5 file. The left panel shows the data tree which looks quite similar to a filesystem tree.

In this picture the groups represent directories, datasets are the files in each directory and links can be hard- or symbolic-links as they are used on common filesystems. There is no equivalent the HDF5 attributes in this picture but this is not a serious limitation.

Compression of individual datasets

To reduce the amount of disk space occupied by a data file on might think about using compression to reduce its size. However, compressing the entire file has several disadvantages

  • not all data within the file are compressable with reasonable data reduction
  • compressing the entire file would require quite some work to access small amounts of meta-data stored within it.
  • different data within the file cannot be compressed with different algorithms.

To overcome these issues HDF5 provides the functionality to apply compression to individual datasets rather than to the entire file. Consider for instance a file with data from an experiment with a large 2D detector. The detector images are stored in a 3D dataset as a single block. However, they occupy 95% of the amount of disk space occupied by the file. Everything else is mall meta-data. With HDF5 it is possible to apply compression only to the dataset with the 3D data. Thus, retrieving meta-data is as fast as before but the ammount of disk space required is significantly reduced. In particular with low-noise detectors this makes a lot of sense.

It is also possible to use different compression algorithms for different datasets.