======================= NeXus: adding semantics ======================= Remaining issues ================ HDF5 solves most of the technical issues raised by storing detector images into individual files. However, a class of problems remain affecting in particular automatic data analysis or long time archiving of the data. The major reason for this is that HDF5 is totally agnostic about what the data, stored within an HDF5 file, represents in the real world. From this point of view the situation with plain HDF5 is rather similar to that the IUCR faced with the `STAR`_ file format. Though, the STAR format is capable of storing all kind of data (as ASCII) in a structured manner, it had no standardized way to add context to the data. This lead ultimately to the development of CIF and related formats which are a subset of the STAR file format. To get a better idea for this class of problems lets consider a simple example. Imagine a very simple synchrotron experiment where data is recorded from the following sources * the storage ring * the undulator * the monochromator * a monitor detector * a sample stage * a 2D detector. In the traditional approach every data item collected during the experiment would get a unique name and stored below this name in a column of a table. The major problem with this is that we have to know that, for instance, *x_s* and *y_s* are the x and y translations of the sample stage, or that *r_curr* is the actual current in the storage ring. Besides this, data stored in ASCII files has typically no units physical associated with it. Thus we have to know in advance that *mr_1*, let's say the roll of the first mirror, is recorded in milli-radiants and not in degree. Putting it all together, some of the problems we had with plain ASCII files remain for HDF5. These would be * we have to know the context of a particular data item in advance * there is no physical unit associated with numeric data * every attempt to add context to the name of an item will quickly lead to rather unhandy long names. .. _STAR: http://pubs.acs.org/doi/abs/10.1021/ci00002a020 Adding semantics ================ This is the point where NeXus enters the stage. NeXus is not a new physical file format. It is rather a set ofrules and conventions how to organize data within an HDF5 file. .. attention:: This is a rather common missconecption: NeXus files **are** HDF5 files. Every software that can read HDF5 files can also read NeXus files. The NeXus standard does nothing else than determining the structure of the file up to a certain point. Adding units to a particular dataset (in NeXus terminology they are called *fields*) is rather simple. The NeXus standard requires that every dataaset has a string attribute with name *units* attached to it, storing the string representation of the physical unit of the data stored in the field (dataset). .. figure:: _images/nexus_physical_units.png :align: center :width: 50% The physical unit of a field is determined by the value of the *units* attribute attached to it. The string in the *units* attribute should follow the `UDUNITS library`_ standard. The second job, associating every dataset with a particular object at the beamline is a bit more complicated. The NeXus way of doing this is by adding types to every group and by storing fields belonging to a particular object within such a group. For instance a group storing data of detectors is of type *NXdetector*. The type of a group is encoded in a string attribute of name *NX_class* attached to the group. .. figure:: _images/nexus_types.png :align: center :width: 50% The *type* of a group is determined by the value of its *NX_class* attribute. In this case the top level group with name *entry* is of type *NXentry*. See blow for what this means. These types are called *base classes* in NeXus terminology. But they do much more than associating a dataset (field) with a particular object. Every *base class* also defines a set of datasets and theirs names which can appear within a group of this type and determines its particular meaning. The *base classes* are defined in the `base class section`_ of the NeXus reference manual. *NXdetector* for instance has a field named *data* which is supposed to store the data recorded from this particular detector. It also defines to fields name *x_pixel_size* and *y_pixel_size* storing the size of each pixel of this very detector. In this way, *base classes* not only help associating data with a particular device at the beamline, they also add meaning to the data stored within each class. Finally, whenever we arrive at a group of a particular class we can make educated guesses what fields we can find there. This greatly helps when writing automatic analysis software which should identify the required data by itself without user interaction. .. _UDUNITS library: http://www.unidata.ucar.edu/software/udunits/#home .. _base class section: http://download.nexusformat.org/doc/html/classes/base_classes/index.html#base-class-definitions The basic structure =================== Aside from concrete devices like detectors, attenuators and the storage ring, the NeXus standard also defines *base classes* of rather abstract nature. There primary purpose is to define the basic structure of an experiment run within a file. The most important of these base classes are +----------------+-------------------------------------------------------------+ | base class | description | +================+=============================================================+ | *NXentry* | the top level group of every experiment. It is the entry | | | point for the data tree of a single experiment run. | +----------------+-------------------------------------------------------------+ | *NXinstrument* | located below *NXentry*, storing every component of the | | | beamline except the sample and all data associated with it. | +----------------+-------------------------------------------------------------+ | *NXsample* | like *NXinstrument*, *NXsample* resides below *NXentry*. | | | It contains every data associated with the sample | | | (including for instance the position of every stage | | | the sample is mounted on. | +----------------+-------------------------------------------------------------+ | *NXdata* | below *NXentry* collecting data from this particular run | | | used for plotting. | +----------------+-------------------------------------------------------------+ .. figure:: _images/nexus_base_classes.png :align: center :width: 50% A typical default NeXus tree shown in HDFview. An instance of *NXentry* acts as the top-level element for a run. Addressing objects within a NeXus file ====================================== To address an object within an HDF5 file a Unix like path is used. However, this requires that the names of each group between the target dataset and the root of the tree is known. This would require rather strict naming conventions which are neither practical nor would they make NeXus easy to use. Consequentevery analysis script or program would have to know the exact path to every bit of information required to do its job. The concept of *types* in NeXus can be used to relieve this constraint in many cases. Instead of searching for a path we can look for a type. So for instance instead of saying `/entry/instrument/pilatus/data` which would translate to give me the field name *data* within the group *pilatus* one could ask `/entry/instrument/:NXdetector/data` which reads: give me the field named *data* in the first instance of *NXdetector* you find. This of course would require that the instrument group only holds one instance of *NXdetector*. If we assume now that the file contains data from only on run, meaning that it has only one *entry* below the root we can generalize the above path to `/:NXentry/:NXinstrument/:NXdetector/data` in which case the only name we have to know is that of the dataset we want to access. However, this is easy as the name *data* is standardized by the base class *NXdetector*. With only two restrictions, namely * there is only on entry in the file * only on detector is used no group names must be known by the user. It is thus possible to write rather generic paths used in generic scripts.