Heirarchical data format (HDF5)
HDF5 is a file format designed to store and organize large amounts of numerical data.
HDF5: API Specification
It is:
- binary (so efficient)
- handles multidimensional data
- has a wide range of types, include complex and double precision
- deals with endian-issues
- stores data hierarchically, datasets (ie files) a kept in a tree of groups (ie directories)
- stores metadata as attributes, attached to groups or to datasets.
HDF5 can be read and written by:
- Mathematica (see below)
- Matlab
- Python (h5py), also PyTables (see below)
- Labview - but see below.
- Quite a lot of other languages which we don't use in the lab: perl, IDL,
Labview and HDF5
Labview has in the past used HDF5 as an internal format, but only written a limited and specific subset of it. General Labview support seems to be some time away. NI seems committed to their TDMS file format (proprietary, but well documented, doesn't support 2D arrays etc).
This lavag post seems to indicate NI tried HDF5 and found performance issues, which are slightly worrying, but we are not likely to need to append >100 separate datastreams. It also discusses their commitment to TDMS and extending TDMS. This is disappointing given HDF5 and other open standards like XSIL.
The best available Labview library seems to be
LVHDF5 - based on HDF5 1.6.5.
There is mailing list evidence that Tomi Maila is developing a library based on HDF5 1.8, which is more recent, but doesn't seem to be publicly released.
LVHDF5
Some limitations need to be investigated. In particular:
Arrays: "Only conversion of 1-D LabVIEW arrays is supported. Note that datasets may still be of higher dimensionality. Array datatypes are typically found only if contained by a cluster"
Not sure what this means - need to try it and see. It is of course always possible to flatten 2D to 1D and store, but highly undesirable.
Biggest problem appears to be
very slow data conversion using strings
Install puts
hdf5dll.dll,
szlibdll.dll, and
zlib1.dll in
C:\Windows\SysWOW64.
Directly calling the HDF5 DLL from LabVIEW
Is now a development project documented at
HDF5 Direct To LabVIEW.
GUIs to work with HDF5 files
There are also some nice GUI explorers such as
ViTables (in python),
HDF Explorer (windows only) and
HDFView (java, cross-platform).
H5LT: Lightweight HDF5 interface
This
lightweight C wrapper looks promising if we need to roll our own interface. There is an
H5LT tutorial. It is particularly attractive because we don't have to fuss with #defines. A good example is that rather than saying:
H5LTread_dataset (file_id, dset_name, H5T_NATIVE_INT, data);
which relies on
H5T?_NATIVE_INT being set somewhere, likely in a header file that Labview doesn't know about, we can just as well say:
H5LTread_dataset_int (file_id, dset_name, data);
which should be trivially callable from labview. Similary, creating a (possibly multidimensional) dataset is as easy as:
H5LTmake_dataset_int (file_id, DSET3_NAME, rank, dims, data_int_in);
Note that DSET3_NAME is not a #define, it's a string constant.
There's also
H5LTdtype_to_text which cheerfully converts opaque datatype enums to text strings, which something like labview can then handle in a fairly platform and library-revision independent way.
As an example of how relatively easy this makes things, the code below:
- creates a new HDF file
- writes in a 3x2 matrix containing numbers 1,2,3,4,5,6
- closes the file
This takes three lines of actual code, as you'd hope!
#include "hdf5.h"
#include "hdf5_hl.h"
#define RANK 2
void main( void )
{
hid_t file_id;
hsize_t dims[RANK]={2,3};
int data[6]={1,2,3,4,5,6};
herr_t status;
file_id = H5Fcreate ("ex_lite1.h5", H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT); // create a HDF5 file
status = H5LTmake_dataset_int(file_id, "/dset", RANK, dims, data); // create and write an integer type dataset named "dset"
status = H5Fclose (file_id); // close file
}
Linking against hdf5dll.dll with MingW32
This is how to compile the above code with the MingW32-tdm compiler (see for example,
[Building DLLs for LabVIEW for how this compiler is installed).
Add the line
#define _SSIZE_T_
before the
#includes to stop
sys/types.h arguing with the HDF5 includes about what
ssize_t is. Urgh.
Then compile with
gcc -c ex_lite1.c -I"c:\Program Files\HDF5 1.8.6\include"
and link with
gcc -o ex_lite1.exe ex_lite1.o -L"c:\Program Files\HDF5 1.8.6\bin" -lhdf5dll -lhdf5_hldll
Info: resolving _H5T_NATIVE_INT_g by linking to __imp__H5T_NATIVE_INT_g (auto-import)
c:/mingw32/bin/../lib/gcc/mingw32/4.5.1/../../../../mingw32/bin/ld.exe: warning:
auto-importing has been activated without --enable-auto-import specified on the
command line.
This should work unless it involves constant data structures referencing symbols
from auto-imported DLLs.
This whinging from the linker seems harmless, but I suppose would be nice to know exactly what is going on. The obvious settings of C_INCLUDE_PATH don't seem to let us get rid of the
-I, nor does LIBRARY_PATH. Perhaps this is spaces in the filenames? Slashes the wrong way around? Meh, can fix with a Makefile if necessary.
This makes
ex_lite1.exe, which cheerfully produces the example h5 file OK. Good.
Python interfaces
A direct Python implementation of the C API, appropriately objecty is
h5py. There's no 64-bit version on the
h5py project site, but you can get one
here if necessary. !
PyTables is a different approach, see below.
HDF5 Tables
Tabular data with columns of differing types can be stored in HDF5. You construct a
compound type (a struct in C), and then make an array of them. The struct variables are columns, the array member structs are rows. This is basically analogous to a single table in a relational database. Clearly, these tables are useful for storing multi-channel timeseries data amongst other things.
While tables can be created out of low level HDF5 library calls, this is tedious and various libraries have evolved. The official HDF5
H5TABLES interface is one.
Pytables is another...
PyTables
PyTables may be a much easier solution for getting tabular data into and out of HDF5. It's very object oriented, and massively faster than we'll need.
Interestingly, unlike in an RDBMS, a column can contain not just atomic types like strings and numbers, but arrays or even other tables. This is the idea of a hierarchical table system. So arguably, multiple BECs in a single HDF5 (ie multiple shots with the same parameters) should be rows in a table, with the BEC images being columns, and everything else being rows in the table too. Hmm. Using paths is requires lexical names like "/bec1/images/absorption". OTOH, other tools accessing hdf5 will likely cope much better if the tables are fairly flat.
The underlying files are still HDF5, and I don't
think it makes much use of metadata for its own purposes. So it shouldn't be hard to have Labview write in this format - well, it shouldn't be
harder than having Labview write anything else in HDF5.
Very encouragingly, the detailed
PyTables manual has this to say about interoperability with generic HDF:
!PyTables can access a wide range of objects in generic HDF5 files, like compound type datasets (that can be mapped to Table objects), homogeneous datasets (that can be mapped to
Array objects) or variable length record datasets (that can be mapped to VLArray objects)._ Besides, if a dataset is not supported, it will be mapped to a special UnImplemented? class (see Section 4.14), that will let the user see that the data is there, although it will be unreachable (still, you will be able to access the attributes and some metadata in the dataset). With that, PyTables probably can access and modify most of the HDF5 files out there.
ViTables
ViTables is a GUI for inspecting HDF5 files in general, particular aimed at fast access to large tabular data in PyTables format.
Installing it is slightly annoying. You need:
- Fairly recent Python, including numpy > 1.4.0. I used EPD version 7.0.1.
- PyQt4. Make sure you get the one for your version of Python! EPD-7 came with Python 2.7, so I got this one
The binary download of
ViTables? didn't work, so I built it from the repository. Doing
hg clone http://hg.berlios.de/repos/vitables vitables_tip gets the code, and then the usual
python setup.py install seemed to do the trick. It didn't like starting from cygwin, but was fine running from a
cmd console as
python vitables.
--
LincolnTurner - 18 Feb 2011
Mathematica and HDF5
Mathematica
speaks HDF5 but compund data structures are not supported (they are ignored by
Import).
A basic package to read
H5Tables in Mathematica is now more-or-less working. --
LincolnTurner - 19 Mar 2011
Mathematica calls HDF5.exe in
<Mathematica install directory>\SystemFiles\Converters\Binaries\
which uses version 1.6.5 of the HDF5 library (in version 8 of Mathematica, at least). A very limited subset of the HDF5 functionality is exposed, in addition to the above problems.
--
LincolnTurner - 07 Mar 2011