2. Data handling

2.1 Introduction

Standard Python containers, such as lists and dictionaries, are not designed for numerical computation. In order to efficiently perform scientific calculations, the NumPy package provides powerful multi-dimensional array objects, with related functions and tools for numerical calculations.

In the following sections NumPy arrays and higher-level data containers built on top of them (namely pandas DataFrames and xarray Datasets) will be introduced. The last section will describe how to load scientific data and conveniently store it in the NeXus/HDF5 file format.

Important note: before going through this lecture you need to add the conda-forge channel to your environment and install the nxarray package through pip. Check the section on Packages installation to learn how to do it.

2.2 NumPy arrays

Numpy arrays, provided by the NumPy package, are the core objects for numerical calculation in Python. Such arrays are multi-dimensional data containers, efficiently mapped into hardware memory. An array, for example, could contain measurement of an experiment, a recorded signal, pixel intensities of an image or point values in multi-dimensional space.

To have a first overview of the NumPy package and NumPy arrays follow the related chapter 1.4 of the SciPy lectures, the official quick-start guide and the basics for beginners. Few of these examples will also introduce already some basics of data plotting. For a more technical insight, you can have also a look at chapter 2.2 of the SciPy lectures.

2.3 Extending NumPy

NumPy provides a set of very powerful functions for data analysis, and NumPy arrays are extremely robust and efficient for numerical computation. On the other hand, in practical data analysis, it could be difficult to handle, explore and relate between them these arrays.

To this concern, several Python packages exist, providing more flexible and expressive data structure, extending NumPy arrays (which are always under the hood) and making them easier and more intuitive to use.

The most popular of these packages is pandas, which is designed to simplify the handling of ‘labeled’ and tabular data, by providing the Series (1D) and DataFrame (2D) data structures. The 10 minutes introduction and the basic functionalities on the official website are a good starting point to have an idea of this package.

Pandas is a very powerful package that makes tabular data handling much easier. Nevertheless, it does not well support higher-dimensional data and it is missing an integrated management of metadata (the attributes related to your data), thus making it not always suited for scientific research data.

To this concern, the xarray package, built on top of NumPy and pandas, fills in excellently these lacks, being often the recommended choice in managing scientific research data. xarray provides the DataArray structure, a labeled N-dimensional array with its coordinates and attributes, and the Dataset structure, a container of DataArrays sharing the same coordinates. To have a better idea of the implementation of these two data structures, check their design description and to have an insight on their basic usage have a look at the quick overview on the official website.

Here, it is worth noting that another important feature provided by xarray is the possibility to easily extend it with domain-specific functionalities, by adding custom ‘accessors’ on the xarray objects. This aspect will be covered in more detail in chapter 4. Analysis.

2.4 Loading and saving

The examples and tutorials in the previous sections already showed some basics of data loading and saving, for NumPy, pandas and xarray respectively.

Despite all these packages support import/export of HDF5 (a file format designed to efficiently store and organize large amount of data), none of them provide an integrated interface to the NeXus file format, the standard de facto for scientific data storage, based on HDF5 and increasingly adopted in laboratories and large-scale facilities all over the world.

With this respect, the nxarray package comes into play, bridging xarray with the NeXus format. This package actually extends xarray, providing convenient loading and saving methods for NeXus files, directly to DataArrays and Datasets. The architecture of a NeXus file closely resembles the structure of an xarray Dataset, and indeed, even if they have been developed independently, both of them are actually specifically designed for handling scientific data with its relevant metadata.

After installation, you can already start to use nxarray, by importing it at any moment with:

import nxarray as nxr

Now the nxr.save() method will be available to xarray objects. For example, the ds Dataset of the previous examples can be saved to a NeXus file to disk simply with:

ds.nxr.save('ds.nxs')

You can load it back, let’s say to another Dataset my_ds with:

my_ds = nxr.load('ds.nxs')

and you can check that the whole structure of your Dataset is the same.

A DataArray can also be saved to a NeXus file. In this case, a Dataset, with your DataArray inside, will be created and saved to file. For example the data DataArray of the previous examples can be equally saved with:

data.nxr.save('data.nxs')

This time, when you will load it, a Dataset will be returned, with your original DataArray inside it:

ds2 = nxr.load('data.nxs')
my_data = ds2['data']

This section concludes with a consideration. NeXus, as reported by its website introduction, ‘is an effort by an international group of scientists motivated to define a common data exchange format’. Indeed, NeXus/HDF5 files are the best choice to save scientific data, and scientist are (and should) adopting it extensively. Data values stored in .nxs file are in binary format, which is the most efficient way to handle numbers, in term of disk space and computational speed. At the same time, .nxs files can be easily loaded and their content visualized quickly, together with all the relevant metadata associated. If you still are inclined to save your data as plain text because ‘I can see what’s inside’, most probably you are simply using the wrong tools to access your data. As the wise man said:

“You are a scientist, not a novelist. Save your data as binary, not as text.”