How to Read HDF5 Files Into Pandas DataFrame

Manav Narula Feb 02, 2024
  1. Export a DataFrame to HDF5 Using Pandas
  2. Read HDF5 File Into a Pandas DataFrame
  3. Conclusion
How to Read HDF5 Files Into Pandas DataFrame

The world is evolving, and a lot of importance is given to big data for storing and processing large amounts of data. Due to such quantity, new file formats have emerged, overtaking the traditional storage options.

One such format is the HDF5 file. An HDF5 file stands for Hierarchal Data Format version 5; this version is most commonly used.

We use this file format to store large quantities of data and organize the contents in a specific hierarchy. The advantage of this is that it requires less storage space and quick access to parts of data.

We can efficiently work with such files in Python with the h5py module. We can even use the Pandas library to load such data into data frames.

This article will demonstrate how to work with HDF5 files using the Pandas library in Python.

Export a DataFrame to HDF5 Using Pandas

We can export a dataframe object to an HDF5 file using the pandas.to_hdf() function. This function writes a dataframe to an HDF5 file using HDFStore.

Before using this function, we need to understand some crucial parameters.

To specify the group identifier, we use the key parameter. We can set different modes for the file with the mode parameter.

The w mode opens the file in write mode and erases the previous content. The a mode opens the file in append mode and adds data conserving the previous content.

We will now export a dataframe to an HDF5 file using this function in the example below.

df = pd.DataFrame({"C1": [10, 11, 12], "C2": [20, 21, 22]}, index=[0, 1, 2])
df.to_hdf("file_data.h5", key="df", mode="w")

The above example will create an HDF5 file with the data frame’s content. We open the file in write mode, erasing any previous data.

Read HDF5 File Into a Pandas DataFrame

We have the pandas.read_hdf() function that we can directly use to read such files. However, it will not work for every HDF5 file.

The Pandas library understands only some specific structures of the HDF5 files, so this function works with only such structures.

See the code below.

import pandas as pd

df = pd.read_hdf("file_data.h5")
print(df)

Output:

   C1  C2
0  10  20
1  11  21
2  12  22

In the above example, we read the HDF5 file created in the previous function using the read_hdf() function. As discussed, not every structure for the HDF5 file can be imported directly using the pandas.read_hdf() function.

For such cases, there is a quick fix that may work. It involves using the h5py and numpy modules.

We will use the h5py.File constructor to read the given HDF5 file and store it in a numpy array using the numpy.array() function. Then, we can keep this data in a dataframe using the pandas.DataFrame() function.

The format for this is shown below.

import pandas as pd
import numpy as np
import h5py

df = pd.DataFrame(np.array(h5py.File("file_data.h5")["df"]))

Conclusion

To wrap up, we discussed how to work with HDF5 files with the Pandas library in Python. We started by learning about the HDF5 file format and their advantage.

We exported a dataframe to such files using the to_hdf() method. We use the read_hdf() function to read such files.

For complex structures, we can use a combination of functions from the h5py, NumPy, and Pandas library to read the files into a dataframe.

Author: Manav Narula
Manav Narula avatar Manav Narula avatar

Manav is a IT Professional who has a lot of experience as a core developer in many live projects. He is an avid learner who enjoys learning new things and sharing his findings whenever possible.

LinkedIn

Related Article - Pandas DataFrame