Add Metadata to Pandas Data Frame

Add Metadata to Pandas Data Frame

Metadata, also known as data about data, is structured data that describes, locates, and manages the content of documents shared on the web through web publishing.

Some web servers and software tools can generate metadata automatically. However, the manual process is also doable.

It can improve a document’s organization, discoverability, accessibility, indexing, and retrieval.

Pandas data frame is a data structure built on top of the data frame that provides the functionality of both R data frames and Python dictionaries.

It is just like a Python dictionary but has all the data analysis and manipulation functionality, like tables in Excel or databases with rows and columns. This tutorial explains adding metadata to Pandas data frames.

Add Metadata to Pandas Data Frame

To add metadata to a data frame, we must meet the below-given requirements.

  1. Create or import a data frame.
  2. Read existing metadata of the data frame.
  3. Add metadata to the data frame.

Create or Import a Data Frame

A data frame is required to add metadata to it. For this purpose, you must install a Python library called pandas.

PS C:\> pip install pandas

Let’s read a data frame from a file using pandas.

Example Code (saved in demo.py):

import pandas as pd

df = pd.read_csv('Data.csv')
print(df)

The above code imports the Python package pandas as pd. The function pd.read_csv() imports a data frame, reads it, and stores it to a variable named df.

Let’s see what pd is.

Output (printed on console):

data frame

Read Existing Metadata of the Data Frame

The imported data frame also contains some existing metadata. We can view it through the below-given code examples.

  1. Pandas info() function provides a quick summary of the data frame. It retrieves information like max_cols, memory_usage, show_counts, and null_counts.

    Let’s run the below code that calls df.info() and prints it.

    Example Code (saved in demo.py):

    print(df.info())
    

    Output (printed on console):

    existing info

  2. Pandas columns attribute returns an immutable n-dimensional array of ordered sets called Index that contains labels of each data frame column. Let’s run the below code that calls df.columns and prints an Index.

    Example Code (saved in demo.py):

    print(df.columns)
    

    Output (printed on console):

    columns

  3. Pandas describe() function generates descriptive statistics of the data frame. This includes count, mean, and standard deviation as std, min, max, and percentiles.

    Let’s run the following code that calls df.describe() and prints it.

    Example Code (saved in demo.py):

    print(df.describe())
    

    Output (printed on console):

    description

Add Metadata to the Data Frame

Let’s run the below code to add metadata to the Pandas data frame.

Example Code (saved in demo.py):

df.audi_car_model = 'Q5'
df.audi_car_price_in_dollars = 119843.12
print(f'Car Model: {df.audi_car_model}')
print(f'Car Price ($): {df.audi_car_price_in_dollars}')

Output (printed on console):

metadata one

Note: Python does not provide a powerful method to propagate metadata to data frames.

For example, operating such as group_by on a data frame with attached metadata will return the previous data frame without attached metadata.

However, you can store the metadata in an HDF5 file for later processing. Let’s run the below code to save metadata in an HDF5 file.

Example Code (saved in demo.py):

def store_in_hdf5(filename, df, **kwargs):
    hdf5_file = pd.HDFStore(filename)
    hdf5_file.put('car_data', df)
    hdf5_file.get_storer('car_data').attrs.metadata = kwargs
    hdf5_file.close()

filename = 'car data.hdf5'
metadata = {'audi_car_model': 'Q5', 'audi_car_price_in_dollars': 119843.12}
store_in_hdf5(filename, df, **metadata)

The store_in_hdf5() function performs the following functions:

  1. Create an hdf5_file using the pd.HDFStore() function with the filename as an argument.
  2. Insert the data frame into the file using the hdf5_file.put() by taking an appropriate name and df as arguments.
  3. Save metadata to hdf5_file. It uses hdf5_file.get_storer('car_data').attrs.metadata and assigns metadata to it.
  4. Call hdf5_file.close() to close the file.

Now, let’s run the below code to import the data frame and metadata from a file.

Example Code (saved in demo.py):

def import_from_file(hdf5_file):
    data = hdf5_file['car_data']
    metadata = hdf5_file.get_storer('car_data').attrs.metadata
    return data, metadata

filename = 'car data.hdf5'
with pd.HDFStore(filename) as hdf5_file:
    data, metadata = import_from_file(hdf5_file)

print(f'Data: {data}')
print(f'Metadata: {metadata}')

The import_from_file() function takes the hdf5_file as an argument. It retrieves the following pieces of information:

  1. data by specifying the data’s name in hdf5_file[].
  2. metadata by calling the metadata attribute of the function hdf5_file.get_storer('car_data').attrs.metadata.

Now, we run the Python file demo.py as:

PS C:>python demo.py

It prints the data and metadata returned by the import_from_file()` function.

Output (printed on console):

metadata two