How to Coalesce Values From Multiple Columns Into a Single Column in Pandas DataFrame

Fariba Laiq Feb 02, 2024
  1. Write Code From Scratch to Coalesce Values From Multiple Columns Into a Single Column in Pandas DataFrame
  2. Use DuckDB to Run SQL Query to Coalesce Values From Multiple Columns Into a Single Column in Pandas DataFrame
  3. Use the combine_first() Method to Coalesce Values From Multiple Columns Into a Single Column in Pandas DataFrame
  4. Use the bfill() Method to Coalesce Values From Multiple Columns Into a Single Column in Pandas DataFrame
  5. Use the mask() Method to Coalesce Values From Multiple Columns Into a Single Column in Pandas DataFrame
How to Coalesce Values From Multiple Columns Into a Single Column in Pandas DataFrame

This tutorial will demonstrate coalescing or returning the first non-null value from multiple columns into another column in Python Pandas dataframe.

For example, use the value from column 1 for the new column 3 if it isn’t null; else, use the value from column 2 for the new column 3 if column 1 is null.

We can achieve this task in numerous ways in the Pandas dataframe.

Write Code From Scratch to Coalesce Values From Multiple Columns Into a Single Column in Pandas DataFrame

We can write our logic from scratch to coalesce values. We have created a Pandas data frame in the following code with three columns, named Age_in_Years, Age_in_Months, and Age_in_Days.

The data frame has some missing values as well. If we want to display the age, first, we will output age in years.

If the value in that column is Null, we will show age in months. Again if the value in months is Null, we will display age in days.

For this purpose, we have written the code from scratch to get the first non-null column’s value. The function is iterating over all the data frame columns, and where it finds a non-null value, it returns that value; else, it checks for the values in other columns.

Example code:

# Python 3.x
import pandas as pd

df_age = pd.DataFrame(
    {
        "Age_in_Years": ["4 y", None, None, None],
        "Age_in_Months": ["48 m", "24 m", None, None],
        "Age_in_Days": ["1440 d", None, "2520 d", None],
    }
)


def get_first_non_null(dfrow, cols):
    for c in cols:
        if pd.notnull(dfrow[c]):
            return dfrow[c]
    return None


cols = ["Age_in_Years", "Age_in_Months", "Age_in_Days"]
df_age["Age"] = df_age.apply(lambda x: get_first_non_null(x, cols), axis=1)
display(df_age)

Output:

pandas coalesce

Use DuckDB to Run SQL Query to Coalesce Values From Multiple Columns Into a Single Column in Pandas DataFrame

Example code:

DuckDB is a Python API and a database management system that uses SQL queries to interact with the database. This package has a built-in coalesce method that selects the first non-null value from the columns.

We will pass column names to coalesce method in the SQL query.

# Python 3.x
import pandas as pd
import duckdb

df_age = pd.DataFrame(
    {
        "Age_in_Years": ["4 y", None, None, None],
        "Age_in_Months": ["48 m", "24 m", None, None],
        "Age_in_Days": ["1440 d", None, "2520 d", None],
    }
)
df_age = duckdb.query(
    """SELECT Age_in_Years, Age_in_Months, Age_in_Days, coalesce(Age_in_Years, Age_in_Months, Age_in_days) as Age from df_age"""
).to_df()
display(df_age)

Output:

pandas coalesce

Use the combine_first() Method to Coalesce Values From Multiple Columns Into a Single Column in Pandas DataFrame

The combine_first() method fills null values in one dataframe with non-null data from the second dataframe to combine two dataframe objects.

In the following code, we will return column values. We will combine Age_in_Years with Age_in_Months, and Age_in_Months with Age_in_Days.

It will return the value from Age_in_years. If Null, it will return the value from Age_in_Months. Again if that is also Null, it will return a value from Age_in_Days.

The data in the actual dataframe will not change, and we will get our desired values in the Age column.

Example code:

# Python 3.x
import pandas as pd

df_age = pd.DataFrame(
    {
        "Age_in_Years": ["4 y", None, None, None],
        "Age_in_Months": ["48 m", "24 m", None, None],
        "Age_in_Days": ["1440 d", None, "2520 d", None],
    }
)
df_age["Age"] = (
    df_age["Age_in_Years"]
    .combine_first(df_age["Age_in_Months"])
    .combine_first(df_age["Age_in_Days"])
)
df_age

Output:

pandas coalesce

Use the bfill() Method to Coalesce Values From Multiple Columns Into a Single Column in Pandas DataFrame

bfill stands for backward fill. This method substitutes NaN with the next row or column value.

Here, we will specify axis=1 to return the value from the next column if the value in the current column is Null.

Example code:

# Python 3.x
import pandas as pd

df_age = pd.DataFrame(
    {
        "Age_in_Years": ["4 y", None, None, None],
        "Age_in_Months": ["48 m", "24 m", None, None],
        "Age_in_Days": ["1440 d", None, "2520 d", None],
    }
)
df_age["Age"] = df_age.bfill(axis=1).iloc[:, 0]
df_age

Output:

pandas coalesce

Use the mask() Method to Coalesce Values From Multiple Columns Into a Single Column in Pandas DataFrame

The mask() method works similarly to if-then.

If the null condition is false for a column, its value will be used. Else, it will take the value from the other specified column.

Example code:

# Python 3.x
import pandas as pd

df_age = pd.DataFrame(
    {
        "Age_in_Years": ["4 y", None, None, None],
        "Age_in_Months": ["48 m", "24 m", None, None],
        "Age_in_Days": ["1440 d", None, "2520 d", None],
    }
)
df_age["Age"] = (
    df_age["Age_in_Years"]
    .mask(pd.isnull, df_age["Age_in_Months"])
    .mask(pd.isnull, df_age["Age_in_Days"])
)
df_age

Output:

pandas coalesce

Author: Fariba Laiq
Fariba Laiq avatar Fariba Laiq avatar

I am Fariba Laiq from Pakistan. An android app developer, technical content writer, and coding instructor. Writing has always been one of my passions. I love to learn, implement and convey my knowledge to others.

LinkedIn

Related Article - Pandas DataFrame