How to Convert Pandas DataFrame to Spark DataFrame

Pulamolu Sai Mohan Feb 02, 2024
  1. Use the createDataFrame() Function to Convert Pandas DataFrame to Spark DataFrame
  2. Use the createDataFrame() With schema Function to Convert Pandas DataFrame to Spark DataFrame
  3. Use the createDataFrame() Function With apache arrow Enabled to Convert Pandas DataFrame to Spark DataFrame
How to Convert Pandas DataFrame to Spark DataFrame

The DataFrame is two dimensional and mutable data structure. The data in DataFrame is stored in labeled axes called rows and columns.

Both Pandas and Spark have DataFrames. This tutorial will discuss different methods to convert Pandas dataframe to Spark dataframe.

Use the createDataFrame() Function to Convert Pandas DataFrame to Spark DataFrame

The createDataFrame() function is used to create a Spark DataFrame from an RDD or a pandas.DataFrame. The createDataFrame() takes the data and scheme as arguments.

We will discuss the schema more shortly.

Syntax of createDataFrame():

createDataFrame(data, schema=None)

Parameters:

  1. data = The dataframe to be passed
  2. schema = str or list, optional

Returns: DataFrame

Approach:

  1. Import the pandas library and create a Pandas Dataframe using the DataFrame() method.
  2. Create a spark session by importing the SparkSession from the pyspark library.
  3. Pass the Pandas dataframe to the createDataFrame() method of the SparkSession object.
  4. Print the DataFrame.

The following code uses the createDataFrame() function to convert Pandas dataframe to Spark dataframe.

# import the pandas
import pandas as pd

# from pyspark library import sql
from pyspark import sql

# Creating a SparkSession
spark_session = sql.SparkSession.builder.appName("pdf to sdf").getOrCreate()

# Creating the pandas DataFrame using pandas.DataFrame()

data = pd.DataFrame(
    {
        "Course": ["Python", "Spark", "Java", "JavaScript", "C#"],
        "Mentor": ["Robert", "Elizibeth", "Nolan", "Chris", "johnson"],
        "price$": [199, 299, 99, 250, 399],
    }
)

# Converting the pandas dataframe in to spark dataframe
spark_DataFrame = spark_session.createDataFrame(data)

# printing the dataframe
spark_DataFrame.show()

Output:

+----------+---------+------+
|    Course|   Mentor|price$|
+----------+---------+------+
|    Python|   Robert|   199|
|     Spark|Elizibeth|   299|
|      Java|    Nolan|    99|
|JavaScript|    Chris|   250|
|        C#|  johnson|   399|
+----------+---------+------+

Use the createDataFrame() With schema Function to Convert Pandas DataFrame to Spark DataFrame

We discussed the createDataFrame() method in the previous example. Now we will see how to change the schema while converting the DataFrame.

This example will use the schema to change the column names, Course to Technology, Mentor to developer and price to Salary.

Schema:

The schema defines the field names and their data types. In Spark, the schema is the structure of the DataFrame, the schema of DataFrame can be defined using the StructType class, which is a collection of StructField.

The StructField takes a field or column’s name, data type and nullable. The nullable parameter defines if that field can be null or not.

Approach:

  1. Import the pandas library and create a Pandas Dataframe using the DataFrame() method.
  2. Create a Spark session by importing the SparkSession from the pyspark library.
  3. Create schema by passing the collection of StructField to the StructType class; the StructField object is created by passing the name, data type, and nullable of a field.
  4. Pass the Pandas dataframe and the schema to the createDataFrame() method of the SparkSession object.
  5. Print the DataFrame.

The following code uses the createDataFrame() and schema to convert Pandas dataframe to Spark dataframe.

# import the pandas
import pandas as pd

# from pyspark library import SparkSession
from pyspark.sql import SparkSession

from pyspark.sql.types import *

# Creating a SparkSession
spark_session = sql.SparkSession.builder.appName("pdf to sdf").getOrCreate()


# Creating the pandas DataFrame using pandas.DataFrame()
data = pd.DataFrame(
    {
        "Course": ["Python", "Spark", "Java", "JavaScript", "C#"],
        "Mentor": ["Robert", "Elizibeth", "Nolan", "Chris", "johnson"],
        "price$": [199, 299, 99, 250, 399],
    }
)

# Creating/Changing schema
dfSchema = StructType(
    [
        StructField("Technology", StringType(), True),
        StructField("developer", StringType(), True),
        StructField("Salary", IntegerType(), True),
    ]
)

# Converting the pandas dataframe in to spark dataframe
spark_DataFrame = spark_session.createDataFrame(data, schema=dfSchema)


# printing the dataframe
spark_DataFrame.show()

Output:

+----------+---------+------+
|Technology|developer|Salary|
+----------+---------+------+
|    Python|   Robert|   199|
|     Spark|Elizibeth|   299|
|      Java|    Nolan|    99|
|JavaScript|    Chris|   250|
|        C#|  johnson|   399|
+----------+---------+------+

Use the createDataFrame() Function With apache arrow Enabled to Convert Pandas DataFrame to Spark DataFrame

The Apache Arrow is a language-independent columnar memory format for flat and hierarchical data or any structured data format. Apache Arrow improves the efficiency of data analytics by creating a standard columnar memory format.

The apache arrow is disabled by default; we can enable it explicitly using the following code.

SparkSession.conf.set("spark.sql.execution.arrow.enabled", "true")

Approach:

  1. Import the pandas library and create a Pandas Dataframe using the DataFrame() method.
  2. Create a spark session by importing the SparkSession from the pyspark library.
  3. Enable the apache arrow using the conf property.
  4. Pass the Pandas dataframe to the createDataFrame() method of the SparkSession object.
  5. Print the DataFrame.

The following code uses the createDataFrame() function by enabling the apache arrow to convert Pandas dataframe to Spark dataframe.

# import the pandas
import pandas as pd

# from pyspark library import sql
from pyspark import sql

# Creating a SparkSession
spark_session = sql.SparkSession.builder.appName("pdf to sdf").getOrCreate()

# Creating the pandas DataFrame using pandas.DataFrame()

data = pd.DataFrame(
    {
        "Course": ["Python", "Spark", "Java", "JavaScript", "C#"],
        "Mentor": ["Robert", "Elizibeth", "Nolan", "Chris", "johnson"],
        "price$": [199, 299, 99, 250, 399],
    }
)


spark_session.conf.set("spark.sql.execution.arrow.enabled", "true")

# Converting the pandas dataframe in to spark dataframe
sprak_arrow = spark_session.createDataFrame(data)

# printing the dataframe
sprak_arrow.show()

Output:

+----------+---------+------+
|    Course|   Mentor|price$|
+----------+---------+------+
|    Python|   Robert|   199|
|     Spark|Elizibeth|   299|
|      Java|    Nolan|    99|
|JavaScript|    Chris|   250|
|        C#|  johnson|   399|
+----------+---------+------+

Related Article - Pandas DataFrame