How to Convert Pandas DataFrame to Spark DataFrame
-
Use the
createDataFrame()Function to Convert Pandas DataFrame to Spark DataFrame -
Use the
createDataFrame()WithschemaFunction to Convert Pandas DataFrame to Spark DataFrame -
Use the
createDataFrame()Function Withapache arrowEnabled to Convert Pandas DataFrame to Spark DataFrame
The DataFrame is two dimensional and mutable data structure. The data in DataFrame is stored in labeled axes called rows and columns.
Both Pandas and Spark have DataFrames. This tutorial will discuss different methods to convert Pandas dataframe to Spark dataframe.
Use the createDataFrame() Function to Convert Pandas DataFrame to Spark DataFrame
The createDataFrame() function is used to create a Spark DataFrame from an RDD or a pandas.DataFrame. The createDataFrame() takes the data and scheme as arguments.
We will discuss the schema more shortly.
Syntax of createDataFrame():
createDataFrame(data, schema=None)
Parameters:
data= The dataframe to be passedschema=stror list, optional
Returns: DataFrame
Approach:
- Import the
pandaslibrary and create a Pandas Dataframe using theDataFrame()method. - Create a
sparksession by importing theSparkSessionfrom thepysparklibrary. - Pass the Pandas dataframe to the
createDataFrame()method of theSparkSessionobject. - Print the DataFrame.
The following code uses the createDataFrame() function to convert Pandas dataframe to Spark dataframe.
# import the pandas
import pandas as pd
# from pyspark library import sql
from pyspark import sql
# Creating a SparkSession
spark_session = sql.SparkSession.builder.appName("pdf to sdf").getOrCreate()
# Creating the pandas DataFrame using pandas.DataFrame()
data = pd.DataFrame(
{
"Course": ["Python", "Spark", "Java", "JavaScript", "C#"],
"Mentor": ["Robert", "Elizibeth", "Nolan", "Chris", "johnson"],
"price$": [199, 299, 99, 250, 399],
}
)
# Converting the pandas dataframe in to spark dataframe
spark_DataFrame = spark_session.createDataFrame(data)
# printing the dataframe
spark_DataFrame.show()
Output:
+----------+---------+------+
| Course| Mentor|price$|
+----------+---------+------+
| Python| Robert| 199|
| Spark|Elizibeth| 299|
| Java| Nolan| 99|
|JavaScript| Chris| 250|
| C#| johnson| 399|
+----------+---------+------+
Use the createDataFrame() With schema Function to Convert Pandas DataFrame to Spark DataFrame
We discussed the createDataFrame() method in the previous example. Now we will see how to change the schema while converting the DataFrame.
This example will use the schema to change the column names, Course to Technology, Mentor to developer and price to Salary.
Schema:
The schema defines the field names and their data types. In Spark, the schema is the structure of the DataFrame, the schema of DataFrame can be defined using the StructType class, which is a collection of StructField.
The StructField takes a field or column’s name, data type and nullable. The nullable parameter defines if that field can be null or not.
Approach:
- Import the
pandaslibrary and create a Pandas Dataframe using theDataFrame()method. - Create a
Sparksession by importing theSparkSessionfrom thepysparklibrary. - Create schema by passing the collection of
StructFieldto theStructTypeclass; theStructFieldobject is created by passing the name, data type, and nullable of a field. - Pass the Pandas dataframe and the schema to the
createDataFrame()method of theSparkSessionobject. - Print the DataFrame.
The following code uses the createDataFrame() and schema to convert Pandas dataframe to Spark dataframe.
# import the pandas
import pandas as pd
# from pyspark library import SparkSession
from pyspark.sql import SparkSession
from pyspark.sql.types import *
# Creating a SparkSession
spark_session = sql.SparkSession.builder.appName("pdf to sdf").getOrCreate()
# Creating the pandas DataFrame using pandas.DataFrame()
data = pd.DataFrame(
{
"Course": ["Python", "Spark", "Java", "JavaScript", "C#"],
"Mentor": ["Robert", "Elizibeth", "Nolan", "Chris", "johnson"],
"price$": [199, 299, 99, 250, 399],
}
)
# Creating/Changing schema
dfSchema = StructType(
[
StructField("Technology", StringType(), True),
StructField("developer", StringType(), True),
StructField("Salary", IntegerType(), True),
]
)
# Converting the pandas dataframe in to spark dataframe
spark_DataFrame = spark_session.createDataFrame(data, schema=dfSchema)
# printing the dataframe
spark_DataFrame.show()
Output:
+----------+---------+------+
|Technology|developer|Salary|
+----------+---------+------+
| Python| Robert| 199|
| Spark|Elizibeth| 299|
| Java| Nolan| 99|
|JavaScript| Chris| 250|
| C#| johnson| 399|
+----------+---------+------+
Use the createDataFrame() Function With apache arrow Enabled to Convert Pandas DataFrame to Spark DataFrame
The Apache Arrow is a language-independent columnar memory format for flat and hierarchical data or any structured data format. Apache Arrow improves the efficiency of data analytics by creating a standard columnar memory format.
The apache arrow is disabled by default; we can enable it explicitly using the following code.
SparkSession.conf.set("spark.sql.execution.arrow.enabled", "true")
Approach:
- Import the
pandaslibrary and create a Pandas Dataframe using theDataFrame()method. - Create a
sparksession by importing theSparkSessionfrom thepysparklibrary. - Enable the apache arrow using the
confproperty. - Pass the Pandas dataframe to the
createDataFrame()method of theSparkSessionobject. - Print the DataFrame.
The following code uses the createDataFrame() function by enabling the apache arrow to convert Pandas dataframe to Spark dataframe.
# import the pandas
import pandas as pd
# from pyspark library import sql
from pyspark import sql
# Creating a SparkSession
spark_session = sql.SparkSession.builder.appName("pdf to sdf").getOrCreate()
# Creating the pandas DataFrame using pandas.DataFrame()
data = pd.DataFrame(
{
"Course": ["Python", "Spark", "Java", "JavaScript", "C#"],
"Mentor": ["Robert", "Elizibeth", "Nolan", "Chris", "johnson"],
"price$": [199, 299, 99, 250, 399],
}
)
spark_session.conf.set("spark.sql.execution.arrow.enabled", "true")
# Converting the pandas dataframe in to spark dataframe
sprak_arrow = spark_session.createDataFrame(data)
# printing the dataframe
sprak_arrow.show()
Output:
+----------+---------+------+
| Course| Mentor|price$|
+----------+---------+------+
| Python| Robert| 199|
| Spark|Elizibeth| 299|
| Java| Nolan| 99|
|JavaScript| Chris| 250|
| C#| johnson| 399|
+----------+---------+------+