Read and Write to Parquet Files in Python

Read and Write to Parquet Files in Python

Jay Shaw Oct-07, 2022 Python Python File
  1. Parquet Interfaces That Read and Write to Parquet Files in Python
  2. Write DataFrames to Parquet File Using the PyArrow Module in Python
  3. Read Meta-Data of Parquet Files Using the PyArrow Module in Python
  4. Write Data to Parquet Files Using the Fastparquet Engine in Python
  5. Read Parquet Files Using Fastparquet Engine in Python
  6. Conclusion

This article focuses on how to write and read parquet files in Python. These types of files are a storage system format that stores data columnar-wise.

Parquet is a performance-optimized file format compared to row-based file formats like CSV. The focus can be placed on required data very rapidly when executing queries on your Parquet-based file system.

Parquet Interfaces That Read and Write to Parquet Files in Python

Python uses engines to write on data frames and read parquet files. This article will explain some engines that write parquet files on databases.

For usage in data analysis systems, the Apache Parquet project offers a standardized open-source columnar storage format. Apache Arrow is the best in-memory transport layer for data being read from or written to Parquet files.

We will learn about two parquet interfaces that read parquet files in Python: pyarrow and fastparquet.

The PyArrow Module in Python

Apache Arrow or PyArrow is an in-memory analytics development platform. It has a technology collection that lets big data systems store, process, and transfer data quickly.

This code is Python-bound in pyarrow, making it possible to write and read parquet files using Pandas.

Installing pyarrow is easy with pip and conda.

For pip, use the command:

pip install pyarrow

For conda, use this command:

conda install -c conda-forge pyarrow

Write DataFrames to Parquet File Using the PyArrow Module in Python

To understand how to write data frames and read parquet files in Python, let’s create a Pandas table in the below program.

There are four imports needed:

  1. pyarrow - For writing parquet products.
  2. numpy - For multi-dimensional arrays.
  3. pandas - For creating data frames.
  4. parquet - A sub-function of pyarrow.

This program creates a dataframe store1 with datasets of multiple types like integer, string, and Boolean. The index list is set to 'abc' to arrange the rows in alphabetical sequencing.

In a variable table1, a Pandas table is created using the syntax Table.from_pandas(). This table is printed to check the results.

import pyarrow.parquet as pq
import numpy as np

import pandas as pd

import pyarrow as pa

store1 = pd.DataFrame({'first': [5, np.nan, -9],

                       'second': ['apple', 'samsung', 'mi'],

                       'third': [False, False, True]},

                      index=list('abc'))

table1 = pa.Table.from_pandas(store1)
print(table1)

Output:

C:\python38\python.exe "C:/Users/Win 10/main.py"
pyarrow.Table
first: double
second: string
third: bool
__index_level_0__: string
----
first: [[5,null,-9]]
second: [["apple","samsung","mi"]]
third: [[false,false,true]]
__index_level_0__: [["a","b","c"]]

Process finished with exit code 0

Now, this data is written in parquet format with write_table. When writing a parquet file, the write_table() function includes several arguments to control different settings.

  1. data_page_size - This parameter regulates the approximate amount of encoded data pages within a column chunk. Currently, 1MB is the default value.
  2. flavor - This provides compatibility settings specific to an Apache Spark Parquet consumer, such as spark.
  3. version - This is the appropriate Parquet format version. While 1.0 and higher values guarantee compatibility with earlier readers, 2.4 and higher values enable more Parquet types and encodings.

In this program, the write_table() parameter is provided with the table table1 and a native file for writing the parquet parquet.txt.

The file’s origin can be indicated without the use of a string. Any of the following are possible:

  • A file path as a string
  • A native PyArrow file
  • A file object in Python

To read this table, the read_table() function is used. A variable table2 is used to load the table onto it.

Lastly, this parquet file is converted to Pandas dataframe using table2.to_pandas() and printed.

pq.write_table(table1, 'sample_file.parquet')

table2 = pq.read_table('sample_file.parquet')

table2.to_pandas()

print("\n", table2)

Output:

C:\python38\python.exe "C:/Users/Win 10/main.py"

 pyarrow.Table
first: double
second: string
third: bool

__index_level_0__: string
----

first: [[5,null,-9]]
second: [["apple","samsung","mi"]]
third: [[false,false,true]]
__index_level_0__: [["a","b","c"]]

Process finished with exit code 0

Parquet files are usually huge data files, and reading parquet files in Python takes a long time to load. So, specific columns can be passed to read data quickly instead of loading the whole file:

In the variable table3, the pq.read_table function is used to write the data. Inside the parameter bracket, two columns are provided: first and third.

table3 = pq.read_table('parquet.txt', columns=['first', 'third'])
print(table3)

The output will display the selected columns.

Output:

C:\python38\python.exe "C:/Users/Win 10/main.py"
pyarrow.Table
first: double
third: bool
----
first: [[5,null,-9]]
third: [[false,false,true]]

Process finished with exit code 0

We use read_pandas to keep any extra index column data when reading a subset of columns from a file using a Pandas data frame as the source:

table4 = pq.read_pandas('parquet.txt', columns=['second']).to_pandas()
print(table4)

Output:

C:\python38\python.exe "C:/Users/Win 10/main.py"
       second
a    apple
b  samsung
c       mi

Process finished with exit code 0

A string file path or an instance of NativeFile (particularly memory maps) will perform better when read than a Python file object, which typically has the poorest read speed.

One or more special columns are automatically created when using pa.Table.from_pandas to convert a table into an Arrow table to maintain track of the index (row labels). If the index is not valuable, it can be chosen to omit by passing preserve index=False because storing the index requires more storage space.

import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

store = pd.DataFrame({'first': [5, np.nan, -9],

                      'second': ['apple', 'samsung', 'mi'],

                      'third': [False, False, True]},

                     index=list('abc'))

print(store)

table = pa.Table.from_pandas(store, preserve_index=False)
pq.write_table(table, 'sample_file.parquet')
t = pq.read_table('sample_file.parquet')

print("\n", t.to_pandas())

The parquet file displayed has its index erased.

Output:

C:\python38\python.exe "C:/Users/Win 10/main.py"
   first   second  third
a    5.0    apple  False
b    NaN  samsung  False
c   -9.0       mi   True

    first   second  third
0    5.0    apple  False
1    NaN  samsung  False
2   -9.0       mi   True

Process finished with exit code 0

Read Meta-Data of Parquet Files Using the PyArrow Module in Python

In addition to reading data from files, the ParquetFile class, which the read_table method uses, offers additional features such as reading the metadata.

import pyarrow.parquet as pq

parquet_file = pq.ParquetFile('example.parquet')
print(parquet_file.metadata)

Output:

C:\python38\python.exe "C:/Users/Win 10/main.py"
<pyarrow._parquet.FileMetaData object at 0x000001DADCBDCA90>
  created_by: parquet-cpp-arrow version 9.0.0
  num_columns: 4
  num_rows: 3
  num_row_groups: 1
  format_version: 2.6
  serialized_size: 2580

Process finished with exit code 0

Write Data to Parquet Files Using the Fastparquet Engine in Python

It is a Python interface for the parquet file format.

This program writes on a parquet file using fastparquet. A data frame store is created with two columns: student and marks.

The data frame is written to a parquet file sample.parquet using the dataframe.to_parquet() function.

The engine is selected as fastparquet but can also be set to pyarrow.

import pandas as pd

store = pd.DataFrame({
    'student': ['Michael', 'Jackson', 'N', 'John', 'Cena'],
    'marks': [20, 10, 22, 21, 22],
})

print(store)
store.to_parquet('sample.parquet', engine='fastparquet')

Output:

C:\python38\python.exe "C:/Users/Win 10/main.py"
   student  marks
0  Michael     20
1  Jackson     10
2        N     22
3     John     21
4     Cena     22

Process finished with exit code 0

As the data is written to the parquet file, let’s read the file.

Read Parquet Files Using Fastparquet Engine in Python

The parquet file is read using the pd.read_parquet function, setting the engine to fastparquet and storing it inside a variable df. Then the results are printed.

df = pd.read_parquet('sample.parquet', engine='fastparquet')
print(df)

Output:

C:\python38\python.exe "C:/Users/Win 10/PycharmProjects/read_parquet/main.py"
   student  marks
0  Michael     20
1  Jackson     10
2        N     22
3     John     21
4     Cena     22

Process finished with exit code 0

Conclusion

This article explains how to read parquet files in Python. The program examples demonstrate reading parquet files using both pyarrow and fastparquet.

The reader should be able to easily create programs that read parquet files in Python.

Related Article - Python File

  • Get All the Files of a Directory
  • Delete a File and Directory in Python
  • Append Text to a File in Python
  • Check if a File Exists in Python
  • Find Files With a Certain Extension Only in Python
  • Read Specific Lines From a File in Python