Read GZ File in Pandas

Read GZ File in Pandas

  1. Read GZ File in Pandas
  2. Use Pandas Data Frame to Read gz File

If you are a Python maniac and use Python for data analysis and processing, you may be interested in reading a gz file as a Pandas data frame using Python. This tutorial educates about a possible way to read a gz file as a data frame using a Python library called pandas.

Read GZ File in Pandas

gz is a file extension for compressed files that are compressed by the standard GNU zip (gzip) compression algorithm. It is widely used as a compression format for Linux and Unix operating systems; for instance, if you have a file for an email, you can use the gz file format to compress the file into a smaller file.

Large data files are compressed using a compression algorithm, and to use this data; the user has to read the content in an organized structure.

Python library; Pandas has a data type called a data frame, which is an integral part of the Python and NumPy ecosystem, making them faster, easier to use, and more powerful than tables and spreadsheets.

A data frame is a data structure used to represent two-dimensional, resizable, potentially heterogeneous tabular data. It contains labeled axes (rows and columns).

Arithmetic operations are placed in both row and column labels. It is a dict-like container for series objects, spreadsheets or SQL tables.

So, if we are interested in reading a gz file as a Pandas data frame using Python, then note that we cannot read the .gz file directly we need to arrange the file’s data into an organized format using Python.

So, how to read the .gz file? For that, we will have to follow the steps given below.

  • State the absolute path of the gz file and subsequent attributes for file reading.
  • Use the read_csv() method from the pandas module and pass the parameter.
  • Use pandas DataFrame to view and manipulate the data of the gz file.

Use Pandas Data Frame to Read gz File

Suppose we want to read a gz compressed file for a CSV file 50_Startups.csv.

path_gzip_file = 'F:/50_Startups.csv.gz'

Let’s run the following code to do it.

Example Code (saved in demo.py):

import pandas as pd

path_gzip_file = 'F:/50_Startups.csv.gz'

gzip_file_data_frame = pd.read_csv(
    path_gzip_file, compression='gzip',
    header=0, sep=',', quotechar='"')

print(gzip_file_data_frame.head(5))

First, we import the pandas module and alias it as pd to work with data frames and to read files. Next, we specify an absolute path of our gz file.

After that, we call the pd.read_csv() method of the pandas module and pass parameters. The pd.read_csv takes multiple parameters and returns a pandas data frame.

We pass five parameters that are listed below.

  1. The first one is a string path object.
  2. The second is the string compression type (in this case, gzip).
  3. The third is integer header (Explicitly pass header=0 so that the existing name can be replaced. The header can be a list of integers specifying row positions for multiple indices of the column. [0,1,3]).
  4. The fourth is the string delimiter (in this case, ,).
  5. The fifth one is quotechar, an optional length 1 string (Characters used to mark the beginning and end of quoted items. Quoted items can contain delimiters and are ignored.)

Finally, we chain the data frame with the head() function that takes one parameter n, returns the first n number of data rows and then prints the data.

Now, we run the above code as follows:

PS F:\> & C:/Python310/python.exe f:/demo.py

Our 50_Startups.csv.gz file is read successfully. See the first 5 rows from the file content below.

   R&D Spend  Administration  Marketing Spend       State     Profit
0  165349.20       136897.80        471784.10    New York  192261.83
1  162597.70       151377.59        443898.53  California  191792.06
2  153441.51       101145.55        407934.54     Florida  191050.39
3  144372.41       118671.85        383199.62    New York  182901.99
4  142107.34        91391.77        366168.42     Florida  166187.94