If you are a Python maniac and use Python for data analysis and processing, you may be interested in reading a
gz file as a Pandas data frame using Python. This tutorial educates about a possible way to read a
gz file as a data frame using a Python library called
Read GZ File in Pandas
gz is a file extension for compressed files that are compressed by the standard GNU zip (
gzip) compression algorithm. It is widely used as a compression format for Linux and Unix operating systems; for instance, if you have a file for an email, you can use the
gz file format to compress the file into a smaller file.
Large data files are compressed using a compression algorithm, and to use this data; the user has to read the content in an organized structure.
Python library; Pandas has a data type called a data frame, which is an integral part of the Python and
NumPy ecosystem, making them faster, easier to use, and more powerful than tables and spreadsheets.
A data frame is a data structure used to represent two-dimensional, resizable, potentially heterogeneous tabular data. It contains labeled axes (rows and columns).
Arithmetic operations are placed in both row and column labels. It is a dict-like container for series objects, spreadsheets or SQL tables.
So, if we are interested in reading a
gz file as a Pandas data frame using Python, then note that we cannot read the
.gz file directly we need to arrange the file’s data into an organized format using Python.
So, how to read the
.gz file? For that, we will have to follow the steps given below.
State the absolute path of the
gzfile and subsequent attributes for file reading.
read_csv()method from the
pandasmodule and pass the parameter.
DataFrameto view and manipulate the data of the
Use Pandas Data Frame to Read
Suppose we want to read a
gz compressed file for a CSV file
path_gzip_file = 'F:/50_Startups.csv.gz'
Let’s run the following code to do it.
Example Code (saved in
import pandas as pd path_gzip_file = 'F:/50_Startups.csv.gz' gzip_file_data_frame = pd.read_csv( path_gzip_file, compression='gzip', header=0, sep=',', quotechar='"') print(gzip_file_data_frame.head(5))
First, we import the
pandas module and alias it as
pd to work with data frames and to read files. Next, we specify an absolute path of our
After that, we call the
pd.read_csv() method of the
pandas module and pass parameters. The
pd.read_csv takes multiple parameters and returns a
pandas data frame.
We pass five parameters that are listed below.
- The first one is a string
- The second is the string
compressiontype (in this case,
- The third is integer
header=0so that the existing name can be replaced. The header can be a list of integers specifying row positions for multiple indices of the column.
- The fourth is the string
delimiter(in this case,
- The fifth one is
quotechar, an optional length
1string (Characters used to mark the beginning and end of quoted items. Quoted items can contain delimiters and are ignored.)
Finally, we chain the data frame with the
head() function that takes one parameter
n, returns the first
n number of data rows and then prints the data.
Now, we run the above code as follows:
PS F:\> & C:/Python310/python.exe f:/demo.py
50_Startups.csv.gz file is read successfully. See the first 5 rows from the file content below.
R&D Spend Administration Marketing Spend State Profit 0 165349.20 136897.80 471784.10 New York 192261.83 1 162597.70 151377.59 443898.53 California 191792.06 2 153441.51 101145.55 407934.54 Florida 191050.39 3 144372.41 118671.85 383199.62 New York 182901.99 4 142107.34 91391.77 366168.42 Florida 166187.94