Read Specific Rows From CSV in Pandas

Read Specific Rows From CSV in Pandas

  1. Use In-Place Assignment to Read Specific Rows From CSV in Pandas
  2. Use the query Function to Read Specific Rows From CSV in Pandas
  3. Use chunksize to Read Specific Rows From CSV in Pandas
  4. Use loc to Read Specific Rows From CSV in Pandas
  5. Use iloc to Read Specific Rows From CSV in Pandas

Not all the data we have access to is needed when working with data. Often, we might need just a part of the data.

With Python, we can work with specific sections of data.

If we work with CSV files, we can work and read specific rows from CSV in Pandas. This article will discuss how we will work with specific rows from CSV in Pandas.

Use In-Place Assignment to Read Specific Rows From CSV in Pandas

We can’t read specific rows from CSV during the reading process (via the read_csv function). Still, after completing the read operation, we can use an in-place assignment to select specific rows from the dataframe we want.

To illustrate, we will use a CSV file attached here. First, let’s read the CSV file - leverage.csv.

import pandas as pd

df = pd.read_csv('leverage.csv')

print(df.head())

Output:

gvkey  datadate  fyearq  fqtr indfmt consol popsrc datafmt curcdq datacqtr  \
0   1300  20150331    2015     1   INDL      C      D     STD    USD   2015Q1
1   1300  20150630    2015     2   INDL      C      D     STD    USD   2015Q2
2   1300  20150930    2015     3   INDL      C      D     STD    USD   2015Q3
3   1300  20151231    2015     4   INDL      C      D     STD    USD   2015Q4
4   1300  20160331    2016     1   INDL      C      D     STD    USD   2016Q1

  datafqtr      atq    cshoq   dlttq     lctq costat   prccq
0   2015Q1  45357.0  781.707  5661.0  15432.0      A  104.31
1   2015Q2  46412.0  781.762  5562.0  15574.0      A  101.97
2   2015Q3  46625.0  770.691  5599.0  16367.0      A   94.69
3   2015Q4  49316.0  770.400  5554.0  18371.0      A  103.57
4   2016Q1  50365.0  762.115  9700.0  15659.0      A  112.05

With this dataframe, we can now select (or read) specific rows. For example, if we want only rows where the fyearq is above 2017, we can use an in-place assignment.

df = df[df['fyearq']>2017]
print(df.head())

Output:

gvkey  datadate  fyearq  fqtr indfmt consol popsrc datafmt curcdq  \
59    1690  20171231    2018     1   INDL      C      D     STD    USD
107   2111  20171231    2018     1   INDL      C      D     STD    USD
179   2663  20171031    2018     1   INDL      C      D     STD    USD
335   3980  20171231    2018     1   INDL      C      D     STD    USD
670   6547  20171231    2018     1   INDL      C      D     STD    USD

    datacqtr datafqtr       atq     cshoq     dlttq      lctq costat   prccq
59    2017Q4   2018Q1  406794.0  5081.651  103922.0  115788.0      A  169.23
107   2017Q4   2018Q1   55363.0   266.242   22095.0    4895.0      A  214.06
179   2017Q3   2018Q1    7746.0   301.000    2269.0    2583.0      A   47.37
335   2017Q4   2018Q1   97734.0  1500.000   20082.0   19875.0      A  107.51
670   2017Q4   2018Q1    6701.1    48.340    2030.0    1211.3      A   75.15

Use the query Function to Read Specific Rows From CSV in Pandas

Instead of using an in-place assignment, we can use the query function, which allows us to pass a query string with a Boolean expression and whichever rows fulfill the expression is selected.

We can achieve the same effect as in the last section by passing the string "fyearq > 2017" to the query function.

df = df.query("fyearq > 2017")
print(df)

Output:

gvkey  datadate  fyearq  fqtr indfmt consol popsrc datafmt curcdq  \
59    1690  20171231    2018     1   INDL      C      D     STD    USD
107   2111  20171231    2018     1   INDL      C      D     STD    USD
179   2663  20171031    2018     1   INDL      C      D     STD    USD
335   3980  20171231    2018     1   INDL      C      D     STD    USD
670   6547  20171231    2018     1   INDL      C      D     STD    USD
...

Use chunksize to Read Specific Rows From CSV in Pandas

However, if we need to work with the CSV file in batches (due to choice, constraints, or memory), we can use the chunksize parameter to read only some rows. Therefore, within the read_csv function, we specify the chunksize parameter, and the result with be an iterator object that we can loop through to access the chunks of the dataset.

Here, we specified that the chunksize be 60, so only 60 rows are loaded.

chunksize = 60
filename = 'leverage.csv'
with pd.read_csv(filename, chunksize=chunksize) as reader:
    for chunk in reader:
        print(chunk)
        print("--------------------------------")

Output:

gvkey  datadate  fyearq  fqtr indfmt consol popsrc datafmt curcdq  \
0    1300  20150331    2015     1   INDL      C      D     STD    USD
1    1300  20150630    2015     2   INDL      C      D     STD    USD
2    1300  20150930    2015     3   INDL      C      D     STD    USD
3    1300  20151231    2015     4   INDL      C      D     STD    USD
...
57   1690  20170630    2017     3   INDL      C      D     STD    USD
58   1690  20170930    2017     4   INDL      C      D     STD    USD
59   1690  20171231    2018     1   INDL      C      D     STD    USD
--------------------------------
     gvkey  datadate  fyearq  fqtr indfmt consol popsrc datafmt curcdq  \
60    1837  20150331    2015     1   INDL      C      D     STD    USD
61    1837  20150630    2015     2   INDL      C      D     STD    USD
62    1837  20150930    2015     3   INDL      C      D     STD    USD
...

We used the --- to indicate that only 60 rows are loaded at a time.

Use loc to Read Specific Rows From CSV in Pandas

Just like the query function or the in-place assignment, we can make use of the loc operator to pass a Boolean expression that will define the specific rows that are read.

df = df.loc[df['fyearq'] > 2017]
print(df)

Output:

gvkey  datadate  fyearq  fqtr indfmt consol popsrc datafmt curcdq  \
59    1690  20171231    2018     1   INDL      C      D     STD    USD
107   2111  20171231    2018     1   INDL      C      D     STD    USD
179   2663  20171031    2018     1   INDL      C      D     STD    USD
335   3980  20171231    2018     1   INDL      C      D     STD    USD
670   6547  20171231    2018     1   INDL      C      D     STD    USD
...

Use iloc to Read Specific Rows From CSV in Pandas

If we need a set of rows and know their indexes, we can use the iloc operator and pass multiple indexes within a []. Then, all the rows with the indexes are read.

Here we read only the rows with the indexes 12, 13, and 45.

df = df.iloc[[12, 13, 45]]
print(df)

Output:

gvkey  datadate  fyearq  fqtr indfmt consol popsrc datafmt curcdq  \
12   1440  20150331    2015     1   INDL      C      D     STD    USD
13   1440  20150630    2015     2   INDL      C      D     STD    USD
45   1487  20170630    2017     2   INDL      C      D     STD    USD
Olorunfemi Akinlua avatar Olorunfemi Akinlua avatar

Olorunfemi is a lover of technology and computers. In addition, I write technology and coding content for developers and hobbyists. When not working, I learn to design, among other things.

LinkedIn

Related Article - Pandas CSV

  • Pandas Read_csv From String
  • Convert Pandas to CSV Without Index