How to Read CSV Data From String in Pandas

Salman Mehmood Feb 02, 2024
How to Read CSV Data From String in Pandas

The main aim of this article is to demonstrate how to read CSV text data from a string or package data using Pandas in Python.

Python Pandas read_csv From String

the Problem

Data can be stored in multiple forms; two of the main forms are defined as structured and unstructured. These main forms can then be further divided into many forms, CSV (comma-separated values) being one of them.

While reading data from a CSV file, be it for some complex queries or processing, depending on the situation, we may need to read data from a string or a certain package.

Consider the following code:

import pandas as pd

import pkgutil
from io import StringIO


def get_data_file(pkg, path):
    f = StringIO()
    contents = unicode(pkgutil.get_data("pymc.examples", "data/wells.dat"))
    f.write(contents)
    return f


wells = get_data_file("pymc.examples", "data/wells.dat")

data = pd.read_csv(wells, delimiter=" ", index_col="id", dtype={"switch": np.int8})

Output:

 File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 401, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 209, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 509, in __init__
    self._make_engine(self.engine)
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 611, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 893, in __init__
    self._reader = _parser.TextReader(src, **kwds)
  File "parser.pyx", line 441, in pandas._parser.TextReader.__cinit__ (pandas/src/parser.c:3940)
  File "parser.pyx", line 551, in pandas._parser.TextReader._get_header (pandas/src/parser.c:5096)
pandas._parser.CParserError: Passed header=0 but only 0 lines in file

In the following code, a function named get_data_file is created to obtain data from the package. To perform this operation, it takes two methods named pkg and path, which store the package’s name and the file’s path, respectively.

The returned value of get_data_file is then passed on to the read_csv method to read the data, but as seen in the output, an error is returned of type CParserError with the description Passed header=0 but only 0 lines in file.

the Solution

The solution to this problem can be approached in two different ways based on the requirement.

Approach 1 - For String With io.StringIO

Consider the following code:

import pandas as pd

from io import StringIO

df = pd.read_csv(
    StringIO(
        "id,switch, arsenic, dist, assoc, educ\n"
        "1, 2.36, 16.826000, 0, 0, 0\n"
        "2, 1, 0.71, 47.321999, 0, 0\n"
        "3, 0, 2.07, 20.966999, 0, 10\n"
        "4, 1, 1.15, 21.486000, 0, 12\n"
        "5, 1, 1.10, 40.874001, 1, 14\n"
        "6, 2, 1.22, 23.123131, 1, 23\n"
        "7, 5, 5.33, 29.232322, 0, 38\n"
        "8, 1, 2.38, 90.222221, 1, 10\n"
        "9, 9, 2.01, 10.222334, 0, 0\n"
        "10, 0, 9.12, 20.324252, 0, 10\n"
    ),
    index_col=0,
)

print(df)

Output:

    switch   arsenic       dist   assoc   educ
id
1     2.36    16.826   0.000000       0      0
2     1.00     0.710  47.321999       0      0
3     0.00     2.070  20.966999       0     10
4     1.00     1.150  21.486000       0     12
5     1.00     1.100  40.874001       1     14
6     2.00     1.220  23.123131       1     23
7     5.00     5.330  29.232322       0     38
8     1.00     2.380  90.222221       1     10
9     9.00     2.010  10.222334       0      0
10    0.00     9.120  20.324252       0     10

Using the StringIO module, it is possible to pass a string to the read_csv method to read CSV text data from a string.

The StringIO module helps in executing methods that normally require “file-like” objects; by passing a string to its constructor, it creates a “file-like” object which can then be passed to methods requiring a “file-like” object.

It is important to note that in Python 2.7, the StringIO module was defined as StringIO, while in Python 3+, it has shifted to io.StringIO.

Proper implementation must be implemented to handle the cases where the code is supposed to be flexible and is targeted to run on both Python2 and Python3.

Approach 2 - For Package With io.BytesIO

Consider the following code:

import numpy as np
import pandas as pd
import io
import pkgutil

wells = pkgutil.get_data("pymc.examples", "data/wells.dat")
print("Data Type: ", str(type(wells)))

df = pd.read_csv(
    io.BytesIO(wells),
    encoding="utf8",
    sep=" ",
    index_col="id",
    dtype={"switch": np.int8},
)
print(df.head())

Output:

Data Type: <class 'bytes'>
    switch  arsenic       dist  assoc  educ
id
1        1     2.36  16.826000      0     0
2        1     0.71  47.321999      0     0
3        0     2.07  20.966999      0    10
4        1     1.15  21.486000      0    12
5        1     1.10  40.874001      1    14

In the case of reading CSV data from a package, we can first use the pkgutil.get_data method to fetch the data, then to read the CSV data from the fetched data, we can use the read_csv method.

It is important to note that we must use the io.BytesIO method to pass the data fetched from the pkgutil.get_data method. And to mention the encoding of data, which is utf8 in our case.

Separators are defined using the sep argument, and the index column should be denoted using the index_col argument.

All in all, the read_csv method takes the following arguments in the code mentioned above.

  1. io.BytesIO(wells) corresponds to the data fetched from the desired package and must be passed in bytes form using the io.BytesIO method to convert it into bytes and pass it as a “file-like” object due to the “file-like” object requirement of read_csv.
  2. encoding='utf8' refers to the text encoding of the input, which can be different values based on the type of data passed.
  3. sep=' ' denotes what type of separators the data consists of. It can be different based on how the supposed CSV data was created in the first place.
  4. index_col="id" refers to the name of the index column of data; again, it depends on the data and can vary from file to file.
Salman Mehmood avatar Salman Mehmood avatar

Hello! I am Salman Bin Mehmood(Baum), a software developer and I help organizations, address complex problems. My expertise lies within back-end, data science and machine learning. I am a lifelong learner, currently working on metaverse, and enrolled in a course building an AI application with python. I love solving problems and developing bug-free software for people. I write content related to python and hot Technologies.

LinkedIn

Related Article - Pandas CSV