- What is Pandas
How to Read Single
.csvFile Using Pandas
- Read Multiple CSV Files in Python
- Concatenate Multiple DataFrames in Python
This tutorial is about how to read multiple
.csv files and concatenate all DataFrames into one.
This tutorial will use Pandas to read the data files and create and combine the DataFrames.
What is Pandas
This package comes with a wide array of functions to read a variety of data files as well as perform data manipulation techniques.
To install the
pandas package on your machine, you must open the Command Prompt/Terminal and run
pip install pandas.
How to Read Single
.csv File Using Pandas
pandas package provides a function to read a
>>> import pandas as pd >>> df = pd.read_csv(filepath_or_buffer)
Given the file path, the
read_csv() will read the data file and return the object.
>>> type(df) <class 'pandas.core.frame.DataFrame'>
Read Multiple CSV Files in Python
There’s no explicit function to perform this task using only the
pandas module. However, we can devise a rational method for performing the following.
Firstly, we need to have the path of all the data files. It will be easy if all the files are situated in one particular folder.
Creating a list where all the files’ paths and names will be stored.
>>> import pandas as pd >>> import glob >>> import os >>> # This is a raw string containing the path of files >>> path = r'D:\csv files' >>> all_files = glob.glob(os.path.join(path, '*.csv')) >>> all_files ['D:\\csv files\\FILE_1.csv', 'D:\\csv files\\FILE_2.csv']
In the above code, a list is created containing the file path.
glob module to find files or pathnames matching a pattern. The
glob follows Standard Unix path expansion rules to match patterns.
There’s no need to install this module externally because it is already included with Python. However, if you do not have this package, type
pip install glob2, and you should be good to go.
To retrieve paths recursively from within directories/files and subdirectories/subfiles, we can utilize the
glob module’s functions
glob.glob(pathname, *, recursive=False)
glob.iglob(pathname, *, recursive=False)
The function will return a list containing the paths of all the files.
For example, to retrieve all file names from a given path, use the asterisk symbol
* at the end of the path, passing it as a string to the
>>> for files in glob.glob(r'D:\csv files\*'): print(files) D:\csv files\FILE_1.csv D:\csv files\FILE_2.csv D:\csv files\textFile1.txt D:\csv files\textFile2.txt
Moreover, specify the file extension after the asterisk symbol to perform a more focused search.
>>> for files in glob.glob(r'D:\csv files\*.csv'): print(files) D:\csv files\FILE_1.csv D:\csv files\FILE_2.csv
What are Raw Strings
In Python, a raw string is formed by adding
R to a literal string. The backslash (
\) is a literal character in Python raw string.
This is useful when we want a string with a backslash but don’t want it to be considered an escape character.
To represent special characters such as tabs and newlines, we use the backslash (
\) to signify the start of an escape sequence.
>>> print("This\tis\nnormal\tstring") This is normal string
However, raw strings treat the backslash (
\) as a literal character. For example:
>>> print(r"This\tis\nnormal\tstring") This\tis\nnormal\tstring
os module contains methods for dealing with the operating system.
os is included in the basic utility modules for Python.
This module offers a portable method of using functionality dependent on the operating system. Python’s
os.path module, a sub-module of the
os module, is used to manipulate common pathnames.
os.path.join() function intelligently joins one or more path components. Except for the last path component, this approach concatenates different path components by placing exactly one directory separator
("/") after each non-empty portion.
A directory separator
("/") is added at the end of the final path component to be linked is empty.
All previously connected components are deleted if a path component represents an absolute path and joining moves on to the component representing the absolute path.
To merge different path components, use the
import os path = 'Users' os.path.join(path, 'Desktop', 'data.csv')
Concatenate Multiple DataFrames in Python
Moving further, use the paths returned from the
glob.glob() function to pull data and create dataframes. Subsequently, we will also append the Pandas dataframe objects to the list.
dataframes = list() for dfs in all_files: data = pd.read_csv(dfs) dataframes.append(data)
A list of dataframes is created.
>>> dataframes [dataframe1, dataframe2]
Concatenating the dataframes.
Note: Before concatenating the dataframes, all the dataframe must have similar columns.
pd.concat(dataframes, ignore_index = True)
pandas.concat() method handles all the intensive concatenation operations together with a Pandas object axis, with set logic operations (union or intersection) of the indexes on the other axis as an optional extra.
## importing the required modules import pandas as pd import os import glob ## Path of the files path = r'D:\csv files' ## joining the path and creating list of paths all_files = glob.glob(os.path.join(path, '*.csv')) dataframes = list() ## reading the data and appending the dataframe for dfs in all_files: data = pd.read_csv(dfs) dataframes.append(data) ## Concatenating the dataframes df = pd.concat(dataframes, ignore_index = True)