Pandas is an advanced data analysis tool or a package extension in Python. Many companies and organizations require high-quality data analysis to use this tool on a large scale.
A data analyst must decide whether to use pandas based on the data type. It is highly recommended to use Pandas when we have data in a SQL table, a spreadsheet or heterogenous columns.
The data can be ordered or unordered, and time-series data is also supported. In this tutorial, let us understand how to mask data in pandas.
Masking is essentially a way to filter data based on one or more than one condition. The output of this masking is generally an object that is returned as
false based on the condition.
dates_data to Create a Dummy Dataframe in Pandas
It can be understood as an advanced
If-Else scheme for a data frame. However, we will first create a dummy data frame using
dates_data, along with a few rows.
import pandas as pd index = pd.date_range('2013-1-1',periods=100,freq='30Min') dates_data = pd.DataFrame(data=list(range(100)), columns=['value'], index=index) dates_data['value2'] = 'Alpha' dates_data['value2'].loc[0:10] = 'Beta'
The code block creates a data frame with rows with dates and two columns named
value2. To view the entries in the data, we use the following code:
value value2 2013-01-01 00:00:00 0 Beta 2013-01-01 00:30:00 1 Beta 2013-01-01 01:00:00 2 Beta 2013-01-01 01:30:00 3 Beta 2013-01-01 02:00:00 4 Beta ... ... ... 2013-01-02 23:30:00 95 Alpha 2013-01-03 00:00:00 96 Alpha 2013-01-03 00:30:00 97 Alpha 2013-01-03 01:00:00 98 Alpha 2013-01-03 01:30:00 99 Alpha
As we can see, we have 100 different entries with time set up equally after intervals of 30 minutes each.
Two additional columns named
value2 are created where we have some values set as numbers and others as either
Masking to Filter Data in Pandas
Masking is an advanced concept in Pandas where the analyst tries to filter data based on a particular condition.
It is possible to filter this data based on one or more than one condition. We will try to explore each one of these cases in detail here.
Let us begin by filtering data such that we only wish to fetch entries from our data frame
mask = dates_data['value2'] == 'Beta' print(dates_data[mask])
value value2 2013-01-01 00:00:00 0 Beta 2013-01-01 00:30:00 1 Beta 2013-01-01 01:00:00 2 Beta 2013-01-01 01:30:00 3 Beta 2013-01-01 02:00:00 4 Beta 2013-01-01 02:30:00 5 Beta 2013-01-01 03:00:00 6 Beta 2013-01-01 03:30:00 7 Beta 2013-01-01 04:00:00 8 Beta 2013-01-01 04:30:00 9 Beta
We have entries related to only the
Beta values in the
value2 column of the
dates_data data frame.
In this way, we can create a mask and then superimpose that mask on our data to filter data. This mask can also be understood as a stencil to filter out certain data.
We will filter data with a certain range of values from the
value column and only the
Beta value from the
value2 column in the
dates_data data frame.
mask = (dates_data['value2'] == 'Beta') & (dates_data['value'] > 3) print(dates_data[mask])
value value2 2013-01-01 02:00:00 4 Beta 2013-01-01 02:30:00 5 Beta 2013-01-01 03:00:00 6 Beta 2013-01-01 03:30:00 7 Beta 2013-01-01 04:00:00 8 Beta 2013-01-01 04:30:00 9 Beta
As we can see in the code block above, we have successfully filtered data such that we have only values greater than 3 in the
value column and the value
Beta only in the
Therefore, with the help of the
Masking technique in Pandas, we can efficiently filter data based on our requirement and based on one condition or more than.