Python Apriori Algorithm

Python Apriori Algorithm

  1. Explanation of the Apriori Algorithm
  2. Apriori Algorithm in Python
  3. Implement the Topological Sort Algorithm in Python

This tutorial will discuss the implementation of the apriori algorithm in Python.

Explanation of the Apriori Algorithm

The Apriori Algorithm is widely used for market basket analysis, i.e., to analyze which items are sold and which other items. This is a useful algorithm for shop owners who want to increase their sales by placing the items sold together close to each other or offering discounts.

This algorithm states that if an itemset is frequent, all non-empty subsets must also be frequent. Let’s look at a small example to help illustrate this notion.

Let’s say that in our store, milk, butter, and bread are frequently sold together. This implies that milk, butter, and milk, bread, and butter, bread are also frequently sold together.

The Apriori Algorithm also states that the frequency of an itemset can never exceed the frequency of its non-empty subsets. We can further illustrate this by expanding a little more on our previous example.

In our store, milk, butter, and bread are sold together 3 times. This implies that all of its non-empty subsets like milk, butter, and milk, bread, and butter, bread are sold together at least 3 times or more.

Apriori Algorithm in Python

Before implementing this algorithm, we need to understand how the apriori algorithm works.

At the start of the algorithm, we specify the support threshold. The support threshold is just the probability of the occurrence of an item in a transaction.

$$ Support(A) =(Number of Transactions Containing the item A) / (Total Number of Transactions) $$

Apart from support, there are other measures like confidence and lift, but we don’t need to worry about those in this tutorial.

The steps we need to follow to implement the apriori algorithm are listed below.

  1. Our algorithm starts with just a 1-itemset. Here, 1 means the number of items in our itemset.
  2. Removes all the items from our data that do not meet the minimum support requirement.
  3. Now, our algorithm increases the number of items (k) in our itemset and repeats steps 1 and 2 until the specified k is reached or there are no itemsets that meet the minimum support requirements.

Implement the Topological Sort Algorithm in Python

To implement the Apriori Algorithm, we will be using the apyori module of Python. It is an external module, and hence we need to install it separately.

The pip command to install the apyori module is below.

pip install apyori

We’ll be using the Market Basket Optimization dataset from Kaggle.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from apyori import apriori

We have imported all the libraries required for our operations in the code given above. Now, we need to read the dataset using pandas.

This has been implemented in the following code snippet.

market_data = pd.read_csv('Market_Basket_Optimisation.csv', header = None)

Now, let’s check the total number of transactions in our dataset.

len(market_data)

Output:

7501

The output shows that we have 7501 records in our dataset. There are just two small problems with this data; these transactions are of variable length.

Given the real-world scenarios, this makes a lot of sense.

To perform the apriori algorithm, we need to convert these arbitrary transactions into equi-length transactions. This has been implemented in the following code snippet.

transacts = []
for i in range(0, len(market_data)):
  transacts.append([str(market_data.values[i,j]) for j in range(0, 20)])

In the above code, we initialized the list transacts and stored our transactions of length 20 in it. The issue here is that we insert null values inside transactions with fewer than 20 items.

But we don’t have to worry about it because the apriori module handles null values automatically.

We now generate association rules from our data with the apriori class constructor. This is demonstrated in the following code block.

rules = apriori(transactions = transacts, min_support = 0.003, min_confidence = 0.2, min_lift = 3, min_length = 2, max_length = 2)

We specified our thresholds for the constructor’s minimum support, confidence, and lift thresholds. We also specified the minimum and the maximum number of items in an itemset to be 2, i.e., we want to generate pairs of items that were frequently sold together.

The apriori algorithm’s association rules are stored inside the rules generator object. We now need a mechanism to convert this rules into a pandas dataframe.

The following code snippet shows a function inspect() that takes the generator object rules returned by our apriori() constructor and converts it into a pandas dataframe.

def inspect(output):
    Left_Hand_Side = [tuple(result[2][0][0])[0] for result in output]
    support = [result[1] for result in output]
    confidence = [result[2][0][2] for result in output]
    lift = [result[2][0][3] for result in output]
    Right_Hand_Side = [tuple(result[2][0][1])[0] for result in output]
    return list(zip(Left_Hand_Side, support, confidence, lift, Right_Hand_Side))

output = list(rules)
output_data = pd.DataFrame(inspect(output), columns = ['Left_Hand_Side', 'Support', 'Confidence', 'Lift', 'Right_Hand_Side'])
print(output_data)

Output:

         Left_Hand_Side   Support  Confidence      Lift Right_Hand_Side
0           light cream  0.004533    0.290598  4.843951         chicken
1  mushroom cream sauce  0.005733    0.300699  3.790833        escalope
2                 pasta  0.005866    0.372881  4.700812        escalope
3         fromage blanc  0.003333    0.245098  5.164271           honey
4         herb & pepper  0.015998    0.323450  3.291994     ground beef
5          tomato sauce  0.005333    0.377358  3.840659     ground beef
6           light cream  0.003200    0.205128  3.114710       olive oil
7     whole wheat pasta  0.007999    0.271493  4.122410       olive oil
8                 pasta  0.005066    0.322034  4.506672          shrimp

We can now sort this dataframe by support level and display the top 5 records in our dataset with the following code.

print(output_data.nlargest(n = 5, columns = 'Lift'))

Output:

      Left_Hand_Side   Support  Confidence      Lift Right_Hand_Side
3      fromage blanc  0.003333    0.245098  5.164271           honey
0        light cream  0.004533    0.290598  4.843951         chicken
2              pasta  0.005866    0.372881  4.700812        escalope
8              pasta  0.005066    0.322034  4.506672          shrimp
7  whole wheat pasta  0.007999    0.271493  4.122410       olive oil

Apriori is a very basic and simple algorithm for market basket analysis. It can provide helpful insides to increase sales of items in a market or a store.

The only disadvantage of this algorithm is that it takes a lot of memory for large datasets. This is because it creates a lot of combinations of frequent items.

We also experienced this limitation as this tutorial was meant to work with the UCI online retail data set, but due to memory limitations, we had to change our dataset to market basket optimization.

Muhammad Maisam Abbas avatar Muhammad Maisam Abbas avatar

Maisam is a highly skilled and motivated Data Scientist. He has over 4 years of experience with Python programming language. He loves solving complex problems and sharing his results on the internet.

LinkedIn