Market basket analysis is a technique used by businesses to identify associations between products or services that are frequently purchased together. It is a type of data mining that involves analyzing customer transaction data, such as point-of-sale records or e-commerce shopping carts, to find patterns in customer buying behavior.
The set of items a customer buys is referred to as an itemset, and market basket analysis seeks to find relationships between purchases.
For this analysis we are going to use the Apriori algorithm. The Apriori algorithm works by first identifying all item sets that have a support greater than or equal to a specified threshold. The support of an item set is the proportion of transactions in which the item set appears.
Once the item sets with sufficient support have been identified, the algorithm generates new candidate item sets by combining these item sets with other item sets that have sufficient support. This process is repeated until no more item sets with sufficient support can be generated.
- Identify frequent item sets in a transaction database.
- Determine the association rules between items, including which items tend to be purchased together and which items tend to be purchased separately.
- Determine the minimum support and minimum confidence levels needed for the association rules to be considered significant.
- Identify which items should be placed near each other in a store or online store to encourage customers to purchase related items.
We want to calculate the support, confidence, and lift for the association rule {A, B} => {C}, which means "if a customer buys items A and B together, they are likely to buy item C as well."
Support: The support measures the frequency of occurrence of a particular item set in the transaction dataset. Support ({A, B}) = Number of transactions containing {A, B} / Total number of transactions
Confidence: The confidence measures the probability that item C is purchased given that items A and B are purchased together. Confidence ({A, B} => {C}) = Support ({A, B, C}) / Support ({A, B})
Lift: The lift measures the strength of the association between item sets. A lift value greater than 1 indicates a positive association between the item sets, while a value less than 1 indicates a negative association. Lift ({A, B} => {C}) = Support ({A, B, C}) / (Support ({A, B}) x Support ({C}))
#for basic operations
import numpy as np
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
# for visualizations
import matplotlib.pyplot as plt
import seaborn as sb
import squarify
# for market basket analysis
from mlxtend.frequent_patterns import apriori, association_rules, fpgrowth
#setting
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
#read the data
store = pd.read_csv('~/MarketBasketAnalysis/GroceryStoreDataSet.csv',names=['product'],header=None)
#check data set
print(store.shape)
50 rows, 1 column
#Statistical description of the dataset.
print(store.describe())
product count | 50 |
---|---|
unique | 47 |
top | BREAD,COFFEE,SUGAR |
freq | 2 |
#show the first few rows
print(store.head())
product | |
---|---|
0 | MILK,BREAD,BISCUIT |
1 | PIZZA |
2 | PIZZA,WATER,BEER |
#checking random entries in the dataset
print(store.random(3))
product | |
---|---|
8 | SALT,JUICE,COFFEE |
45 | BREAD,SUGAR,BOURNVITA |
15 | SHRIMP,EGGS,AVOCADO,BREAD |
#data set prep, split column in several columns
store_df = store['product'].str.split(",", expand=True)
#show the first few rows
print(store_df.head())
0 | 1 | 2 | 3 | 4 | |
0 | MILK | BREAD | BISCUIT | None | None |
1 | PIZZA | None | None | None | None |
2 | PIZZA | WATER | BEER | None | None |
#Statistical description of the dataset.
print(store_df.describe())
0 | 1 | 2 | 3 | 4 | |
count | 50 | 49 | 45 | 16 | 1 |
unique | 20 | 18 | 15 | 10 | 1 |
top | BREAD | BREAD | BISCUIT | CORNFLAKES | CORNFLAKES |
freq | 9 | 6 | 8 | 4 | 1 |
# looking at the frequency of most popular items
plt.rcParams['figure.figsize'] = (18, 7)
color = plt.cm.copper(np.linspace(0, 1, 40))
store_df[0].value_counts().head(40).plot.bar(color = color)
plt.title('frequency of most popular items', fontsize = 20)
plt.xticks(rotation = 90 )
plt.grid()
plt.show()
y = store_df[0].value_counts().head(50).to_frame()
y.index
plt.rcParams['figure.figsize'] = (20, 20)
squarify.plot(sizes = y.values, label = y.index,
alpha=.8,
color = sb.color_palette("magma"),
ec = 'white')
plt.title('Tree Map for Popular Items')
plt.axis('off')
plt.show()
result: Bread and Coffee being the most frequent item on the list
This analysis requires that all the data for a transaction be included in 1 row and the items should be 1-hot encoded.
#create list
store_list = list(store['product'].apply(lambda x: x.split(",")))
['MILK', 'BREAD', 'BISCUIT'], ['PIZZA'], ['PIZZA', 'WATER', 'BEER']
# 1 transaction per row with each product 1 hot encoded
te = TransactionEncoder()
store_ap = te.fit(store_list).transform(store_list)
store_ap = pd.DataFrame(store_ap,columns=te.columns_)
print(store.head())
APPLE | AVOCADO | BANANA | BEER | ... | |
0 | False | False | False | False | ... |
1 | False | False | False | False | ... |
2 | False | False | False | True | ... |
print(store_ap.shape)
[50 rows x 29 columns]
The algorithm employs level-wise search for frequent itemsets. A list of all possible itemsets is generated with having a support value greater than min_support value = 0.07
frequent_itemsets = apriori(store_ap, min_support=0.07, use_colnames=True) #support higher (relative frequency that the rules show up)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
print(frequent_itemsets)
support | itemsets | length | |
---|---|---|---|
0 | 0.12 | (AVOCADO) | 1 |
1 | 0.22 | (BISCUIT) | 1 |
Typically, support is used to measure the abundance or frequency (often interpreted as significance or importance). We refer to an itemset as a "frequent itemset" if the support is larger than a specified minimum-support threshold. Next, we will generate the rules with their corresponding support, confidence and lift. Lift is the ratio of the observed support to that expected if the two rules were independent and confidence is a measure of the reliability of the rule.
The Min_support is a floating point value between 0 and 1 that indicates the minimum support required for an itemset to be selected. -> number of observation with item / total observation
The antecedents refers to the set of items that are used to predict or recommend another set of items in a customer's transaction history. The consequent is a term used to refer to the item or items that are being predicted or recommended based on the presence of another set of items in a customer's transaction history. In short, the antecedents refers to the item already bought, the consequent refers to the possible purchase. Antecedent support refers to the frequency of antecedent and Consequent support refers to the frequency of consequent.
Metric can be set to confidence, lift, support, leverage and conviction:
#lift
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=0.3)
rules["antecedents_length"] = rules["antecedents"].apply(lambda x: len(x))
rules["consequents_length"] = rules["consequents"].apply(lambda x: len(x))
rules.sort_values("lift")
rules.head()
antecedents | consequents | antecedent support | consequent support | support | confidence | lift | ... | |
0 | (AVOCADO) | (BREAD) | 0.12 | 0.44 | 0.08 | 0.666667 | 1.515152 | ... |
1 | (BREAD) | (AVOCADO) | 0.44 | 0.12 | 0.08 | 0.181818 | 1.515152 | ... |
# Confidence
rules2 = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.3)
rules2["antecedents_length"] = rules2["antecedents"].apply(lambda x: len(x))
rules2["consequents_length"] = rules2["consequents"].apply(lambda x: len(x))
rules2.sort_values("confidence")
rules2.head()
antecedents | consequents | antecedent support | consequent support | support | confidence | lift | ... | |
0 | (AVOCADO) | (BREAD) | 0.12 | 0.44 | 0.08 | 0.666667 | 1.515152 | ... |
1 | (BISCUIT) | (BREAD) | 0.22 | 0.44 | 0.08 | 0.363636 | 0.826446 | ... |
Filter the dataframe by using standard pandas code. We are looking for a large lift (>=2) & confidence (>=0.6):
print(rules2.loc[(rules2['lift']>=2) & (rules2['confidence']>= 0.6)] )
antecedents | consequents | antecedent support | consequent support | support | confidence | lift | ... | |
3 | (CHEESE) | (BREAD) | 0.10 | 0.44 | 0.01 | 1.000000 | 2.272727 | ... |
10 | (SUGAR) | (COFFEE) | 0.14 | 0.24 | 0.01 | 0.714286 | 2.976190 | ... |
1.Support vs Confidence
plt.scatter(rules2['support'], rules2['confidence'], alpha=0.5)
plt.xlabel('support')
plt.ylabel('confidence')
plt.title('Support vs Confidence')
plt.show()
- Support vs Lift
plt.scatter(rules2['support'], rules2['lift'], alpha=0.5)
plt.xlabel('support')
plt.ylabel('lift')
plt.title('Support vs Lift')
plt.show()