Skip to content

nadineoka/MarketBasketAnalysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MarketBasketAnalysis

Market basket analysis is a technique used by businesses to identify associations between products or services that are frequently purchased together. It is a type of data mining that involves analyzing customer transaction data, such as point-of-sale records or e-commerce shopping carts, to find patterns in customer buying behavior.

The set of items a customer buys is referred to as an itemset, and market basket analysis seeks to find relationships between purchases.

For this analysis we are going to use the Apriori algorithm. The Apriori algorithm works by first identifying all item sets that have a support greater than or equal to a specified threshold. The support of an item set is the proportion of transactions in which the item set appears.

Once the item sets with sufficient support have been identified, the algorithm generates new candidate item sets by combining these item sets with other item sets that have sufficient support. This process is repeated until no more item sets with sufficient support can be generated.

The Apriori algorithm can be used to:

  • Identify frequent item sets in a transaction database.
  • Determine the association rules between items, including which items tend to be purchased together and which items tend to be purchased separately.
  • Determine the minimum support and minimum confidence levels needed for the association rules to be considered significant.
  • Identify which items should be placed near each other in a store or online store to encourage customers to purchase related items.

Calculations

We want to calculate the support, confidence, and lift for the association rule {A, B} => {C}, which means "if a customer buys items A and B together, they are likely to buy item C as well."

Support: The support measures the frequency of occurrence of a particular item set in the transaction dataset. Support ({A, B}) = Number of transactions containing {A, B} / Total number of transactions

Confidence: The confidence measures the probability that item C is purchased given that items A and B are purchased together. Confidence ({A, B} => {C}) = Support ({A, B, C}) / Support ({A, B})

Lift: The lift measures the strength of the association between item sets. A lift value greater than 1 indicates a positive association between the item sets, while a value less than 1 indicates a negative association. Lift ({A, B} => {C}) = Support ({A, B, C}) / (Support ({A, B}) x Support ({C}))

Getting started

Importing the required libraries

#for basic operations
import numpy as np 
import pandas as pd 
from mlxtend.preprocessing import TransactionEncoder 

# for visualizations
import matplotlib.pyplot as plt
import seaborn as sb
import squarify

# for market basket analysis
from mlxtend.frequent_patterns import apriori, association_rules, fpgrowth
#setting
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

Importing the dataset

#read the data
store = pd.read_csv('~/MarketBasketAnalysis/GroceryStoreDataSet.csv',names=['product'],header=None)

#check data set
print(store.shape) 

50 rows, 1 column

#Statistical description of the dataset.
print(store.describe())
product count 50
unique 47
top BREAD,COFFEE,SUGAR
freq 2
#show the first few rows
print(store.head())
product
0 MILK,BREAD,BISCUIT
1 PIZZA
2 PIZZA,WATER,BEER
#checking random entries in the dataset
print(store.random(3))
product
8 SALT,JUICE,COFFEE
45 BREAD,SUGAR,BOURNVITA
15 SHRIMP,EGGS,AVOCADO,BREAD

Data Visualizations

#data set prep, split column in several columns
store_df = store['product'].str.split(",", expand=True)
#show the first few rows
print(store_df.head())
0 1 2 3 4
0 MILK BREAD BISCUIT None None
1 PIZZA None None None None
2 PIZZA WATER BEER None None
#Statistical description of the dataset.
print(store_df.describe())
0 1 2 3 4
count 50 49 45 16 1
unique 20 18 15 10 1
top BREAD BREAD BISCUIT CORNFLAKES CORNFLAKES
freq 9 6 8 4 1

bar chart

# looking at the frequency of most popular items 
plt.rcParams['figure.figsize'] = (18, 7)
color = plt.cm.copper(np.linspace(0, 1, 40))
store_df[0].value_counts().head(40).plot.bar(color = color)
plt.title('frequency of most popular items', fontsize = 20)
plt.xticks(rotation = 90 )
plt.grid()
plt.show()

plotting a tree map

y = store_df[0].value_counts().head(50).to_frame()
y.index

plt.rcParams['figure.figsize'] = (20, 20)
squarify.plot(sizes = y.values, label = y.index, 
              alpha=.8, 
              color = sb.color_palette("magma"), 
              ec = 'white')
plt.title('Tree Map for Popular Items')
plt.axis('off')
plt.show()

result: Bread and Coffee being the most frequent item on the list

Apriori algorithm

This analysis requires that all the data for a transaction be included in 1 row and the items should be 1-hot encoded.

#create list
store_list = list(store['product'].apply(lambda x: x.split(",")))

['MILK', 'BREAD', 'BISCUIT'], ['PIZZA'], ['PIZZA', 'WATER', 'BEER']

# 1 transaction per row with each product 1 hot encoded
te = TransactionEncoder()
store_ap = te.fit(store_list).transform(store_list)
store_ap = pd.DataFrame(store_ap,columns=te.columns_)
print(store.head())
APPLE AVOCADO BANANA BEER ...
0 False False False False ...
1 False False False False ...
2 False False False True ...
print(store_ap.shape)

[50 rows x 29 columns]

Create some rules

The algorithm employs level-wise search for frequent itemsets. A list of all possible itemsets is generated with having a support value greater than min_support value = 0.07

frequent_itemsets = apriori(store_ap, min_support=0.07, use_colnames=True) #support higher (relative frequency that the rules show up)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
print(frequent_itemsets)
support itemsets length
0 0.12 (AVOCADO) 1
1 0.22 (BISCUIT) 1
...

Typically, support is used to measure the abundance or frequency (often interpreted as significance or importance). We refer to an itemset as a "frequent itemset" if the support is larger than a specified minimum-support threshold. Next, we will generate the rules with their corresponding support, confidence and lift. Lift is the ratio of the observed support to that expected if the two rules were independent and confidence is a measure of the reliability of the rule.

The Min_support is a floating point value between 0 and 1 that indicates the minimum support required for an itemset to be selected. -> number of observation with item / total observation

The antecedents refers to the set of items that are used to predict or recommend another set of items in a customer's transaction history. The consequent is a term used to refer to the item or items that are being predicted or recommended based on the presence of another set of items in a customer's transaction history. In short, the antecedents refers to the item already bought, the consequent refers to the possible purchase. Antecedent support refers to the frequency of antecedent and Consequent support refers to the frequency of consequent.

Metric can be set to confidence, lift, support, leverage and conviction:

#lift
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=0.3) 
rules["antecedents_length"] = rules["antecedents"].apply(lambda x: len(x))
rules["consequents_length"] = rules["consequents"].apply(lambda x: len(x))
rules.sort_values("lift")
rules.head()
antecedents consequents antecedent support consequent support support confidence lift ...
0 (AVOCADO) (BREAD) 0.12 0.44 0.08 0.666667 1.515152 ...
1 (BREAD) (AVOCADO) 0.44 0.12 0.08 0.181818 1.515152 ...
# Confidence
rules2 = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.3)
rules2["antecedents_length"] = rules2["antecedents"].apply(lambda x: len(x))
rules2["consequents_length"] = rules2["consequents"].apply(lambda x: len(x))
rules2.sort_values("confidence")
rules2.head()
antecedents consequents antecedent support consequent support support confidence lift ...
0 (AVOCADO) (BREAD) 0.12 0.44 0.08 0.666667 1.515152 ...
1 (BISCUIT) (BREAD) 0.22 0.44 0.08 0.363636 0.826446 ...

Filter the dataframe by using standard pandas code. We are looking for a large lift (>=2) & confidence (>=0.6):

print(rules2.loc[(rules2['lift']>=2) & (rules2['confidence']>= 0.6)] )
antecedents consequents antecedent support consequent support support confidence lift ...
3 (CHEESE) (BREAD) 0.10 0.44 0.01 1.000000 2.272727 ...
10 (SUGAR) (COFFEE) 0.14 0.24 0.01 0.714286 2.976190 ...

Visualizing results

1.Support vs Confidence

plt.scatter(rules2['support'], rules2['confidence'], alpha=0.5)
plt.xlabel('support')
plt.ylabel('confidence')
plt.title('Support vs Confidence')
plt.show()
  1. Support vs Lift
plt.scatter(rules2['support'], rules2['lift'], alpha=0.5)
plt.xlabel('support')
plt.ylabel('lift')
plt.title('Support vs Lift')
plt.show()

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages