MarketBasketAnalysis

Market basket analysis is a technique used by businesses to identify associations between products or services that are frequently purchased together. It is a type of data mining that involves analyzing customer transaction data, such as point-of-sale records or e-commerce shopping carts, to find patterns in customer buying behavior.

The set of items a customer buys is referred to as an itemset, and market basket analysis seeks to find relationships between purchases.

For this analysis we are going to use the Apriori algorithm. The Apriori algorithm works by first identifying all item sets that have a support greater than or equal to a specified threshold. The support of an item set is the proportion of transactions in which the item set appears.

Once the item sets with sufficient support have been identified, the algorithm generates new candidate item sets by combining these item sets with other item sets that have sufficient support. This process is repeated until no more item sets with sufficient support can be generated.

The Apriori algorithm can be used to:

Identify frequent item sets in a transaction database.
Determine the association rules between items, including which items tend to be purchased together and which items tend to be purchased separately.
Determine the minimum support and minimum confidence levels needed for the association rules to be considered significant.
Identify which items should be placed near each other in a store or online store to encourage customers to purchase related items.

Calculations

We want to calculate the support, confidence, and lift for the association rule {A, B} => {C}, which means "if a customer buys items A and B together, they are likely to buy item C as well."

Support: The support measures the frequency of occurrence of a particular item set in the transaction dataset. Support ({A, B}) = Number of transactions containing {A, B} / Total number of transactions

Confidence: The confidence measures the probability that item C is purchased given that items A and B are purchased together. Confidence ({A, B} => {C}) = Support ({A, B, C}) / Support ({A, B})

Lift: The lift measures the strength of the association between item sets. A lift value greater than 1 indicates a positive association between the item sets, while a value less than 1 indicates a negative association. Lift ({A, B} => {C}) = Support ({A, B, C}) / (Support ({A, B}) x Support ({C}))

Getting started

Importing the required libraries

#for basic operations
import numpy as np 
import pandas as pd 
from mlxtend.preprocessing import TransactionEncoder 

# for visualizations
import matplotlib.pyplot as plt
import seaborn as sb
import squarify

# for market basket analysis
from mlxtend.frequent_patterns import apriori, association_rules, fpgrowth

#setting
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

Importing the dataset

#read the data
store = pd.read_csv('~/MarketBasketAnalysis/GroceryStoreDataSet.csv',names=['product'],header=None)

#check data set
print(store.shape)

50 rows, 1 column

#Statistical description of the dataset.
print(store.describe())

product count	50
unique	47
top	BREAD,COFFEE,SUGAR
freq	2

#show the first few rows
print(store.head())

	product
0	MILK,BREAD,BISCUIT
1	PIZZA
2	PIZZA,WATER,BEER

#checking random entries in the dataset
print(store.random(3))

	product
8	SALT,JUICE,COFFEE
45	BREAD,SUGAR,BOURNVITA
15	SHRIMP,EGGS,AVOCADO,BREAD

Data Visualizations

#data set prep, split column in several columns
store_df = store['product'].str.split(",", expand=True)

#show the first few rows
print(store_df.head())

	0	1	2	3	4
0	MILK	BREAD	BISCUIT	None	None
1	PIZZA	None	None	None	None
2	PIZZA	WATER	BEER	None	None

#Statistical description of the dataset.
print(store_df.describe())

	0	1	2	3	4
count	50	49	45	16	1
unique	20	18	15	10	1
top	BREAD	BREAD	BISCUIT	CORNFLAKES	CORNFLAKES
freq	9	6	8	4	1

bar chart

# looking at the frequency of most popular items 
plt.rcParams['figure.figsize'] = (18, 7)
color = plt.cm.copper(np.linspace(0, 1, 40))
store_df[0].value_counts().head(40).plot.bar(color = color)
plt.title('frequency of most popular items', fontsize = 20)
plt.xticks(rotation = 90 )
plt.grid()
plt.show()

plotting a tree map

y = store_df[0].value_counts().head(50).to_frame()
y.index

plt.rcParams['figure.figsize'] = (20, 20)
squarify.plot(sizes = y.values, label = y.index, 
              alpha=.8, 
              color = sb.color_palette("magma"), 
              ec = 'white')
plt.title('Tree Map for Popular Items')
plt.axis('off')
plt.show()

result: Bread and Coffee being the most frequent item on the list

Apriori algorithm

This analysis requires that all the data for a transaction be included in 1 row and the items should be 1-hot encoded.

#create list
store_list = list(store['product'].apply(lambda x: x.split(",")))

['MILK', 'BREAD', 'BISCUIT'], ['PIZZA'], ['PIZZA', 'WATER', 'BEER']

# 1 transaction per row with each product 1 hot encoded
te = TransactionEncoder()
store_ap = te.fit(store_list).transform(store_list)
store_ap = pd.DataFrame(store_ap,columns=te.columns_)

print(store.head())

	APPLE	AVOCADO	BANANA	BEER	...
0	False	False	False	False	...
1	False	False	False	False	...
2	False	False	False	True	...

print(store_ap.shape)

[50 rows x 29 columns]

Create some rules

The algorithm employs level-wise search for frequent itemsets. A list of all possible itemsets is generated with having a support value greater than min_support value = 0.07

frequent_itemsets = apriori(store_ap, min_support=0.07, use_colnames=True) #support higher (relative frequency that the rules show up)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
print(frequent_itemsets)

	support	itemsets	length
0	0.12	(AVOCADO)	1
1	0.22	(BISCUIT)	1

...

Typically, support is used to measure the abundance or frequency (often interpreted as significance or importance). We refer to an itemset as a "frequent itemset" if the support is larger than a specified minimum-support threshold. Next, we will generate the rules with their corresponding support, confidence and lift. Lift is the ratio of the observed support to that expected if the two rules were independent and confidence is a measure of the reliability of the rule.

The Min_support is a floating point value between 0 and 1 that indicates the minimum support required for an itemset to be selected. -> number of observation with item / total observation

The antecedents refers to the set of items that are used to predict or recommend another set of items in a customer's transaction history. The consequent is a term used to refer to the item or items that are being predicted or recommended based on the presence of another set of items in a customer's transaction history. In short, the antecedents refers to the item already bought, the consequent refers to the possible purchase. Antecedent support refers to the frequency of antecedent and Consequent support refers to the frequency of consequent.

Metric can be set to confidence, lift, support, leverage and conviction:

#lift
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=0.3) 
rules["antecedents_length"] = rules["antecedents"].apply(lambda x: len(x))
rules["consequents_length"] = rules["consequents"].apply(lambda x: len(x))
rules.sort_values("lift")

rules.head()

	antecedents	consequents	antecedent support	consequent support	support	confidence	lift	...
0	(AVOCADO)	(BREAD)	0.12	0.44	0.08	0.666667	1.515152	...
1	(BREAD)	(AVOCADO)	0.44	0.12	0.08	0.181818	1.515152	...

# Confidence
rules2 = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.3)
rules2["antecedents_length"] = rules2["antecedents"].apply(lambda x: len(x))
rules2["consequents_length"] = rules2["consequents"].apply(lambda x: len(x))
rules2.sort_values("confidence")

rules2.head()

	antecedents	consequents	antecedent support	consequent support	support	confidence	lift	...
0	(AVOCADO)	(BREAD)	0.12	0.44	0.08	0.666667	1.515152	...
1	(BISCUIT)	(BREAD)	0.22	0.44	0.08	0.363636	0.826446	...

Filter the dataframe by using standard pandas code. We are looking for a large lift (>=2) & confidence (>=0.6):

print(rules2.loc[(rules2['lift']>=2) & (rules2['confidence']>= 0.6)] )

	antecedents	consequents	antecedent support	consequent support	support	confidence	lift	...
3	(CHEESE)	(BREAD)	0.10	0.44	0.01	1.000000	2.272727	...
10	(SUGAR)	(COFFEE)	0.14	0.24	0.01	0.714286	2.976190	...

Visualizing results

1.Support vs Confidence

plt.scatter(rules2['support'], rules2['confidence'], alpha=0.5)
plt.xlabel('support')
plt.ylabel('confidence')
plt.title('Support vs Confidence')
plt.show()

Support vs Lift

plt.scatter(rules2['support'], rules2['lift'], alpha=0.5)
plt.xlabel('support')
plt.ylabel('lift')
plt.title('Support vs Lift')
plt.show()

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.DS_Store		.DS_Store
.gitattributes		.gitattributes
GroceryStoreDataSet.csv		GroceryStoreDataSet.csv
README.md		README.md
check_data.py		check_data.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MarketBasketAnalysis

The Apriori algorithm can be used to:

Calculations

Getting started

Importing the required libraries

Importing the dataset

Data Visualizations

bar chart

plotting a tree map

Apriori algorithm

Create some rules

Visualizing results

About

Releases

Packages

Languages

nadineoka/MarketBasketAnalysis

Folders and files

Latest commit

History

Repository files navigation

MarketBasketAnalysis

The Apriori algorithm can be used to:

Calculations

Getting started

Importing the required libraries

Importing the dataset

Data Visualizations

bar chart

plotting a tree map

Apriori algorithm

Create some rules

Visualizing results

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages