Skip to content
Al Va edited this page Mar 26, 2024 · 2 revisions

Welcome to the E-Commerce-EDA wiki!

E-Commerce Exploratory Data Analysis (EDA) in Python

This case example is aimed at testing and implementing e-commerce Exploratory Data Analysis (EDA) in Python based on the previous study: Data Science for E-Commerce with Python Both input data and the source code can be found here

  • Basic imports

import pandas as pd import matplotlib.pyplot as plt import matplotlib.gridspec as gridspec import seaborn as sns import numpy as np

  • Read dataset and preview df = pd.read_csv("C:/Users/adrou/OneDrive/Documents/STOCKNEWS/e_commerce.csv")
  • Exploring data df.info()

<class 'pandas.core.frame.DataFrame'> RangeIndex: 550068 entries, 0 to 550067 Data columns (total 12 columns): Column Non-Null Count Dtype


0 User_ID 550068 non-null int64
1 Product_ID 550068 non-null object 2 Gender 550068 non-null object 3 Age 550068 non-null object 4 Occupation 550068 non-null int64
5 City_Category 550068 non-null object 6 Stay_In_Current_City_Years 550068 non-null object 7 Marital_Status 550068 non-null int64
8 Product_Category_1 550068 non-null int64
9 Product_Category_2 376430 non-null float64 10 Product_Category_3 166821 non-null float64 11 Purchase 550068 non-null int64
dtypes: float64(2), int64(5), object(5) memory usage: 50.4+ MB

  • Count null features in the dataset df.isnull().sum()

User_ID 0 Product_ID 0 Gender 0 Age 0 Occupation 0 City_Category 0 Stay_In_Current_City_Years 0 Marital_Status 0 Product_Category_1 0 Product_Category_2 173638 Product_Category_3 383247 Purchase 0 dtype: int64

  • Replace the null features with 0: df.fillna(0, inplace=True) # Re-check N/A was replaced with 0.

  • Group by User ID: purchases = df.groupby(['User_ID']).sum().reset_index()

purchases.head()

df[df['User_ID'] == 1000001]

purchase_by_age = df.groupby('Age')['Purchase'].mean().reset_index()

print(purchase_by_age)

 Age     Purchase

0 0-17 8933.464640 1 18-25 9169.663606 2 26-35 9252.690633 3 36-45 9331.350695 4 46-50 9208.625697 5 51-55 9534.808031 6 55+ 9336.280459

print(purchase_by_age['Age']) 0 0-17 1 18-25 2 26-35 3 36-45 4 46-50 5 51-55 6 55+ Name: Age, dtype: object

plt.figure(figsize=(10,6)) plt.plot(purchase_by_age['Age'],purchase_by_age['Purchase'],color='blue', marker='x',lw=4,markersize=18) plt.grid() plt.xlabel('Age Group', fontsize=14) plt.ylabel('Total Purchases in $', fontsize=14) plt.title('Average Sales distributed by age group', fontsize=16) plt.show()

ecom_age

plt.hist(purchase_by_age['Purchase'])

ecom_purchase

  • Grouping by gender and age age_and_gender = df.groupby('Age')['Gender'].count().reset_index() gender = df.groupby('Gender')['Age'].count().reset_index()
  • Plot distribution plt.figure(figsize=(12,9)) plt.pie(age_and_gender['Gender'], labels=age_and_gender['Age'],autopct='%d%%', colors=['cyan', 'steelblue','peru','blue','yellowgreen','salmon','#0040FF'],textprops={'fontsize': 16}) plt.axis('equal') plt.title("Age Distribution", fontsize='20') plt.show()

ecom_agepie

  • Plot gender distribution plt.figure(figsize=(12,9)) plt.pie(gender['Age'], labels=gender['Gender'],autopct='%d%%', colors=['salmon','steelblue'],textprops={'fontsize': 16}) plt.axis('equal') plt.title("Gender Distribution", fontsize='20') plt.show()

ecom_gender

  • Group by occupation: occupation = df.groupby('Occupation')['Purchase'].mean().reset_index()
  • Plot bar chart with line plot: sns.set(style="white", rc={"lines.linewidth": 3}) fig, ax1 = plt.subplots(figsize=(12,9)) sns.barplot(x=occupation['Occupation'],y=occupation['Purchase'],color='#004488',ax=ax1) sns.lineplot(x=occupation['Occupation'],y=occupation['Purchase'],color='salmon',marker="o",ax=ax1) plt.axis([-1,21,8000,10000]) plt.title('Occupation Bar Chart', fontsize='15') plt.show() sns.set()

ecom_occupation

  • Group by product ID product = df.groupby('Product_ID')['Purchase'].count().reset_index() product.rename(columns={'Purchase':'Count'},inplace=True) product_sorted = product.sort_values('Count',ascending=False)
  • Plot line plot plt.figure(figsize=(12,6)) plt.plot(product_sorted['Product_ID'][:10], product_sorted['Count'][:10], linestyle='-', color='green', marker='o',lw=4,markersize=12) plt.title("Best-selling Products", fontsize='20') plt.xlabel('Product ID', fontsize='18') plt.ylabel('Products Sold', fontsize='18') plt.show() ecom_bestsellingproducts
  • Group by Age vs Purchase: occupation = df.groupby('Age')['Purchase'].mean().reset_index()
  • Plot bar chart with line plot: sns.set(style="white", rc={"lines.linewidth": 3}) fig, ax1 = plt.subplots(figsize=(12,9)) sns.barplot(x=occupation['Age'],y=occupation['Purchase'],color='#004488',ax=ax1) sns.lineplot(x=occupation['Age'],y=occupation['Purchase'],color='salmon',marker="o",ax=ax1) plt.ylim([8800, 9600]) plt.title('Purchase vs Age Bar Chart', fontsize='15') plt.show() sns.set() ecom_purchase_age occupation = df.groupby('Gender')['Purchase'].mean().reset_index()
  • Plot bar chart with line plot: sns.set(style="white", rc={"lines.linewidth": 3}) fig, ax1 = plt.subplots(figsize=(8,6)) sns.barplot(x=occupation['Gender'],y=occupation['Purchase'],color='#004488',ax=ax1) sns.lineplot(x=occupation['Gender'],y=occupation['Purchase'],color='salmon',marker="o",ax=ax1) plt.ylim([8000, 9600]) plt.title('Purchase vs Gender Bar Chart', fontsize='15') plt.show() sns.set() ecom_purchase_gender
Clone this wiki locally