-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Welcome to the E-Commerce-EDA wiki!
This case example is aimed at testing and implementing e-commerce Exploratory Data Analysis (EDA) in Python based on the previous study: Data Science for E-Commerce with Python Both input data and the source code can be found here
- Basic imports
import pandas as pd import matplotlib.pyplot as plt import matplotlib.gridspec as gridspec import seaborn as sns import numpy as np
- Read dataset and preview df = pd.read_csv("C:/Users/adrou/OneDrive/Documents/STOCKNEWS/e_commerce.csv")
- Exploring data df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 550068 entries, 0 to 550067 Data columns (total 12 columns): Column Non-Null Count Dtype
0 User_ID 550068 non-null int64
1 Product_ID 550068 non-null object
2 Gender 550068 non-null object
3 Age 550068 non-null object
4 Occupation 550068 non-null int64
5 City_Category 550068 non-null object
6 Stay_In_Current_City_Years 550068 non-null object
7 Marital_Status 550068 non-null int64
8 Product_Category_1 550068 non-null int64
9 Product_Category_2 376430 non-null float64
10 Product_Category_3 166821 non-null float64
11 Purchase 550068 non-null int64
dtypes: float64(2), int64(5), object(5)
memory usage: 50.4+ MB
- Count null features in the dataset df.isnull().sum()
User_ID 0 Product_ID 0 Gender 0 Age 0 Occupation 0 City_Category 0 Stay_In_Current_City_Years 0 Marital_Status 0 Product_Category_1 0 Product_Category_2 173638 Product_Category_3 383247 Purchase 0 dtype: int64
-
Replace the null features with 0: df.fillna(0, inplace=True) # Re-check N/A was replaced with 0.
-
Group by User ID: purchases = df.groupby(['User_ID']).sum().reset_index()
purchases.head()
df[df['User_ID'] == 1000001]
purchase_by_age = df.groupby('Age')['Purchase'].mean().reset_index()
print(purchase_by_age)
Age Purchase
0 0-17 8933.464640 1 18-25 9169.663606 2 26-35 9252.690633 3 36-45 9331.350695 4 46-50 9208.625697 5 51-55 9534.808031 6 55+ 9336.280459
print(purchase_by_age['Age']) 0 0-17 1 18-25 2 26-35 3 36-45 4 46-50 5 51-55 6 55+ Name: Age, dtype: object
plt.figure(figsize=(10,6)) plt.plot(purchase_by_age['Age'],purchase_by_age['Purchase'],color='blue', marker='x',lw=4,markersize=18) plt.grid() plt.xlabel('Age Group', fontsize=14) plt.ylabel('Total Purchases in $', fontsize=14) plt.title('Average Sales distributed by age group', fontsize=16) plt.show()
plt.hist(purchase_by_age['Purchase'])
- Grouping by gender and age age_and_gender = df.groupby('Age')['Gender'].count().reset_index() gender = df.groupby('Gender')['Age'].count().reset_index()
- Plot distribution plt.figure(figsize=(12,9)) plt.pie(age_and_gender['Gender'], labels=age_and_gender['Age'],autopct='%d%%', colors=['cyan', 'steelblue','peru','blue','yellowgreen','salmon','#0040FF'],textprops={'fontsize': 16}) plt.axis('equal') plt.title("Age Distribution", fontsize='20') plt.show()
- Plot gender distribution plt.figure(figsize=(12,9)) plt.pie(gender['Age'], labels=gender['Gender'],autopct='%d%%', colors=['salmon','steelblue'],textprops={'fontsize': 16}) plt.axis('equal') plt.title("Gender Distribution", fontsize='20') plt.show()
- Group by occupation: occupation = df.groupby('Occupation')['Purchase'].mean().reset_index()
- Plot bar chart with line plot: sns.set(style="white", rc={"lines.linewidth": 3}) fig, ax1 = plt.subplots(figsize=(12,9)) sns.barplot(x=occupation['Occupation'],y=occupation['Purchase'],color='#004488',ax=ax1) sns.lineplot(x=occupation['Occupation'],y=occupation['Purchase'],color='salmon',marker="o",ax=ax1) plt.axis([-1,21,8000,10000]) plt.title('Occupation Bar Chart', fontsize='15') plt.show() sns.set()
- Group by product ID product = df.groupby('Product_ID')['Purchase'].count().reset_index() product.rename(columns={'Purchase':'Count'},inplace=True) product_sorted = product.sort_values('Count',ascending=False)
- Plot line plot plt.figure(figsize=(12,6)) plt.plot(product_sorted['Product_ID'][:10], product_sorted['Count'][:10], linestyle='-', color='green', marker='o',lw=4,markersize=12) plt.title("Best-selling Products", fontsize='20') plt.xlabel('Product ID', fontsize='18') plt.ylabel('Products Sold', fontsize='18') plt.show()
- Group by Age vs Purchase: occupation = df.groupby('Age')['Purchase'].mean().reset_index()
- Plot bar chart with line plot: sns.set(style="white", rc={"lines.linewidth": 3}) fig, ax1 = plt.subplots(figsize=(12,9)) sns.barplot(x=occupation['Age'],y=occupation['Purchase'],color='#004488',ax=ax1) sns.lineplot(x=occupation['Age'],y=occupation['Purchase'],color='salmon',marker="o",ax=ax1) plt.ylim([8800, 9600]) plt.title('Purchase vs Age Bar Chart', fontsize='15') plt.show() sns.set() occupation = df.groupby('Gender')['Purchase'].mean().reset_index()
- Plot bar chart with line plot: sns.set(style="white", rc={"lines.linewidth": 3}) fig, ax1 = plt.subplots(figsize=(8,6)) sns.barplot(x=occupation['Gender'],y=occupation['Purchase'],color='#004488',ax=ax1) sns.lineplot(x=occupation['Gender'],y=occupation['Purchase'],color='salmon',marker="o",ax=ax1) plt.ylim([8000, 9600]) plt.title('Purchase vs Gender Bar Chart', fontsize='15') plt.show() sns.set()
E-Commerce EDA Python