- About the project
- About the dataset
- Project's Main Strategy
- Libraries and Packages
- Project's Phases
- Insights
This project is a Guided Project from DataCamp (the data is already prepared and ready to use). It is about analyzing an online sports retail company data. Its main goal is to answer some questions related to the store's products and their prices, brands, revenue, ratings and number of reviews.
This project's dataset is divided into 5 csv files:
- Info contains product id with its name and description.
- Finance contains product id with its listing price, sale price, discount, and revenue.
- Reviews contains product id with its rating and number of reviews.
- Traffic contains product id and it last visited date on the website.
- Brands contains product id and its brand.
All files have the product id as a primary key
First, I've created a MySQL database in which data was stored. Then I've established a connection between MySQL server and Python. After that I've executed all queries in Python and retrieved the required data. Finally I've done the analysis and visualization using Python's libraries and packages.
- Pandas to deal with dataframes and perform data analysis
- Numpy to get statistical information
- Matplotlib and Seaborn for visualization
- mysql.connector to deal with the database
- load_dotenv and os to deal with variable environment
- HTML and warnings to customize the output plots
-
Creating a database with the help of MySQL Server
CREATE DATABASE sports_store;
-
Creating the tables (a table for each csv file)
CREATE TABLE info( product_name VARCHAR(100), product_id VARCHAR(15) PRIMARY KEY, description VARCHAR(500) );
CREATE TABLE finance( product_id VARCHAR(15) PRIMARY KEY, listing_price DECIMAL(6, 2), sale_price DECIMAL(6, 2), discount DECIMAL(2, 2), revenue DECIMAL(8,2) );
CREATE TABLE reviews( product_id VARCHAR(15) PRIMARY KEY, rating DECIMAL(2, 1), reviews INT );
CREATE TABLE traffic( product_id VARCHAR(15), last_visited DATETIME );
CREATE TABLE brands( product_id VARCHAR(15), brand VARCHAR(15) );
-
Importing data into these tables
Using the following lines for all files:
LOAD DATA LOCAL INFILE 'path\file_name.csv' INTO TABLE table_name FIELDS TERMINATED BY ',' ENCLOSED BY '"' LINES TERMINATED BY '\n' IGNORE 1 ROWS;
Example:
LOAD DATA LOCAL INFILE 'path\info.csv' INTO TABLE info FIELDS TERMINATED BY ',' ENCLOSED BY '"' LINES TERMINATED BY '\n' IGNORE 1 ROWS;
-
Final database structure:
-
This step was done using 2 functions in Python:
-
show_columns:
def show_columns(table): ''' This function simply shows the columns of the table specified :param table -> the name of the table that you want to show its columns ''' cursor.execute("""SHOW COLUMNS FROM {};""".format(table)) columns = cursor.fetchall() print("Columns of {} table are:".format(table)) for number, column in enumerate(columns): print(str(number+1) + ") " + column[0], end="\n")
-
show_first_ten_results:
def show_first_ten_results(table): ''' This function shows the first 10 products of the table specified :param table -> the name of the table that you want to show its first 10 results ''' cursor.execute("""SELECT * FROM {} ORDER BY product_id LIMIT 10;""".format(table)) columns = cursor.column_names results = cursor.fetchall() df = create_dataframe(columns, results) # To return resutls in the form of a dataframe return df
Using these 2 functions, I can dig more into the data and understand what each column in each table exactly represents.
-
Data extraction:
I've used the following SQL concepts:
- Simple queries (SELECT ... FROM ... WHERE)
- CASE ... WHEN ... THEN
- GROUP BY & ORDER BY
- JOINs
- CTEs and Subqueries
- Some analytical functions
-
Data transformation:
- Changing data types
- Dealing with datetime formats
-
Data Analysis:
- Understand the results and extracting insights
This phase was done using a hybrid strategy; the analysis was done in SQL and the queries are get executed using MySQL/Python Connector, and the visualization was done using Matplotlib and Seaborn libraries in Python.
This phase can be divided into some steps:
- The price categories which causes the highest revenue for both brands Adidas and Nike
- How the number of reviews affects the revenue ?
- How does the number of reviews change along the months of the year ?
- The pecentage of footwears and the other clothing products in the store
- The website's number of visits along the quarters of the year
- Female products VS. Male products for both Adidas and Nike
This phase depends on the previous phase as we just visualize the data that we've extracted to get more patterns and understand more insights.
Average and Expensive price categories achieve more revenue than Elite which is the most expensive products.
Revenue is increasing with the number of reviews which says that people buy the product which has a greater number of reviews then the revenue increases.
Along the quarters of the year, and for both brands Adidas and Nike, the first quarter (first 3 months of the year) has a greater number of visits for products than the other quarters.
In addition to the previous insights, the following insights have been gathered during the analysis phase (they're explained in the notebook without using visuals).
-
The store has much more Adidas products than Nike products.
-
For all products, Adidas has an average discount of 33.45% while Nike has zero!
-
The top 2 best selling products are Air Jordan 10 Retro and Nike Zoom Fly from Nike, and they were last visited on the website in 2019 and 2018 respectively.
-
Product's description length doesn't affect the rating of the product that much.
-
Adidas pieces are much pricier than Nike.
-
Because of the first point, Adidas has much greater number of reviews than Nike along the months of the year.