This notebook, Decision_Tree.ipynb, demonstrates a complete machine learning workflow using a Decision Tree Classifier to predict user purchase decisions based on social network advertisements. The project illustrates each stage of the pipeline: data import, preprocessing, exploratory data analysis, model training, evaluation, and practical interpretation of results.
Basics of Decision Trees : https://github.com/vyasdeepti/Decision-Tree-Tutorial
- π§ͺ Overview
- π Dataset
- ποΈ Workflow
- π How to Run the Notebook
- π§ͺ Results & Interpretation
- π οΈ Requirements
- β¨ References
The notebook guides you through a supervised classification problem: Will a social network user purchase a product after seeing an ad? Using the DecisionTreeClassifier from scikit-learn, we build a predictive model based on user demographics and salary information. This workflow is suitable for students, data science beginners, and anyone seeking a practical illustration of decision trees in Python.
A Decision Tree is a popular supervised machine learning algorithm that is used for both classification and regression tasks. It works by breaking down complex decision-making processes into a series of simpler decisions, represented as a tree-like graph of nodes and branches. Decision Trees are intuitive, easy to visualize, and require minimal data preparation.
A Decision Tree is a flowchart-like structure where:
- Internal nodes represent "tests" or "decisions" on attributes/features.
- Branches represent the outcome of a test.
- Leaf nodes represent the class label (for classification) or value (for regression).
The path from the root to a leaf represents a classification or decision rule.
- 
Select the Best Feature: The algorithm chooses the feature that best splits the dataset into subsets with distinct target values. Common criteria: - Gini Impurity (for classification)
- Entropy/Information Gain (for classification)
- Mean Squared Error (for regression)
 
- 
Split the Dataset: Divide the dataset into subsets based on the selected feature. 
- 
Repeat Recursively: For each subset, repeat the process until one of the stopping criteria is met (e.g., all samples in a subset belong to the same class, or maximum depth is reached). 
- 
Assign Output: Assign a class (for classification) or value (for regression) to each leaf node. 
Suppose we want to build a Decision Tree to classify whether someone will play tennis based on the weather.
| Outlook | Temperature | Humidity | Windy | Play Tennis | 
|---|---|---|---|---|
| Sunny | Hot | High | False | No | 
| Sunny | Hot | High | True | No | 
| Overcast | Hot | High | False | Yes | 
| Rain | Mild | High | False | Yes | 
| Rain | Cool | Normal | False | Yes | 
| Rain | Cool | Normal | True | No | 
| Overcast | Cool | Normal | True | Yes | 
Step 1: Calculate Information Gain for each feature and choose the best one (e.g., Outlook).
Step 2: Split the dataset based on Outlook:
- Sunny β Further split based on Humidity.
- Overcast β Always Play Tennis = Yes (pure leaf).
- Rain β Further split based on Windy.
The resulting tree might look like:
Outlook?
βββ Sunny
β   βββ Humidity?
β       βββ High: No
β       βββ Normal: Yes
βββ Overcast: Yes
βββ Rain
    βββ Windy?
        βββ False: Yes
        βββ True: No
Suppose you want to predict house prices based on features like size and location.
- At each split, the algorithm chooses the feature and threshold that minimizes the variance (mean squared error) in the target variable (house price).
- Leaf nodes contain the average house price of the subset.
- Easy to understand and interpret: Can be visualized graphically.
- No need for feature scaling: Handles both numerical and categorical data.
- Handles non-linear relationships: No need for linearity in the data.
- Prone to overfitting: Especially with deep trees and small datasets.
- Unstable: Small variations in the data can result in a different tree.
- Biased towards features with more levels: Can prefer features with more categories.
- Prune the tree: Limit the maximum depth or minimum samples per leaf.
- Use ensembles: Techniques like Random Forest and Gradient Boosting combine multiple trees for better generalization.
- Cross-validation: Use to select optimal tree parameters.
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier, plot_tree
import matplotlib.pyplot as plt
# Load sample data
X, y = load_iris(return_X_y=True)
# Initialize and fit classifier
clf = DecisionTreeClassifier(max_depth=3, random_state=42)
clf.fit(X, y)
# Visualize the tree
plt.figure(figsize=(12,8))
plot_tree(clf, filled=True, feature_names=load_iris().feature_names, class_names=load_iris().target_names)
plt.show()- File: Social_Network_Ads.csv
- Columns:
- User ID(removed in preprocessing)
- Gender(categorical)
- Age(numerical)
- EstimatedSalary(numerical)
- Purchased(target: 0 = No, 1 = Yes)
 
The dataset consists of 400 entries, each representing a unique user and their response to an online advertisement.
The notebook starts by importing all the necessary libraries, including:
- Data manipulation: pandas,numpy
- Visualization: matplotlib,seaborn
- Preprocessing and modeling: scikit-learn
- Loads Social_Network_Ads.csvinto a pandas DataFrame.
- Displays the first few rows to understand the structure.
- Drop Irrelevant Columns: Removes User IDas it does not contribute to prediction.
- Categorical Encoding: Transforms Genderinto a numeric format using label encoding.
- Feature Scaling (Optional): Scales AgeandEstimatedSalaryfor improved model performance.
- Null/Outlier Check: (Recommended for real-world data)
- Uses describe()to summarize numerical features (mean, std, min/max, quartiles).
- Visualizes distributions (e.g., histograms, boxplots) and examines relationships between features.
- Feature Selection: Chooses relevant columns as features (Gender,Age,EstimatedSalary).
- Train-Test Split: Splits the data into training and testing sets (commonly 75% train, 25% test) using train_test_split.
- Initializes the Decision Tree Classifier (DecisionTreeClassifierfrom scikit-learn).
- Fits the model to the training data.
- Predicts outcomes on the test set.
- Calculates metrics:
- Accuracy Score
- Confusion Matrix
- Classification Report (precision, recall, f1-score)
- F1 Score
 
- Optionally, plots Precision-Recall curves and other metrics.
- Visualizes the decision boundaries of the trained classifier.
- Uses scatter plots to show correctly and incorrectly classified points.
- Optionally, plots the tree structure for interpretability.
notebooks/
  βββ Decision_Tree.ipynb
data/
  βββ Social_Network_Ads.csv
README.md
- 
Clone the Repository or download the notebook file. git clone https://github.com/vyasdeepti/Machine-Learning.git cd Machine-Learning
- 
Install Required Libraries (if not already installed): pip install pandas numpy matplotlib seaborn scikit-learn 
- 
Place the CSV File 
 EnsureSocial_Network_Ads.csvis in the same directory as the notebook.
- 
Start Jupyter Notebook or Colab - For Jupyter:
Then openjupyter notebook Decision_Tree.ipynb.
- For Google Colab: Use the Colab badge at the top of the notebook or upload the notebook directly.
 
- For Jupyter:
- 
Run All Cells 
 Follow the notebook top-down, executing each cell in order.
Here's an explanation of the concepts in the context of the code in Decision_Tree.ipynb:
What it is: Label encoding is the process of converting categorical variables (like "Gender") into numeric codes, so they can be used in machine learning models.
How it's done in the code:
This code uses LabelEncoder from scikit-learn, likely doing something like:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df_net['Gender'] = le.fit_transform(df_net['Gender'])This converts the "Gender" column from "Male"/"Female" to 1/0 (or 0/1), making it usable for the decision tree model.
What it is: A correlation matrix shows the relationship (correlation coefficient) between pairs of features. Values range from -1 (perfect negative correlation) to 1 (perfect positive correlation).
How it's used:
This code probably uses pandas or seaborn to visualize the correlations:
corr = df_net.corr()
sns.heatmap(corr, annot=True)This helps you see which features are strongly related to each other or to the target "Purchased".

What it is: Feature scaling standardizes numeric features so that they have similar ranges, which helps many machine learning models perform better.
How it's done in the code:
We have used StandardScaler:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X = sc.fit_transform(X)This transforms features like "Age" and "EstimatedSalary" to have mean 0 and standard deviation 1.
What it is: A confusion matrix is a table that visualizes the performance of a classification algorithm, showing counts of true positives, false positives, true negatives, and false negatives.
How it's used in the code:
We have used scikit-learnβs confusion_matrix:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)The matrix lets us see how many correct and incorrect predictions the model made for each class (Purchased = 1 or 0).
Summary Table:
| Concept | Purpose | Code Example | 
|---|---|---|
| Label Encoding | Convert categories to numbers | df_net['Gender'] = le.fit_transform(df_net['Gender']) | 
| Correlation Matrix | Show relationships between features | corr = df_net.corr(); sns.heatmap(corr, annot=True) | 
| Feature Scaling | Standardize feature ranges | X = sc.fit_transform(X) | 
| Confusion Matrix | Evaluate classifier predictions | cm = confusion_matrix(y_test, y_pred) | 
Upon completion, you will have:
- π A well-trained Decision Tree model for the classification problem.
- π Performance metrics showing how well the model predicts user purchases.
- π Visualizations that make the modelβs logic and performance transparent.
- π Insights into which features are most important for the prediction.
- Python 3.x
- pandas
- numpy
- matplotlib
- seaborn
- scikit-learn
- scikit-learn Documentation: Decision Trees
- Pandas Documentation
- Matplotlib Documentation
- Seaborn Documentation
Q: What Python version is required?
A: Python 3.7 or higher.
Q: Can I use my own dataset?
A: Yes! Replace Social_Network_Ads.csv with your data.



