This project aims to predict flight ticket prices using various machine learning algorithms and comprehensive exploratory data analysis (EDA). The dataset used in this project is sourced from Kaggle, and the objective is to build a model that can accurately forecast the price of flight tickets based on multiple features.
- Perform extensive exploratory data analysis (EDA) to understand the data distribution and feature relationships.
- Preprocess the data to handle missing values, encode categorical variables, and scale numerical features.
- Implement and compare different machine learning algorithms to identify the best-performing model.
- Fine-tune the chosen model to achieve optimal performance.
- Evaluate the model's performance using appropriate metrics.
- Data Wrangling and Preprocessing: Cleaning, transforming, and preparing the data for analysis and modeling.
- Exploratory Data Analysis (EDA): Visualizing and interpreting data to uncover insights and relationships.
- Feature Engineering: Creating new features to enhance model performance.
- Machine Learning Algorithms: Implementing and comparing multiple algorithms, including Linear Regression, Decision Trees, Random Forest, and Gradient Boosting.
- Model Evaluation and Tuning: Using metrics like RMSE, MAE, and R² to evaluate models and applying hyperparameter tuning for optimization.
- Data Visualization: Utilizing libraries such as Matplotlib, Seaborn, and Plotly for insightful visualizations.
The dataset was imported and loaded into a Pandas DataFrame for initial examination and preprocessing.
- Univariate Analysis: Analyzed the distribution of individual features.
- Bivariate Analysis: Explored relationships between pairs of features and the target variable.
- Multivariate Analysis: Investigated complex interactions between multiple features.
- Visualization: Used Matplotlib, Seaborn, and Plotly to create plots such as histograms, box plots, scatter plots, and heatmaps.
- Handling Missing Values: Imputed missing values using appropriate techniques.
- Encoding Categorical Variables: Applied One-Hot Encoding to convert categorical features into numerical format.
- Feature Scaling: Normalized numerical features using StandardScaler.
Created new features based on domain knowledge to improve model performance. For instance, extracted day, month, and year from the date features.
- Model Selection: Implemented multiple machine learning algorithms, including Linear Regression, Decision Trees, Random Forest, and Gradient Boosting.
- Model Evaluation: Evaluated models using metrics like RMSE, MAE, and R².
- Model Tuning: Applied hyperparameter tuning techniques such as Grid Search and Random Search to optimize model performance.
The final model was saved and prepared for deployment to predict flight ticket prices on new, unseen data.
Summarized the findings and insights gained from the analysis and modeling process. Highlighted the best-performing model and its practical implications.
- Best Model: The Random Forest Regressor outperformed other models with the lowest RMSE and highest R² score.
- Performance Metrics: Achieved an RMSE of 0.11, MAE of 0.069, and R² of 0.94 on the test set.
- Programming Language: Python
- Libraries: Pandas, NumPy, Scikit-learn, Matplotlib, Seaborn, Plotly
- Jupyter Notebook: For interactive analysis and visualization
For any questions or collaboration opportunities, feel free to reach out via LinkedIn or Email.