In 1912, the Titanic set off on its first voyage across the Atlantic Ocean, carrying passengers ranging from the wealthy elite to emigrants seeking a new life. Tragically, the ship collided with an iceberg and sank, resulting in the loss of over 1,500 lives. This disaster not only shook the world but also sparked discussions about maritime safety and the social dynamics of the time.
This repository explores the factors affecting passenger survival on the Titanic and aims to build a predictive model to estimate survival probabilities based on available passenger characteristics. The available dataset contains a detailed records of the passengers aboard, including information such as age, gender, passenger class, fare paid, and survival outcome. However, some key data points are missing, particularly in features like age and cabin, which poses challenges for building accurate predictive models.
In this project, two different approaches are explored and compared based on model performance:
-
1. Removing Missing Data: This method involves deleting rows with missing values to clean the dataset. While it ensures that the remaining data is complete, it reduces the number of observations available for analysis.
-
2. Filling Missing Data: This approach fills in missing values in an effort to retain more data and potentially enhance the model's performance.
Overall, more robust models (Random Forest, XGBoost) were achieved using the second approach, which involved filling in missing values. A version of the developed model was also submitted to Kaggle’s Titanic-Machine Learning from Disaster competition, where it ranked in the top 9.38% (1316 out of 14036).
Given that the true survival status of Titanic passengers is publicly available, some higher-ranked entries likely used manually crafted labels to achieve near-perfect accuracies. Therefore, the actual position of the provided model could be higher if all competitors strictly followed the competition rules. You can also find the Kaggle's notebook here.
It's important to mention that the score shown in the above image (0.78947) was achieved through a slightly modified ensemble model and different parameter tuning compared to the provided notebook (0.78468). These exact details are not shared here to encourage independent experimentation and to prevent you from overfitting. 😜
The Titanic dataset used in this project is divided into two main files: train.csv
and test.csv
. Below is a brief description of each file:
-
train.csv
: This is the primary training dataset containing labeled data used to train the model. It includes 891 records and 12 columns, with theSurvived
column indicating whether a passenger survived (1) or not (0). This dataset is used to build and validate the machine learning model. -
test.csv
: This is the test dataset that contains 418 records and 11 columns. It does not have theSurvived
column. The goal is to predictSurvived
using a model trained on the provided training data.
On the competition's data, you will also find the gender_submission.csv
file, which is an example submission file (not the true labels) provided by Kaggle. This file shows the expected format of the predictions, containing only the PassengerId
and Survived
columns.
The following table provides a detailed description of the columns found in train.csv
and test.csv
:
Column Name | Data Type | Description |
---|---|---|
PassengerId |
Integer | Unique identifier for each passenger |
Survived |
Integer | Survival status (0 = No, 1 = Yes) |
Pclass |
Integer | Passenger class (1 = 1st, 2 = 2nd, 3 = 3rd) |
Name |
String | Name of the passenger |
Sex |
String | Gender of the passenger (male , female ). |
Age |
Float | Age of the passenger |
SibSp |
Integer | Number of siblings/spouses aboard the Titanic |
Parch |
Integer | Number of parents/children aboard the Titanic |
Ticket |
String | Ticket number |
Fare |
Float | Passenger fare |
Cabin |
String | Cabin number |
Embarked |
String | Port of embarkation (C = Cherbourg; Q = Queenstown; S = Southampton) |
-
Download the required dataset from:
-
Upload the
train.csv
andtest.csv
files to your own Google Drive in your preferred folder structure. -
Update the file paths in the notebook to reflect your own Google Drive paths.
-
Run the notebook cells as instructed to reproduce the results.
-
Download the required dataset from:
-
Clone the repository:
git clone https://github.com/Dalageo/ML-TitanicShipwreck.git
-
Navigate to the cloned directory:
cd ML-TitanicShipwreck
-
Open the
Exploring the World's Most Renowned Shipwreck.ipynb
using your preferred Jupyter-compatible environment (e.g., Jupyter Notebook, VS Code, or PyCharm) -
Update file paths for
train.csv
andtest.csv
as needed. -
Run the cells sequentially to reproduce the results.
The dataset used in this project is provided by Kaggle as part of the Titanic-Machine Learning from Disaster competition. Special thanks to Kaggle's data science community, and Will Cukierski for making this dataset available for educational and research purposes.
This work is licensed under the Apache License 2.0. It was chosen to comply with the competition rules, which require the use of an Open Source Initiative (OSI) approved license that permits commercial use while promoting open collaboration.