Skip to content

Latest commit

 

History

History
106 lines (67 loc) · 7.37 KB

README.md

File metadata and controls

106 lines (67 loc) · 7.37 KB

Prediction of life expectancy

Introduction

“Life expectancy declining in many English communities even before pandemic” (Head, 2021). The article published on Imperial College London’s blog grabbed our attention while researching for the dataset. After doing further research on this topic, we found that this is one of the alarming issues in developed countries like the UK and the US (Head, 2021). So, as a team, we decided to implement the skills that we had learned during our course to identify and understand the factors that can affect the life expectancy of human beings.

Based on the above problem statement, we designed the following research question.

Research question:

“Can we predict the life expectancy at birth based on the world development indicator such as Unemployment rate, Infant Mortality Rate, GDP, GNI, Clean fuels and cooking technologies, etc.,”

Datasets:

In this research we have downloaded 4 different raw dataset from World Bank and World Health Organization. These datasets are as follows:

Dataset Description
World Development Indicators This dataset contains the data of 1444 development indicators for 2666 countries and country groups between the years 1960 to 2020. This dataset was downloaded from the world bank’s data hub.
Health workforce This dataset contains the health workforce information such as medical doctors (per 10000 population), number of medical doctors, number of Generalist medical practitioners, etc.
Mortality from CVD, cancer, diabetes or CRD between exact ages 30 and 70 (%) This dataset contains information on mortality caused by various non-communicable diseases such as cardiovascular disease (CVD), cancer, diabetes etc. We have used two files for this dataset. Separately for both males and females. This dataset was downloaded from the world bank’s databank.
Suicide morality rate (per 100,000 population) This data set contains information on the suicide mortality rate per 100,000 population. We have used two files for this dataset. Separately for both males and females. This dataset was downloaded from the world bank’s databank.

Complete and cleaned Life Expectancy Dataset is publised in Kaggle.

Procedure

  • Data acquisition: We have collected aforementioned public datasets from various sources.
  • Data preparation and cleaning: We cleaned, merged and feature engineered (Principal Component Analysis - PCA) the datasets using pandas, matplotlib, missingno and scikit learn
  • Exploratory data analysis: We did Univariate Analysis by plotting each numerical variable on a histogram and boxplot to understand data distribution and outliers. Similarly, in Multivariate Analysis we plotted life expectancy against other numerical variables in a scatter plot to know the relationship between the variables. And dimensionality reduction using Unsupervised Learning technique.
  • Machine Learning Prediction: For machine learning prediction, we have implemented Support Vector Machine (SVM), Random Forest, Decision Tree and K-Nearest Neighbour (KNN) for our research. Howevere, in this repository we have pubished only random forest algorithm.
  • Deep learning prediction: Similarly, in deep learning method we have experimented various hyperparameters like hidden layers, activation funciton, optimizer and epochs.
  • Hyperparameter Tuning: To tune the parameters in machine learning we have implemented RandomizedSearchCV and in neural network we used Keras Tuner.
  • Evaluation of impact of features on model: We plotted the Shap value to understand the impact of each features in model in deep learning. And in machine learning we plotted features_importances_ property of RandomForestRegressor using scikit learn.

Complete report is available here at github.

screenshots:

Exploratory Data Analysis (EDA)

Male vs Female life expectancy

Findings from preliminary EDA: Female has higher life expectancy compared with male.


Corelation Plot

Corelation plot showing the multicollinearity issues among variables.


Biplot between Component and variable

Biplot showing the relationship between Principal Component and variable


Visualization of variance and component in scree plot

Visualization of variance and component (PCA1 and PCA2) in scree plot


Data preparation and cleaning

Missing and invalid data

Visualization of missing and invalid data before cleaning.


Missing and invalid data

Visualization of missing and invalid data before cleaning using missingno library.


Missing and invalid data

Visualization of missing and invalid data after cleaning.


Missing and invalid data

Visualization of missing and invalid data after cleaning using missingno library.


Performance of Machine Learning Algorithm in our dataset

Machine learning performance

Performance of Random Forest after fine tuning of hyperparameters on Life expectancy datasets.


Feature importance on model prediction

Visualization of feature importance on prediction. #Explainable AI


Performance of Deep Learning Algorithm (Neural Network) in our dataset

Neural Network's performance

Performance of Neural Network after fine tuning of hyperparameters on Life expectancy datasets.


Visualization of shap value

Visualization of shap value on model's prediction. #Explainable AI


References

Head, E., 2021. Life expectancy declining in many English communities even before pandemic | Imperial News | Imperial College London. [online] Imperial News. Available at: https://www.imperial.ac.uk/news/231119/life-expectancy-declining-many-english-communities/ [Accessed 23 March 2022].