In this project, we use a dataset containing bone marrow transplantation characteristics for pediatric patients from UCI’s Machine Learning Repository.
We will use the dataset (, to build a pipeline, containing all preprocessing and data cleaning steps, and then select the best classifier to predict patient survival.
pipeline score: 0.4879020616816332
sklearn metric: 0.4879020616816332
GridSearchCV best score: -5.409647741106873
GridSearchCV best params: {'regr__fit_intercept': True}
The best_regression_model is: Ridge(alpha=1)
The hyperparameters_of_regression_model are: {'alpha': 1, 'copy_X': True, 'fit_intercept': True, 'max_iter': None, 'normalize': False, 'random_state': None, 'solver': 'auto', 'tol': 0.001}
The hyperparameters_of_imputer are: {'add_indicator': False, 'copy': True, 'fill_value': None, 'missing_values': nan, 'strategy': 'most_frequent', 'verbose': 0}
"check both arrays are equal": True
-donor_age - Age of the donor at the time of hematopoietic stem cells apheresis,
-donor_age_below_35 - Is donor age less than 35 (yes, no),
-donor_ABO - ABO blood group of the donor of hematopoietic stem cells (0, A, B, AB),
-donor_CMV - Presence of cytomegalovirus infection in the donor of hematopoietic stem cells prior to transplantation (present, absent),
-recipient_age - Age of the recipient of hematopoietic stem cells at the time of transplantation,
-recipient_age_below_10 - Is recipient age below 10 (yes, no),
-recipient_age_int - Age of the recipient discretized to intervals (0,5], (5, 10], (10, 20]),
-recipient_gender - Gender of the recipient (female, male),
-recipient_body_mass - Body mass of the recipient of hematopoietic stem cells at the time of the transplantation, … -survival_status - Survival status (0 - alive, 1 - dead),
Pipeline Accuracy Test Set: 0.7894736842105263
The best classification model is: LogisticRegression()
The hyperparameters of the best classification model are: {'C': 1.0, 'class_weight': None, 'dual': False, 'fit_intercept': True, 'intercept_scaling': 1, 'l1_ratio': None, 'max_iter': 100, 'multi_class': 'auto', 'n_jobs': None, 'penalty': 'l2', 'random_state': None, 'solver': 'lbfgs', 'tol': 0.0001, 'verbose': 0, 'warm_start': False}
The number of components selected in the PCA step are: 37
Best Model Accuracy Test Set: 0.8157894736842105
MAE Baseline Score
: 8.232 | MAE Score with Ratio Features
: 7.948
In the figure on the right, feature fuel_type
has a low MI score, but it separates two price populations with different trends within the horsepower
feature. This infers that fuel_type
(low MI score) contributes to an interaction effect.
We investigated trend lines from one category to the next to identify any interaction effect.