A repository containing code and data for a churn prediction project in the automotive service business.
The automotive retail business is under pressure. The industry relies heavily on recurring aftersales service revenues to make a profit. The goal of this project is to help a large european automotive retail and service company to reduce its churn rate (= the ratio of customers who switch away from one supplier to another in a given period) by building a predictive model to identify customers that are about to churn. The project was a successful prototype for a model that was later implemented by the companys Data Science team. I used this project as capstone for Udacity's Machine Learning Nanodegree.
The project was designed to answer 3 questions:
- Can customer churn be predicted? (Yes, the model achieves a reasonable F1-score of 0.82 over all)
- What are main drivers for churn? (Top 3: car age, duration of relationship / recency, distance home to branch)
- What can be done to prevent churn? (This question is addressed in the blogpost only, see notebook 4.)
The project proposal and final report for the full project can be found in the reports
section. A subsequent blogpost summarizes the results. In the rescources
section you'll find some interesting papers concerning churn prediction with Machine Learning.
This project requires Python 3.x and the following Python libraries installed:
You will also need to have software installed to run and execute an iPython Notebook
The main code is split-up into 4 Jupyter notebooks, numbered 1 to 4:
1-prep_get geo distances.ipynb
: feature engineering: calculate geo distances from customer adresses to their service branch2-EDA_cleaning.ipynb
: EDA and cleaning of features3-modelling_evaluation.ipynb
: modelling with 5 different classifiers, experimentation with PCA and tuning, result evaluation4-end_to_end_run.ipynb
: this is a concise rework of the earlier steps for the best performing Gradient Boosting Classifier. Because of more targeted data preparation the results here are better than in the version from notebook 3.
In this project I started to outsource functions for cleaning and EDA into collections of functions that would later form my codebook
(see the repository of the same name.) For this project they are still stored in some .py files in the main folder.
The cleaned dataset churnDataWithDisctances.csv
used for modelling consists of approximately 50,000 data points (=cars), with each datapoint having 43 features. (This set is a pre-cleaned version of the original dataset, given to me by a large european automotive retail group, see notebooks 1 and 2 for cleaning steps.)
Target Variable
target_event
: customer status ('CHURN', 'ACTIVE')
Features
NUM_CONSEC_SERVICES
: number of consequtive service events for a carSUM_INVOICE_AMOUNT
: sum of invoice amount that have been charged for all service eventsNUM_EVENTS
: total number of service visitsLAST_MILEAGE
: mileage recorded at last visitMEAN_MILEAGE_PER_MNTH
: calculated mean mileage per monthage_mnth
: a car's age in monthsINSPECTION_INTERVAL_UID
: timespan a car has to show up for mandatory service in monthsLIST_PRICE
: car priceCAR_BRAND_UID
: car brandFUEL_TYPE_UID
: fuel typeGEAR_TYPE_UID
: gear typeWHEEL_DRIVE_UID
: wheel drive typeNUMBER_OF_DOORS
: number of doorsGEAR_COUNT
: gear countBASE_MARGIN
: base margin for dealerSALES_TYPE
: car modelPERSON_LANGUAGE_UID
: language of car ownerPERSON_STATE
: address state of car ownerPERSON_ADDRESS_COUNT
:number of addresses in CRM-system for car ownerownerAge
: age of car ownerREGION_UID
: address region of car ownerPARTNER_LANGUAGE_UID
: language of garage branch the car is affiliated withIS_PREFERRED_PARTNER
: branch type (company internal use)IS_DEALER
: branch type (company internal use)PARTER_STATE
: address state of garage branchPARTNER_ADDRESS_COUNT
:number of addresses in CRM-system for garage branch- ... 13 categorical socio-demografic features that have been bougth from third party supplier ...
dist_metres
(sic!): distance in meters from customer home address to garage branch (travel by car)duration_days
: duration of customer relationship from first to last recorded service visits