Credit Risk Prediction

Team Member:

Name	Student ID
Zerun Zhu	2201212450
Fanyuan Ma	2301212364

Data Description:

y: ModelChoice_Default_Flag (0: no risk; 1: have risk)
segment: Site, Industry(3 dummy var), Age_of_Company_in_Month
X: 3 modulars(financial_variables, internal_behavior_variables, bureau_variables)

Basic Framework

Thought:

Firstly use all groups' infomation, build modular models.

Then for specific segment, use present modular models' prediction results, to build a specific model for this segment.

Data Processing

turn 'Age_of_Company_in_Month' into 'Age_of_Company_in_Year' and divide it up into 'Age_Category'
Use Site and Industry as Segment variables, age of Company is saved to use in modular integration for segments.
Stratified random sample by ['ModelChoice_Default_Flag', 'Segment2'], get X_train, X_test, y_train, y_test

Baseline model

Do one-hot encoding.
Use all variables as input for XGBoost model, and use optuna to find optimal parameters.
Do 5-fold cross validation.
Baseline effect on testset: Test ROC AUC = 0.7744

Model1 : XGBoost + Logistic

Do RandomOverSample because of imbalance problem.
Also use XGBoost and optuna and 5-fold cv to find optimal model for each groups of modular variables, obtaining 3 prediction probability as label 1 for each rows. Inside this XGBoost, we set imbalance-weight.
For each segment, Use ['Age_of_Company_in_Years', 'modular0', 'modular1', 'modular2'] as input, firstly use knn-imputer to fill Nan inside each segment, then build logistic model. Pay attention to record knn-imputer models to do same dealing with testset later.
Model effect on testset: Test ROC AUC = 0.5246

Model2 : Deal with outliers and do PCA inside each modular variables.

There exists great corelation inside modular variables, partly because of existence of outliers.
So we deal with outliers for trainset firstly, replace outliers with boundary quantile values, and record these values to do same dealing with testset later.
Then for each group of moduler variables, we do min-max scaling firstly, so that knn-imputer can treat every variable fairly. Then use knn-imputer to fill missing values.
Inside each group of moduler variables, we do pca and reserve some important pcas. Before this, we need to do normalization scaling. pca_components = {'financial_variables': 10, 'internal_behavior_variables': 15, 'bureau_variables': 20}
do other thing same as model2. Build XGboost model for each group of module variables, and then use age information and 3 module models' prediction probability as input, build a logistic model to get final output.
Model effect on testset: Test ROC AUC = 0.5246

Model3: Do dimensionality reduction use autoencoder and build XGboost model.

Replace outliers with boundary quantile values.
Do min-max scaling and impute Nan with knn-imputer.
Train a autoencoder model to compress data into 10-dim and then expand them to original dimensions again. Here we just use those positive training samples to build model, adn get trainning loss on all training samples. Training loss = 0.0931. Test Loss = 0.0942
Just use encoder part to reduce data dimension into 10. Just use these 10-dim training data to train a XGboost model using all 10 features.
Model effect on testset: Test ROC AUC = 0.50

Model4: Replace XGboost in model 3 with Logistic model.

I think for a 10-dim data, XGBoost is too much, so I also48se logistic model here using those 10-dim model after dimension reduction. Model effect on testset: Test ROC AUC = 0.5

Summary

I think maybe autoencoder is a strong tool. I also found related paper in credit risk prediction using it. But here its effect is not so good. The primary problem should be the dimension reducing to 10 dim. It's too low. It lost too much information. I think maybe 30-50 is more appropriate. And still, baseline XGboost model is strong and best.

Autoencoder

Autoencoder Network Structure

Here is the structure of a typical Autoencoder (Source: MLF_Finance_Research.pdf):

Unsupervised NN, fully connected.
Encoder + decoder, input dimension = output dimension.
Objective: minimize the difference between input and output.
Bottleneck: hidden layer with fewer dimensions.

Autoencoder for Feature Engineering and Anomaly Detection

Dimension reduction and nonlinear PCA.
Anomaly detection, imbalanced sample. (Non default samples and default samples)
Utilizing overfitting. Low $X-\hat{X}$ in normal sample and higher in abnormal sample.
Therefore, $X-\hat{X}$ can be used as features to distinguish normal and abnormal data.

$X-\hat{X}$ in our dataset with different modules

Using autoencoder in different feature modules, we have L1Loss:

Module	Non default	Default data	% Diff
Financial	0.3142	0.3027	-3.6%
Internal Behaviour	0.3212	0.4104	27.8%
Bureau	0.1638	0.2125	29.7%

The correlation of $X-\hat{X}$ is as below:

Possible Alternatives

Generator and discriminator.
Generator generates fake sample from random noise.
Discriminator classify the real sample and the generated sample.
A strong discriminator can be used as a classifier of normal and abnormal sample.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Team Member:

Data Description:

Basic Framework

Thought:

Data Processing

Baseline model

Model1 : XGBoost + Logistic

Model2 : Deal with outliers and do PCA inside each modular variables.

Model3: Do dimensionality reduction use autoencoder and build XGboost model.

Model4: Replace XGboost in model 3 with Logistic model.

Summary

Autoencoder

Autoencoder Network Structure

Autoencoder for Feature Engineering and Anomaly Detection

$X-\hat{X}$ in our dataset with different modules

Possible Alternatives

Files

README.md

Latest commit

History

README.md

File metadata and controls

Team Member:

Data Description:

Basic Framework

Thought:

Data Processing

Baseline model

Model1 : XGBoost + Logistic

Model2 : Deal with outliers and do PCA inside each modular variables.

Model3: Do dimensionality reduction use autoencoder and build XGboost model.

Model4: Replace XGboost in model 3 with Logistic model.

Summary

Autoencoder

Autoencoder Network Structure

Autoencoder for Feature Engineering and Anomaly Detection

$X-\hat{X}$ in our dataset with different modules

Possible Alternatives