-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathkdd.rmd
988 lines (798 loc) · 52.4 KB
/
kdd.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
---
title: "Predicting Excitement at DonorsChoose.org"
output:
html_document:
fig_caption: yes
fig_height: 3.5
fig_width: 5
number_sections: yes
toc: yes
---
<style>
pre {
overflow-x: auto;
}
pre code {
word-wrap: normal;
white-space: pre;
}
</style>
```{r setup, echo=FALSE, cache=FALSE}
knitr::opts_knit$set(root.dir='/mnt/ssd/kdd', size="small")
options(width=110)
```
# Introduction
"[DonorsChoose.org](DonorsChoose.org) is an online charity that makes it easy to help students in need through school donations. At any time, thousands of teachers in K-12 schools propose projects requesting materials to enhance the education of their students."
![https://www.donorschoose.org/project/tablets-for-collaborative-and-immersive/3208685/](donorschoose_proj.png)
The [given challenge](https://www.kaggle.com/c/kdd-cup-2014-predicting-excitement-at-donors-choose) is to help DonorsChoose by predicting which projects are likely to be "exciting", where exciting is a business construct meaning that the project will:
* Be fully funded
* Have at least one donor referred by a teacher
* Have a high percentage of donors leaving an original message
* Have at least one donation made with the desired payment means
* Have donations from certain desirable donors
To try to predict whether a project will become exciting or not, we are given a substantial amount of data about each project, including:
* Various structured information about each project, filled when the teachers submit the projects
* Structured information about the resources requested for each project
* Project text posted by the teachers (unstructured)
* Information about the outcomes of projects in the training set
* Information about the donations to each project in the training set
# Instructions and Setup
* This notebook was written with R Markdown in RStudio 1.1.
* It was run with R 3.4.4 and also has pieces of code ran with Python 3.6.
* To reproduce these results, the .rmd notebook file can be opened and run in RStudio.
* Note that there are pieces of R code that take a long time to run (several hours). To be able to run this notebook relatively quickly, I saved the results that take a long time to compute, and those results are loaded instead of being run. The notebook is already configured to do that, but that requires the required .rdata files to be placed in the working directory.
* In the setup code above, please set the working directory (root.dir) as needed.
* To run the notebook, the .csv files of the challenge, as well as the 15zpallagi.csv file (provided in the GitLab repo), must also be placed in the working directory.
Let's start by loading a few packages required for this notebook.
```{r, message=FALSE}
library(data.table) # the best and fastest package for data manipulation
library(ggplot2) # the widely used and acknowledged plotting system
theme_set(theme_bw(base_size=9))
library(parallel) # parallel computation package for speeding up some operations
library(tm) # NLP / needs package libxml2-dev in Ubuntu 16.04, available through apt-get install
library(topicmodels) # NLP / needs package libgsl-dev in Ubuntu 16.04, available through apt-get install
```
# Data loading and cleaning
## Projects data: projects.csv
```{r}
projects <- fread("projects.csv", na.strings="")
```
The file containing the key information about each project. It has a total of `r nrow(projects)` instances, and `r ncol(projects)` features.
The list and description of each feature can be found [here](https://www.kaggle.com/c/kdd-cup-2014-predicting-excitement-at-donors-choose/data).
Let's check what the features look like, and which ones are usables, by looking at the unique values in which one. This function counts the number of unique values, and if fewer than the given threshold, it shows the unique values sorted by frequency.
```{r}
uniquevals <- function(feature, tablelim=100) {
u <- length(unique(feature))
if(is.numeric(feature)) {
return(paste("Numeric. Unique values:",u))
}
if(u <= tablelim) {
return(sort(table(feature, useNA="ifany"), decreasing=T))
} else {
return(paste0("Categorical. Unique values:",u))
}
}
lapply(projects, uniquevals)
```
We can identify a number of features that will be unusable (categorical with too many values) or irrelevant (ids) for classification. We won't remove them from the data until the last step though, because they might be useful for future feature engineering.
```{r}
nonusable <- c("projectid","teacher_acctid","schoolid","school_ncesid","school_latitude","school_longitude",
"school_city","school_zip","school_district","school_county","date_posted","total_price_including_optional_support")
usable <- setdiff(colnames(projects),nonusable)
```
We could also see that there was some missing data, and some categorical features with levels that were extremely rare. Let's fix that.
```{r}
projects[is.na(school_metro), school_metro := "unknown"] # 81908 ocurrences, more than the least frequent category
projects[teacher_prefix%in%c("Mr. & Mrs.","Dr."), teacher_prefix := NA] # just 2, 4, 12 occurrences
projects[is.na(secondary_focus_subject), secondary_focus_subject := "unknown"] # 207893 occurrences, more than any other
projects[is.na(secondary_focus_area), secondary_focus_area := "unknown"] # 207893 occurrences, more than any other
projects[, fulfillment_labor_materials := as.character(fulfillment_labor_materials)] # this is actually a categorical variable
projects[is.na(fulfillment_labor_materials), fulfillment_labor_materials := "unknown"] # 35082 occurrences
impute <- function(x) {
if(is.numeric(x)) { # median
x[is.na(x)] <- median(x, na.rm = TRUE)
} else { # most frequent
x[is.na(x)] <- names(which.max(table(x)))
}
return(x)
}
projects[, (usable) := lapply(.SD, impute), .SDcols=usable]
```
Now we'll fix the types of the columns, and check how it looks.
```{r}
projects[, date_posted := as.Date(date_posted)]
projects[, school_zip := as.numeric(school_zip)]
categorical_variables <- usable[sapply(projects[,usable,with=F],is.character)]
numeric_variables <- usable[sapply(projects[,usable,with=F],is.numeric)]
projects[,(categorical_variables) := lapply(.SD, function(x){factor(x)}), .SDcols=categorical_variables]
summary(projects)
```
Looking good! We're done cleaning up the projects data.
## Outcomes data: outcomes.csv
Let's load the data, check how it looks, and transform the categorical/binary variables to factors.
```{r}
outcomes <- fread("outcomes.csv", na.strings="")
categorical_variables = setdiff(colnames(outcomes)[sapply(outcomes,is.character)], "projectid") # exclude the projectid
outcomes[,(categorical_variables) := lapply(.SD, factor), .SDcols=categorical_variables]
summary(outcomes)
```
Lots of NAs in some variables, always the same number (94398). We can also see that all of them correspond to non-exciting projects. There is no provided information on these cases, so let's assume that there is some problem with this data and discard it. We don't want potentially bad data for training. Also, we can see that we already have way more non-exciting projects than exciting, we don't need more of these.
```{r}
outcomes <- outcomes[!is.na(at_least_1_green_donation)]
summary(outcomes)
```
There are still some NA's in great_messages_proportion. We have no info on this, so let's assume that NA in great_messages_proportion means no messages at all, and so let's impute zero.
```{r}
outcomes[is.na(great_messages_proportion), great_messages_proportion := 0]
```
And we're done with the outcomes data.
## Merge projects and outcomes
We are now ready to merge the projects and outcomes data.
```{r}
merged <- merge(projects, outcomes, by="projectid", all.x=T)
```
Let's check the time span of the date we have. Remember that the objective is to predict the outcome of new projects, so if the outcomes changed considerably through time, we shouldn't use old data.
```{r}
absolute <- merged[!is.na(is_exciting), .N, by=.(year(date_posted),month(date_posted),is_exciting)]
ggplot(absolute, aes(as.Date(paste0(year,"-",month,"-01")),N,group=is_exciting)) +
geom_line(aes(colour=is_exciting)) + labs(x="Date") + ggtitle("Number of exciting/non-exciting projects over time")
ratio <- merged[!is.na(is_exciting), sum(is_exciting=="t")/.N, by=.(year(date_posted),month(date_posted))]
ggplot(ratio, aes(as.Date(paste0(year,"-",month,"-01")), V1)) +
geom_line() + labs(x="Date", y="Exciting/Total") + ggtitle("Ratio of exciting over time")
```
We can see that projects only started to get exciting around 2010, quickly rising and then becoming relatively stable in 2011. So, let's only use the data from 2011 onwards.
```{r}
merged <- merged[date_posted >= as.Date("2011-01-01")]
```
## Looking at some features
To better grasp the problem at hand, let's also look at some categorical features that intuitively seem relevant for predicting excitment.
```{r}
by_area = merged[!is.na(is_exciting), .(ExcitingRatio=sum(is_exciting=="t")/.N), by=.(primary_focus_area)]
ggplot(by_area, aes(primary_focus_area, ExcitingRatio)) + geom_bar(stat="identity") +
ggtitle("Ratio of exciting projects per Primary focus area") + theme(axis.text.x=element_text(angle=45, hjust=1))
by_poverty = merged[!is.na(is_exciting), .(ExcitingRatio=sum(is_exciting=="t")/.N), by=.(poverty_level)]
by_poverty[, poverty_level := factor(poverty_level, labels=paste(c("highest","high","moderate","low"),"poverty"))]
ggplot(by_poverty, aes(poverty_level, ExcitingRatio)) + geom_bar(stat="identity") +
ggtitle("Ratio of exciting projects per school's area poverty level") + theme(axis.text.x=element_text(angle=45, hjust=1))
by_res = merged[!is.na(is_exciting), .(ExcitingRatio=sum(is_exciting=="t")/.N), by=.(resource_type)]
ggplot(by_res, aes(resource_type, ExcitingRatio)) + geom_bar(stat="identity") +
ggtitle("Ratio of exciting projects per requested resource type") + theme(axis.text.x=element_text(angle=45, hjust=1))
by_grade = merged[!is.na(is_exciting), .(ExcitingRatio=sum(is_exciting=="t")/.N), by=.(grade_level)]
by_grade[, grade_level := relevel(grade_level, "Grades PreK-2")]
ggplot(by_grade, aes(grade_level, ExcitingRatio)) + geom_bar(stat="identity") +
ggtitle("Ratio of exciting projects per grade level") + theme(axis.text.x=element_text(angle=45, hjust=1))
by_state = merged[!is.na(is_exciting), .(ExcitingRatio=sum(is_exciting=="t")/.N), by=.(school_state)]
ggplot(by_state, aes(school_state, ExcitingRatio)) + geom_bar(stat="identity") +
ggtitle("Ratio of exciting projects per state") + theme(axis.text.x=element_text(angle=90, hjust=1))
```
## Preparing the data for model building
I will train the models in Python, so the data needs to be exported. The exportData function exports separate files for the training X (feature vectors) and y (target vector) data, and also exports the test data that is needed to make a submission to the Kaggle challenge.
```{r}
exportData <- function(dt, basename="data", yvars="is_exciting", excludevars=c()) {
outdata <- copy(dt)
setorder(outdata, date_posted) # sort by date
# all binary variables (f/t) become 0/1's
binary_vars <- colnames(outdata)[sapply(outdata, function(x){is.factor(x) & length(levels(x))==2})]
outdata[,(binary_vars) := lapply(.SD, function(x){as.numeric(x)-1}), .SDcols=binary_vars]
# remove outcome vars from the X data, as well as other non-wanted features
# and ensure projectid always comes first
train_x <- outdata[complete.cases(outdata),
c("projectid", setdiff(names(outdata), c(yvars, excludevars))), with=F]
# projectid always comes first
train_y <- outdata[complete.cases(outdata),
c("projectid", setdiff(yvars,"projectid")), with=F]
# also export the data for Kaggle challenge
testprojects <- fread("sampleSubmission.csv")
test_x <- outdata[projectid %in% testprojects$projectid,
c("projectid", setdiff(names(outdata), c(yvars, excludevars))), with=F]
# write to files
write.csv(train_x, file=paste0("train_x_",basename,".csv"), row.names=F)
write.csv(train_y, file=paste0("train_y_",basename,".csv"), row.names=F)
write.csv(test_x, file=paste0("test_x_",basename,".csv"), row.names=F)
rm(outdata, train_x, test_x, train_y)
}
exportData(merged, basename="basic", yvars=colnames(outcomes), excludevars=nonusable)
```
# Model building in Python
I'll use Python to build and evaluate the models, relying on the **sklearn**. The motivation for using Python is that there is a single package (sklearn) with many different models implemented, as well as metrics, preprocessing techniques, and so on. On **R**, this would have to be done with a variety of packages, which is not as clear and is more difficult to use.
## Data loading and transformation
Let's start by defining a function in Python that will be useful for loading the data. It transforms each categorical feature in a set of dummy variables, using one-hot encoding.
```{python, eval=FALSE}
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import roc_auc_score, auc, roc_curve
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import StandardScaler
def dataload(name,folder='/mnt/ssd/kdd/'):
X = pd.read_csv(folder + 'train_x_'+ name + '.csv')
X.drop('projectid', axis=1, inplace=True)
X_master = pd.read_csv(folder + 'test_x_'+ name + '.csv')
X_master.drop('projectid', axis=1, inplace=True)
# joining the train with the test data to ensure that the
# one-hot-encoding is consistent in both
# this should be redone using one OneHotEncoder per categorical feature
joined = pd.concat((X, X_master), axis=0)
joined = pd.get_dummies(joined, drop_first=True)
# Now splitting them again
X_master = joined.iloc[-X_master.shape[0]:]
X = joined.iloc[:-X_master.shape[0]]
y = pd.read_csv(folder + 'train_y_' + name + '.csv')
y = y['is_exciting']
# Double-check that are no NaN values anywhere
if len(X[X.isnull().T.any().T]) > 0:
raise Exception("NaN values in X")
if len(X_master[X_master.isnull().T.any().T]) > 0:
raise Exception("NaN values in X test")
return X, y, X_master
```
Now, a function to split the data into training and test data. The test data is the **20%** most recent data. The objective of the model is to predict the outcome of future projects, so it makes sense that the training/test is also split that way. The performance of the model on the test data should be a good indicator of its performance in a real application scenario.
```{python, eval=FALSE}
def splitdata(X, y, test_split=0.2):
# split into train and test
lim = int((1 - test_split) * X.shape[0])
# the test data is the most recent data
# we're trying to predict future outcomes, so this is the most accurate model assessment
X_train = X.iloc[:lim, ]
y_train = y.iloc[:lim, ]
X_test = X.iloc[lim:, ]
y_test = y.iloc[lim:, ]
return X_train, y_train, X_test, y_test
```
The dataset contains features with widely disparate value ranges, so let's standardize the features. Min-max scaling could be problematic, as we've seen some outliers in a number of features. So, I'm standardizing based on the mean and standard deviation.
```{python, eval=FALSE}
def featurescaling(X, sc=None):
# Feature Scaling: mean and stdev scaling
if sc==None:
sc = StandardScaler()
X = sc.fit_transform(X)
else:
X = sc.transform(X)
if not np.isfinite(X.all()):
raise Exception('Found non-finite values')
return X, sc
```
Finally, a convenience function that loads and transforms the data as needed for feeding to the model training.
```{python, eval=FALSE}
def load_and_transform(name, folder='/mnt/ssd/kdd/', split=0.2):
X, y, Xm = dataload(name, folder=folder)
Xtr, ytr, Xte, yte = splitdata(X, y, test_split=split)
# only using the train data for computing the standardization coefficients
Xtr, sc = featurescaling(Xtr)
if len(Xte) > 0:
Xte, _ = featurescaling(Xte, sc)
Xm, _ = featurescaling(Xm, sc)
return Xtr, ytr, Xte, yte, Xm
```
## Model assessment
As stated above, the test data is the most recent data, that will not be seen by the model. Cross-validation for assessing the model's performance should not be needed in this scenario, as we would be using new projects to predict the outcome of old projects, which does not match to the intended application. And cross-validation is also computationally expensive, which can be problematic given that we have a huge amount of data and features.
The data is largely imbalanced, with only around 15% of the projects belonging to the class of *exciting project*. The metrics used for assessing the model should therefore be insensitive to class imbalance. For instance, a model that classifies everything as non-exiciting would obtain a relatively high accuracy, but it is completely useless. The chosen metrics are not affected by class imbalance:
1. **Confusion matrix**. Considering a threshold of 0.5, that is, instances with probabilities below 0.5 are considered non-exiciting, and above that are considered exciting.
2. Metrics computed from the confusion matrix: **True Positive Rate** (when it's actually exciting, how often does it predict exciting), **False Positive Rate** (when it's actually not-exciting, how often does it predict exciting), and **Precision** (when it predicts exciting, how often is it correct).
3. **ROC curve**. The Receiver Operating Characteristic curve plot has the true positive rate on the Y axis, and the false positive rate on the X axis. This means that the top left corner of the plot is the ideal point: a false positive rate of zero, and a true positive rate of one. The higher the steepness of the curve the better, meaning that it maximizes the true positive rate while minimizing the false positive rate.
4. **ROC-AUC score**. The area under the ROC curve. The higher, the better, ranging from 0.5 (random guess) to 1 (perfect predictions).
I've defined a convenience function for assessing the model's performance:
```{python, eval=FALSE}
def assessmodel(model, X_test, y_test):
y_prob = model.predict_proba(X_test)[:,1]
y_pred = y_prob > 0.5 # cutoff
cm = confusion_matrix(y_test, y_pred)
print('Confusion matrix:\n' + str(cm))
# True Positive Rate: When it's actually yes, how often does it predict yes?
tpr = cm[1, 1] / (cm[1, 0] + cm[1, 1])
# False Positive Rate: When it's actually no, how often does it predict yes?
fpr = cm[0, 1] / (cm[0, 0] + cm[0, 1])
# Precision: When it predicts yes, how often is it correct?
precision = cm[1, 1] / (cm[0, 1] + cm[1, 1])
print('True Positive Rate:', round(tpr, 2))
print('False Positive Rate:', round(fpr, 2))
print('Precision:', round(precision, 2))
auc = roc_auc_score(y_test, y_prob)
print('ROC AUC score: ' + str(auc))
# ROC curve
fpr,tpr,_ = roc_curve(y_test, y_prob)
plt.plot(fpr, tpr, color='blue', label='ROC curve')
plt.plot([0, 1], [0, 1], color='black', linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend()
plt.show()
```
## Model building
The target for training the model is the *is_exciting* binary variable. This means we're treating this as a binary classification problem, where the model should output the probability of the project being exciting.
As a baseline, I chose to use a Logistic Regression model, with L1 (Lasso) regularization. Logistic Regression is the go-to method for binary classification. There is a rather large number of features in the training data, and it is relatively sparse, due to the categorical variables that were one-hot-encoded. For instance, the school state alone corresponds to 50 features in the training data. Since there is not enough time to due proper feature selection in order to try to reduce the number of features, I will instead defer that task to the model. Logistic Regression with L1 regularization is capable of performing feature selection, and hence it was my choice for a base model.
```{python, eval=FALSE}
def fitLRmodel(X, y):
from sklearn.linear_model import LogisticRegression
# Balanced class_weight because we have way more non-exciting than exciting
lr = LogisticRegression(penalty='l1', class_weight='balanced', verbose=1)
lr.fit(X, y)
return lr
```
The parameters of the model itself were mostly left untouched. The only exception is *class_weight*, which is set to *balanced*. This parameter dictates the weights associated with each class. The *balanced* option automatically computes the weights so that they are inversely proportional to class frequencies in the input data. This is important since in the training data the classes are largely imbalanced. Not addressing this imbalance could result in models that are biased towards negative predictions (non exciting), as this is by far the most common class. Tuning other parameters such as the *C* value (regularization strength) could lead to better models, but that would require a careful evaluation, possible with cross-validation, which would take some time to perform.
## Training the model with the basic dataset
Let's see how the model performs using the data we have so far.
```{python, eval=FALSE}
X_train, y_train, X_test, y_test, X_master = load_and_transform('basic', split=0.2)
lr1 = fitLRmodel(X_train, y_train)
assessmodel(lr1, X_test, y_test)
```
```
# nonzeros / # features = 137/146
Confusion matrix:
[[35949 12992]
[ 4838 3222]]
True Positive Rate: 0.4
False Positive Rate: 0.27
Precision: 0.2
ROC AUC score: 0.5968095858369598
```
![ROC curve](basic_roc.png)
The performance of the model is, overall, relatively weak. The true positive rate is rather high, meaning that 40% of the projects that are actually exciting are being predicted as so. But the model is giving a large amount of false positives, resulting in a very low precision. The value of just ~0.59 in the ROC-AUC score confirms the poor performance of the model.
# Basic feature engineering
To try to improve the model, let's proceed to extract aditional features from the existing data, which might help the model make more accurate predictions.
## Time features
It's concievable that the date the projects were posted influences their outcome. As we saw earlier, for instance, there was a slight upwards trend in the ratio of exciting projects. Also, it's possible that projects posted in certain times of the year are more likely to be successful. Let's extract a few features from the date_posted feature so that the posting date can also be used for learning.
```{r}
# year as numeric
merged[, year_posted := year(date_posted)]
# month as categorical, since non-linearities are likely
merged[, month_posted := factor(months(date_posted, abbreviate=T))]
# weekday as categorical because non-linearities are likely
merged[, wday_posted := factor(weekdays(date_posted, abbreviate=T))]
# day in the month as numeric to avoid an excess of categories
merged[, day_posted := mday(date_posted)]
```
Let's look at some plots to understand eventual relations between the date posted and the project's success. It's clear there is some influence, mostly in the month. It makes sense: projects just before the school year begining tend to be more successful.
```{r}
by_month = merged[!is.na(is_exciting), .(ExcitingRatio=sum(is_exciting=="t")/.N), by=.(month_posted)]
by_month[, month_posted := factor(month_posted,
levels=c("Jan","Fev","Mar","Abr","Mai","Jun","Jul","Ago","Set","Out","Nov","Dez"))]
ggplot(by_month, aes(month_posted, ExcitingRatio)) + geom_bar(stat="identity") +
ggtitle("Ratio of exciting projects per month")
by_wday = merged[!is.na(is_exciting), .(ExcitingRatio=sum(is_exciting=="t")/.N), by=.(wday_posted)]
by_wday[, wday_posted := factor(wday_posted,
levels=c("Seg","Ter","Qua","Qui","Sex","Sáb","Dom"))]
ggplot(by_wday, aes(wday_posted, ExcitingRatio)) + geom_bar(stat="identity") +
ggtitle("Ratio of exciting projects per weekday")
by_mday = merged[!is.na(is_exciting), .(ExcitingRatio=sum(is_exciting=="t")/.N), by=.(day_posted)]
ggplot(by_mday, aes(day_posted, ExcitingRatio)) + geom_line() +
ggtitle("Ratio of exciting projects per day in the month")
```
## Previous outcomes
We have the schoolid and the teacherid in the projects data. While these cannot be used directly por classification, it is possible to use this data to see how many projects a school/teacher has submitted in the past, and the outcome of these projects. Intuitively, experience and success in submitting such projects in the past should be, to some degree, related to the success of future projects.
```{r}
if(file.exists("history.rdata")) {
# Results already previously computed
load("history.rdata")
merged <- merge(merged, history, by="projectid")
} else {
# This will take some time, let's parallelise to make this faster
cl <- makeCluster(detectCores())
clusterEvalQ(cl, library(data.table))
clusterExport(cl, list("merged"))
# Previously submitted and successful projects from the school
count_school <- function(x){
sum(merged$schoolid==x["schoolid"] & merged$date_posted < x["date_posted"])
}
merged[, school_previous_submitted := parApply(cl, .SD, 1, count_school)]
count_school_success <- function(x){
sum(merged$is_exciting=="t" & merged$schoolid==x["schoolid"] &
merged$date_posted < x["date_posted"])
}
merged[, school_previous_success := parApply(cl, .SD, 1, count_school_success)]
# Previously submitted and successful projects from the teacher
count_teacher <- function(x){
sum(merged$teacher_acctid==x["teacher_acctid"] &
merged$date_posted < x["date_posted"])
}
merged[, teacher_previous_submitted := parApply(cl, .SD, 1, count_teacher)]
count_teacher_success <- function(x){
sum(merged$is_exciting=="t" & merged$teacher_acctid==x["teacher_acctid"] &
merged$date_posted < x["date_posted"])
}
merged[, teacher_previous_success := parApply(cl, .SD, 1, count_teacher_success)]
# Let's save what we just calculated, as it will take some time to recompute it
history <- merged[,.(projectid,school_previous_submitted,school_previous_success,
teacher_previous_submitted,teacher_previous_success)]
save(history, file="history.rdata")
}
```
Let's see what these new features look like.
```{r}
summary(history[,-c("projectid")])
```
We can look at some correlations to try to validate the hypothesis that previous submissions are related with future outcomes. We can see some positive correlations with the number of previously successful projects, but negative correlations with the total number of previously submitted projects. These correlations are quite low though, so let's not jump to conclusions.
```{r}
m <- merged[complete.cases(merged)]
cor(m$school_previous_success, as.numeric(m$is_exciting=='t'), method="spearman")
cor(m$teacher_previous_success, as.numeric(m$is_exciting=='t'), method="spearman")
cor(m$school_previous_submitted, as.numeric(m$is_exciting=='t'), method="spearman")
cor(m$teacher_previous_submitted, as.numeric(m$is_exciting=='t'), method="spearman")
rm(m)
```
## Resource data
The projects.csv dataset has the main type of resource required by the project (resource_type feature), and that's the only knowledge about the resources we have so far. Let's look into the resources.csv data to see if we can extract some more valuable information.
```{r}
resources <- fread("resources.csv")
resources <- resources[projectid %in% merged$projectid] # the only data that is worth it
head(resources, n=5)
summary(resources[, .N, by=.(projectid)]$N) # number of resources per project
cats <- unique(resources$project_resource_type) # project resource types
# number of different resource types per project
summary(resources[, .(U=length(unique(project_resource_type))), by=.(projectid)]$U)
items <- length(unique(resources$item_name)) # number of different items
```
There are `r items` different items, for a total of `r nrow(resources)` resources. This is way too sparse, and thus not useful. We can also see that there's only one resource type per project, which corresponds to the resource type already in the resource_type feature in projects.csv, which makes this useless. Let's extract what we can: *average price of the resources in a project*, *average quantity per resource in a project*, *number of different resources required by the project*.
```{r}
resourcestats <- resources[, .(average_resource_price=mean(item_unit_price, na.rm=T),
average_quantity=mean(item_quantity, na.rm=T),
different_resources=.N), by=.(projectid)]
merged <- merge(merged, resourcestats, by="projectid", all.x=TRUE)
```
Let's see what these new features look like.
```{r}
summary(resourcestats[,-c("projectid")])
```
Some missing data (very few), let's fix that.
```{r}
merged[, average_resource_price := impute(average_resource_price)]
merged[, average_quantity := impute(average_quantity)]
```
## Re-evaluation
Let's see if the results improved with the new features we added.
```{r}
exportData(merged, basename="feature_eng_simple", yvars=colnames(outcomes), excludevars=nonusable)
```
```{python, eval=FALSE}
X_train, y_train, X_test, y_test, X_master = load_and_transform('feature_eng_simple', split=0.2)
lr2 = fitLRmodel(X_train, y_train)
assessmodel(lr2, X_test, y_test)
```
```
# nonzeros / # features = 164/172
Confusion matrix:
[[37993 10948]
[ 4996 3064]]
True Positive Rate: 0.38
False Positive Rate: 0.22
Precision: 0.22
ROC AUC score: 0.6208175256143482
```
We can see that feature engineering worked. Compared to the previous model, we have approximately the same amount of false negatives, but we have 2000 fewer false positives. The False Positive Rate decreased from 0.27 to 0.22, the precision increased from 0.2 to 0.22, and the ROC-AUC score increased from 0.59 to 0.62.
# Using additional data
The results we got so far are slightly disappointing with respect to the accuracy of the model. There are two likely explanations for this:
* The outcome of a project might be largely stochastic, and thus mostly unpredictable.
* The model doesn't have enough information to make accurate predictions.
We can't do much about the first hypohtesis, so let's go for the second. Let's try to gather additional information that might help predicting the outcome of a project.
## Population data
In the plots shown above, we can see that the project excitment ratio varies largely between states, and also slightly between different poverty levels. The working hypothesis is that the outcome of a project is influenced by the socio-economic context of the school. This is highly likely, as even the [donorschoose.org](https://www.donorschoose.org) highlights this information in the projects' pages:
![Taken from the page of an arbitrary project: https://www.donorschoose.org/project/bounce-into-focus/3149021](donorschoose_income.png)
The data we have in the dataset about the socio-economic context is very few, only the poverty_level attribute:
```{r}
table(merged$poverty_level)
```
We have the school zip code for each project, which in the US is a good indicator of the geographical area. After some digging, I found a very complete and accurate dataset with socio-economic data per zip code in the US, at the [irs.gov](https://www.irs.gov/) website: [Individual Income Tax Statistics - ZIP Code Data (SOI)](https://www.irs.gov/statistics/soi-tax-stats-individual-income-tax-statistics-zip-code-data-soi). Citing the source, the data includes items such as:
* Number of returns, which approximates the number of households
* Number of personal exemptions, which approximates the population
* Adjusted gross income
* ...
A full description of the features (131 in total) can be found [here](http://www.nber.org/tax-stats/zipcode/2015/desc/zipcode2015/desc.txt).
We are going to extract the following features:
1. Number of households
2. Total population number
3. Total number of dependents
4. Fraction of dependents: number of dependents / population number
5. Relative distribution of the population over the 6 defined income classes
```{r}
tax <- fread("15zpallagi.csv")
# N1: number of returns (approximates number of households)
# agi_stub: adjusted gross income class
# Calculate the proportion of returns in each income class
income_distribution <- tax[,.(STATE,zipcode,agi_stub,N1)]
income_distribution[, N1 := N1 / sum(N1), by=.(STATE,zipcode)]
# From long to wide (one column per income class)
income_distribution <- dcast(income_distribution, STATE + zipcode ~ agi_stub, value.var="N1")
colnames(income_distribution) <- c("STATE","zipcode",paste0("agi",1:6))
# N2: number of exemptions (approximates the population)
# numdep: number of dependents (likely people in school age)
totals <- tax[, .(households=sum(N1),population=sum(N2),dependents=sum(NUMDEP),
fraction_dependents=sum(NUMDEP)/sum(N2)), by=.(STATE,zipcode)]
ziptotals <- merge(totals,income_distribution,by=c("STATE","zipcode"))
```
We have the features we want per state and zipcode, let's see how it looks and merge it in our data.
```{r}
summary(ziptotals)
sum(merged$school_zip %in% ziptotals$zipcode) # data for nearly all projects!
merged <- merge(merged, ziptotals, by.x=c("school_zip","school_state"), by.y=c("zipcode","STATE"), all.x=TRUE)
# there were very few cases for which zip codes were not found, impute medians of the state
vars <- setdiff(colnames(ziptotals), c("STATE","zipcode"))
merged[, (vars) := lapply(.SD, impute), by=.(school_state), .SDcols=vars]
```
We could have also easily extracted the state-wise data, but that wouldn't make much sense, as we are using the state itself directly as a feature.
## Re-evaluation after socio-economic data
Let's see if the results improved with the new features we added.
```{r}
exportData(merged, basename="feature_eng_census", yvars=colnames(outcomes), excludevars=nonusable)
```
```{python, eval=FALSE}
X_train, y_train, X_test, y_test, X_master = load_and_transform('feature_eng_census', split=0.2)
lr3 = fitLRmodel(X_train, y_train)
assessmodel(lr3, X_test, y_test)
```
```
# nonzeros / # features = 170/182
Confusion matrix:
[[37856 11081]
[ 4947 3117]]
True Positive Rate: 0.39
False Positive Rate: 0.23
Precision: 0.22
ROC AUC score: 0.6214096741364261
```
The improvement in the model's performance was minor with the new features. The ROC-AUC score increased from 0.6208 to 0.6214, which is marginal. It might be possible that the informatiom already contained in the original dataset (the *state* and the *poverty level*) was sufficient, and thus adding additional information about the school's location was irrelevant. Another hypothesis is that the model does not have enough power to handle this data, as there might be non-linear relations between the new features and the projects' outcome, which our Logistic Regression model is incapable of capturing.
## Natural Language Processing for the project essays
The data available for the projects also includes the complete project essays and statement needs (file essays.csv). Although the projects.csv data already has a few structured attributes that help categorize the projects (*primary_focus_area*, *primary_focus_subject*, *secondary_focus_subject*, *secondary_focus_area*) and their needs (*resource_type*), it is conceivable that the available categories do not accurately capture the diversity of the projects themes. It is also possible that projects using certain keywords are more likely to attract attention, regardless of the theme. To investigate these possibilities, I will try to extract useful knowledge from the essays file by using text mining techniques.
The main idea is to use [probabilistic topic models](https://dl.acm.org/citation.cfm?id=2133826) to automatically extract topics from the given corpus (all the essays). After the topics are defined, we can then *classify* each essay against those topics, that is, what's the probability of a given essay belonging to each of those topics? These probabilities can then be used as features to train the model.
This implementation described in this notebook is adapted from the guide available [here](https://eight2late.wordpress.com/2015/09/29/a-gentle-introduction-to-topic-modeling-using-r):
1. Apply typical text minining techniques to cleanup the essays: remove punctuation and stop words, stem the words, and remove very frequent and unfrequent terms.
2. Use topic modelling with a sample of the data to extract the topics, using the [LDA algorithm](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) in particular.
3. Use those topics to calculate the probabilities of each essay belonging to each topic.
Note that the essays.csv file also as a *short_description* attribute, which could be more useful for this task. However, I noticed that most of the short descriptions were actually truncated in arbitrary places, ending with "...". These can compromise the results, so I am using the full project essay instead.
First, let's define a function that will pre-process the essay texts to make it easier for the topic models.
```{r}
preProcessCorpus <- function(texts) {
# create corpus from the vector source
descrs <- Corpus(VectorSource(texts))
# to lower case
descrs <- tm_map(descrs,content_transformer(tolower))
# remove punctuation
descrs <- tm_map(descrs, removePunctuation)
descrs <- tm_map(descrs, removeNumbers)
# remove stopwords (and, to, with, etc, ...)
descrs <- tm_map(descrs, removeWords, stopwords("english"))
# remove whitespace
descrs <- tm_map(descrs, stripWhitespace)
# stem document (going -> go, documentation -> document, ...)
descrs <- tm_map(descrs,stemDocument)
return(descrs)
}
```
Now, let's load the data and pre-process it.
```{r}
if(file.exists("essays_tm.rdata")) {
# Use the previously computed objects
# this loads the objects essayscorpus, essaysdtm, essayslda, essaysprobs
load("essays_tm.rdata")
} else {
# Load the data -- the fread was not able to deal with the line brekage. read.csv works
essays <- read.csv("essays.csv", allowEscapes=TRUE, stringsAsFactors=FALSE)
setDT(essays)
# Use only the data that is worth it
essays <- essays[projectid %in% merged$projectid]
essayscorpus <- preProcessCorpus(essays$essay)
}
```
Let's see what it looks like by looking at one random essay after the pre-processing.
```{r}
writeLines(as.character(essayscorpus[[1]]))
```
Second, let's extract the Document-Term-Matrix required for topic modelling. Note that in this operation I'm also removing the overly common terms (appearing in more than 90% of the documents), and the overly sparse terms (appearing in less than 1% of the documents). This is a common operation that helps keep the number of different terms in check.
```{r}
extractDTM <- function(corpus, ids) {
# create document-term matrix
ndocs <- length(descrs)
# ignore overly sparse terms (appearing in less than x% of the documents)
minDocFreq <- ndocs * 0.01
# ignore overly common terms (appearing in more than y% of the documents)
maxDocFreq <- ndocs * 0.90
dtm <- DocumentTermMatrix(descrs, control=list(bounds=list(global=c(minDocFreq, maxDocFreq))))
# use the ids as rownames
rownames(dtm) <- ids
# Remove the empty entries (should be very few)
dtm = dtm[unique(dtm$i),]
return(dtm)
}
if(!exists("essaysdtm")) { # previous results weren't available
essaysdtm <- extractDTM(essayscorpus, essays$projectid)
}
inspect(essaysdtm)
```
Third, let's extract the topics with probabilistic topic modelling. I had to sample the data that is used to make the models because this takes quite some time.
```{r}
if(!exists("essayslda")) { # previous results weren't available
essayslda <- LDA(essaysdtm[sample(1:nrow(essaysdtm),50000),], 50, method="Gibbs",
control=list(nstart=1, burnin=2000, iter=2000, verbose=1))
}
```
After letting it run overnight, let's look at the fitted topics. I'm showing the terms with the higher probability of appearing for each of the fitted topics. We can immediately see that there a lot of topics that make sense here, looking good!
```{r}
terms(essayslda,6)
```
Fourth, let's calculate the probabilities of each essay belonging to each of the fitted topics. This also takes quite some time.
```{r}
if(!exists("essaysprobs")) { # previous results weren't available
essaysprobs <- posterior(essayslda, essaysdtm)$topics
# Lets save everything to reuse later
save(essayscorpus, essaysdtm, essayslda, essaysprobs, file="essays_tm.rdata")
}
```
Finally, let's merge the probabilities into our dataset.
```{r}
essaysprobs_dt <- data.table(essaysprobs)
cols <- paste0("essay_topic_",1:ncol(essaysprobs))
colnames(essaysprobs_dt) <- cols
essaysprobs_dt[, projectid := rownames(essaysdtm)]
merged <- merge(merged, essaysprobs_dt, all.x=TRUE, by="projectid")
# Handle NA's (very few, caused by lack of essay) - impute uniform probabilities
imputeuniform <- function(x, ncols){
x[is.na(x)] <- 1 / ncols
return(x)
}
merged[, (cols) := lapply(.SD, imputeuniform, length(cols)), .SDcols=cols]
```
Now let's do the same for the needs statement.
```{r}
if(file.exists("needs_tm.rdata")) {
# Load the already computed results
load("needs_tm.rdata")
} else {
needscorpus <- preProcessCorpus(essays$need_statement)
needsdtm <- extractDTM(needscorpus, essays$projectid)
needslda <- LDA(needsdtm[sample(1:nrow(needsdtm),50000),], 30,
method="Gibbs", control=list(nstart=1, burnin=2000, iter=2000, verbose=1))
needsprobs <- posterior(needslda, needsdtm)$topics
save(needscorpus, needsdtm, needslda, needsprobs, file="needs_tm.rdata")
}
```
Let's take a look at the extracted topics. I used 30 topics instead of 50 since there are much fewer different terms in the needs statements.
```{r}
inspect(needsdtm) # the most frequent terms in the corpus
terms(needslda,6) # the 6 most likely terms for each topic
```
And finally, merge this into our data.
```{r}
needsprobs_dt <- data.table(needsprobs)
cols <- paste0("needs_topic_",1:ncol(needsprobs))
colnames(needsprobs_dt) <- cols
needsprobs_dt[, projectid := rownames(needsdtm)]
merged <- merge(merged, needsprobs_dt, all.x=TRUE, by="projectid")
merged[, (cols) := lapply(.SD, imputeuniform, length(cols)), .SDcols=cols]
```
## Re-evaluation after NLP
And now export the data with these new features. It's time to check if all this work improved the predictions.
```{r}
exportData(merged, basename="feature_eng_nlp", yvars=colnames(outcomes), excludevars=nonusable)
```
```{python, eval=FALSE}
X_train, y_train, X_test, y_test, X_master = load_and_transform('feature_eng_nlp', split=0.2)
lr4 = fitLRmodel(X_train, y_train)
assessmodel(lr4, X_test, y_test)
```
```
# nonzeros / # features = 248/262
Confusion matrix:
[[37463 11478]
[ 4837 3223]]
True Positive Rate: 0.4
False Positive Rate: 0.23
Precision: 0.22
ROC AUC score: 0.6226416899509781
```
![ROC curve of the final model](final_roc.png)
Again, the performance increase is mostly minor. The ROC-AUC score went from 0.621 to 0.622. The True Positive Rate increased from 0.39 to 0.4, and the other metrics remained at similar values as before.
# Kaggle submission
Finally, let's train the model with all the data and see how the model fares in the Kaggle competition.
```{python, eval=FALSE}
X_train, y_train, _, _, X_master = load_and_transform('feature_eng_nlp', split=0)
lrfinal = fitLRmodel(X_train, y_train)
pred = lrfinal.predict_proba(X_master)[:,1]
# Write submission file
ori = pd.read_csv('/mnt/ssd/kdd/test_x_feature_eng_nlp.csv') # get the projectids
out_master = pd.DataFrame({'projectid':ori['projectid'].values,'is_exciting':pred})
out_master= out_master[['projectid','is_exciting']] # sort the columns
out_master.to_csv('data/kaggle.csv', sep=',', index=False) # write the submission file
```
![ROC-AUC score achieved in the test data used by the Kaggle competition](kaggle_result.png)
The Kaggle competition uses the ROC-AUC score to assess the submissions. The score of 0.61361 is incredibly similar to the score of 0.622 that I achieved with my own test data. These are relatively poor results, but at least it means that the model is not overfitted to the training data.
![Leaderboard of the Kaggle competition](kaggle_leaderboard.png)
If the competition was live, I would achieve the 49th place. There is definitely still room for improvement, but this place would put me in the top 10% of the leaderboard. Not that bad, after all.
# Towards Deployment
Due to the time allowed for this challenge, the code has been developed for ease of exploration and experimentation, not for deployment. Nevertheless, there was always care so that all the features extracted were feasible and reallistic in a production environment, and the system could yield classifications of new instances with minor computational cost. To classify a new project instance based on the methods proposed in this notebook, an application would have to implement the following steps:
1. Clean the given project instance data as done in this notebook.
2. Extract the aditional features from the posting date and resources information of the given project.
3. Extract the features based on previous outcomes of projects submitted by the same teacher/school. This would have to be done considering all the available historical data at the time of the prediction. For classifying a single new instance, this would correspond to no more than a few SELECT COUNT in the database.
4. Extract the features related to the socio-economic data of the school's zip code area, as done in this notebook. The dataset I used for socio-economic data is open, updated every year, is extremely complete (nearly all ZIP codes in the US) and of high quality, so this wouldn't be an issue.
5. Calculate the topic probabilities for the project essay and statement of needs. Fitting the topics to the existing corpus was computational expensive, but new instances could simply be assigned to the already fitted topic models, which is fast. I actually did this, as the topic models were only fitted with a sample of the data.
6. Having the feature vector complete, the categorical variables would have to be one-hot-encoded, using the previosuly fitted one-hot-encoders, and then all the features would have to be standardized with the previously calculated standardization coefficients.
7. Finally, the standardized feature vector could be fed to the previously fitted model for making the prediction.
# Final remarks
From the get-go, I knew this was an extremely challenging problem. Just by looking at the Kaggle score table, we can see that the first place was awarded to a submission with a ROC-AUC score of just 0.67, which can be considered as a [poor/fair performance](http://gim.unmc.edu/dxtests/roc3.htm). Half of the submissions are below 0.56, which is marginally better than random guess.
Building a model with the data immediately available in the dataset (after cleaning and adequate preparation) confirmed the relatively low performance of the model. To try to improve the model's performance, I heavily invested in feature engineering, by (1) extracting additional features from the provided datasets, (2) extracting features from socio-economic data available online, and (3) applying text mining techniques to extract features from the projects' essays.
Some of these features improved the model's performance, but even the best model yielded a relatively poor performance, with a relatively good true positive rate (0.4), but low precision (0.22), meaning that approximately only one in five projects predicted as exciting would actually become exciting. The model is thus modestly unreliable, and would naturally have limited uses in a production setting.
## Future directions
To try to improve upon these results, a significantly different approach could be experimented. Currently, I'm predicting whether a project is exciting or not based on the features associated with that project (binary classification). However, as stated in the Introduction, there are several criteria that must be met in order for a project to be considered exciting (fully funded, had donors leaving original messages, had donors referred by teachers, etc.). The dataset also includes this information, which is currently not being used. A different approach could be:
1. Predict the probability of fullfilment of each of the criteria independently.
2. Calculate the probability of a project being exciting based on those probabilities.
Such an approach would also be more informative for the end user, as it would be more explicit about why a given project is being predicted as exciting or not.
In parallel with this new approach, other topics deserve further work and investigation:
1. Fine tune the models and experiment with a wider variety of models.
2. Understand why the new features from population data and text mining barely improved the results. Experiment with non-linear models to see if this features can be leveraged that way.
3. Experiment with feature selection and understand the importance of each feature in the dataset. This would also provide important information for the end user.
# Bonus - DNNs
As bonus, some experiments with Deep Neural Networks that amounted to nothing useful. Unfortunately, I did not have enough time to understand why the training of the neural network is mostly failing. The accuracy increases over the epochs, but the ROC-AUC score actually decreases, meaning that the neural network is likely getting biased towards negative predictions (by far the most common class). The class weights should have handled this problem, but apparently it's not working as expected. A different loss function might also be desirable in this case.
```{python, eval=FALSE}
from tensorflow.python.keras.callbacks import Callback
class ROCMetric(Callback):
"""
Custom metric to evaluate the ROC AUC at the end of each epoch in
the validation data
"""
def __init__(self, validation_data=(), interval=1):
super().__init__()
self.interval = interval
self.X_val, self.y_val = validation_data
def on_epoch_end(self, epoch, logs={}):
if epoch % self.interval == 0:
y_pred = self.model.predict_proba(self.X_val, verbose=0)
score = roc_auc_score(self.y_val, y_pred)
logs['rocauc'] = score
print('ROC AUC: ' + str(score))
```
```{python, eval=FALSE}
def fitDNN(X_train, y_train, X_test, y_test):
from tensorflow.python.keras.models import Sequential
from tensorflow.python.keras.layers import Dense, Dropout
# Initialise the ANN
ann = Sequential()
# input layer and first hidden layer
ann.add(Dense(100, kernel_initializer='glorot_uniform', activation='relu', input_dim=X_train.shape[1]))
ann.add(Dropout(rate=0.3))
# second hidden layer
ann.add(Dense(100, kernel_initializer='glorot_uniform', activation='relu'))
ann.add(Dropout(rate=0.3))
# output layer, only one output node: probability of being exciting
ann.add(Dense(1, kernel_initializer='glorot_uniform', activation='sigmoid'))
# compiling the ANN -- applying Stochastic Gradient Descent
ann.compile(optimizer='nadam', loss='binary_crossentropy', metrics=['accuracy'])
print(ann.summary())
roc = ROCMetric(validation_data=(X_test, y_test), interval=1)
# fitting the network to data
# using class weights since the data is largely imbalanced
from sklearn.utils import class_weight
class_w = dict(enumerate(class_weight.compute_class_weight('balanced', np.unique(y_train), y_train)))
h = ann.fit(X_train, y_train, batch_size=32, epochs=100, class_weight=class_w,
validation_data=(X_test,y_test), callbacks=[roc], verbose=2)
return ann, h
```
```{python, eval=FALSE}
X_train, y_train, X_test, y_test, X_master = load_and_transform('feature_eng_nlp', split=0.2)
ann, history = fitDNN(X_train, y_train, X_test, y_test)
assessmodel(ann, X_test, y_test)
```
```
Layer (type) Output Shape Param #
=================================================================
dense_1 (Dense) (None, 100) 26200
_________________________________________________________________
dropout_1 (Dropout) (None, 100) 0
_________________________________________________________________
dense_2 (Dense) (None, 100) 10100
_________________________________________________________________
dropout_2 (Dropout) (None, 100) 0
_________________________________________________________________
dense_3 (Dense) (None, 1) 101
=================================================================
Total params: 36,401
Trainable params: 36,401
```
```
Confusion matrix:
[[25158 23783]
[ 3102 4958]]
True Positive Rate: 0.62
False Positive Rate: 0.49
Precision: 0.17
ROC AUC score: 0.5860088624967633
```
![ROC-AUC score in the test data over the epochs](dnn_rocauc.png)