-
Notifications
You must be signed in to change notification settings - Fork 13
/
ml_common_models.qmd
1992 lines (1433 loc) · 81.2 KB
/
ml_common_models.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Common Models in Machine Learning {#sec-ml-common-models}
![](img/chapter_gp_plots/gp_plot_7.svg){width=75%}
Before really getting into some machine learning models, let's get one thing straight from the outset: **any model may be used in machine learning**, from a standard linear model to a deep neural network. The key focus in ML is on performance, and generally we'll go with what works for the situation. This means that the modeler is often less concerned with the interpretation of the model, and more with the ability of the model to predict well on new data. But, as we'll see, we can do both if desired. In this chapter, we will explore some of the more common machine learning models and techniques.
```{r}
#| include: False
#| label: setup-ml-models
```
```{python}
#| include: False
#| label: setup-ml-models-py
import pandas as pd
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.model_selection import cross_validate, KFold, cross_val_score
from sklearn.metrics import accuracy_score
```
## Key Ideas {#sec-ml-common-key-ideas}
The take home messages from this section include the following:
- Any model can be used with machine learning.
- A good and simple baseline is essential for interpreting your performance results.
- You only need a small set of tools (models) to go very far with machine learning.
### Why this matters {#sec-ml-common-why-matters}
Having the right tools in data science saves time and improves results, and using well-known tools means you'll have plenty of resources for help. It also allows you to focus more on the data and the problem, rather than the details of the model. A simple model might be all you need, but if you need something more complex, these models can still provide a performance benchmark.
### Helpful context {#sec-ml-common-good-to-know}
Before diving in, it'd be helpful to be familiar with the following:
- Linear models, esp. linear and logistic regression (@sec-foundation, @sec-glm)
- Basic machine learning concepts as outlined in @sec-ml-core-concepts
- Model estimation as outlined in @sec-estimation
## General Approach {#sec-ml-common-general-approach}
Let's start with a general approach to machine learning to help us get some bearings. Here is an example outline of the process we could typically take. It incorporates some of the ideas we also cover in other chapters, and we'll demonstrate most of this in the following sections.
- Define the problem, including the target variable(s)
- Select the model(s) to be used, including one baseline model
- Define the performance objective and metric(s) used for model assessment
- Define the search space (parameters, hyperparameters) for those models
- Define the search method (optimization)
- Implement a validation technique and collect the corresponding performance metrics
- Evaluate the chosen model on unseen data
- Interpret the results
Here is a more concrete example:
- Define the problem: predict the probability of heart disease given a set of features
- Select the model(s) to be used: ridge regression (main model), standard regression with no penalty (baseline)
- Define the objective and performance metric(s): RMSE, R-squared
- Define the search space (parameters, hyperparameters) for those models: ridge penalty parameter
- Define the search method (optimization): grid search
- Implement some sort of cross-validation technique: 5-fold cross-validation
- Evaluate the results on unseen data: RMSE on test data
- Interpret the results: the ridge regression model performed better than the baseline model, and the coefficients tell us something about the nature of the relationship between the features and the target
As we go along in this chapter, we'll see most of this in action. So let's get to it!
## Data Setup {#sec-ml-common-data-setup}
```{r}
#| eval: true
#| echo: false
#| label: load-heart-disease-data
# df_income = read_csv("data/census_income.csv", na = c("", "NA", '?'))
df_heart = read_csv("data/heart_disease_processed.csv") |>
mutate(
male = factor(ifelse(male == 1, 'yes', 'no'), levels = c('no', 'yes')),
across(where(is.character), as.factor)
)
df_heart_num = read_csv("data/heart_disease_processed_numeric.csv")
prevalence = mean(df_heart_num$heart_disease)
majority = pmax(prevalence, 1 - prevalence)
```
For our demonstration here, we'll use the heart disease dataset. This is a popular ML binary classification problem, where we want to predict whether a patient has heart disease, given information such as age, sex, resting heart rate etc (@sec-dd-heart-disease-uci).
*There are two forms of the data that we'll use* - one which is mostly in raw form, and one that is purely numeric, where the categorical features are dummy coded and where numeric variables have been standardized (@sec-data-transformations). The purely numeric version will allow us to forgo any additional data processing for some model/package implementations (like penalized regression). We have also dropped the handful of rows with missing values, even though some techniques, like tree-based models, naturally handle missing values. This form of the data will allow us to use any model and make direct comparisons among them later.
:::{.panel-tabset}
##### Python
For python we'll go ahead and do all the imports needed for this chapter.
```{python}
#| label: setup-imports-py
# Basic data packages
import pandas as pd
import numpy as np
# Models
from sklearn.linear_model import LogisticRegression
from lightgbm import LGBMClassifier
from sklearn.neural_network import MLPClassifier
# Metrics and more
from sklearn.model_selection import (
cross_validate, RandomizedSearchCV, train_test_split
)
from sklearn.metrics import accuracy_score
from sklearn.inspection import PartialDependenceDisplay
```
```{python}
#| label: setup-data-py
df_heart = pd.read_csv('https://tinyurl.com/heartdiseaseprocessed')
df_heart_num = pd.read_csv('https://tinyurl.com/heartdiseaseprocessednumeric')
# convert appropriate features to categorical
non_num_cols = df_heart.select_dtypes(exclude='number').columns
df_heart[non_num_cols] = df_heart[non_num_cols].astype('category')
X = df_heart_num.drop(columns=['heart_disease']).to_numpy()
y = df_heart_num['heart_disease'].to_numpy()
prevalence = np.mean(y)
majority = np.max([prevalence, 1 - prevalence])
```
##### R
```{r}
#| label: setup-data-r
library(tidyverse)
df_heart = read_csv('https://tinyurl.com/heartdiseaseprocessed') |>
mutate(across(where(is.character), as.factor))
df_heart_num = read_csv('https://tinyurl.com/heartdiseaseprocessednumeric')
# for use with for mlr3
X = df_heart_num |>
as_tibble() |>
mutate(heart_disease = factor(heart_disease)) |>
janitor::clean_names() # remove some symbols
prevalence = mean(df_heart_num$heart_disease)
majority = pmax(prevalence, 1 - prevalence)
```
:::
In this data, roughly `r scales::percent(prevalence)` suffered from heart disease, so if we're interested in accuracy- we could get `r scales::percent(majority)` correct by just guessing the majority class of no disease. Hopefully we can do better than that!
One last thing, as we go along, performance metrics will vary depending on your setup (e.g., Python vs. R), package versions used, and other things. As such your results may not look exactly like these, and that's okay! Your results should still be similar, and the important thing is to understand the concepts and how to apply them to your own data.
## Beat the Baseline {#sec-ml-common-baseline}
```{r}
#| eval: false
#| label: beat-the-baseline
#| echo: false
#|
l = c(NA, 'img/me_for_web.jpeg', 'img/seth.png')
library(ggimage)
performance_data = tibble(
model = c("Baseline", "Model MC", "Model SB"),
accuracy = c(0.75, 0.85, 0.88),
image = l # Path to the image file
)
# Create the bar chart
ggplot(performance_data, aes(x = model, y = accuracy, fill = model)) +
geom_col(width=.1, alpha = .5, color = NA, show.legend = FALSE) +
geom_point(data = performance_data |> slice(1), size = 10, alpha = 1, show.legend = FALSE) +
geom_image(aes(image=image), size = .075) +
scale_fill_manual(values = unname(okabe_ito)) +
coord_flip() +
labs(
title = "Beat the Baseline!",
x = "",
y = "Performance (Awesome Units)"
)
ggsave("img/ml-beat_the_baseline.svg", width = 8, height = 6, dpi = 300)
```
![Hypothetical Model Comparison](img/ml-beat_the_baseline.svg){width=75%}
Before getting carried away with models, we should have a good reference point for performance - a **baseline model**. The baseline model should serve as a way to gauge how much better your model performs over one that is simpler, probably more computationally efficient, more interpretable, and is still *viable*. It could also be a model that is sufficiently complex to capture something about the data you are exploring, but not as complex as the models you're also interested in.
Take a classification model for example. In this case we might use a logistic regression as a baseline. It is a viable model to begin answering some questions, and get a sense of performance possibilities, but it is often too simple to be adequately performant for many situations. We should be able to do better with more complex models, or if we can't, there is little justification for using them.
### Why do we do this? {#sec-ml-common-baseline-why}
Having a baseline model can help you avoid wasting time and resources implementing more complex tools, and to avoid mistakenly thinking performance is better than expected. It is probably rare, but sometimes relationships for the chosen features and target are mostly or nearly linear and have little interaction. In this case, no amount of fancy modeling will make complex feature targets exist if they don't already. Furthermore, if our baseline is a more complex model that actually incorporates nonlinear relationships and interactions (e.g., a GAMM), you'll often find that the more complex models often don't significantly improve on it. As a last example, in time series settings, a *moving average* can often be a difficult baseline to beat, and so can be a good starting point.
So in general, you may find that the initial baseline model is good enough for present purposes, and you can then move on to other problems to solve, like acquiring data that is more predictive. This is especially true if you are working in a situation with limited time and resources, but should be of mind generally.
### How much better? {#sec-ml-common-baseline-how-much}
In many settings, it often isn't enough to merely beat the baseline model. Your model should perform *statistically* better. For instance, if your advanced model accuracy is 75% and your baseline model's accuracy is 73%, that's great. But, it's good to check if this 2% difference is statistically significant. Remember, accuracy and other metrics are *estimates* and come with uncertainty[^ifonlystatdiff]. This means you can get a ranged estimate for them, as well as test whether they are different from one another. @tbl-prop-test-r shows an example comparison of 75% vs. 73% accuracy at different sample sizes. If the difference is not statistically significant, then it's possible you should stick with the baseline model, or maybe try a different approach to compete with it. This is because such a result means that the next time you run the model on new data, the baseline may actually perform better, or at least you can't be sure that it won't.
[^ifonlystatdiff]: There would be far less hype and wasted time if those in ML and DL research simply did this rather than just reporting the chosen metric of their model 'winning' against other models. It's not that hard to do, yet many do not provide a ranged estimate for their metric, let alone test statistical comparisons among models. You don't even have to bootstrap many common metric estimates for binary classification, since they are just proportions. It'd also be nice if they used a more meaningful baseline than logistic regression, but that's a different story. And one more thing, although many papers also rank the competing models, ranks and mean ranks also have uncertainty, and ranks are typically very noisy.
```{r}
#| echo: false
#| label: tbl-prop-test-r
#| tbl-cap: Interval Estimates for Accuracy
# Sample size of 1000
n_1000 = prop.test(x = c(750, 730), n = c(1000, 1000), correct = FALSE)$conf.int
# Sample size of 10000
n_10000 = prop.test(x = c(7500, 7300), n = c(10000, 10000), correct = FALSE)$conf.int
# Test for difference in two values at sample size of 1000
n_1000_diff = prop.test(x = c(750, 730), n = c(1000, 1000), correct = FALSE)$p.value
# Test for difference in two values at sample size of 10000
n_10000_diff = prop.test(x = c(7500, 7300), n = c(10000, 10000), correct = FALSE)$p.value
# Combine results
tibble(
Sample_Size = c("1000", "10000"),
Lower = c(n_1000[1], n_10000[1]),
Upper = c(n_1000[2], n_10000[2]),
p_value = c(n_1000_diff, n_10000_diff)
) |>
gt() |>
cols_label(Sample_Size = "Sample Size", Lower = "Lower Bound", Upper = "Upper Bound", p_value = "p-value") |>
tab_footnote(
md("Confidence intervals are for the difference in proportions at values of .75 and .73,\n\nand p-values are for the difference in proportions.")
) |>
tab_options(
footnotes.font.size = 9
)
```
That said, in some situations *any* performance increase is worth it, and even if we can't be certain a result is statistically better, any sign of improvement is worth pursuing. For example, if you are trying to predict the next word in a sentence, and your baseline is 70% accurate, and your new model is 72% accurate, that may be significant in terms of user experience. You should still try and show that this is a consistent increase and not a fluke if possible. In other settings, you'll need to make sure the cost is worth it. Is 2% worth millions of dollars? Six months of research? These are among many of the practical considerations you may have to make as well.
## Penalized Linear Models {#sec-ml-common-penalized}
So let's get on with some models already! Let's use the classic linear model as our starting point for ML. We show explicitly how to estimate models like lasso and ridge regression in @sec-estim-penalty. Those work well as a baseline, and so should be in your ML modeling toolbox.
### Elastic Net {#sec-ml-common-elasticnet}
Another common linear model approach is **elastic net**, which we also saw in @sec-ml-core-concepts. It combines two techniques: lasso and ridge regression. We demonstrate the lasso and ridge penalties in @sec-estim-penalty, but all you have to know is that elastic net combines the two penalties- one for lasso and one for ridge, along with a standard objective function for a numeric or categorical target. The relative proportion of the two penalties is controlled by a mixing parameter, and the optimal value for it is determined by cross-validation. So for example, you might end up with a 75% lasso penalty and 25% ridge penalty. In the end though, it's just a slightly fancier logistic regression!
Let's apply this to the heart disease data. We are only doing simple cross-validation here to get a better performance assessment, but you are more than welcome to tune both the penalty parameter and the mixing ratio as we have demonstrated before (@sec-ml-tuning). We'll revisit hyperparameter tuning towards the end of this chapter.
:::{.panel-tabset}
##### Python
```{python}
#| label: elasticnet-py
#| eval: false
model_elastic = LogisticRegression(
penalty = 'elasticnet',
solver = 'saga',
l1_ratio = 0.5,
random_state = 42,
max_iter = 10000,
verbose = False,
)
# use cross-validation to estimate performance
model_elastic_cv = cross_validate(
model_elastic,
X,
y,
cv = 5,
scoring = 'accuracy',
)
# pd.DataFrame(model_elastic_cv) # default output
```
```{python}
#| label: elasticnet-py-save-results
#| echo: false
#| eval: false
pd.DataFrame(model_elastic_cv).to_csv('ml/data/elasticnet-py-results.csv', index=False)
```
```{python}
#| label: elasticnet-py-print-results
#| echo: false
model_elastic_cv = pd.read_csv('ml/data/elasticnet-py-results.csv')
print(
'Training accuracy: ',
np.round(model_elastic_cv['test_score'].mean(), 3),
'\nGuessing: ',
np.round(majority, 3),
)
```
##### R
```{r}
#| label: elasticnet-r
#| eval: false
library(mlr3verse)
tsk_elastic = as_task_classif(
X,
target = "heart_disease"
)
model_elastic = lrn(
"classif.cv_glmnet",
nfolds = 5,
type.measure = "class",
alpha = 0.5
)
model_elastic_cv = resample(
task = tsk_elastic,
learner = model_elastic,
resampling = rsmp("cv", folds = 5)
)
# model_elastic_cv$aggregate(msr('classif.acc')) # default output
```
```{r}
#| label: elasticnet-r-save-results
#| echo: false
#| eval: false
saveRDS(model_elastic_cv, 'ml/data/elasticnet-r-results.rds')
```
```{r}
#| label: elasticnet-r-print-results
#| echo: false
library(mlr3verse)
library(glue)
model_elastic_cv = readRDS('ml/data/elasticnet-r-results.rds')
# Evaluate
acc_inc = round(model_elastic_cv$aggregate(msr('classif.acc')) / majority - 1, 3)
glue("Training Accuracy: {round(model_elastic_cv$aggregate(msr('classif.acc')), 3)}\nGuessing: {round(majority, 3)}")
```
:::
So we're starting off with what seems to be a good model. Our average accuracy across the validation sets is definitely doing better than guessing, with a performance increase of more than `r scales::label_percent()(round(acc_inc, 1))`!
### Strengths & weaknesses {#sec-ml-common-penalized-strengths-weaknesses}
Let's take a moment to consider the strengths and weaknesses of penalized regression models.
**Strengths**
- Intuitive approach. In the end, it's still just a standard regression model you're already familiar with.
- Widely used for many problems. Lasso/Ridge/ElasticNet would be fine to use in any setting you would use linear or logistic regression.
- A good baseline for tabular data problems.
**Weaknesses**
- Does not automatically seek out interactions and non-linearity, and as such will generally not be as predictive as other techniques.
- Variables have to be scaled or results will largely reflect data types.
- May have interpretability issues with correlated features.
- Relatively weaker performance compared to other models, especially in high-dimensional settings.
### Additional thoughts {#sec-ml-common-penalized-additional}
Using penalized regression is a very good default method in the tabular data setting, and is something to strongly consider for more interpretation-focused model settings. These approaches predict better on new data than their standard, non-regularized complements, so they provide a nice balance between interpretability and predictive power. However, in general they are not going to be as strong of a method as others typically used in the machine learning world, and may not even be competitive without a lot of feature engineering. If prediction is all you care about, you'll likely need something else. Now let's see if we can do better with other models!
## Tree-based Models {#sec-ml-common-trees}
Let's move beyond standard linear models and get into a notably different type of approach. Tree-based methods are a class of models that are very popular in machine learning contexts, and for good reason, they work *very* well. To get a sense of how they work, consider the following classification example where we want to predict a binary target as 'Yes' or 'No'.
```{r}
#| echo: false
#| eval: false
#| label: fig-tree-graph-setup
g = DiagrammeR::grViz('img/tree.dot')
g |>
DiagrammeRsvg::export_svg() |>
charToRaw() |>
rsvg::rsvg_svg("img/tree.svg")
```
![A simple classification tree](img/tree.svg){#fig-tree width=40%}
We have two numeric features, $X_1$ and $X_2$. At the start, we take $X_1$ and make a split at the value of 5. Any observation less than 5 on $X_1$ goes to the right with a prediction of *No*. Any observation greater than or equal to 5 goes to the left, where we then split based on values of $X_2$. Any observation less than 3 goes to the right with a prediction of *Yes*. Any observation greater than or equal to 3 goes to the left with a prediction of *No*. So in the end, we see that an observation that is relatively lower on $X_1$, or relatively higher on both, results in a prediction of *No*. On the other hand, an observation that is high on $X_1$ and low on $X_2$ results in a prediction of *Yes*.
This is a simple example, but it illustrates the core idea of a tree-based model, where the **tree** reflects the total process, and **branches** are represented by the splits going down, ultimately ending at **leaves** where predictions are made. We can also think of the tree as a series of `if-then` statements, where we start at the top and work our way down until we reach a leaf node, which is a prediction for all observations that qualify for that leaf.
A single tree would likely be the most interpretable model we could probably come up with. Furthermore, it incorporates nonlinearities through multiple branches on a single feature, interactions by branching across different features, and feature selection by excluding features that do not result in useful splits for the objective, all in one.
However, a single tree is not a very stable model unfortunately, and so does not generalize well. For example, just a slight change in data, or even just starting with a different feature, might produce a very different tree[^cartbase]. Even though predictions could be similar, model interpretation would be very different.
[^cartbase]: A single regression/classification tree actually could serve as a decent baseline model, especially given the interpretability, and modern methods try to make them more stable.
The solution to that problem is straightforward though - by using the power of a bunch of trees, we can get predictions for each observation from each tree, and then average the predictions, resulting in a much more stable estimate. This is the concept behind both **random forests** and **gradient boosting**, which can be seen as different algorithms to produce a bunch of trees. They are also considered types of **ensemble models**, which are models that combine the predictions of multiple models, to ultimately produce a single prediction for each observation. In this case each tree serves as a model.
Random forests (RF) and boosting methods (GB) are very easy to implement, to a point. However, there are typically several hyperparameters to consider for tuning. Here are just a few to think about:
- Number of trees
- Learning rate (GB)
- Maximum **depth** of each tree
- Minimum number of observations in each leaf
- Number of features to consider at each tree/split
- Regularization parameters (GB)
- Out-of-bag sample size (RF)
The number of trees is simply how many trees you want to build, and is a key parameter setting for both RF and GB. For boosting models, the number of trees and learning rate play off of each other. Having more trees allows for a smaller rate[^gblr], which might improve the model but will take longer to train. However, it can lead to overfitting if other steps are not taken.
[^gblr]: For boosting models, the learning rate is a scaling factor for the contribution of each tree to the overall model. A smaller learning rate means that each tree contributes less to the overall model, and so you'll need more trees to get the same performance, all else being equal.
The depth of each tree refers to how many levels we allow the model to branch out, and is a crucial parameter. It controls the complexity of each tree, and thus the complexity of the overall model- less depth helps to avoid overfitting, but if the depth is too shallow, you won't be able to capture the nuances of the data. The minimum number of observations required for each leaf is also important for similar reasons - a lower number will allow for more complex trees, while a higher number will result in simpler trees.
It's also generally a good idea to take a random sample of features for each tree (or possibly even each branch), to also help reduce overfitting, but it's not obvious what proportion to take. The regularization parameters[^gbl1l2] are typically less important in practice, but can help reduce overfitting as in other modeling circumstances we've talked about. As with hyperparameters in other model settings, you'll use something like cross-validation to settle on final values.
[^gbl1l2]: For boosting models, the regularization parameters are basically penalties on the weights of the leaves. For example, a smaller value would reduce the contribution of that leaf to the overall model, and so would help to reduce overfitting.
### Example with LightGBM {#sec-ml-common-trees-lightgbm}
Here is an example of gradient boosting with the heart disease data. We'll explicitly set some of the parameters, and use 5-fold cross-validation to estimate performance.
:::{.panel-tabset}
##### Python
Although boosting methods are available in [scikit-learn]{.pack} for Python, in general we recommend using the [lightgbm]{.pack} or [xgboost]{.pack} packages directly for boosting, as both have a sklearn API (as demonstrated). Also, they both provide R and Python implementations of the package, making it easy to not lose your place when switching between languages. We'll use [lightgbm]{.pack} here[^nomeow].
[^nomeow]: Some also prefer [catboost]{.pack}. Your humble authors have not actually been able to practically implement catboost in a setting where it was more predictive or as efficient/speedy as [xgboost]{.pack} or [lightgbm]{.pack} to get to the same performance level, but some have had notable success with it.
```{python}
#| label: boost-py
#| results: hide
#| eval: false
model_boost = LGBMClassifier(
n_estimators = 1000,
learning_rate = 1e-3,
max_depth = 5,
verbose = -1,
random_state=42,
)
model_boost_cv = cross_validate(
model_boost,
df_heart.drop(columns='heart_disease'),
df_heart['heart_disease'],
cv = 5,
scoring='accuracy',
)
# pd.DataFrame(model_boost_cv)
```
```{python}
#| label: boost-py-save-results
#| echo: false
#| eval: false
pd.DataFrame(model_boost_cv).to_csv('ml/data/boost-py-results.csv', index=False)
```
```{python}
#| label: boost-py-print-results
#| echo: false
model_boost_cv = pd.read_csv('ml/data/boost-py-results.csv')
print(
'Training accuracy: ',
np.round(np.mean(model_boost_cv['test_score']), 3),
'\nGuessing: ',
np.round(majority, 3),
)
```
##### R
Note that as of writing, the [mlr3]{.pack} requires one of the extended packages for its implementation of lightgbm, and so we'll use the [mlr3extralearners]{.pack} package.
```{r}
#| eval: false
#| label: boost-r
library(mlr3verse)
# for lightgbm, you need mlr3extralearners and lightgbm package installed
# it is available from github via:
# remotes::install_github("mlr-org/mlr3extralearners@*release")
library(mlr3extralearners)
set.seed(42)
# Define task
# For consistency we use X, but lgbm can handle factors and missing data
# and so we can use the original df_heart if desired
tsk_boost = as_task_classif(
df_heart, # can use the 'raw' data
target = "heart_disease"
)
# Define learner
model_boost = lrn(
"classif.lightgbm",
num_iterations = 1000,
learning_rate = 1e-3,
max_depth = 5
)
# Cross-validation
model_boost_cv = resample(
task = tsk_boost,
learner = model_boost,
resampling = rsmp("cv", folds = 5)
)
```
```{r}
#| label: boost-r-save-results
#| echo: false
#| eval: false
# Note: mlr3 can sometimes fail to process factor targets (which it requires) appropriately
# this appears to have been an repeated issue https://github.com/mlr-org/mlr3/issues/91
saveRDS(model_boost_cv, "ml/data/boost-r-results.rds")
```
```{r}
#| label: boost-r-print-results
#| echo: false
model_boost_cv = readRDS("ml/data/boost-r-results.rds")
# Evaluate
glue("Training Accuracy: {round(model_boost_cv$aggregate(msr('classif.acc')), 3)}\nGuessing: {round(majority, 3)}")
```
:::
So here we have a model that is also performing well, though not significantly better or worse than our elastic net model. For most tabular data situations, we'd expect boosting to do better, but this shows why we want a good baseline or simpler model for comparison. We'll revisit hyperparameter tuning using this model later.
<!-- If you'd like to see an example of how we could implement a form boosting by hand, see @app-boosting. -->
<!-- TODO: ADD GBLINEAR BY HAND TO APPENDIX for web version in future -->
### Strengths & weaknesses {#sec-ml-common-trees-strengths-weaknesses}
Random forests and boosting methods, though not new, are still 'state of the art' in terms of performance on tabular data like the type we've been using for our demos here. You'll often find that it will usually take considerable effort to beat them.
**Strengths**
- A single tree is highly interpretable.
- Relatively good prediction out of the box.
- Easily incorporates features of different types (the scale of numeric features, or using categorical features, doesn't matter).
- Tolerance to irrelevant features.
- Some tolerance to correlated inputs.
- Handling of missing values. Missing values are just another value to potentially split on[^defaultmiss].
[^defaultmiss]: It's not clear why most model functions still have no default for this sort of thing in `r lubridate::year(Sys.Date())`. Is it that hard to drop or impute them with an informative message?
**Weaknesses**
- Honestly few, but like all techniques, it might be relatively less predictive in certain situations. There is [no free lunch](https://machinelearningmastery.com/no-free-lunch-theorem-for-machine-learning/).
- It does take more effort to tune relative to linear model methods.
## Deep Learning and Neural Networks {#sec-ml-common-dl-nn}
```{r}
#| echo: false
#| eval: false
#| label: nn-graph-setup
g = DiagrammeR::grViz('img/nn_basic.dot')
g |>
DiagrammeRsvg::export_svg() |>
charToRaw() |>
rsvg::rsvg_svg("img/nn_basic.svg")
```
![A neural network](img/nn_basic.svg){#fig-basic-nn width=50%}
**Deep learning has fundamentally transformed the world of data science, and, in many ways, the world itself**. It has been used to solve problems in image detection, speech recognition, natural language processing, and more, from assisting with cancer diagnosis, to writing entire novels, providing self-driving cars, and even helping the formerly blind see. It is an extremely powerful tool.
For tabular data however, the story is a bit different. Here, deep learning has consistently struggled to outperform models like boosting and even penalized regression in many cases. But while it is not always the best tool for the job, it should be in your modeling toolbox, if only because it potentially can be the most performant model, and may well become the dominant model for tabular data in the future. Here we'll provide a brief overview of the key concepts behind neural networks, the underlying approach to deep learning, and then demonstrate how to implement a simple neural network to get things started.
### What is a neural network? {#sec-ml-common-nnet}
**Neural networks** form the basis of deep learning models. They have actually been around a while - [computationally and conceptually going back decades](https://cs.stanford.edu/people/eroberts/courses/soco/projects/neural-networks/History/history1.html)[^neuralbasis]^,^ [^cogneuron]. Like other models, they are computational tools that help us understand how to get outputs from inputs. However, they weren't quickly adopted due to computing limitations, similar to the slow adoption of Bayesian methods. But now, neural networks, or deep learning more generally, have recently become the go-to method for many problems.
[^neuralbasis]: Most consider the scientific origin with @mcculloch_logical_1943.
[^cogneuron]: On the conceptual side, they served as a rudimentary model of neuronal functioning in the brain, and a way to understand how the brain processes information. The models sprung from the cognitive revolution, a backlash against the behaviorist approach to psychology, and used the computer as [a metaphor for how the brain might operate](https://en.wikipedia.org/wiki/Connectionism).
### How do they work? {#sec-ml-common-nnet-how}
At its core, a neural network can be seen as a series of matrix multiplications and other operations to produce combinations of features, and ultimately a desired output. We've been talking about inputs and outputs since the beginning (@sec-lm-relationships), but neural networks like to put a lot more in between the inputs and outputs than we've seen with other models. However, the key operations are often no different than what we've done with a basic linear model, and sometimes even simpler! But the combinations of features they produce can represent many aspects of the data that are not easily captured by simpler models.
One notable difference from models we've been seeing is that neural networks implement *multiple combinations of features*, where each combination is referred to as a hidden **node** or unit[^hidden]. In a neural network, each feature has a weight (or coefficient), just like in a linear model of the type we've used before. These features are multiplied by their weights and then added together. But we actually create multiple such combinations, as depicted in the 'H' or 'hidden' nodes in the following visualization.
[^hidden]: The term 'hidden' is used because these nodes are between the input or output. It does not imply a latent/hidden variable in the sense it is used in structural equation or measurement models, state space models, and similar, but there is a lot of common ground. See the connection with principal components analysis for example (@sec-ml-more-pca-as-net).
```{r}
#| echo: false
#| eval: false
#| label: half-nn-graph-setup
g = DiagrammeR::grViz('img/nn_first_hidden_layer.dot')
g |>
DiagrammeRsvg::export_svg() |>
charToRaw() |>
rsvg::rsvg_svg("img/nn_first_hidden_layer.svg")
```
![The first hidden layer](img/nn_first_hidden_layer.svg){#fig-first-layer width=25%}
The next phase is where things can get more interesting. We take those hidden units and add in nonlinear transformations before moving deeper into the network. The transformations applied are typically referred to as **activation functions**[^relu]. So, the output of the current (typically linear) part is transformed in a way that allows the model to incorporate nonlinearities. While this might sound new, this is just like how we use link functions in generalized linear models (@sec-glm-distributions). Furthermore, these multiple combinations also allow us to incorporate interactions between features.
[^relu]: We have multiple options for our activation functions, and probably the most common activation function in deep learning is the **rectified linear unit** or [ReLU](https://en.wikipedia.org/wiki/Rectifier_(neural_networks)), and its more recent variants. Others used include the [sigmoid function](https://en.wikipedia.org/wiki/Sigmoid_function), which is exactly the same as what we used in logistic regression, the hyperbolic tangent function, and of course the linear/identity function, which does not do any transformation at all.
```{r}
#| echo: false
#| eval: false
#| label: half-nn-graph-activation
g = DiagrammeR::grViz('img/nn_activation2.dot')
g |>
DiagrammeRsvg::export_svg() |>
charToRaw() |>
rsvg::rsvg_svg("img/nn_activation2.svg")
```
![Activation function applied to a node's output before passing it as input to next layer](img/nn_activation2.svg){#fig-activate width=33%}
But we can go even further! We can add more layers, and more nodes in each layer, even different types of layers, to create a **deep neural network**. We can also add components specific to certain types of processing, have some parts of the network only connected to certain other parts, apply specific computations to specific components, and more. The complexity really is only limited by our imagination, *and computational capacity*! This is what helps make neural networks so powerful - given enough nodes, layers, and components, they can approximate ***any*** function, which could include the true function that connects our features to the target. Practically though, the feature inputs become an output or multiple outputs that can then be assessed in the same ways as other models.
```{r}
#| echo: false
#| eval: false
#| label: nn-deep-graph-setup
g = DiagrammeR::grViz('img/nn_deep2.dot')
g |>
DiagrammeRsvg::export_svg() |>
charToRaw() |>
rsvg::rsvg_svg("img/nn_deep.svg")
```
![A more complex neural network](img/nn_deep.svg){#fig-complex-nn width=75%}
```{r}
#| echo: false
#| eval: false
#| label: nn-act-graph-setup
g = DiagrammeR::grViz('img/nn_activation2.dot')
g |>
DiagrammeRsvg::export_svg() |>
charToRaw() |>
rsvg::rsvg_svg("img/nn_activation2.svg")
```
Before getting too carried away, let's simplify things a bit by returning to some familiar ground. Consider a logistic regression model. There we take the linear combination of features and weights, and then apply the sigmoid function (inverse logit) to it, and that is the output of the model that we compare to our observed target and calculate an objective function.
We can revisit a plot we saw earlier (@fig-graph-logistic) to make things more concrete. The input features are $X_1$, $X_2$, and $X_3$, and the output is the probability of a positive outcome of a binary target. The weights are $w_1$, $w_2$, and $w_3$, and the bias[^biascs] is $w_0$. The hidden node is just our linear predictor which we can create via matrix multiplication of the feature matrix and weights. The sigmoid function is the activation function, and the output is the probability of the chosen label.
[^biascs]: It's not exactly clear [why computer scientists chose to call this the bias](https://stats.stackexchange.com/questions/511726/different-usage-of-the-term-bias-in-stats-machine-learning), but it's the same as the intercept in a linear model, or conceptually as an offset or constant. It has nothing to do with the word bias as used in every other modeling context.
```{r}
#| echo: false
#| eval: false
#| label: logregnn-graph-setup
g = DiagrammeR::grViz('img/nn_logreg.dot')
g |>
DiagrammeRsvg::export_svg() |>
charToRaw() |>
rsvg::rsvg_svg("img/nn_logreg.svg")
```
![A logistic regression as a neural network with a single hidden layer with one node, and sigmoid activation](img/nn_logreg.svg){#fig-logreg-nn width=50%}
This shows that we can actually think of logistic regression as a very simple neural network, with a linear combination of the inputs as a single hidden node and a sigmoid activation function adding the nonlinear transformation. Indeed, the earliest **multilayer perceptron** models were just composed of multiple layers of logistic regressions!
:::{.callout-note title="GAMs and Neural Networks" collapse='true'}
You can think of neural networks as nonlinear extensions of linear models. Regression approaches like GAMs and gaussian process regression can be seen as [approximations to neural networks](https://arxiv.org/abs/1711.00165) (see also @rasmussen_gaussian_2005), bridging the gap between the simpler, and more interpretable linear model, and black box of a deep neural network. This brings us back to having a good baseline. If you know some simpler tools that can approximate more complex ones, you can often get 'good enough' results with the simpler models.
:::
### Trying it out {#sec-ml-common-dl-nn-try}
The neural network model we'll use is a **multilayer perceptron** (MLP), which is a model like the one we've been showing. It consists of multiple hidden layers of potentially varying sizes, and we can incorporate activation functions as we see fit.
:::{.callout-note title='More on neural networks for tabular data' collapse='true'}
Be aware that this would be considered a bare minimum approach for a neural network, and generally you'd need to do more, even for standard tabular data. To begin with, you'd want to tune the **architecture**, or structure of hidden layers. For example, you might want to try more layers, as well as 'wider' layers, or more nodes per layer. Also, we'd usually want to use **embeddings** for categorical features as opposed to the one-hot approach used here (@sec-data-cat)[^fastaitab].
[^fastaitab]: A really good tool for a standard MLP type approach with automatic categorical embeddings is [fastai]{.pack}'s tabular learner. For a more flexible, roll-your-own type of approach, consider the recent [torch_frame]{.pack}.
:::
For our demo, we'll use the numeric heart disease data with one-hot encoded categorical features. For our architecture, we'll use three hidden layers with 200 nodes each. As noted, these and other settings are hyperparameters that you'd normally prefer to tune, but we'll just set them as fixed parameters.
:::{.panel-tabset}
##### Python
For our demonstration we'll use [sklearn]{.pack}'s builtin `MLPClassifier`. We set the learning rate to 0.001. We set an **adaptive learning rate**, which is a way to automatically adjust the learning rate as the model trains. The ReLU activation function is default. We'll also use the **nesterov momentum** approach, which is a modification to an SGD variant (Adam). We use a **warm start**, which allows us to train the model in stages, and is useful for early stopping. We'll also set the **validation fraction**, which is the proportion of data to use for the validation set. And finally, we'll use **shuffle** to shuffle each batch used during the SGD approach (@sec-estim-opt-algos-sgd).
```{python}
#| label: deep-py
#| eval: false
model_mlp = MLPClassifier(
hidden_layer_sizes = (200, 200, 200),
learning_rate = 'adaptive',
learning_rate_init = 0.001,
shuffle = True,
random_state = 123,
warm_start = True,
nesterovs_momentum = True,
validation_fraction = .2,
verbose = False,
)
# with the above settings, this will take a few seconds
model_mlp_cv = cross_validate(
model_mlp,
X,
y,
cv = 5
)
# pd.DataFrame(model_mlp_cv) # default output
```
```{python}
#| echo: false
#| eval: false
#| label: deep-py-save-results
pd.DataFrame(model_mlp_cv).to_csv('ml/data/deep-py-results.csv', index=False)
```
```{python}
#| label: deep-py-print-results
#| echo: false
model_mlp_cv = pd.read_csv('ml/data/deep-py-results.csv')
print(
'Training accuracy: ',
np.round(np.mean(model_mlp_cv['test_score']), 3),
'\nGuessing: ',
np.round(majority, 3),
)
```
##### R
For R, we'll use [mlr3torch]{.pack}, which calls [pytorch]{.pack} directly under the hood. We'll use the same architecture as was done with the Python example. It uses the **ReLU** activation function as a default. We'll also use the **adam** SGD variant as the optimizer, which is a popular choice in deep learning models, and the default for the [sklearn]{.pack} approach. We'll use **cross entropy** as the loss function, which is the same as the log loss objective function used in logistic regression and other ML classification models. We use a **batch size** of 16. Batch size is the number of observations to use for each [batch of training](https://stats.stackexchange.com/questions/153531/what-is-batch-size-in-neural-network). We'll also use **epochs** of 50, which is the number of times to train on the entire dataset (probably way more than necessary). We'll also use **predict type** of **prob**, which is the type of prediction to make. Finally, we'll use both **logloss** and **accuracy** as the metrics to track. As specified, this took over a minute.
```{r}
#| label: deep-r
#| eval: false
library(mlr3torch)
learner_mlp = lrn(
"classif.mlp",
# defining network parameters
neurons = c(200, 200, 200),
# training parameters
batch_size = 16,
epochs = 50,
# Defining the optimizer, loss, and callbacks
optimizer = t_opt("adam", lr = 1e-3),
loss = t_loss("cross_entropy"),
# Measures to track
measures_train = msrs(c("classif.logloss")),
validate = .1,
measures_valid = msrs(c("classif.logloss", "classif.ce")),
# predict type (required by logloss)
predict_type = "prob",
seed = 123
)
tsk_mlp = as_task_classif(
x = X,
target = 'heart_disease'
)
# this will take a few seconds depending on your chosen settings and hardware
model_mlp_cv = resample(
task = tsk_mlp,
learner = learner_mlp,
resampling = rsmp("cv", folds = 5),
)
model_mlp_cv$aggregate(msr("classif.acc")) # default output
```
```{r}
#| label: deep-r-save-results
#| echo: false
#| eval: false
saveRDS(model_mlp_cv, "ml/data/deep-r-results.rds")
```
```{r}
#| label: deep-r-print-results
#| echo: false
# library(mlr3torch)
model_mlp_cv = readRDS("ml/data/deep-r-results.rds")
glue('Training Accuracy: {round(model_mlp_cv$aggregate(msr("classif.acc")), 3)}\nGuessing: {round(majority, 3)}')
```
:::
This model actually did pretty well, and we're on par with our accuracy as we were with the other two models. This is somewhat surprising given the nature of the data- small number of observations with different data types- a type of situation in which neural networks don't usually do as well as others. Just goes to show, you never know until you try!
:::{.callout-note title='Deep and Wide' collapse='true'}
A now relatively old question in deep learning is what is the better approach: deep networks, with more layers, or extremely wide (lots of neurons) and fewer layers? The answer is that it can depend on the problem, but in general, deep networks are more efficient and easier to train, and [will generalize better](https://stats.stackexchange.com/questions/222883/why-are-neural-networks-becoming-deeper-but-not-wider). Deeper networks have the ability to build upon what the previous layers have learned, basically compartmentalizing different parts of the task to learn. More important to the task is creating an architecture that is able to learn the appropriate aspects of the data, and generalize well.
:::
### Strengths & weaknesses {#sec-ml-common-dl-nn-strengths-weaknesses}
So why might we want to use a neural network for tabular data? The main reason is that they can be the most performant model, and can potentially capture the most complex relationships in the data. They can also be used for a wide variety of data types, and can be used for a wide variety of tasks. However, they are also the most complex model, and can be the most difficult to tune and interpret.
**Strengths**
- Good prediction generally.
- Incorporates the predictive power of different combinations of inputs.
- Some tolerance to correlated inputs.
- Batch processing and parallelization of many operations makes it very efficient for large datasets.
- Can be used for even standard GLM approaches.
- Can be added as a component to other deep learning models (e.g., LLMs that are handling text input).
**Weaknesses**
- Susceptible to irrelevant features.
- Doesn't consistently outperform other methods that are easier to implement on tabular data.
## A Tuned Example {#sec-ml-common-tuned-ex}
We noted in the chapter on machine learning concepts that there are often multiple hyperparameters we are concerned with for a given model (@sec-ml-tuning). We had hyperparameters for each of the models in this chapter also. For the elastic net model, we might want to tune the penalty parameters and the mixing ratio. For the boosting method, we might want to tune the number of trees, the learning rate, the maximum depth of each tree, the minimum number of observations in each leaf, and the number of features to consider at each tree/split. And for the neural network, we might want to tune the number of hidden layers, the number of nodes in each layer, the learning rate, the batch size, the number of epochs, and the activation function. There is plenty to explore!
Here is an example of a hyperparameter search using the boosting model. We'll tune the number of trees, the learning rate, the minimum number of observations in each leaf, and the maximum depth of each tree. We'll use a **randomized search** across the parameter space to sample from the set of hyperparameters, rather than searching every possible combination as in a **grid search**. This is a good approach when you have a lot of hyperparameters to tune, and/or when you have a lot of data.
:::{.panel-tabset}
##### Python
```{python}
#| label: tune-boost-py
#| eval: false
#| results: hide
# train-test split
X_train, X_test, y_train, y_test = train_test_split(
df_heart.drop(columns='heart_disease'),
df_heart_num['heart_disease'],
test_size = 0.2,
random_state = 42
)
model_boost = LGBMClassifier(
verbose = -1
)
param_grid = {
'n_estimators': [500, 1000],
'learning_rate': [1e-3, 1e-2, 1e-1],
'max_depth': [3, 5, 7, 9],
'min_child_samples': [1, 5, 10],
}
# this will take a few seconds
model_boost_cv_tune = RandomizedSearchCV(
model_boost,
param_grid,
n_iter = 10,
cv = 5,
scoring = 'accuracy',
n_jobs = -1,
random_state = 42
)
model_boost_cv_tune.fit(X_train, y_train)
test_predictions = model_boost_cv_tune.predict(X_test)
accuracy_score(y_test, test_predictions)
```
```{python}
#| label: tune-boost-py-save-results
#| echo: false
#| eval: false
#| results: hide
# save the model_boost_cv best estimator model
import joblib
#save your model or results
joblib.dump(model_boost_cv_tune.best_estimator_, 'ml/data/tune-boost-py-model.pkl')
test_predictions = model_boost_cv_tune.predict(X_test)
pd.DataFrame(model_boost_cv_tune.cv_results_).to_csv(
'ml/data/tune-boost-py-results.csv',
index=False
)
pd.DataFrame(test_predictions).to_csv(
'ml/data/tune-boost-py-predictions.csv',
index=False
)
```
```{python}
#| label: tune-boost-py-print-results
#| echo: false
from sklearn.model_selection import RandomizedSearchCV, train_test_split