ml_common_models.qmd


# Common Models in Machine Learning {#sec-ml-common-models}

![](img/chapter_gp_plots/gp_plot_7.svg){width=75%}

Before really getting into some machine learning models, let's get one thing straight from the outset: **any model may be used in machine learning**, from a standard linear model to a deep neural network. The key focus in ML is on performance, and generally we'll go with what works for the situation. This means that the modeler is often less concerned with the interpretation of the model, and more with the ability of the model to predict well on new data. But, as we'll see, we can do both if desired. In this chapter, we will explore some of the more common machine learning models and techniques. 


```{r}
#| include: False
#| label: setup-ml-models

```

```{python}
#| include: False
#| label: setup-ml-models-py

import pandas as pd
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.model_selection import cross_validate, KFold, cross_val_score
from sklearn.metrics import accuracy_score
```


## Key Ideas {#sec-ml-common-key-ideas}

The take home messages from this section include the following:

- Any model can be used with machine learning.
- A good and simple baseline is essential for interpreting your performance results.
- You only need a small set of tools (models) to go very far with machine learning.

### Why this matters {#sec-ml-common-why-matters}

Having the right tools in data science saves time and improves results, and using well-known tools means you'll have plenty of resources for help. It also allows you to focus more on the data and the problem, rather than the details of the model. A simple model might be all you need, but if you need something more complex, these models can still provide a performance benchmark.


### Helpful context {#sec-ml-common-good-to-know}


Before diving in, it'd be helpful to be familiar with the following:

- Linear models, esp. linear and logistic regression (@sec-foundation, @sec-glm)
- Basic machine learning concepts as outlined in @sec-ml-core-concepts
- Model estimation as outlined in  @sec-estimation


## General Approach {#sec-ml-common-general-approach}

Let's start with a general approach to machine learning to help us get some bearings. Here is an example outline of the process we could typically take. It incorporates some of the ideas we also cover in other chapters, and we'll demonstrate most of this in the following sections.

- Define the problem, including the target variable(s)
- Select the model(s) to be used, including one baseline model
- Define the performance objective and metric(s) used for model assessment
- Define the search space (parameters, hyperparameters) for those models
- Define the search method (optimization)
- Implement a validation technique and collect the corresponding performance metrics
- Evaluate the chosen model on unseen data
- Interpret the results

Here is a more concrete example:

- Define the problem: predict the probability of heart disease given a set of features
- Select the model(s) to be used: ridge regression (main model), standard regression with no penalty (baseline)
- Define the objective and performance metric(s): RMSE, R-squared
- Define the search space (parameters, hyperparameters) for those models: ridge penalty parameter
- Define the search method (optimization): grid search
- Implement some sort of cross-validation technique: 5-fold cross-validation
- Evaluate the results on unseen data: RMSE on test data
- Interpret the results: the ridge regression model performed better than the baseline model, and the coefficients tell us something about the nature of the relationship between the features and the target


As we go along in this chapter, we'll see most of this in action. So let's get to it!


## Data Setup {#sec-ml-common-data-setup}

```{r}
#| eval: true
#| echo: false
#| label: load-heart-disease-data

# df_income = read_csv("data/census_income.csv", na = c("", "NA", '?'))
df_heart = read_csv("data/heart_disease_processed.csv") |> 
    mutate(
        male = factor(ifelse(male == 1, 'yes', 'no'), levels = c('no', 'yes')),
        across(where(is.character), as.factor)
    )
df_heart_num = read_csv("data/heart_disease_processed_numeric.csv")

prevalence = mean(df_heart_num$heart_disease)
majority = pmax(prevalence, 1 - prevalence)
```


For our demonstration here, we'll use the heart disease dataset. This is a popular ML binary classification problem, where we want to predict whether a patient has heart disease, given information such as age, sex, resting heart rate etc (@sec-dd-heart-disease-uci).  

*There are two forms of the data that we'll use* - one which is mostly in raw form, and one that is purely numeric, where the categorical features are dummy coded and where numeric variables have been standardized (@sec-data-transformations). The purely numeric version will allow us to forgo any additional data processing for some model/package implementations (like penalized regression). We have also dropped the handful of rows with missing values, even though some techniques, like tree-based models, naturally handle missing values. This form of the data will allow us to use any model and make direct comparisons among them later.


:::{.panel-tabset}

##### Python

For python we'll go ahead and do all the imports needed for this chapter.

```{python}
#| label: setup-imports-py
# Basic data packages
import pandas as pd
import numpy as np

# Models
from sklearn.linear_model import LogisticRegression
from lightgbm import LGBMClassifier
from sklearn.neural_network import MLPClassifier

# Metrics and more
from sklearn.model_selection import (
    cross_validate, RandomizedSearchCV, train_test_split
)
from sklearn.metrics import accuracy_score
from sklearn.inspection import PartialDependenceDisplay
```

```{python}
#| label: setup-data-py
df_heart = pd.read_csv('https://tinyurl.com/heartdiseaseprocessed')
df_heart_num = pd.read_csv('https://tinyurl.com/heartdiseaseprocessednumeric')

# convert appropriate features to categorical
non_num_cols = df_heart.select_dtypes(exclude='number').columns
df_heart[non_num_cols] = df_heart[non_num_cols].astype('category')

X = df_heart_num.drop(columns=['heart_disease']).to_numpy()
y = df_heart_num['heart_disease'].to_numpy()

prevalence = np.mean(y)
majority = np.max([prevalence, 1 - prevalence])
```


##### R

```{r}
#| label: setup-data-r
library(tidyverse)

df_heart = read_csv('https://tinyurl.com/heartdiseaseprocessed') |> 
    mutate(across(where(is.character), as.factor))

df_heart_num = read_csv('https://tinyurl.com/heartdiseaseprocessednumeric')

# for use with for mlr3
X = df_heart_num |> 
    as_tibble() |> 
    mutate(heart_disease = factor(heart_disease)) |> 
    janitor::clean_names() # remove some symbols

prevalence = mean(df_heart_num$heart_disease)
majority = pmax(prevalence, 1 - prevalence)
```

:::

In this data, roughly `r scales::percent(prevalence)` suffered from heart disease, so if we're interested in accuracy- we could get `r scales::percent(majority)` correct by just guessing the majority class of no disease. Hopefully we can do better than that!

One last thing, as we go along, performance metrics will vary depending on your setup (e.g., Python vs. R), package versions used, and other things. As such your results may not look exactly like these, and that's okay! Your results should still be similar, and the important thing is to understand the concepts and how to apply them to your own data.


## Beat the Baseline {#sec-ml-common-baseline}


```{r}
#| eval: false
#| label: beat-the-baseline
#| echo: false
#| 
l = c(NA, 'img/me_for_web.jpeg', 'img/seth.png')
library(ggimage)
performance_data = tibble(
    model = c("Baseline", "Model MC", "Model SB"),
    accuracy = c(0.75, 0.85, 0.88),
    image = l  # Path to the image file
)

# Create the bar chart
ggplot(performance_data, aes(x = model, y = accuracy, fill = model)) +
    geom_col(width=.1, alpha = .5, color = NA, show.legend = FALSE) +
    geom_point(data = performance_data |> slice(1), size = 10, alpha = 1, show.legend = FALSE) +
    geom_image(aes(image=image), size = .075) +
    scale_fill_manual(values = unname(okabe_ito)) +
    coord_flip() +
    labs(
        title = "Beat the Baseline!",
        x = "",
        y = "Performance (Awesome Units)"
    ) 

ggsave("img/ml-beat_the_baseline.svg", width = 8, height = 6, dpi = 300)
```

![Hypothetical Model Comparison](img/ml-beat_the_baseline.svg){width=75%}

Before getting carried away with models, we should have a good reference point for performance - a **baseline model**. The baseline model should serve as a way to gauge how much better your model performs over one that is simpler, probably more computationally efficient, more interpretable, and is still *viable*. It could also be a model that is sufficiently complex to capture something about the data you are exploring, but not as complex as the models you're also interested in.

Take a classification model for example. In this case we might use a logistic regression as a baseline. It is a viable model to begin answering some questions, and get a sense of performance possibilities, but it is often too simple to be adequately performant for many situations. We should be able to do better with more complex models, or if we can't, there is little justification for using them.


### Why do we do this? {#sec-ml-common-baseline-why}

Having a baseline model can help you avoid wasting time and resources implementing more complex tools, and to avoid mistakenly thinking performance is better than expected. It is probably rare, but sometimes relationships for the chosen features and target are mostly or nearly linear and have little interaction. In this case, no amount of fancy modeling will make complex feature targets exist if they don't already. Furthermore, if our baseline is a more complex model that actually incorporates nonlinear relationships and interactions (e.g., a GAMM), you'll often find that the more complex models often don't significantly improve on it. As a last example, in time series settings, a *moving average* can often be a difficult baseline to beat, and so can be a good starting point. 

So in general, you may find that the initial baseline model is good enough for present purposes, and you can then move on to other problems to solve, like acquiring data that is more predictive. This is especially true if you are working in a situation with limited time and resources, but should be of mind generally.

### How much better? {#sec-ml-common-baseline-how-much}

In many settings, it often isn't enough to merely beat the baseline model. Your model should perform *statistically* better. For instance, if your advanced model accuracy is 75% and your baseline model's accuracy is 73%, that's great. But, it's good  to check if this 2% difference is statistically significant. Remember, accuracy and other metrics are *estimates* and come with uncertainty[^ifonlystatdiff]. This means you can get a ranged estimate for them, as well as test whether they are different from one another.  @tbl-prop-test-r shows an example comparison of 75% vs. 73% accuracy at different sample sizes. If the difference is not statistically significant, then it's possible you should stick with the baseline model, or maybe try a different approach to compete with it. This is because such a result means that the next time you run the model on new data, the baseline may actually perform better, or at least you can't be sure that it won't.

[^ifonlystatdiff]: There would be far less hype and wasted time if those in ML and DL research simply did this rather than just reporting the chosen metric of their model 'winning' against other models. It's not that hard to do, yet many do not provide a ranged estimate for their metric, let alone test statistical comparisons among models. You don't even have to bootstrap many common metric estimates for binary classification, since they are just proportions. It'd also be nice if they used a more meaningful baseline than logistic regression, but that's a different story. And one more thing, although many papers also rank the competing models, ranks and mean ranks also have uncertainty, and ranks are typically very noisy.


```{r}
#| echo: false
#| label: tbl-prop-test-r
#| tbl-cap: Interval Estimates for Accuracy

# Sample size of 1000
n_1000 = prop.test(x = c(750, 730), n = c(1000, 1000), correct = FALSE)$conf.int

# Sample size of 10000
n_10000 = prop.test(x = c(7500, 7300), n = c(10000, 10000), correct = FALSE)$conf.int

# Test for difference in two values at sample size of 1000
n_1000_diff = prop.test(x = c(750, 730), n = c(1000, 1000), correct = FALSE)$p.value

# Test for difference in two values at sample size of 10000
n_10000_diff = prop.test(x = c(7500, 7300), n = c(10000, 10000), correct = FALSE)$p.value

# Combine results
tibble(
    Sample_Size = c("1000", "10000"),
    Lower = c(n_1000[1], n_10000[1]),
    Upper = c(n_1000[2], n_10000[2]),
    p_value = c(n_1000_diff, n_10000_diff)
) |> 
    gt() |>
    cols_label(Sample_Size = "Sample Size", Lower = "Lower Bound", Upper = "Upper Bound", p_value = "p-value") |> 
    tab_footnote(
        md("Confidence intervals are for the difference in proportions at values of .75 and .73,\n\nand p-values are for the difference in proportions.")

    ) |> 
    tab_options(
        footnotes.font.size =  9
    )
```

That said, in some situations *any* performance increase is worth it, and even if we can't be certain a result is statistically better, any sign of improvement is worth pursuing. For example, if you are trying to predict the next word in a sentence, and your baseline is 70% accurate, and your new model is 72% accurate, that may be significant in terms of user experience. You should still try and show that this is a consistent increase and not a fluke if possible. In other settings, you'll need to make sure the cost is worth it. Is 2% worth millions of dollars? Six months of research? These are among many of the practical considerations you may have to make as well.


## Penalized Linear Models {#sec-ml-common-penalized}

So let's get on with some models already! Let's use the classic linear model as our starting point for ML. We show explicitly how to estimate models like lasso and ridge regression in @sec-estim-penalty. Those work well as a baseline, and so should be in your ML modeling toolbox. 


### Elastic Net {#sec-ml-common-elasticnet}

Another common linear model approach is **elastic net**, which we also saw in @sec-ml-core-concepts. It combines two techniques: lasso and ridge regression.  We demonstrate the lasso and ridge penalties in @sec-estim-penalty, but all you have to know is that elastic net combines the two penalties- one for lasso and one for ridge, along with a standard objective function for a numeric or categorical target. The relative proportion of the two penalties is controlled by a mixing parameter, and the optimal value for it is determined by cross-validation.  So for example, you might end up with a 75% lasso penalty and 25% ridge penalty. In the end though, it's just a slightly fancier logistic regression!

Let's apply this to the heart disease data. We are only doing simple cross-validation here to get a better performance assessment, but you are more than welcome to tune both the penalty parameter and the mixing ratio as we have demonstrated before (@sec-ml-tuning). We'll revisit hyperparameter tuning towards the end of this chapter.


:::{.panel-tabset}

##### Python

```{python}
#| label: elasticnet-py
#| eval: false
model_elastic = LogisticRegression(
    penalty = 'elasticnet',
    solver = 'saga',
    l1_ratio = 0.5,
    random_state = 42,
    max_iter = 10000,
    verbose = False,
)

# use cross-validation to estimate performance
model_elastic_cv = cross_validate(
    model_elastic,
    X,
    y,
    cv = 5,
    scoring = 'accuracy',
)

# pd.DataFrame(model_elastic_cv) # default output
```

```{python}
#| label: elasticnet-py-save-results
#| echo: false
#| eval: false

pd.DataFrame(model_elastic_cv).to_csv('ml/data/elasticnet-py-results.csv', index=False)
```


```{python}
#| label: elasticnet-py-print-results
#| echo: false

model_elastic_cv = pd.read_csv('ml/data/elasticnet-py-results.csv')

print(
    'Training accuracy: ',
    np.round(model_elastic_cv['test_score'].mean(), 3),
    '\nGuessing: ',
    np.round(majority, 3),
)
```

##### R

```{r}
#| label: elasticnet-r
#| eval: false

library(mlr3verse)

tsk_elastic = as_task_classif(
    X,
    target = "heart_disease"
)

model_elastic = lrn(
    "classif.cv_glmnet", 
    nfolds = 5, 
    type.measure = "class", 
    alpha = 0.5
)

model_elastic_cv = resample(
    task = tsk_elastic,
    learner = model_elastic,
    resampling = rsmp("cv", folds = 5)
)

# model_elastic_cv$aggregate(msr('classif.acc')) # default output
```

```{r}
#| label: elasticnet-r-save-results
#| echo: false
#| eval: false

saveRDS(model_elastic_cv, 'ml/data/elasticnet-r-results.rds')
```


```{r}
#| label: elasticnet-r-print-results
#| echo: false

library(mlr3verse)
library(glue)

model_elastic_cv = readRDS('ml/data/elasticnet-r-results.rds')

# Evaluate
acc_inc = round(model_elastic_cv$aggregate(msr('classif.acc')) / majority - 1, 3)

glue("Training Accuracy: {round(model_elastic_cv$aggregate(msr('classif.acc')), 3)}\nGuessing: {round(majority, 3)}")
```

:::

So we're starting off with what seems to be a good model. Our average accuracy across the validation sets is definitely doing better than guessing, with a performance increase of more than `r scales::label_percent()(round(acc_inc, 1))`! 


### Strengths & weaknesses {#sec-ml-common-penalized-strengths-weaknesses}

Let's take a moment to consider the strengths and weaknesses of penalized regression models.

**Strengths**

- Intuitive approach.  In the end, it's still just a standard regression model you're already familiar with.
- Widely used for many problems.  Lasso/Ridge/ElasticNet would be fine to use in any setting you would use linear or logistic regression.
- A good baseline for tabular data problems.

**Weaknesses**

- Does not automatically seek out interactions and non-linearity, and as such will generally not be as predictive as other techniques.
- Variables have to be scaled or results will largely reflect data types.
- May have interpretability issues with correlated features.
- Relatively weaker performance compared to other models, especially in high-dimensional settings.

### Additional thoughts {#sec-ml-common-penalized-additional}

Using penalized regression is a very good default method in the tabular data setting, and is something to strongly consider for more interpretation-focused model settings. These approaches predict better on new data than their standard, non-regularized complements, so they provide a nice balance between interpretability and predictive power. However, in general they are not going to be as strong of a method as others typically used in the machine learning world, and may not even be competitive without a lot of feature engineering. If prediction is all you care about, you'll likely need something else. Now let's see if we can do better with other models!


## Tree-based Models {#sec-ml-common-trees}

Let's move beyond standard linear models and get into a notably different type of approach. Tree-based methods are a class of models that are very popular in machine learning contexts, and for good reason, they work *very* well. To get a sense of how they work, consider the following classification example where we want to predict a binary target as 'Yes' or 'No'.

```{r}
#| echo: false
#| eval: false
#| label: fig-tree-graph-setup
g = DiagrammeR::grViz('img/tree.dot')

g |> 
    DiagrammeRsvg::export_svg() |>  
    charToRaw() |>  
    rsvg::rsvg_svg("img/tree.svg")
```

![A simple classification tree](img/tree.svg){#fig-tree width=40%}

We have two numeric features, $X_1$ and $X_2$. At the start, we take $X_1$ and make a split at the value of 5. Any observation less than 5 on $X_1$ goes to the right with a prediction of *No*. Any observation greater than or equal to 5 goes to the left, where we then split based on values of $X_2$. Any observation less than 3 goes to the right with a prediction of *Yes*. Any observation greater than or equal to 3 goes to the left with a prediction of *No*. So in the end, we see that an observation that is relatively lower on $X_1$, or relatively higher on both, results in a prediction of *No*. On the other hand, an observation that is high on $X_1$ and low on $X_2$ results in a prediction of *Yes*.

This is a simple example, but it illustrates the core idea of a tree-based model, where the **tree** reflects the total process, and **branches** are represented by the splits going down, ultimately ending at **leaves** where predictions are made. We can also think of the tree as a series of `if-then` statements, where we start at the top and work our way down until we reach a leaf node, which is a prediction for all observations that qualify for that leaf.

A single tree would likely be the most interpretable model we could probably come up with. Furthermore, it incorporates nonlinearities through multiple branches on a single feature, interactions by branching across different features, and feature selection by excluding features that do not result in useful splits for the objective, all in one.

However, a single tree is not a very stable model unfortunately, and so does not generalize well. For example, just a slight change in data, or even just starting with a different feature, might produce a very different tree[^cartbase]. Even though predictions could be similar, model interpretation would be very different.

[^cartbase]: A single regression/classification tree actually could serve as a decent baseline model, especially given the interpretability, and modern methods try to make them more stable.

The solution to that problem is straightforward though - by using the power of a bunch of trees, we can get predictions for each observation from each tree, and then average the predictions, resulting in a much more stable estimate. This is the concept behind both **random forests** and **gradient boosting**, which can be seen as different algorithms to produce a bunch of trees. They are also considered types of **ensemble models**, which are models that combine the predictions of multiple models, to ultimately produce a single prediction for each observation. In this case each tree serves as a model.

Random forests (RF) and boosting methods (GB) are very easy to implement, to a point. However, there are typically several hyperparameters to consider for tuning. Here are just a few to think about:

- Number of trees
- Learning rate (GB)
- Maximum **depth** of each tree
- Minimum number of observations in each leaf
- Number of features to consider at each tree/split
- Regularization parameters (GB)
- Out-of-bag sample size (RF)

The number of trees is simply how many trees you want to build, and is a key parameter setting for both RF and GB. For boosting models, the number of trees and learning rate play off of each other. Having more trees allows for a smaller rate[^gblr], which might improve the model but will take longer to train. However, it can lead to overfitting if other steps are not taken. 

[^gblr]: For boosting models, the learning rate is a scaling factor for the contribution of each tree to the overall model. A smaller learning rate means that each tree contributes less to the overall model, and so you'll need more trees to get the same performance, all else being equal.


The depth of each tree refers to how many levels we allow the model to branch out, and is a crucial parameter. It controls the complexity of each tree, and thus the complexity of the overall model- less depth helps to avoid overfitting, but if the depth is too shallow, you won't be able to capture the nuances of the data.  The minimum number of observations required for each leaf is also important for similar reasons - a lower number will allow for more complex trees, while a higher number will result in simpler trees.

It's also generally a good idea to take a random sample of features for each tree (or possibly even each branch), to also help reduce overfitting, but it's not obvious what proportion to take. The regularization parameters[^gbl1l2] are typically less important in practice, but can help reduce overfitting as in other modeling circumstances we've talked about. As with hyperparameters in other model settings, you'll use something like cross-validation to settle on  final values.

[^gbl1l2]: For boosting models, the regularization parameters are basically penalties on the weights of the leaves. For example, a smaller value would reduce the contribution of that leaf to the overall model, and so would help to reduce overfitting.


### Example with LightGBM {#sec-ml-common-trees-lightgbm}


Here is an example of gradient boosting with the heart disease data. We'll explicitly set some of the parameters, and use 5-fold cross-validation to estimate performance.

:::{.panel-tabset}

##### Python

Although boosting methods are available in [scikit-learn]{.pack} for Python, in general we recommend using the [lightgbm]{.pack} or [xgboost]{.pack} packages directly for boosting, as both have a sklearn API (as demonstrated). Also, they both provide R and Python implementations of the package, making it easy to not lose your place when switching between languages.  We'll use [lightgbm]{.pack} here[^nomeow].

[^nomeow]: Some also prefer [catboost]{.pack}. Your humble authors have not actually been able to practically implement catboost in a setting where it was more predictive or as efficient/speedy as [xgboost]{.pack} or [lightgbm]{.pack} to get to the same performance level, but some have had notable success with it. 

```{python}
#| label: boost-py
#| results: hide
#| eval: false
model_boost = LGBMClassifier(
    n_estimators = 1000,
    learning_rate = 1e-3,
    max_depth = 5,
    verbose = -1,
    random_state=42,
)

model_boost_cv = cross_validate(
    model_boost,
    df_heart.drop(columns='heart_disease'),
    df_heart['heart_disease'],
    cv = 5,
    scoring='accuracy',
)

# pd.DataFrame(model_boost_cv)
```

```{python}
#| label: boost-py-save-results
#| echo: false
#| eval: false

pd.DataFrame(model_boost_cv).to_csv('ml/data/boost-py-results.csv', index=False)
```

```{python}
#| label: boost-py-print-results
#| echo: false

model_boost_cv = pd.read_csv('ml/data/boost-py-results.csv')

print(
    'Training accuracy: ',
    np.round(np.mean(model_boost_cv['test_score']), 3),
    '\nGuessing: ',
    np.round(majority, 3),
)
```

##### R

Note that as of writing, the [mlr3]{.pack} requires one of the extended packages for its implementation of lightgbm, and so we'll use the [mlr3extralearners]{.pack} package.


```{r}
#| eval: false
#| label: boost-r
library(mlr3verse)

# for lightgbm, you need mlr3extralearners and lightgbm package installed
# it is available from github via:
# remotes::install_github("mlr-org/mlr3extralearners@*release")
library(mlr3extralearners) 

set.seed(42)

# Define task
# For consistency we use X, but lgbm can handle factors and missing data 
# and so we can use the original df_heart if desired
tsk_boost = as_task_classif(
    df_heart,                   # can use the 'raw' data
    target = "heart_disease"
)

# Define learner
model_boost = lrn(
    "classif.lightgbm",
    num_iterations = 1000,
    learning_rate = 1e-3,
    max_depth = 5
)

# Cross-validation
model_boost_cv = resample(
    task = tsk_boost,
    learner = model_boost,
    resampling = rsmp("cv", folds = 5)
)
```

```{r}
#| label: boost-r-save-results
#| echo: false
#| eval: false

# Note: mlr3 can sometimes fail to process factor targets (which it requires) appropriately
# this appears to have been an repeated issue https://github.com/mlr-org/mlr3/issues/91

saveRDS(model_boost_cv, "ml/data/boost-r-results.rds")
```

```{r}
#| label: boost-r-print-results
#| echo: false

model_boost_cv = readRDS("ml/data/boost-r-results.rds")

# Evaluate

glue("Training Accuracy: {round(model_boost_cv$aggregate(msr('classif.acc')), 3)}\nGuessing: {round(majority, 3)}")
```

:::


So here we have a model that is also performing well, though not significantly better or worse than our elastic net model. For most tabular data situations, we'd expect boosting to do better, but this shows why we want a good baseline or simpler model for comparison. We'll revisit hyperparameter tuning using this model later.  


<!-- If you'd like to see an example of how we could implement a form  boosting by hand, see @app-boosting. -->
<!-- TODO: ADD GBLINEAR BY HAND TO APPENDIX for web version in future -->


### Strengths & weaknesses {#sec-ml-common-trees-strengths-weaknesses}

Random forests and boosting methods, though not new, are still 'state of the art' in terms of performance on tabular data like the type we've been using for our demos here. You'll often find that it will usually take considerable effort to beat them. 

**Strengths**

- A single tree is highly interpretable.
- Relatively good prediction out of the box.
- Easily incorporates features of different types (the scale of numeric features, or using categorical features, doesn't matter).
- Tolerance to irrelevant features.
- Some tolerance to correlated inputs.
- Handling of missing values. Missing values are just another value to potentially split on[^defaultmiss].

[^defaultmiss]: It's not clear why most model functions still have no default for this sort of thing in `r lubridate::year(Sys.Date())`. Is it that hard to drop or impute them with an informative message?

**Weaknesses**

- Honestly few, but like all techniques, it might be relatively less predictive in certain situations. There is [no free lunch](https://machinelearningmastery.com/no-free-lunch-theorem-for-machine-learning/).
- It does take more effort to tune relative to linear model methods.


## Deep Learning and Neural Networks {#sec-ml-common-dl-nn}


```{r}
#| echo: false
#| eval: false
#| label: nn-graph-setup

g = DiagrammeR::grViz('img/nn_basic.dot')

g |> 
    DiagrammeRsvg::export_svg() |>
    charToRaw() |>
    rsvg::rsvg_svg("img/nn_basic.svg")
```


![A neural network](img/nn_basic.svg){#fig-basic-nn width=50%}

**Deep learning has fundamentally transformed the world of data science, and, in many ways, the world itself**. It has been used to solve problems in image detection, speech recognition, natural language processing, and more, from assisting with cancer diagnosis, to writing entire novels, providing self-driving cars, and even helping the formerly blind see. It is an extremely powerful tool.


For tabular data however, the story is a bit different. Here, deep learning has consistently struggled to outperform models like boosting and even penalized regression in many cases. But while it is not always the best tool for the job, it should be in your modeling toolbox, if only because it potentially can be the most performant model, and may well become the dominant model for tabular data in the future. Here we'll provide a brief overview of the key concepts behind neural networks, the underlying approach to deep learning, and then demonstrate how to implement a simple neural network to get things started.


### What is a neural network? {#sec-ml-common-nnet}

**Neural networks** form the basis of deep learning models. They have actually been around a while - [computationally and conceptually going back decades](https://cs.stanford.edu/people/eroberts/courses/soco/projects/neural-networks/History/history1.html)[^neuralbasis]^,^ [^cogneuron]. Like other models, they are computational tools that help us understand how to get outputs from inputs. However, they weren't quickly adopted due to computing limitations, similar to the slow adoption of Bayesian methods. But now, neural networks, or deep learning more generally, have recently become the go-to method for many problems. 

[^neuralbasis]: Most consider the scientific origin with @mcculloch_logical_1943.
[^cogneuron]: On the conceptual side, they served as a rudimentary model of neuronal functioning in the brain, and a way to understand how the brain processes information. The models sprung from the cognitive revolution, a backlash against the behaviorist approach to psychology, and used the computer as [a metaphor for how the brain might operate](https://en.wikipedia.org/wiki/Connectionism).


### How do they work? {#sec-ml-common-nnet-how}

At its core, a neural network can be seen as a series of matrix multiplications and other operations to produce combinations of features, and ultimately a desired output. We've been talking about inputs and outputs since the beginning (@sec-lm-relationships), but neural networks like to put a lot more in between the inputs and outputs than we've seen with other models. However, the key operations are often no different than what we've done with a basic linear model, and sometimes even simpler! But the combinations of features they produce can represent many aspects of the data that are not easily captured by simpler models.


One notable difference from models we've been seeing is that neural networks implement *multiple combinations of features*, where each combination is referred to as a hidden **node** or unit[^hidden]. In a neural network, each feature has a weight (or coefficient), just like in a linear model of the type we've used before. These features are multiplied by their weights and then added together. But we actually create multiple such combinations, as depicted in the 'H' or 'hidden' nodes in the following visualization.

[^hidden]: The term 'hidden' is used because these nodes are between the input or output. It does not imply a latent/hidden variable in the sense it is used in structural equation or measurement models, state space models, and similar, but there is a lot of common ground. See the connection with principal components analysis for example (@sec-ml-more-pca-as-net).


```{r}
#| echo: false
#| eval: false
#| label: half-nn-graph-setup

g = DiagrammeR::grViz('img/nn_first_hidden_layer.dot')

g |> 
    DiagrammeRsvg::export_svg() |>
    charToRaw() |>
    rsvg::rsvg_svg("img/nn_first_hidden_layer.svg")
```


![The first hidden layer](img/nn_first_hidden_layer.svg){#fig-first-layer width=25%}


The next phase is where things can get more interesting. We take those hidden units and add in nonlinear transformations before moving deeper into the network. The transformations applied are typically referred to as **activation functions**[^relu]. So, the output of the current (typically linear) part is transformed in a way that allows the model to incorporate nonlinearities. While this might sound new, this is just like how we use link functions in generalized linear models (@sec-glm-distributions). Furthermore, these multiple combinations also allow us to incorporate interactions between features.

[^relu]:  We have multiple options for our activation functions, and probably the most common activation function in deep learning is the **rectified linear unit** or [ReLU](https://en.wikipedia.org/wiki/Rectifier_(neural_networks)), and its more recent variants. Others used include the [sigmoid function](https://en.wikipedia.org/wiki/Sigmoid_function), which is exactly the same as what we used in logistic regression, the hyperbolic tangent function, and of course the linear/identity function, which does not do any transformation at all.


```{r}
#| echo: false
#| eval: false
#| label: half-nn-graph-activation

g = DiagrammeR::grViz('img/nn_activation2.dot')

g |> 
    DiagrammeRsvg::export_svg() |>
    charToRaw() |>
    rsvg::rsvg_svg("img/nn_activation2.svg")
```

![Activation function applied to a node's output before passing it as input to next layer](img/nn_activation2.svg){#fig-activate width=33%}


But we can go even further! We can add more layers, and more nodes in each layer, even different types of layers, to create a **deep neural network**. We can also add components specific to certain types of processing, have some parts of the network only connected to certain other parts, apply specific computations to specific components, and more. The complexity really is only limited by our imagination, *and computational capacity*! This is what helps make neural networks so powerful - given enough nodes, layers, and components, they can approximate ***any*** function, which could include the true function that connects our features to the target. Practically though, the feature inputs become an output or multiple outputs that can then be assessed in the same ways as other models. 

```{r}
#| echo: false
#| eval: false
#| label: nn-deep-graph-setup

g = DiagrammeR::grViz('img/nn_deep2.dot')

g |>
    DiagrammeRsvg::export_svg() |>
    charToRaw() |>
    rsvg::rsvg_svg("img/nn_deep.svg")
```

![A more complex neural network](img/nn_deep.svg){#fig-complex-nn width=75%}


```{r}
#| echo: false
#| eval: false
#| label: nn-act-graph-setup
g = DiagrammeR::grViz('img/nn_activation2.dot')

g |>
    DiagrammeRsvg::export_svg() |>
    charToRaw() |>
    rsvg::rsvg_svg("img/nn_activation2.svg")
```


Before getting too carried away, let's simplify things a bit by returning to some familiar ground. Consider a logistic regression model. There we take the linear combination of features and weights, and then apply the sigmoid function (inverse logit) to it, and that is the output of the model that we compare to our observed target and calculate an objective function.

We can revisit a plot we saw earlier (@fig-graph-logistic) to make things more concrete. The input features are $X_1$, $X_2$, and $X_3$, and the output is the probability of a positive outcome of a binary target. The weights are $w_1$, $w_2$, and $w_3$, and the bias[^biascs] is $w_0$. The hidden node is just our linear predictor which we can create via matrix multiplication of the feature matrix and weights. The sigmoid function is the activation function, and the output is the probability of the chosen label.

[^biascs]: It's not exactly clear [why computer scientists chose to call this the bias](https://stats.stackexchange.com/questions/511726/different-usage-of-the-term-bias-in-stats-machine-learning), but it's the same as the intercept in a linear model, or conceptually as an offset or constant. It has nothing to do with the word bias as used in every other modeling context.

```{r}
#| echo: false
#| eval: false
#| label: logregnn-graph-setup

g = DiagrammeR::grViz('img/nn_logreg.dot')

g |> 
    DiagrammeRsvg::export_svg() |> 
    charToRaw() |> 
    rsvg::rsvg_svg("img/nn_logreg.svg")
```

![A logistic regression as a neural network with a single hidden layer with one node, and sigmoid activation](img/nn_logreg.svg){#fig-logreg-nn width=50%}


This shows that we can actually think of logistic regression as a very simple neural network, with a  linear combination of the inputs as a single hidden node and a sigmoid activation function adding the nonlinear transformation. Indeed, the earliest **multilayer perceptron** models were just composed of multiple layers of logistic regressions! 

:::{.callout-note title="GAMs and Neural Networks" collapse='true'} 
You can think of neural networks as nonlinear extensions of linear models. Regression approaches like GAMs and gaussian process regression can be seen as [approximations to neural networks](https://arxiv.org/abs/1711.00165) (see also @rasmussen_gaussian_2005), bridging the gap between the simpler, and more interpretable linear model, and black box of a deep neural network. This brings us back to having a good baseline. If you know some simpler tools that can approximate more complex ones, you can often get 'good enough' results with the simpler models.
:::


### Trying it out {#sec-ml-common-dl-nn-try}


The neural network model we'll use is a **multilayer perceptron** (MLP), which is a model like the one we've been showing. It consists of multiple hidden layers of potentially varying sizes, and we can incorporate activation functions as we see fit. 


:::{.callout-note title='More on neural networks for tabular data' collapse='true'}
Be aware that this would be considered a bare minimum approach for a neural network, and generally you'd need to do more, even for standard tabular data. To begin with, you'd want to tune the **architecture**, or structure of hidden layers. For example, you might want to try more layers, as well as 'wider' layers, or more nodes per layer. Also, we'd usually want to use **embeddings**  for categorical features as opposed to the one-hot approach used here (@sec-data-cat)[^fastaitab].

[^fastaitab]: A really good tool for a standard MLP type approach with automatic categorical embeddings is [fastai]{.pack}'s tabular learner. For a more flexible, roll-your-own type of approach, consider the recent [torch_frame]{.pack}.
:::

For our demo, we'll use the numeric heart disease data with one-hot encoded categorical features. For our architecture, we'll use three hidden layers with 200 nodes each. As noted, these and other settings are hyperparameters that you'd normally prefer to tune, but we'll just set them as fixed parameters. 


:::{.panel-tabset}

##### Python

For our demonstration we'll use [sklearn]{.pack}'s builtin `MLPClassifier`. We set the learning rate to 0.001.  We set an **adaptive learning rate**, which is a way to automatically adjust the learning rate as the model trains. The ReLU activation function is default.  We'll also use the **nesterov momentum** approach, which is a modification to an SGD variant (Adam). We use a **warm start**, which allows us to train the model in stages, and is useful for early stopping. We'll also set the **validation fraction**, which is the proportion of data to use for the validation set. And finally, we'll use **shuffle** to shuffle each batch used during the SGD approach (@sec-estim-opt-algos-sgd).


```{python}
#| label: deep-py
#| eval: false
model_mlp = MLPClassifier(
    hidden_layer_sizes = (200, 200, 200),  
    learning_rate = 'adaptive',
    learning_rate_init = 0.001,
    shuffle = True,
    random_state = 123,
    warm_start = True,
    nesterovs_momentum = True,
    validation_fraction =  .2,
    verbose = False,
)

# with the above settings, this will take a few seconds
model_mlp_cv = cross_validate(
    model_mlp, 
    X, 
    y, 
    cv = 5
) 

# pd.DataFrame(model_mlp_cv) # default output
```

```{python}
#| echo: false
#| eval: false
#| label: deep-py-save-results

pd.DataFrame(model_mlp_cv).to_csv('ml/data/deep-py-results.csv', index=False)
```

```{python}
#| label: deep-py-print-results
#| echo: false

model_mlp_cv = pd.read_csv('ml/data/deep-py-results.csv')

print(
    'Training accuracy: ',
    np.round(np.mean(model_mlp_cv['test_score']), 3),
    '\nGuessing: ',
    np.round(majority, 3),
)
```

##### R

For R, we'll use [mlr3torch]{.pack}, which calls [pytorch]{.pack} directly under the hood. We'll use the same architecture as was done with the Python example. It uses the **ReLU** activation function as a default. We'll also use the **adam** SGD variant as the optimizer, which is a popular choice in deep learning models, and the default for the [sklearn]{.pack} approach. We'll use **cross entropy** as the loss function, which is the same as the log loss objective function used in logistic regression and other ML classification models.  We use a **batch size** of 16. Batch size is the number of observations to use for each [batch of training](https://stats.stackexchange.com/questions/153531/what-is-batch-size-in-neural-network). We'll also use **epochs** of 50, which is the number of times to train on the entire dataset (probably way more than necessary). We'll also use **predict type** of **prob**, which is the type of prediction to make. Finally, we'll use both **logloss** and **accuracy** as the metrics to track. As specified, this took over a minute.


```{r}
#| label: deep-r
#| eval: false

library(mlr3torch)

learner_mlp = lrn(
    "classif.mlp",
    # defining network parameters
    neurons = c(200, 200, 200),
    # training parameters
    batch_size = 16,
    epochs = 50,
    # Defining the optimizer, loss, and callbacks
    optimizer = t_opt("adam", lr = 1e-3),
    loss = t_loss("cross_entropy"),
    # Measures to track
    measures_train = msrs(c("classif.logloss")),
    validate = .1,
    measures_valid = msrs(c("classif.logloss", "classif.ce")),
    # predict type (required by logloss)
    predict_type = "prob",
    seed = 123
)

tsk_mlp = as_task_classif(
    x = X,
    target = 'heart_disease'
)

# this will take a few seconds depending on your chosen settings and hardware
model_mlp_cv = resample(
    task = tsk_mlp,
    learner = learner_mlp,
    resampling = rsmp("cv", folds = 5),
)

model_mlp_cv$aggregate(msr("classif.acc")) # default output
```

```{r}
#| label: deep-r-save-results
#| echo: false
#| eval: false

saveRDS(model_mlp_cv, "ml/data/deep-r-results.rds")
```


```{r}
#| label: deep-r-print-results
#| echo: false

# library(mlr3torch)
model_mlp_cv = readRDS("ml/data/deep-r-results.rds")

glue('Training Accuracy: {round(model_mlp_cv$aggregate(msr("classif.acc")), 3)}\nGuessing: {round(majority, 3)}')
```

:::


This model actually did pretty well, and we're on par with our accuracy as we were with the other two models. This is somewhat surprising given the nature of the data- small number of observations with different data types- a type of situation in which neural networks don't usually do as well as others. Just goes to show, you never know until you try! 

:::{.callout-note title='Deep and Wide' collapse='true'}

A now relatively old question in deep learning is what is the better approach: deep networks, with more layers, or extremely wide (lots of neurons) and fewer layers? The answer is that it can depend on the problem, but in general, deep networks are more efficient and easier to train, and [will generalize better](https://stats.stackexchange.com/questions/222883/why-are-neural-networks-becoming-deeper-but-not-wider).  Deeper networks have the ability to build upon what the previous layers have learned, basically compartmentalizing different parts of the task to learn. More important to the task is creating an architecture that is able to learn the appropriate aspects of the data, and generalize well.
:::


### Strengths & weaknesses {#sec-ml-common-dl-nn-strengths-weaknesses}

So why might we want to use a neural network for tabular data? The main reason is that they can be the most performant model, and can potentially capture the most complex relationships in the data. They can also be used for a wide variety of data types, and can be used for a wide variety of tasks. However, they are also the most complex model, and can be the most difficult to tune and interpret.

**Strengths**

- Good prediction generally.
- Incorporates the predictive power of different combinations of inputs.
- Some tolerance to correlated inputs.
- Batch processing and parallelization of many operations makes it very efficient for large datasets.
- Can be used for even standard GLM approaches.
- Can be added as a component to other deep learning models (e.g., LLMs that are handling text input).

**Weaknesses**

- Susceptible to irrelevant features.
- Doesn't consistently outperform other methods that are easier to implement on tabular data.


## A Tuned Example {#sec-ml-common-tuned-ex}

We noted in the chapter on machine learning concepts that there are often multiple hyperparameters we are concerned with for a given model (@sec-ml-tuning). We had hyperparameters for each of the models in this chapter also. For the elastic net model, we might want to tune the penalty parameters and the mixing ratio. For the boosting method, we might want to tune the number of trees, the learning rate, the maximum depth of each tree, the minimum number of observations in each leaf, and the number of features to consider at each tree/split. And for the neural network, we might want to tune the number of hidden layers, the number of nodes in each layer, the learning rate, the batch size, the number of epochs, and the activation function. There is plenty to explore!

Here is an example of a hyperparameter search using the boosting model. We'll tune the number of trees, the learning rate, the minimum number of observations in each leaf, and the maximum depth of each tree. We'll use a **randomized search** across the parameter space to sample from the set of hyperparameters, rather than searching every possible combination as in a **grid search**. This is a good approach when you have a lot of hyperparameters to tune, and/or when you have a lot of data.

:::{.panel-tabset}

##### Python

```{python}
#| label: tune-boost-py
#| eval: false
#| results: hide
# train-test split
X_train, X_test, y_train, y_test = train_test_split(
    df_heart.drop(columns='heart_disease'), 
    df_heart_num['heart_disease'],
    test_size = 0.2,
    random_state = 42
)

model_boost = LGBMClassifier(
    verbose = -1
)

param_grid = {
    'n_estimators': [500, 1000],
    'learning_rate': [1e-3, 1e-2, 1e-1],
    'max_depth': [3, 5, 7, 9],
    'min_child_samples': [1, 5, 10],
}

# this will take a few seconds
model_boost_cv_tune = RandomizedSearchCV(
    model_boost, 
    param_grid, 
    n_iter = 10,
    cv = 5, 
    scoring = 'accuracy', 
    n_jobs = -1,
    random_state = 42
)

model_boost_cv_tune.fit(X_train, y_train)

test_predictions = model_boost_cv_tune.predict(X_test)
accuracy_score(y_test, test_predictions)
```

```{python}
#| label: tune-boost-py-save-results
#| echo: false
#| eval: false
#| results: hide

# save the model_boost_cv best estimator model
import joblib

#save your model or results
joblib.dump(model_boost_cv_tune.best_estimator_, 'ml/data/tune-boost-py-model.pkl')

test_predictions = model_boost_cv_tune.predict(X_test)

pd.DataFrame(model_boost_cv_tune.cv_results_).to_csv(
    'ml/data/tune-boost-py-results.csv', 
    index=False
)
pd.DataFrame(test_predictions).to_csv(
    'ml/data/tune-boost-py-predictions.csv', 
    index=False
)
```

```{python}
#| label: tune-boost-py-print-results
#| echo: false

from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.metrics import accuracy_score

X_train, X_test, y_train, y_test = train_test_split(
    df_heart.drop(columns='heart_disease'),
    df_heart_num['heart_disease'],
    test_size = 0.2,
    random_state = 42
)

df_model_boost_cv_tune = pd.read_csv('ml/data/tune-boost-py-results.csv')
test_predictions = pd.read_csv('ml/data/tune-boost-py-predictions.csv')

print(
    # 'Training accuracy: ',
    # np.round(df_model_boost_cv_tune['mean_test_score'].mean(), 3),
    '\nTest Accuracy',
    np.round(accuracy_score(y_test, test_predictions), 3),
    '\nGuessing: ',
    np.round(majority, 3),
)
```

##### R

```{r}
#| label: tune-boost-r
#| eval: false
set.seed(1234)

tsk_model_boost_cv_tune = as_task_classif(
    df_heart,
    target = "heart_disease",
    positive = "yes"
)

split = partition(tsk_model_boost_cv_tune, ratio = .8)

lrn_lgbm = lrn(
    "classif.lightgbm",
    num_iterations = to_tune(c(500, 1000)),
    learning_rate = to_tune(1e-3, 1e-1, logscale = TRUE),
    max_depth = to_tune(c(3, 5, 7, 9)),
    min_data_in_leaf = to_tune(c(1, 5, 10))
)

model_boost_cv_tune = auto_tuner(
    tuner = tnr("random_search"),
    learner = lrn_lgbm,
    resampling = rsmp("cv", folds = 5),
    measure = msr("classif.acc"),
    terminator = trm("evals", n_evals = 10)
)

model_boost_cv_tune$train(tsk_model_boost_cv_tune, row_ids = split$train)
test_preds = model_boost_cv_tune$predict(tsk_model_boost_cv_tune, row_ids = split$test)
test_preds$score(msr("classif.acc"))
```

```{r}
#| echo: false
#| eval: false
#| label: tune-boost-r-save-results

# as.data.table(lrn("classif.lightgbm")$param_set)  |> pull(id)

save(
    model_boost_cv_tune,
    tsk_model_boost_cv_tune,
    split,
    file = "ml/data/tune-boost-r-results.RData"
)
```

```{r}
#| label: tune-boost-r-print-results
#| echo: false

load("ml/data/tune-boost-r-results.RData")

# Evaluate
# as.data.table(instance$archive)

library(mlr3verse)

test_acc = model_boost_cv_tune$predict(tsk_model_boost_cv_tune, row_ids = split$test)$score(msr("classif.acc"))

glue("Test Accuracy: {round(test_acc, 3)}\nGuessing: {round(majority, 3)}")
```

:::

Looks like we've done a lot better than guessing. Even if we don't do better than our previously untuned model, we should feel better that we've done our due diligence in trying to find the best set of underlying parameters, rather than just going with defaults or what *seems* to work best.


## Comparing Models {#sec-ml-common-compare}

```{r}
#| label: tune-comparison-r
#| echo: false
#| eval: false

library(mlr3verse)
library(mlr3torch)

set.seed(1234)
split_prop = .8
split = rsample::initial_split(X, prop = split_prop)

df_train = rsample::training(split)
df_test  = rsample::testing(split)

tuner = tnr("random_search")
cv = rsmp("cv", folds = 10)
metric = msr("classif.acc")

task = as_task_classif(
    df_train,
    target = "heart_disease"
)

# Tune and setup glmnet
lrn_glmnet = lrn(
    "classif.cv_glmnet",
    predict_type = "prob",
    alpha = to_tune(0, 1.0), # mixing
    s = to_tune(1e-3, 1e-1) # penalty
)

lrn_glmnet_tune = AutoTuner$new(
    learner = lrn_glmnet,
    resampling = cv,
    measure = metric,
    tuner = tnr("random_search"),
    terminator = trm("evals", n_evals = 10)
)

lrn_glmnet_tune$train(task)

lrn_glmnet_tune$tuning_result$learner_param_vals

lrn_glmnet = lrn(
    "classif.glmnet",
    predict_type = "prob"
)

lrn_glmnet$param_set$values = lrn_glmnet_tune$tuning_result$learner_param_vals[[1]]


# Tune and setup lgbm
lrn_lgbm = lrn(
    "classif.lightgbm",
    predict_type = "prob",
    num_iterations = to_tune(c(250, 500, 1000)),
    learning_rate = to_tune(1e-3, 1e-1),
    max_depth = to_tune(c(2, 3, 5, 7, 9)),
    min_data_in_leaf = to_tune(c(1, 5, 10)),
    feature_fraction = to_tune(0.75, 1.0),
    lambda_l1 = to_tune(0, 1.0),
    lambda_l2 = to_tune(0, 1.0)
)

lrn_lgbm_tune = AutoTuner$new(
    learner = lrn_lgbm,
    resampling = cv,
    measure = metric,
    tuner = tnr("random_search"),
    terminator = trm("evals", n_evals = 10)
)

lrn_lgbm_tune$train(task)

lrn_lgbm = lrn(
    "classif.lightgbm",
    predict_type = "prob"
)

lrn_lgbm$param_set$values = lrn_lgbm_tune$tuning_result$learner_param_vals[[1]]

# Tune and setup MLP
library(mlr3torch)

learner_mlp = lrn(
    "classif.mlp",
    # defining network parameters
    layers = to_tune(2, 3),
    d_hidden = to_tune(c(100, 200)),
    # training parameters
    batch_size = 16,
    epochs = 200,
    # Defining the optimizer, loss, and callbacks
    optimizer = t_opt("adam", lr = 1e-3),
    loss = t_loss("cross_entropy"),
    # Measures to track
    measures_train = msrs(c("classif.logloss")),
    measures_valid = msrs(c("classif.logloss", "classif.ce")),
    # predict type (required by logloss)
    predict_type = "prob"
)

lrn_mlp_tune = AutoTuner$new(
    learner = learner_mlp,
    resampling = cv,
    measure = metric,
    tuner = tnr("random_search"),
    terminator = trm("evals", n_evals = 10)
)

lrn_mlp_tune$train(task)


lrn_mlp = lrn(
    "classif.mlp",
    predict_type = "prob"
)

lrn_mlp$param_set$values = lrn_mlp_tune$tuning_result$learner_param_vals[[1]]

save(
    df_train,
    df_test,
    split_prop,
    task,
    lrn_lgbm,
    lrn_glmnet,
    lrn_mlp,
    file = "ml/data/autotune-comparison-r-results.RData"
)
```


```{python}
#| label: tune-comparison-py
#| echo: false
#| eval: false
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, precision_score, recall_score

def calculate_metrics(y_true, y_pred):
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    acc = accuracy_score(y_true, y_pred)
    tpr = recall_score(y_true, y_pred)  # True Positive Rate (Sensitivity)
    tnr = tn / (tn + fp)  # True Negative Rate (Specificity)
    f1 = f1_score(y_true, y_pred)
    ppv = precision_score(y_true, y_pred)  # Positive Predictive Value (Precision)
    npv = tn / (tn + fn)  # Negative Predictive Value
    return {'acc': acc, 'tpr': tpr, 'tnr': tnr, 'f1': f1, 'ppv': ppv, 'npv': npv}


def evaluate_model(model, X_train, y_train, X_test, y_test):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    return calculate_metrics(y_test, y_pred)


# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define models
model_glmnet = LogisticRegression(solver='saga', max_iter=10000, random_state=42)
model_boost = LGBMClassifier(random_state=42, verbosity=-1)
model_mlp = MLPClassifier(
    learning_rate = 'adaptive',
    learning_rate_init = 0.001,
    shuffle = True,
    warm_start = True,
    nesterovs_momentum = True,
    validation_fraction =  .2,
    verbose = False,
    max_iter=10000,
    random_state=42
)

# Define parameter space for each model
param_glmnet = {
    'penalty': ['elasticnet'],
    'l1_ratio': [0.1, 0.5, 0.9],
    'C': [0.01, 0.1, 1, 10]
}

param_boost = {
    'n_estimators': [250, 500],
    'learning_rate': [1e-3, 0.01, 0.1, 0.5],
    'max_depth': [2, 3, 5, 7],
    'min_child_samples': [1, 5, 10, 50],
    'colsample_bytree': [0.75, 1.0],
    'reg_alpha': [0.0, 1],
    'reg_lambda': [0.0, 1]
}

param_mlp = {
    'hidden_layer_sizes': [(100, 100), (200, 200), (100, 100, 100), (200, 200, 200)],
    # 'activation': ['relu', 'tanh'],
    'alpha': [0.0001, 0.001, 0.01]
}

# Perform hyperparameter tuning
search_glmnet = RandomizedSearchCV(model_glmnet, param_glmnet, cv=10, scoring='accuracy', random_state=42)
search_boost = RandomizedSearchCV(model_boost, param_boost, cv=10, scoring='accuracy', random_state=42)
search_mlp = RandomizedSearchCV(model_mlp, param_mlp, cv=10, scoring='accuracy', random_state=42)

# Fit the grid searches
search_glmnet.fit(X_train, y_train)
search_boost.fit(X_train, y_train)
search_mlp.fit(X_train, y_train)

# Get the best models
best_glmnet = search_glmnet.best_estimator_
best_boost = search_boost.best_estimator_
best_mlp = search_mlp.best_estimator_

# Make predictions on the test set
y_pred_glmnet = best_glmnet.predict(X_test)
y_pred_boost = best_boost.predict(X_test)
y_pred_mlp = best_mlp.predict(X_test)

# Evaluate the best models on the test set
results_glmnet = evaluate_model(best_glmnet, X_train, y_train, X_test, y_test)
results_boost = evaluate_model(best_boost, X_train, y_train, X_test, y_test)
results_mlp = evaluate_model(best_mlp, X_train, y_train, X_test, y_test)


# Aggregate results into a DataFrame for comparison
results_df = pd.DataFrame({
    'GLMNet': results_glmnet,
    'Boost': results_boost,
    'MLP': results_mlp
})

results = results_df.T.reset_index().rename(columns={'index': 'model'})

results

results.to_csv('ml/data/ml-common-py_model_comparison.csv', index=False)

# Combine predictions with observed values into a DataFrame
df_predictions = pd.DataFrame(
    {
        'Observed': y_test,
        'GLMNet_Predictions': y_pred_glmnet,
        'Boost_Predictions': y_pred_boost,
        'MLP_Predictions': y_pred_mlp,
    }
)

df_predictions.to_csv('ml/data/ml-common-py_predictions.csv', index=False)
```


```{r}
#| label: benchmark-r
#| echo: false
#| eval: false

# Run Benchmark 

# load("ml/data/tune-comparison-r-results.RData") # old
load("ml/data/autotune-comparison-r-results.RData") 


set.seed(123)

benchmark_grid = benchmark_grid(
    task, 
    c(lrn_lgbm, lrn_glmnet, lrn_mlp), 
    rsmp("cv", folds = 10)
)

bmr = benchmark(benchmark_grid, store_models = FALSE)

save(
    bmr,
    file = "ml/data/benchmark-comparison-r-results.RData"
)
```


We can tune all the models and compare them head to head. For this demo, we'll just describe what we did, as you've seen the code for how to do so throughout this chapter already. We first split the same data into training and test sets (20% test). Then with training data, we tuned each model over different settings:

- Elastic net: penalty and mixing ratio
- Boosting: number of trees, learning rate, and maximum depth, etc.
- Neural network: number of hidden layers, number of nodes in each layer, etc.

After this, we used the tuned values to retrain the model on the complete training data set. At this stage it's not necessary to investigate in most settings, but we show the results of the 10-fold cross-validation for the already-tuned models, to give a sense of the uncertainty in error estimation with a small sample like this. Even with the 'best' settings, we can see that there is definitely some variability across data splits.


:::{.content-visible when-format='html'}
```{r}
#| label: fig-benchmark-plot
#| fig-cap: Cross-validation results for tuned models.
#| echo: false
#| eval: false

# TODO: return to this if after redoing; many things got broken with mlr3 update
load("ml/data/benchmark-comparison-r-results.RData")

# used in callout tip
val_accuracy = bmr$score(msr("classif.acc")) |> 
    summarize(acc = mean(classif.acc), .by = learner_id)


p = autoplot(bmr, measure = msr("classif.acc"))

p$data |> 
    arrange(learner_id) |> 
    mutate(
        learner_id = str_remove(learner_id, "classif."),
        learner_id = str_replace(learner_id, "glmnet", "elastic net"),
        learner_id = str_replace(learner_id, "mlp", "MLP"),
        learner_id = fct_inorder(str_to_title(str_replace(learner_id, "lightgbm", "boost"))),
    ) |>
    ggplot(aes(x = learner_id, y = classif.acc)) +
    ggbeeswarm::geom_beeswarm(size = 5) +
    labs(
        x = '',
        y = 'Accuracy',
        subtitle = 'Cross-validation results for tuned models'
    ) +
    scale_y_continuous(
        breaks = seq(.7, 1, .05), 
        labels = scales::percent_format(accuracy = 1)
    ) +
    theme(
        axis.text.x = element_text(size= 14),
        axis.text.y = element_text(size= 14),
        plot.subtitle = element_text(size = 18)
    )

ggsave(
    "img/ml-benchmark-comparison.svg",
    width = 8,
    height = 6
)
```
:::

<!-- :::{.content-visible when-format='pdf'} -->
![Cross-validation results for tuned models](img/ml-benchmark-comparison.svg){#fig-benchmark}
<!-- ::: -->


```{r}
#| label: prop-test-py
#| echo: false

# CHANGE IF USING PY RESULTS

best_preds = read_csv('ml/data/ml-common-py_predictions.csv')

# pred_class_glm = lrn_glmnet$predict_newdata(df_test)$response
# pred_class_boost = lrn_lgbm$predict_newdata(df_test)$response
# observed = lrn_glmnet$predict_newdata(df_test)$truth
pred_class_glm = best_preds$GLMNet_Predictions
pred_class_boost = best_preds$Boost_Predictions
observed = best_preds$Observed
n_test = length(observed)

acc_interval_enet = prop.test(
    x = sum(pred_class_glm == observed),
    n = n_test,
    # p = .5,
    # alternative = 'greater'
) 
enet_correct = sum(
    pred_class_glm == observed
)
lgbm_correct = sum(
    pred_class_boost == observed
)
tab_TPR = matrix(
    c(
        sum(pred_class_glm & observed),
        sum(observed == 1) - sum(pred_class_glm & observed),
        sum(pred_class_boost & observed),
        sum(observed == 1) - sum(pred_class_boost & observed)
    ), 
    nrow= 2,
    byrow = TRUE
)
tab_acc = matrix(
    c(
        sum(pred_class_glm == observed),
        sum(pred_class_glm != observed),
        sum(pred_class_boost == observed),
        sum(pred_class_boost != observed)
    ), 
    nrow= 2,
    byrow = TRUE
)

diff_interval = prop.test(
    x = tab_acc
) 

# a curiosity, but also if we want interval estimates for the other stuff.
# library(boot)

# # Define the function to calculate the proportion of correct predictions
# prop_correct = function(data, indices) {
#     # Subset the data using the bootstrap indices
#     bootstrap_data = data[indices, ]
    
#     # Calculate the proportion of correct predictions
#     prop = sum(lrn_glmnet$predict_newdata(bootstrap_data)$response == lrn_glmnet$predict_newdata(bootstrap_data)$truth) / nrow(bootstrap_data)
    
#     return(prop)
# }

# # Set the number of bootstrap iterations
# n_iterations = 250

# # Perform bootstrap resampling
# bootstrap_results = boot(data = df_test, statistic = prop_correct, R = n_iterations)

# # Print the bootstrap results
# quantile(bootstrap_results$t, c(0.025, 0.975))
```

We now look at the performance on the holdout set with our tuned models in the following table[^modcompare]. In this case, we see something that might surprise you - the simplest model does really well! In this case, we'd probably declare it the winner given the combination of ease of use and interpretability. Again, your results may vary depending on whether you used a seed, R vs. Python, and possibly other aspects of your modeling environment.

[^modcompare]: This table was based on Python randomized CV search, but the R approach produced similar results, but they can both vary quite a bit even with just a random seed change due to the small sample size.

```{r}
#| label: old-tbl-benchmark-r
#| echo: false
#| eval: false

# no longer used (py preferred) but kept for comparison.
library(mlr3verse)

load("ml/data/autotune-comparison-r-results.RData")

set.seed(1234)

lrn_glmnet$train(task)
lrn_lgbm$train(task)
lrn_mlp$train(task)

test_performance = map_df(
    # list(enet = lrn_glmnet, lgbm = lrn_lgbm, mlp = lrn_mlp),
    list(enet = lrn_glmnet, lgbm = lrn_lgbm, mlp = lrn_mlp),
    \(x) {
        preds = x$predict_newdata(df_test)
        mlr3measures::confusion_matrix(preds$truth, preds$response, positive = '1')$measures
    },
    .id = 'model'
)

df_metrics_cv = tibble(
    model  = c('Elastic Net', 'Boost', 'MLP'),
    `Acc.` = test_performance$acc,
    `TPR`  = test_performance$tpr,
    `TNR`  = test_performance$tnr,
    `F1`   = test_performance$f1,
    `PPV`  = test_performance$ppv,
    `NPV`  = test_performance$npv
)

```

```{r}
#| label: tbl-benchmark-py
#| tbl-cap: Metrics for tuned models on holdout data.
#| echo: false

df_metrics_cv = read_csv('ml/data/ml-common-py_model_comparison.csv')
max_cols = map_int(df_metrics_cv |> select(-model), \(x) which.max(x))
max_cols2 = map(df_metrics_cv |> select(-model) |> slice(-1), \(x) which.max(x))

df_metrics_cv |> 
    mutate(model  = c('Elastic Net', 'Boost', 'MLP')) |> 
    gt(decimals = 2) |> 
    tab_style(
        style = cell_text(weight = "bold"),
        locations = map( # ridiculous, but it works
            names(max_cols),
            \(x) cells_body(columns = x, rows = max_cols[[x]])
        )
    ) |> 
    tab_style(
        style = cell_text(weight = "bold"),
        locations = cells_body(
            columns = c(tpr, npv),
            rows = 2
        )
    )
```

 It's important to note that none of these results are *statistically different* from each other. As an example, the elastic net model had an accuracy of `r round(acc_interval_enet$estimate, 2)`, but the interval estimate for such a small holdout sample is very wide - from `r round(acc_interval_enet$conf.int[1], 2)` to `r round(acc_interval_enet$conf.int[2], 2)`. The interval estimate for the *difference* in accuracy between the elastic net and boosting models is from `r round(diff_interval$conf.int[1], 2)` to `r round(diff_interval$conf.int[2], 2)`[^proptest]. Again, we shouldn't take this result too far, as we're dealing with a small data set and it is difficult to detect potentially complex relationships in such a setting. In addition, we could have done more to explore the parameter space of the models, but we'll leave that for another time. But this was a good example of the importance of having an adequate baseline, and where complexity didn't really help much, though all our approaches did reasonably well.

[^proptest]: We just used the `prop.test` function in R for these values with the key question of whether these proportions are different. A lot of the metrics people look at from confusion matrices are proportions.


:::{.callout-note title='Test metrics better than training?' collapse='true'}
Some may wonder how the holdout results can be better than training, which you might have seen in playing around with the models for this data. This can definitely happen, and at least in this case, would probably just reflect the small sample size. The holdout set is a random sample of 20% of the complete data, which is `r round(nrow(df_heart)*.2)` examples. Just slightly different predictions could result in a several percentage point difference in accuracy. In general though, you'd expect the holdout results to be a bit, or even significantly, worse than the training results.
:::


## Interpretation {#sec-ml-common-interpret}

When it comes to machine learning, many models we use don't have an easy interpretation, like with coefficients in a linear regression model. However, that doesn't mean we can't still figure out what's going on. Let's use the boosting model as an example.


### Feature Importance {#sec-ml-common-feat-imp}

The default importance metric for a lightgbm model is the number of splits in which a feature is used across trees, and this will depend a lot on the chosen parameters of the best model. For the table below, we show the top 4 features from the tuned model and values rescaled to be between 0 and 1 for easier comparison. But there are other ways to think about what importance means that will be specific to a model, data setting, and the ultimate goal of the modeling process. 

:::{.panel-tabset}

##### Python

```{python}
#| label: imp-py
#| eval: false
#| echo: true
# Get feature importances
best_model = model_boost_cv_tune.best_estimator_
best_model.feature_importances_ # seriously, no feature names?

# if it's not obvious which of these values belongs to which feature, do this:
pd.DataFrame({
    'Feature': best_model.feature_name_,
    'Importance': best_model.feature_importances_
}).sort_values('Importance', ascending=False)
```

```{python}
#| label: imp-py-save
#| eval: false
#| echo: false

pd.DataFrame({
    'Feature': best_model.feature_name_,
    'Importance': best_model.feature_importances_
}).sort_values('Importance', ascending=False).to_csv('ml/data/tune-boost-py-importances.csv', index=False)
```

##### R

R shows the proportion of splits in which a feature is used across trees rather than the raw number.

```{r}
#| echo: true
#| eval: false
#| label: imp-r

# Get feature importances
model_boost_cv_tune$learner$importance()
```


:::


```{r}
#| echo: false
#| eval: true
#| label: tbl-imp-r-show
#| tbl-cap: Top 4 features from a tuned LGBM model.

# load the tuned model
# load("ml/data/tune-boost-r-results.RData")

# model_boost_cv_tune$learner$importance() |> 
#     as_tibble(rownames = "Feature")  |> 
#     head(4) |> 
#     gt()
df_imp = read_csv('ml/data/tune-boost-py-importances.csv')
df_imp |> 
    mutate(Importance = rescale(Importance)) |>
    head(4) |> 
    gt()

```

Now let's think about a visual display to aid our understanding. Here we show a partial dependence plot (@sec-knowing-related-viz) to see the effects of cholesterol and being male. From this we can see that males are expected to have a higher probability of heart disease, and that cholesterol has a positive relationship with heart disease, though this occurs mostly after midpoint for cholesterol (shown by vertical line). The plot shown is a prettier version of what you'd get with the following code, but the model predictions are the same.

:::{.panel-tabset}

##### Python

```{python}
#| label: pdp-py
#| eval: false
PartialDependenceDisplay.from_estimator(
    model_boost_cv_tune, 
    df_heart.drop(columns='heart_disease'), 
    features=['cholesterol', 'male'], 
    categorical_features=['male'], 
    percentiles=(0, .9),
    grid_resolution=75
)
```

##### R

For R we'll use the [iml]{.pack} package. 

```{r}
#| eval: false
#| label: pdp-r

library(iml)

prediction = Predictor$new(
    model_boost_cv_tune$model$learner,
    data = df_heart,
    type = 'prob', 
    class = 'yes'
)

# interaction plot, select a singe feature for a single feature plot
effect_dat = FeatureEffect$new(
    prediction, 
    feature = c('cholesterol', 'male'),
    method = "pdp", 
)

effect_dat$plot(show.data = TRUE)
```

:::


```{r}
#| eval: false
#| echo: false
#| label: fig-pdp-r-plot-show
#| fig-cap: Partial dependence plot for cholesterol and male.

# some factor level mismatch that is either a problem with iml or mlr3 makes this error 
# in render but not interactively
library(iml)

prediction = Predictor$new(
    model_boost_cv_tune$model$learner,
    data = df_heart,
    type = 'prob', 
    class = 'yes'
)

effect_dat = FeatureEffect$new(
    prediction, 
    feature = c('male'), 
    method = "pdp",
    grid.size = 2
)

p_male = effect_dat$results |> 
    ggplot(aes(x = male, y = .value)) +
    geom_col(width=.01) +
    geom_point(size = 10, alpha = 1) +
    lims(y = c(0, .6)) +
    labs(subtitle = 'Predicted Probability of Heart Disease', y = '') 
    

effect_dat = FeatureEffect$new(
    prediction, 
    feature = c('cholesterol'), 
    method = "pdp",
    grid.size = 200
)

p_chol = effect_dat$results |> 
    ggplot(aes(x = cholesterol, y = .value)) +
    geom_vline(
        xintercept = median(df_heart$cholesterol),
        color = okabe_ito[6],
        linewidth = .5,
        alpha = 1
    ) +
    geom_line(linewidth = 1) +
    coord_cartesian(xlim = c(100, 325)) +
    labs(y = '')

p_pdp = (p_male + p_chol)  & ylim(c(0, .7))

p_pdp

ggsave(
    'img/ml-pdp-plot.svg',
    width = 8,
    height = 6,
)
```


![Partial dependence plot for cholesterol](img/ml-pdp-plot.svg){width=100% #fig-pdp-r-plot-show}


```{python}
#| label: shap-py
#| eval: false
#| echo: false

# load the model

import joblib

joblib.load('ml/data/tune-boost-py-model.pkl')

X_train, X_test, y_train, y_test = train_test_split(
    df_heart.drop(columns='heart_disease'), df_heart_num['heart_disease'], test_size=0.2, random_state=42
)

import shap

explainer = shap.TreeExplainer(model_boost_cv_tune.best_estimator_)

shap_values = explainer.shap_values(df_heart.drop(columns='heart_disease'))

shap.summary_plot(shap_values, df_heart.drop(columns='heart_disease'), plot_type="waterfall")

shap.importances(model_boost_cv_tune.best_estimator_, df_heart.drop(columns='heart_disease'))
```


## Other ML Models for Tabular Data {#sec-ml-common-other-models}

When you research classical machine learning models for the kind of data we've been exploring, you'll find a variety of methods. Popular approaches from the past include *k*-nearest neighbors regression, principal components regression, support vector machines (SVM), and more. You don't see these used in practice as much though for several reasons:

- Some, like k-nearest neighbors regression, generally don't predict as well as other models.
- Others, like linear discriminant analysis, make strong assumptions about how the data is distributed.
- Some models, like SVM, tend to work well only with 'clean' and well-structured data of the same type.
- Many of these models' standard approach is computationally demanding, making them less practical for large datasets.
- Lastly, some of these models are less interpretable, making it hard to understand their predictions without an obvious gain in performance.

While some of these classical models might still work well in unique situations, when you have tools that can handle a lot of data complexity and predict very well (and usually better) like tree-based methods, there's not much reason to use the historical alternatives. If you're interested in learning more about them or think one of them is just 'neat', you could potentially use it as a baseline model. Alternatively, you could maybe employ them as part of an ensemble or **stacked** model, where you combine the predictions of multiple models to produce a single prediction. This is a common approach in machine learning, and is often used in Kaggle competitions. 

There are also other methods that are more specialized, such as those for text, image, and audio data. We will provide an overview of these elsewhere (@sec-ml-more). Currently, the main research effort for new models for tabular data regards deep learning methods like large language models (LLMs). While typically used for text data, they can be adapted for tabular data as well. They are very powerful, but also computationally expensive. The issue is primarily whether a model can be devised that can consistently beat boosting and other approaches that already do very well. While it hasn't happened yet, there is a good chance it will in the near future. For now, the best approach is to use the best model that works for your data, and to be open to new methods as they come along.


:::{.callout-note title='SOTA Deep Learning for Tabular Data' collapse='true'}
As of this writing, the current state-of-the-art (SOTA) for deep learning on tabular data appears to be techniques like TabR (@gorishniy_tabr_2023) and Modern NCA (@ye_modern_2024). These are very new and not yet widely used, but they are showing promise in some benchmarks.
:::


## Wrapping Up {#sec-ml-common-wrap}

In this chapter we've provided a few common and successful models you can implement with much success in machine learning.  You don't really need much beyond these for tabular data unless your unique data condition somehow requires it. But a couple things are worth mentioning before moving on from models in machine learning...

> **Feature engineering will typically pay off more in performance than the model choice.**

> **Thinking hard about the problem and the data is more important than the model choice.**

> **The best model is simply the one that works best for your situation.**

You'll always get more payoff by coming up with better features to use in the model, as well as just using better data that's been 'fixed' because you've done some good exploratory data analysis. Thinking harder about the problem means you will waste less time going down dead ends. You also can find better data to use to solve the problem by thinking more clearly about the question at hand. And finally, it's good to not be stuck on one model, and be willing to use something new to get the job done.


### The common thread {#sec-ml-common-thread}

When it comes to machine learning, you can use any model you feel like, and this could be standard statistical models like we've covered elsewhere. Both boosting and neural networks, like GAMs and related techniques, can be put under a common heading of *basis function models*. GAMs with certain types of smooth functions are approximations of gaussian processes, and gaussian processes are equivalent to a neural network with an infinitely wide hidden layer (@neal_priors_1996).
Even the most complicated deep learning model typically has components that involve feature combinations and transformations that we use in far simpler models like linear regression.

### Choose your own adventure {#sec-ml-common-choose}

If you haven't had much exposure to statistical approaches we suggest heading to any chapter before chapter [-@sec-ml-core-concepts]. Otherwise, consider an overview of more machine learning techniques (@sec-ml-more), data (@sec-data), or causal modeling (@sec-causal).


### Additional resources {#sec-ml-common-resources}

Additional resources include those mentioned in @sec-ml-resources, but here are some more to consider:

- [Interpretable ML](https://christophm.github.io/interpretable-ml-book/)  (@molnar_interpretable_2023)
- Interpretable Machine Learning with Python (@masis_interpretable_2023)
- [Machine Learning Q & AI](https://substack.com/redirect/84b123ee-9f41-4b96-be90-5e38028c2424?j=eyJ1IjoiMXR6aHdpIn0.5Ipuq5Kqxxiat2G2fMY0hl7pUnbdbCGq1J3hhzg1FeU) (@raschka_machine_2023)
- [Google's Course on Decision Forests](https://developers.google.com/machine-learning/decision-forests)

For deep learning specifically:

- [Common activation functions](https://ml-cheatsheet.readthedocs.io/en/latest/activation_functions.html)
- An overview of deep learning applications for tabular data by Michael (@clark_this_2021, @clark_deep_2022)
- [Dive into Deep Learning](https://d2l.ai/) (@zhang_dive_2023)
- Fast AI course (@howard_practical_2024)


## Guided Exploration {#sec-ml-common-exercise}

Tune a model of your choice to predict whether a movie is good or bad with the [movie review data](https://tinyurl.com/moviereviewsdata). Use the categorical target, and use one-hot encoded features if needed. Make sure you use a good baseline model for comparison!

:::{.panel-tabset}

##### Python

```{python}
#| label: exercise-py
#| eval: false
df_reviews = pd.read_csv('https://tinyurl.com/moviereviewsdata')

df_reviews_sub = df_reviews[[
    'review_year',
    'age',
    'children_in_home',
    'education',
    'work_status',
    'genre',
    'release_year',
    'word_count',
    'rating_good'
]]


X_train, X_test, y_train, y_test = train_test_split(
    df_reviews_sub.drop(columns='rating_good'), 
    df_heart_num['rating_good'],
    test_size = ???,
    random_state = 42
)

model_boost = LGBMClassifier(
    verbose = -1
)

param_grid = {
    'n_estimators': ???,
    'learning_rate': ???,
    'max_depth': ???,
    'min_child_samples': ???,
}

# this will take a few seconds
model_boost_cv_tune = RandomizedSearchCV(
    model_boost, 
    param_grid, 
    n_iter = 10,
    cv = ???, 
    scoring = ????, 
    n_jobs = -1,
    random_state = 42
)

model_boost_cv_tune.fit(X_train, y_train)

test_predictions = model_boost_cv_tune.predict(X_test)
accuracy_score(y_test, test_predictions)
```

##### R

```{r}
#| label: exercise-r
#| eval: false

df_reviews = read_csv('https://tinyurl.com/moviereviewsdata')

df_reviews_sub = df_reviews %>%
    select(
        review_year,
        age,
        children_in_home,
        education,
        work_status,
        genre,
        release_year,
        word_count,
        rating_good
    ) |> 
    mutate(
        across(where(is.character), \(x) as.factor(x))
    )

set.seed(42)

tsk_model_boost_cv_tune = as_task_classif(
    df_reviews_sub,
    target = "rating_good"
)

split = partition(tsk_model_boost_cv_tune, ratio = ??)

lrn_lgbm = lrn(
    "classif.lightgbm",
    num_iterations = to_tune(c(???, ???)),
    learning_rate = to_tune(1e-3, 1e-1, logscale = TRUE),
    max_depth = to_tune(c(???, ???)),
    min_data_in_leaf = to_tune(c(???, ???))
)

model_boost_cv_tune = auto_tuner(
    tuner = tnr("random_search"),
    learner = lrn_lgbm,
    resampling = rsmp("cv", folds = ???),
    measure = msr("classif.acc"),
    terminator = trm("evals", n_evals = ???)
)

model_boost_cv_tune$train(tsk_model_boost_cv_tune, row_ids = split$train)
model_boost_cv_tune$predict(tsk_model_boost_cv_tune, row_ids = split$test)$score(msr("classif.acc"))
```
:::