RF var importance.Rmd

---
title: "Extracting information from RF fit objects"
author: Gertjan Verhoeven
date: September 2017
output:
  pdf_document: 
  toc: true
toc_depth: 3
---


# Laden packages

```{r, results='hide', message=FALSE, warning=FALSE}
rm(list=ls())
library(randomForest)
library(party)
library(ranger)
library(data.table)
library(ggplot2)

# uniform prob density function
rdunif <- function(n,k) sample(1:k, n, replace = T)
```

# Summary

This notebook explores the various ways to extract information from RandomForest fit objects.
This includes:

* Extracting information on variable importance, as well as constructing partial dependence plots.
* Regarding variable importance, this covers Gini importance and Permutation importance.
* Regarding partial dependence plots, this covers regular PDP's as well as ICE plots.
* We explore packages `mlr` and `pdp`.
* Both randomForest() and ranger() RF implementations are considered. 
* The simulations of Strobl et al 2008 that demonstrate bias for particular variable types in the Gini importance are reproduced here.
* We end with exploring a method to detect interactions using a combination of OLS and RF. This promising approach is further tested in a separate notebook.

Conclusions:

* Gini importance is horribly biased, use permutation importance instead.
* ranger() is much faster than randomForest and appears quite stable.
* `pdp` is more suitable for our purposes than `mlr`.
* In realistic datasets ICE plots are not the panacee for interaction detection we hoped it would be, although it does work in a relatively clean dataset.


```{r eval = F, echo = F}
#varImpPlot(my_randomForest, type = 1, scale = F)
#varImpPlot(my_randomForest)
#  type=NULL, class=NULL, scale=TRUE,
# type	
# either 1 or 2, specifying the type of importance measure (1=mean decrease in accuracy, 2=mean decrease in node impurity = Gini).
# class	
# for classification problem, which class-specific measure to return.
# scale	
# For permutation based measures, should the measures be divided their “standard errors”?
```

# importance definitions as used in randomForest

Here are the definitions of the variable importance measures. 

`Permutation importance`:
The first measure is computed from permuting OOB data: For each tree, the prediction error on the out-of-bag portion of the data is recorded (error rate for classification, MSE for regression). Then the same is done after permuting each predictor variable. The difference between the two are then averaged over all trees, and normalized by the standard deviation of the differences. If the standard deviation of the differences is equal to 0 for a variable, the division is not done (but the average is almost always equal to 0 in that case).

`Gini importance`:
The second measure is the total decrease in node impurities from splitting on the variable, averaged over all trees. For classification, the node impurity is measured by the Gini index. For regression, it is measured by residual sum of squares.

First do classification (binomial dist) to stay close to Strobl et al.

# Simulate dataset function

```{r}
# simulate noise dataset with signal on x2 as function of relevance
generateData <- function(nsize, relevance, interaction = 0){
  y <- rbinom(n = nsize, size = 1, prob = 0.5)
  x1 <- rnorm(n = nsize, mean = 0, sd = 1)
  x3 <- rdunif(n = nsize, k = 4)
  x4 <- rdunif(n = nsize, k = 10)
  x5 <- rdunif(n = nsize, k = 20)
  x2 <- rep(-1, nsize)
  x2[y == 1 & x1 < 0] <- rbinom(n = sum(y == 1 & x1 < 0), size = 1, prob = 0.5 - relevance - interaction)
  x2[y == 1 & x1 >= 0] <- rbinom(n = sum(y == 1 & x1 >= 0), size = 1, prob = 0.5 - relevance + interaction)
  x2[y == 0 & x1 < 0] <- rbinom(n = sum(y == 0 & x1 < 0), size = 1, prob = 0.5 + relevance - interaction)
  x2[y == 0 & x1 >= 0] <- rbinom(n = sum(y == 0 & x1 >= 0), size = 1, prob = 0.5 + relevance + interaction)
  
  my_df <- data.frame(y = as.factor(y), x1, x2, x3, x4, x5)
  my_df
}

```

Test on the variable importance measures.

# Importance null case study (no signal present)

```{r}
fullrun <- 0

if(fullrun){
  set.seed(1)
  for(i in 1:1000){
  #set.seed <- sample(1:1000, 1)
  my_df <- generateData(120, 0)
  
  my_randomForest <- randomForest(y ~ ., data = my_df,
                                  importance = TRUE, ntree = 50,
                                  mtry = 3, replace = TRUE)
  
  if(i == 1) myres <- cbind(randomForest::importance(my_randomForest, type = NULL, scale=F), i)
  
  else myres <- rbind(myres, 
                      cbind(randomForest::importance(my_randomForest, type = NULL, scale=F), i))
  
  }
  saveRDS(myres, file = "myres.rds")
} else { myres <- readRDS("myres.rds") }

my_rownames <- row.names(myres)
myres <- data.table(myres)
myres <- myres[, varname := my_rownames]
```

# Gini importance

```{r}
ggplot(myres, aes(x = factor(varname), y = MeanDecreaseGini)) + 
  geom_boxplot()
```


# Permutation importance

```{r}
ggplot(myres, aes(x = factor(varname), y = MeanDecreaseAccuracy)) + 
  geom_boxplot()

```

De permutation importance is on average ok (unbiased) but the variance remains a function of the number of levels of the predictor.

# Importance signal case study (signal on x2)

We set the relevance at 0.1 , this gives a weak signal for N=120 observations.

```{r}
fullrun <- 0

if(fullrun){
  set.seed(1)
  for(i in 1:1000){
  #set.seed <- sample(1:1000, 1)
  my_df <- generateData(120, 0.1)
  
  my_randomForest <- randomForest(y ~ ., data = my_df,
                                  importance = TRUE, ntree = 50,
                                  mtry = 3, replace = TRUE)
  
  if(i == 1) myres <- cbind(randomForest::importance(my_randomForest, type = NULL, scale=F), i)
  
  else myres <- rbind(myres, 
                      cbind(randomForest::importance(my_randomForest, type = NULL, scale=F), i))
  
  }
  saveRDS(myres, file = "myres2.rds")
} else { myres <- readRDS("myres2.rds") }

my_rownames <- row.names(myres)
myres <- data.table(myres)
myres <- myres[, varname := my_rownames]
```


```{r}
ggplot(myres, aes(x = factor(varname), y = MeanDecreaseGini)) + 
  geom_boxplot()
```

```{r}
ggplot(myres, aes(x = factor(varname), y = MeanDecreaseAccuracy)) + 
  geom_boxplot()
```

This is for relevance 0.1, where the boxplot of x2 overlaps with the non-informative predictors.
We see this also when using logistic regression (see below)

# Importance signal case study (signal on x2)

```{r}
fullrun <- 0

if(fullrun){
  set.seed(1)
  for(i in 1:1000){
  #set.seed <- sample(1:1000, 1)
  my_df <- generateData(120, 0.2)
  
  my_randomForest <- randomForest(y ~ ., data = my_df,
                                  importance = TRUE, ntree = 50,
                                  mtry = 3, replace = TRUE)
  
  if(i == 1) myres <- cbind(randomForest::importance(my_randomForest, type = NULL, scale=F), i)
  
  else myres <- rbind(myres, 
                      cbind(randomForest::importance(my_randomForest, type = NULL, scale=F), i))
  
  }
  saveRDS(myres, file = "myres3.rds")
} else { myres <- readRDS("myres3.rds") }

my_rownames <- row.names(myres)
myres <- data.table(myres)
myres <- myres[, varname := my_rownames]
```

```{r}
ggplot(myres, aes(x = factor(varname), y = MeanDecreaseGini)) + 
  geom_boxplot()
```

```{r}
ggplot(myres, aes(x = factor(varname), y = MeanDecreaseAccuracy)) + 
  geom_boxplot()
```

# compare with logistic regression

If we set relevance on 0,2, we get a pretty consistent significant signal using logistic regression.
If we set the relevance lower, e.g. 0.1, in about 50% of the cases, we do not detect a significant relationship (at p < 5%).

```{r}
my_df <- generateData(120, 0.2)

my_glm <- glm(y ~ ., data = my_df, family = "binomial")

summary(my_glm)

pval_dist <- vector()
for(i in 1:1000){
  my_df <- generateData(120, 0.1)
  
  my_glm <- glm(y ~ ., data = my_df, family = "binomial")
  
  pval_x2 <- coef(summary(my_glm))[3,4]
  pval_dist[i] <- pval_x2
}

ggplot(data.frame(pval = pval_dist), aes(x = pval)) + geom_histogram() + geom_vline(xintercept = 0.05)
```

# Now check ranger implementation

Need to run it twice, once for gini importance, once for permutation importance.

First gini importance. We run with strong signal on X2.

```{r}
fullrun <- 0

if(fullrun){
  set.seed(1)
  for(i in 1:1000){
  #set.seed <- sample(1:1000, 1)
  my_df <- generateData(120, 0.2)
  
  my_ranger <- ranger(y ~ ., data = my_df,
                                  importance = "impurity", num.trees = 50,
                                  mtry = 3, replace = TRUE)
  
  if(i == 1) {
    myres_tmp <- ranger::importance(my_ranger);
    myres <- cbind(names(myres_tmp), myres_tmp,  i)
  } else{ myres_tmp <- ranger::importance(my_ranger);
    myres <- rbind(myres, 
                      cbind(names(myres_tmp), myres_tmp,  i))
  }
  }
  saveRDS(myres, file = "myres4.rds")
} else { myres <- readRDS("myres4.rds") }

#my_rownames <- row.names(myres)
myres <- data.table(myres)
setnames(myres, "V1", "varname")
setnames(myres, "myres_tmp", "MeanDecreaseGini")
myres <- myres[, varname := as.factor(varname)]
myres <- myres[, MeanDecreaseGini := as.numeric(MeanDecreaseGini)]
myres <- myres[, i := as.integer(i)]

```


```{r}
ggplot(myres, aes(x = factor(varname), y = MeanDecreaseGini)) + 
  geom_boxplot()
```

Indeed we find the same pattern as with randomForest. x1 and x5 are artificially preferred even though the signal is on x2.

Now run with the permutation performance.

```{r}
fullrun <- 0

if(fullrun){
  set.seed(1)
  for(i in 1:1000){
  #set.seed <- sample(1:1000, 1)
  my_df <- generateData(120, 0.2)
  
  my_ranger <- ranger(y ~ ., data = my_df,
                                  importance = "permutation", num.trees = 50,
                                  mtry = 3, replace = TRUE)
  
  if(i == 1) {
    myres_tmp <- ranger::importance(my_ranger);
    myres <- cbind(names(myres_tmp), myres_tmp,  i)
  } else{ myres_tmp <- ranger::importance(my_ranger);
    myres <- rbind(myres, 
                      cbind(names(myres_tmp), myres_tmp,  i))
  }
  }
  saveRDS(myres, file = "myres5.rds")
} else { myres <- readRDS("myres5.rds") }

#my_rownames <- row.names(myres)
myres <- data.table(myres)
setnames(myres, "V1", "varname")
setnames(myres, "myres_tmp", "MeanDecreaseAccuracy")
myres <- myres[, varname := as.factor(varname)]
myres <- myres[, MeanDecreaseAccuracy := as.numeric(MeanDecreaseAccuracy)]
myres <- myres[, i := as.integer(i)]

```


```{r}
ggplot(myres, aes(x = factor(varname), y = MeanDecreaseAccuracy)) + 
  geom_boxplot()
```

# Make partial dependence plots (pdp) met mlr package

Partial dependence plots plot the effect of a predictor on the outcome, GIVEN (averaged over) the effect of the other predictors.
This is different from effects plots where the other predictors are set at typical values for instance.

Mlr package is similar to caret in that it is a resampling framework to drive multiple different learners.
It contains a pdp function.

# test op iris dataset

```{r}
library(mlr)
lrn.classif <- makeLearner("classif.ranger", predict.type = "prob")
# iris.task is predef
fit.classif <- train(lrn.classif, iris.task)

pd <- generatePartialDependenceData(fit.classif, iris.task)

plotPartialDependence(pd)
```

# test randomForest op eigen dataset

Nu testen op onze synthetische dataset.
Eerst een grote dataset (N =1200).

```{r}
my_df <- generateData(1200, 0.2)

my_task <- makeClassifTask(data = my_df, target = "y")

lrn.classif <- makeLearner("classif.ranger", 
                           predict.type = "prob",
                           par.vals = list(num.trees = 500,
                                            mtry = 3, 
                                           replace = TRUE,
                                           num.threads = 8))
getLearnerProperties("classif.ranger")

fit.classif = train(lrn.classif, my_task)
pd <- generatePartialDependenceData(fit.classif, my_task)

plotPartialDependence(pd)
```

We find that p(y = 1| x2 == 0) = 0.3 and p(y = 1| x2 = 1) = 0.7 exactly as we generated the simulated data.
For the other variables we expect p = 0.5.

This is not exactly what we find at N = 1200, there are still patterns visible. 
This is caused because we have normally distributed data, so at the tails there is not much data.
Try to include a "rug" on these kinds of plots to show where the data is at.

We conclude for now that switching to the mlr framwork is a bit risky, we keep using caret and try a different package for pdp's, called `pdp`.

# Partial dependence plots with pdp package

PDPs help visualize the relationship between a subset of the features (typically 1-3) and the response while accounting for the average effect of the other predictors in the model.

We changed the data generating function so that we now can turn on an "interaction" effect, that we can try to detect.

> x2[y == 1 & x1 < 0] <- rbinom(n = sum(y == 1 & x1 < 0), size = 1, prob = 0.5 - relevance - interaction)
> x2[y == 1 & x1 >= 0] <- rbinom(n = sum(y == 1 & x1 >= 0), size = 1, prob = 0.5 - relevance + interaction)
> x2[y == 0 & x1 < 0] <- rbinom(n = sum(y == 0 & x1 < 0), size = 1, prob = 0.5 + relevance - interaction)
> x2[y == 0 & x1 >= 0] <- rbinom(n = sum(y == 0 & x1 >= 0), size = 1, prob = 0.5 + relevance + interaction)

# random forest pdp without interaction

```{r}
set.seed(123)
library(pdp)

my_df <- generateData(1200, 0.2, 0)
my_df$y <- as.integer(my_df$y) -1
my_rf <- ranger(y ~ . , data = my_df, num.trees = 500,
                                            mtry = 3, 
                                           replace = TRUE)

my_df2 <- generateData(1200, 0.2, 0.2)
my_df2$y <- as.integer(my_df2$y) -1
my_rf2 <- ranger(y ~ . , data = my_df2, num.trees = 500,
                                            mtry = 3, 
                                           replace = TRUE)

my_preds <- predict(my_rf, data = my_df)

  
pd1 <- partial(my_rf, pred.var = "x1")
pd2 <- partial(my_rf, pred.var = "x2")
pd3 <- partial(my_rf, pred.var = "x3")
pd4 <- partial(my_rf, pred.var = "x4")
pd5 <- partial(my_rf, pred.var = "x5")

grid.arrange(autoplot(pd1, rug = TRUE, train = my_df) + geom_hline(yintercept = 0.5) + ylim(0, 1), 
              autoplot(pd2, rug = TRUE, train = my_df) + geom_hline(yintercept = 0.5) + ylim(0, 1), 
             autoplot(pd3, rug = TRUE, train = my_df) + geom_hline(yintercept = 0.5) + ylim(0, 1), 
             autoplot(pd4, rug = TRUE, train = my_df) + geom_hline(yintercept = 0.5) + ylim(0, 1),
             autoplot(pd5, rug = TRUE, train = my_df) + geom_hline(yintercept = 0.5) + ylim(0, 1),
             ncol = 3)


```

This is according to expectations:  only an effect on x2.

# Randomforest without interaction: ICE curves

This shows, FOR EACH DATAPOINT, the prediction curve y = f_hat(x2 | x1, x3, x4, x5).
This is supposed to help in detecting interactions.

```{r}
pice <- partial(my_rf, pred.var = "x2", ice = TRUE)
gp <- autoplot(pice, alpha = 0.1, main = "ICE curve  for x2")
gp

```


```{r}
pice <- partial(my_rf, pred.var = "x4", ice = TRUE)
gp <- autoplot(pice, alpha = 0.3, main = "ICE curve  for x4")
gp

```

# centered ice curve for x4

```{r}
pice <- partial(my_rf, pred.var = "x4", ice = TRUE, center = TRUE)
gp <- autoplot(pice, alpha = 0.1, main = "c-ICE curves")
gp

```

# RandomForest pdp: Now try with an interaction between x1 and x2

```{r}
my_df2 <- generateData(1200, 0.2, 0.2)
my_df2$y <- as.integer(my_df2$y) -1
```

We have generated data with an interaction between x1 and x2.


```{r}
my_df2 <- data.table(my_df2)
my_df2[, .(avg_y = mean(y),
           avg_x1 = mean(x1),
           aant = .N), .(x2)]
```


```{r}
my_rf2 <- ranger(y ~ . , data = my_df2, num.trees = 500,
                                            mtry = 3, 
                                           replace = TRUE)

pd1 <- partial(my_rf2, pred.var = "x1")
pd2 <- partial(my_rf2, pred.var = "x2")
pd3 <- partial(my_rf2, pred.var = "x3")
pd4 <- partial(my_rf2, pred.var = "x4")
pd5 <- partial(my_rf2, pred.var = "x5")

grid.arrange(autoplot(pd1, rug = TRUE, train = my_df2) + geom_hline(yintercept = 0.5) + ylim(0, 1), 
              autoplot(pd2, rug = TRUE, train = my_df2) + geom_hline(yintercept = 0.5) + ylim(0, 1), 
             autoplot(pd3, rug = TRUE, train = my_df2) + geom_hline(yintercept = 0.5) + ylim(0, 1), 
             autoplot(pd4, rug = TRUE, train = my_df2) + geom_hline(yintercept = 0.5) + ylim(0, 1),
             autoplot(pd5, rug = TRUE, train = my_df2) + geom_hline(yintercept = 0.5) + ylim(0, 1),
             ncol = 3)
```

Surprisingly enough, this results in a correlation between x1 en y.
This can be understood because y = f(x1 | x2) and we have introduced a correlation between x1 en x2.


# RandomForest with interaction: ICE curve for x2

```{r}
pice <- partial(my_rf2, pred.var = "x2", ice = TRUE, center = F)
gp <- autoplot(pice, alpha = 0.1, main = "ICE curves")
gp

```

# RandomForest with interaction: ICE curve for x2

```{r}
pice <- partial(my_rf2, pred.var = "x1", ice = TRUE, center = F)
gp <- autoplot(pice, alpha = 0.1, main = "ICE curves")
gp
```

OK, something happens here. But did we actually create an interaction? Or only a correlation between two predictors?

# Randomforest with interaction: 2d PDP voor x1 en x2
  
This visualizes the  2d relationship y = f(x1, x2 | the rest)
  
```{r}
pice <- partial(my_rf2, pred.var = c("x1", "x2"))
gp <- autoplot(pice, alpha = 0.1, main = "x1 vs x2")
gp
```

# Compare to lm

```{r}
my_lm <- lm(y ~ x1, data = my_df2)
summary(my_lm)

my_lm <- lm(y ~ ., data = my_df2)
summary(my_lm)
```

Here we see that x1 correlates with y, but only if x2 is included in the model.

```{r}
my_df2 <- data.table(my_df2)

# univariate there is no relation between x1 and y
my_df2[, .(avg_y = mean(y),
           aant = .N), .(x1 > 0)]

# but given x2 the expectation op y depends on x1
my_df2[x1 < 0, .(avg_y = mean(y),
           aant = .N), .(x2)]
my_df2[x1 >= 0, .(avg_y = mean(y),
           aant = .N), .(x2)]

```

# Try again with different dataset that contains a true interaction

It is not clear if we created a proper interaction with our previous approach. PM

We create a new dataset that explicitly contain an interaction between dummy c (0 or 1) and x.

We fit a GLM model that does not specifically models this interaction.

```{r}
set.seed(1)
c <- rep(0:1,each=500)
x <- rnorm(1000)
#z <- rnorm(1000) # extra noise
lp <- -3 + 2*c*x #+ z
link_lp <- exp(lp)/(1 + exp(lp))
y <- (runif(1000) < link_lp) 

df_int <- data.frame(y = as.integer(y), x, c)

my_glm <- glm(y ~ ., data = df_int, family = "binomial")
summary(my_glm)

logit <- function(p) (
  log(p / (1 - p))
)
coef(my_glm)[1]

```

# logistic regression: create pdp plot for the interaction vars c and x

```{r}
pd1 <- partial(my_glm, pred.var = "c")
pd2 <- partial(my_glm, pred.var = "x")


grid.arrange(autoplot(pd1, rug = TRUE, train = df_int) , 
              autoplot(pd2, rug = TRUE, train = df_int) ,
             ncol = 3)
```

# logistic model without interaction on data with interaction: ICE curves

Can we now detect the interaction with ICE plot?
The covariate is x and we see the normally distributed values here.


```{r}
pice <- partial(my_glm, pred.var = "c", ice = TRUE, center = F)
gp <- autoplot(pice, alpha = 0.1, main = "ICE curve voor c")
gp
```

These curves all differ in the value of x.

```{r}
pice <- partial(my_glm, pred.var = "x", ice = TRUE, center = F)
gp <- autoplot(pice, alpha = 0.1, main = "ICE curve voor x", rug = T, train = df_int)
gp
```
These curves all differ in the value of c.

Insight: since these plots show the model fit, and the model does not include an interaction, the plots don't show the interaction.
We need a more flexible learner such as RF that can detect the interactions.

# Do random forest on interaction dataset


```{r}
my_rf3 <- ranger(y ~ . , data = df_int, num.trees = 2000,
                                            mtry = 1, 
                                           replace = TRUE)
my_rf3
```

# Plot pdps for x and c using the RF fit
```{r}
pd1 <- partial(my_rf3, pred.var = "c")
pd2 <- partial(my_rf3, pred.var = "x")


grid.arrange(autoplot(pd1, rug = TRUE, train = df_int) +ylim(0, 0.4) , 
              autoplot(pd2, rug = TRUE, train = df_int) + ylim(0, 0.4),
             ncol = 2)
```
This differs from the code above, here we directly see the response value y_hat, 
for the glm results above the link function has not yet been applied! PM

# Plot ICE plots for x and c for the RF fit

```{r}
pice <- partial(my_rf3, pred.var = "c", ice = TRUE, center = F)
gp <- autoplot(pice, alpha = 0.1, main = "ICE curves")
gp
```


```{r}
pice <- partial(my_rf3, pred.var = "x", ice = TRUE, center = F)
gp <- autoplot(pice, alpha = 0.1, main = "ICE curves")
gp
```

We see that the expectation for y for large x is split into two groups.
These are the c = 0 and c = 1  groups!! The trees have used c in splits but only for large x.

We have indeed visualized an interaction.

# Does this also work in a large p setting?

Simulate a dataset with 30 noise vars, 20 predictors, and two 2-way interaction terms.


```{r}
# create simulated dataset
library(MASS)

set.seed(1)
n_records <- 2000
nvars_noise <- 30

# generate  model specs
var_specs <- data.frame(varname = paste("X", 1:nvars_noise, sep=''), 
                 mean = rnorm(nvars_noise , 0, 5), 
                 sd = abs(rnorm(nvars_noise, 0, 5)))

#function which returns a matrix, and takes column vectors as arguments for mean and sd
normv <- function( n , mean , sd ){
  out <- rnorm( n*length(mean) , mean = mean , sd = sd )
  return( matrix( out , nrow = n, ncol = length(mean) , byrow = TRUE ) )
}

d_noise <- data.frame(normv( n_records , var_specs$mean , var_specs$sd ))
colnames(d_noise) <- var_specs$varname
```

The noise variables are not used to create the y var with.

```{r}
set.seed(1)
# generate signal vars
nvars <- 20
# moet 4 of meer zijn ivm interacties onder

signal_var_names <- paste("X", (nvars_noise+1):(nvars_noise + nvars), sep='')

# coefs
mu_vec <- rep(0, nvars)

# generate covariance matrix
Sigma <- matrix(rnorm(nvars * nvars, 0, 1), nvars, nvars)

#Sigma_pos_def <- t(Sigma) %*% Sigma
Sigma_pos_def <- diag(nvars)

# sample data from multivariate distribution with means mu and covariance matrix sigma_pos_def
d_signal <- mvrnorm(n = n_records, mu_vec, Sigma_pos_def, tol = 1e-6)

#cor(d_signal)
```

Start with an uncorrelated dataset. diag() gives the identity matrix.
We see that there is almost no covariance.

Generate the y using only the signal variables.

```{r}
set.seed(1)
# coefs beta
beta_signal <- runif(nvars, -10, 10)

y_signal <- colSums(beta_signal * t(d_signal))

# voeg twee interacties toe
n_inter <- 2

# dit zijn de interactie vars
intvars <- signal_var_names[1:4]

y_signal <- y_signal + 10 * d_signal[,1] * d_signal[,2]
y_signal <- y_signal + 10 * d_signal[,3] * d_signal[,4]

# zet nog wat noise op de y.
y_signal <- y_signal + rnorm(n_records, 0, 50)

# zet alle data bij elkaar
colnames(d_signal) <- signal_var_names
dataset <- data.frame(Y = y_signal, d_noise, d_signal )

# rearrange columns
sleutel <- sample(2:(nvars_noise + nvars + 1), replace = F)

dataset <- dataset[, c(1, sleutel)]

koppel <- data.frame(new_name = paste("X", 1:(nvars_noise + nvars), sep = ''),
                      orig_name = paste("X", sleutel-1, sep = '')
)
koppel <- data.table(koppel)

koppel[, type_var := "noise"]
koppel[orig_name %in% signal_var_names, type_var := "signal"]

colnames(dataset) <- c("Y", paste("X", 1:(nvars_noise + nvars), sep=''))
dataset <- data.table(dataset)
```

# Screen for important vars using permutation importance measure

```{r}
mtry_val <- round(ncol(dataset)/3, 0)

my_ranger <- ranger(Y ~ ., data = dataset,
                                  importance = "permutation", num.trees = 1000,
                                  mtry = mtry_val, replace = TRUE)
my_ranger
```

# Extract importance
```{r}
myres_tmp <- ranger::importance(my_ranger)
myres <- cbind(names(myres_tmp), myres_tmp,  i)
#my_rownames <- row.names(myres)
myres <- data.table(myres)
setnames(myres, "V1", "varname")
setnames(myres, "myres_tmp", "MeanDecreaseAccuracy")
myres <- myres[, varname := as.factor(varname)]
myres <- myres[, MeanDecreaseAccuracy := as.numeric(MeanDecreaseAccuracy)]
myres <- myres[, i := as.integer(i)]

```


```{r}
ggplot(myres, 
       aes(x = reorder(factor(varname), MeanDecreaseAccuracy), y = MeanDecreaseAccuracy)) + 
  geom_point() + coord_flip()
```

By mirroring the negative Decrease in accuracy values we get a rough threshold for "truly" significant predictors.
By eye it looks like the threshold is somewhere around value 10.

# PDP/ICE plots for the variables with signal

```{r}
topvars <- myres[MeanDecreaseAccuracy > 10]$varname

koppel[new_name %in% topvars ]

koppel[orig_name %in% intvars,]$new_name %in% topvars
# drie van de vier interactie vars zitten in de screening
```

We find that with this threshhold, we only detect signal variables, no noise variables.
So the screening has worked. Now try and find the interactions

```{r}
mtry_val <- round(ncol(dataset[, c("Y", as.character(topvars)), with = F])/3, 0)

my_ranger <- ranger(Y ~ ., data = dataset[, c("Y", as.character(topvars), "X32"), with = F],
                                  importance = "permutation", num.trees = 1000,
                                  mtry = mtry_val, replace = TRUE)
my_ranger
```

Check using OLS if the signals are strong enough.

```{r}
koppel[orig_name %in% intvars]

# we now have X5 en X6 en X41, add X32 for the ICE plots

my_glm <- glm(Y ~., data = dataset[, c("Y", as.character(topvars), "X32"), with = F], 
              family = "gaussian")

my_glm_int <- glm(Y ~. + X5:X32 + X6:X41, data = dataset[, c("Y", as.character(topvars), "X32"), with = F], 
                  family = "gaussian")

AIC(my_glm)
AIC(my_glm_int)
```

We clearly see an improvement in model fit (as measured with AIC) if we include the interactions.
This sugggests the signals are strong enough to be detected with RF.

```{r}
summary(my_glm)
summary(my_glm_int)
```

# Visualize interacting vars with ICE plots

```{r}
pice <- partial(my_ranger, pred.var = "X11", ice = TRUE, center = F)
gp <- autoplot(pice, alpha = 0.05, main = "ICE curve")
gp
```

```{r}
pice <- partial(my_ranger, pred.var = "X11", ice = TRUE, center = T)
gp <- autoplot(pice, alpha = 0.05, main = "ICE curve")
gp
```

```{r}
pice <- partial(my_ranger, pred.var = "X6", ice = TRUE, center = F)
gp <- autoplot(pice, alpha = 0.05, main = "ICE curve")
gp
```

```{r}
pice <- partial(my_ranger, pred.var = "X6", ice = TRUE, center = T)
gp <- autoplot(pice, alpha = 0.05, main = "ICE curve")
gp
```

```{r}
pice <- partial(my_ranger, pred.var = "X41", ice = TRUE, center = F)
gp <- autoplot(pice, alpha = 0.05, main = "ICE curve")
gp
```

```{r}
pice <- partial(my_ranger, pred.var = "X41", ice = TRUE, center = T)
gp <- autoplot(pice, alpha = 0.05, main = "ICE curve")
gp
```

There is not a lot to see in the RF plots.

# Try and explain the difference between OLS and RF with RF

Idea: if the only difference between OLS and RF models is dat OLS has not modeled the interactions, then we should be able to use RF to discover which predictors are required to improve the OLS predictions to the level of the RF predictions.

```{r}
my_glm <- glm(Y ~., data = dataset[, c("Y", as.character(topvars), "X32"), with = F], 
              family = "gaussian")

```

```{r}
mtry_val <- round(ncol(dataset[, c("Y", as.character(topvars), "X32"), with = F])/3, 0)

my_ranger <- ranger(Y ~ ., data = dataset[, c("Y", as.character(topvars), "X32"), with = F],
                                  importance = "permutation", num.trees = 1000,
                                  mtry = mtry_val, replace = TRUE)
my_ranger
```
```{r}
pred_RF <- predict(my_ranger, data = dataset[, c("Y", as.character(topvars), "X32"), with = F])
#pred_RF$predictions
pred_GLM <- predict(my_glm, data = dataset[, c("Y", as.character(topvars), "X32"), with = F])

plot(pred_RF$predictions, pred_GLM)
```

```{r}
pred_diff <- pred_RF$predictions - pred_GLM

my_ranger_diff <- ranger(Ydiff ~ ., data = data.table(Ydiff = pred_diff, dataset[, c(as.character(topvars), "X32"), with = F]),
                                  importance = "permutation", num.trees = 1000,
                                  mtry = mtry_val, replace = TRUE)
my_ranger_diff

```
```{r}
myres_tmp <- ranger::importance(my_ranger_diff)
myres <- cbind(names(myres_tmp), myres_tmp,  i)
#my_rownames <- row.names(myres)
myres <- data.table(myres)
setnames(myres, "V1", "varname")
setnames(myres, "myres_tmp", "MeanDecreaseAccuracy")
myres <- myres[, varname := as.factor(varname)]
myres <- myres[, MeanDecreaseAccuracy := as.numeric(MeanDecreaseAccuracy)]
myres <- myres[, i := as.integer(i)]

```


```{r}
ggplot(myres, 
       aes(x = reorder(factor(varname), MeanDecreaseAccuracy), y = MeanDecreaseAccuracy)) + 
  geom_point() + coord_flip()
```
```{r}
koppel[orig_name %in% intvars,]$new_name
```