Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SHAP feature contribution for linear trees #4002

Closed
SpeckledJim2 opened this issue Feb 19, 2021 · 4 comments
Closed

SHAP feature contribution for linear trees #4002

SpeckledJim2 opened this issue Feb 19, 2021 · 4 comments

Comments

@SpeckledJim2
Copy link

Description

When using linear_tree = TRUE and predict() with predcontrib = TRUE, the sum of the feature contributions does not equal the predicted value.

Reproducible example

library(lightgbm)
x <- matrix(data = sample(rnorm(100L), size = 100L), ncol = 1L)
y <- 2L * x + runif(nrow(x), 0L, 0.1)

lgb_params_1 <- list(
objective = "regression"
, linear_tree = FALSE
, verbose = -1L
, metric = "mse"
, seed = 0L
, num_leaves = 2L
, bagging_freq = 1L
, subsample = 1.0
)

dtrain <- lgb.Dataset(data = x, label = y)
bst_lin_1 <- lgb.train(data = dtrain, nrounds = 10L, params = lgb_params_1, valids = list("train" = dtrain))

lgb_params_2 <- lgb_params_1
lgb_params_2$linear_tree <- TRUE # this is the only parameter that has changed
dtrain <- lgb.Dataset(data = x, label = y)
bst_lin_2 <- lgb.train(data = dtrain, nrounds = 10L, params = lgb_params_2, valids = list("train" = dtrain))

pred_1 <- predict(bst_lin_1, x, predcontrib = FALSE) # predict on model 1
pred_contrib_1 <- rowSums(predict(bst_lin_1, x, predcontrib = TRUE)) # predict on model 1 with feature contribs
diff_1 <- pred_1 - pred_contrib_1
sd(diff_1) # very close to zero as expected

pred_2 <- predict(bst_lin_2, x, predcontrib = FALSE) # predict on model 2
pred_contrib_2 <- rowSums(predict(bst_lin_2, x, predcontrib = TRUE)) # predict on model 2 with feature contribs
diff_2 <- pred_2 - pred_contrib_2
sd(diff_2) # not zero - rowSums do not total to predicted values

Environment info

R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so

LightGBM installed from source (following R installation instructions), version 3.1.1.99

Additional Comments

This is possibly the same as issue #3998

I think that the lgb.model.dt.tree function might need some work as well as you might not want to show the constant leaf values in the output if linear_tree = TRUE.

Many thanks

@btrotta
Copy link
Collaborator

btrotta commented Feb 20, 2021

@SpeckledJim2 feature contributions is not yet implemented for linear trees. I'll make a PR to update the docs to mention this.

Pull requests are welcome if anyone would like to work on implementing SHAP (i.e. predicting feature contributions) for linear trees.

Regarding your other comment, I agree it's somewhat confusing to have both the constant leaf values and the linear coefficients in the output. But on the other hand, it might be worth keeping both since the constant values automatically get calculated even for linear trees (so it is no extra work to calculate them), and it gives us the option to recover the basic constant-value tree from the output. But I don't have a very strong view either way on this, comments from others are welcome.

@SpeckledJim2
Copy link
Author

Thanks for the reply, re tree output table, I agree that leaving the constant values in there makes sense as you get them "for free" anyway. The extra thing to include I think would be the coefficients of the linear model for each leaf if possible, but it might not be something that has widespread use so I don't have a strong opinion on it either.

Re feature contributions for linear models, I am happy to help test anything developed, but my skills are limited to R at the moment - but if there is a way to help there, do let me know.

@btrotta
Copy link
Collaborator

btrotta commented Feb 23, 2021

@SpeckledJim2 the coefficients of the linear model are already available in the output of save_model. The relevant parts of the output are num_features (number of features used in the linear model for each leaf), leaf_features (index of the features used in each leaf's linear model), 'leaf_const' (constant terms of the linear models), and leaf_coeff (coefficients of the linear models). (Note that the first tree in the list always has constant models at the leaves, so it will not have linear coefficients, but the subsequent trees will.)

@StrikerRUS StrikerRUS changed the title Sum of feature contributions != predicted value when linear_tree = TRUE SHAP feature contribution for linear trees Mar 7, 2021
@StrikerRUS
Copy link
Collaborator

Closed in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants