Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add approx_contrib option for feature contributions #4219

Closed
gravesee opened this issue Apr 23, 2021 · 4 comments
Closed

Add approx_contrib option for feature contributions #4219

gravesee opened this issue Apr 23, 2021 · 4 comments

Comments

@gravesee
Copy link
Contributor

xgboost has two methods for calculating feature contributions: TreeSHAP and Approximate. The approximate is much faster and has some nice properties w.r.t. monotonicity that aren't guaranteed when using TreeSHAP. It is also easier to port the approximate method to production systems when the lightgbm object can't be used directly.

Motivation

TreeSHAP is the gold standard but there are practical reasons for preferring a fast, simple method of calculating feature contributions. The approach outlined below can still be used for interpretation, is easier to port to other production systems, and is quicker to calculate.

Description

The approximate method of feature contribution first distributes the leaf weights up through the internal nodes of the tree. The parent weight is equal to the cover-weighted sum of the left and right child weights. If lightgbm already calculates internal leaf weights then this becomes even simpler to implement.

After the weights are distributed up through all nodes of the tree, then the feature contribution for a split is calculated by subtracting the parent node weight from the child node weights and aggregating across all features and trees in the ensemble.

References

http://blog.datadive.net/interpreting-random-forests/

@StrikerRUS
Copy link
Collaborator

@gravesee Thanks a lot for the feature request!

Just to clarify, are you aware of that you can get feature importance with the help of the following function (and corresponding Booster methods in language wrappers)?

/*!
* \brief Get model feature importance.
* \param handle Handle of booster
* \param num_iteration Number of iterations for which feature importance is calculated, <= 0 means use all
* \param importance_type Method of importance calculation:
* - ``C_API_FEATURE_IMPORTANCE_SPLIT``: result contains numbers of times the feature is used in a model;
* - ``C_API_FEATURE_IMPORTANCE_GAIN``: result contains total gains of splits which use the feature
* \param[out] out_results Result array with feature importance
* \return 0 when succeed, -1 when failure happens
*/
LIGHTGBM_C_EXPORT int LGBM_BoosterFeatureImportance(BoosterHandle handle,
int num_iteration,
int importance_type,
double* out_results);

@gravesee
Copy link
Contributor Author

@StrikerRUS thanks for the response. It looks like the code you linked to returns the feature importance gain in the model for each feature overall. What I am requesting in this issue is the approximate feature contribution for all features for each input record. The output would be a matrix where the number of rows matches the number of inputs and the number of columns matches the number of features + 1 (with one extra column for the model bias).

@StrikerRUS
Copy link
Collaborator

@gravesee Ah, I see! Thanks for clarifying!

@StrikerRUS
Copy link
Collaborator

Closed in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants