Add random uniform sentinels to avoid overfitting #4622

Shulito · 2021-09-23T03:25:31Z

Preface

Sorry if this feature is being added to the framework. I looked everywhere and it doesn't seem so, but it's difficult to search because searching for "random + tree" 99.99% of the time leads to random forest.

Summary

Instead of using the taditional hyperparameters to control overfitting (like max_depth), add random uniform feature variables that act as sentinels to check if the split of a node is going to lead to an overfitted tree.

Motivation and Description

Create N random (therefore, uncorrelated) uniform feature variables between 0 and 1 and add them to the dataset. If, when constructing one of the trees, one of this sentinel features is selected as the best feature to split the node over the real features of the dataset, that means that this node shouldn't be split because it found a spurious correlation that's better that any split of the real features. If this happens at the root, stop creating trees.

Alternatives

Enable the possibility to add user-defined predicate callbacks (with access to the environment) before a split happens and before a new tree is created for user defined behaviour to stop node splitting and tree creation.

References

https://www.kdnuggets.com/2019/10/feature-selection-beyond-feature-importance.html -> Feature Importance + Random Features section.

shiyu1994 · 2021-09-23T16:27:24Z

Hi @Shulito, thanks for using LightGBM. Using random uniform sentinel features to set implicit threshold for minimum split gain (as an implicit min_gain_to_split) seems valuable to me. That is an interesting idea. Is it widely used in Kaggle competitions with other models?

Shulito · 2021-09-24T04:57:45Z

I don't know if it's widely used. Since this is not supported out of the box by frameworks, what people do is add this sentinel features manually to the dataset, train a model and then check which features are below a group of sentinels in the feature importance list (those below, are discarded), and the continue to build the "real" model.

shiyu1994 · 2021-09-24T06:00:26Z

It seems that sentinel features are used to filtering some features after a model finish training, according to the final feature importance. In that case, we don't have to give a direct support in LightGBM, since it is easy to implement, and doesn't interfere with the training process.
As for stopping splitting node when the best split condition of real features is no better than that of a random feature, this requires treating the random feature as a real feature during training, and construct histograms for it, which would increase some training cost. I'm not sure whether it is worthy, if no more evidence showing the effectiveness.
But I think we can have it in the Feature Requests and Voting Hub (#2302).

shiyu1994 · 2021-09-25T16:37:58Z

Closed in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature.

Shulito changed the title ~~Adding random uniform sentinels to avoid overfitting~~ Add random uniform sentinels to avoid overfitting Sep 23, 2021

jameslamb added the feature request label Sep 23, 2021

shiyu1994 mentioned this issue Sep 25, 2021

Feature Requests & Voting Hub #2302

Open

shiyu1994 closed this as completed Sep 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add random uniform sentinels to avoid overfitting #4622

Add random uniform sentinels to avoid overfitting #4622

Shulito commented Sep 23, 2021 •

edited

Loading

shiyu1994 commented Sep 23, 2021

Shulito commented Sep 24, 2021

shiyu1994 commented Sep 24, 2021

shiyu1994 commented Sep 25, 2021

Add random uniform sentinels to avoid overfitting #4622

Add random uniform sentinels to avoid overfitting #4622

Comments

Shulito commented Sep 23, 2021 • edited Loading

Preface

Summary

Motivation and Description

Alternatives

References

shiyu1994 commented Sep 23, 2021

Shulito commented Sep 24, 2021

shiyu1994 commented Sep 24, 2021

shiyu1994 commented Sep 25, 2021

Shulito commented Sep 23, 2021 •

edited

Loading