-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add random uniform sentinels to avoid overfitting #4622
Comments
Hi @Shulito, thanks for using LightGBM. Using random uniform sentinel features to set implicit threshold for minimum split gain (as an implicit |
I don't know if it's widely used. Since this is not supported out of the box by frameworks, what people do is add this sentinel features manually to the dataset, train a model and then check which features are below a group of sentinels in the feature importance list (those below, are discarded), and the continue to build the "real" model. |
It seems that sentinel features are used to filtering some features after a model finish training, according to the final feature importance. In that case, we don't have to give a direct support in LightGBM, since it is easy to implement, and doesn't interfere with the training process. |
Closed in favor of being in #2302. We decided to keep all feature requests in one place. Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature. |
Preface
Sorry if this feature is being added to the framework. I looked everywhere and it doesn't seem so, but it's difficult to search because searching for "random + tree" 99.99% of the time leads to random forest.
Summary
Instead of using the taditional hyperparameters to control overfitting (like max_depth), add random uniform feature variables that act as sentinels to check if the split of a node is going to lead to an overfitted tree.
Motivation and Description
Create N random (therefore, uncorrelated) uniform feature variables between 0 and 1 and add them to the dataset. If, when constructing one of the trees, one of this sentinel features is selected as the best feature to split the node over the real features of the dataset, that means that this node shouldn't be split because it found a spurious correlation that's better that any split of the real features. If this happens at the root, stop creating trees.
Alternatives
Enable the possibility to add user-defined predicate callbacks (with access to the environment) before a split happens and before a new tree is created for user defined behaviour to stop node splitting and tree creation.
References
https://www.kdnuggets.com/2019/10/feature-selection-beyond-feature-importance.html -> Feature Importance + Random Features section.
The text was updated successfully, but these errors were encountered: