-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] Add score_tree_interval during early stopping #5090
Comments
Em. I wanted to implement stage_predict for XGBoost. |
happy to take a stab at it, though due to certain restrictions I won't be able to PR into this repository until I get approval but I can collaborate here in the meantime |
Sorry for the late reply. I will take a look into this option today. |
no problem, not super high priority for me. Mostly interested in getting familiar with the internals of this library and I think attempting to implement this kind of thing would teach me a lot |
Give it a go, and feel free to reach me if you need any help. One personal advice, be careful of the prediction cache, it bites. |
great, thanks! |
@trivialfis Just getting some time to start looking at this today. I think there was a misunderstanding on what I'd be implementing. I was intending on only scoring certain iterations of the model during model fit to improve model fit performance (speed not accuracy). staged_predict is on the prediction side. I don't know of any parameter following the scikit learn implementation that enables this kind of thing so it would have to be an xgboost unique parameter The logic for this kind of thing should be fairly simple. At a quick glance all that would need to be done is to skip the eval_set method in this loop if the iteration is not of the iterations to score. just off the top of my head it would probably look something like this
EDIT: looks like my initial idea was a bit naive, there would be modifications ot the early stopping callback as well |
Oops, sorry for missing the ping. Will look into it. Might be slow in response this week. |
No problem at all! Take your time |
Request:
Add a
score_tree_interval
option so that when you're building with really large data, the model doesn't eval on each tree.Similar to this:
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/score_tree_interval.html
Purpose:
With large data, scoring every iteration on the validation set is extremely costly. Currently with
early_stopping_rounds
the behavior is to score on every early_stopping_round round.https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.XGBRegressor.fit.
It would be nice to be able to space out the width of the early stopping rounds, as H2O does, see H2O's stopping_rounds parameter for example. You can tell H2O to score every 20 trees and if the model hasn't improved in 5 scoring iterations (e.g., 100 trees), then XGBoost would stop training.
It would also be nice if you could also enable different early stopping at different points in the tree. For example, suppose you wanted to not score on the eval_set until the 1000th tree, and then score on every tree. This would help make the training time more efficient if you knew before hand (say, from prior modeling runs) how long the tree took before converging.
I'm primarily concerned with the python implementation of this library, I don't think this has been implemented elsewhere already
The text was updated successfully, but these errors were encountered: