-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LGBM sometimes strongly extrapolates in a regression problem with huge feature set #5033
Comments
Thanks for using LightGBM. Are you able to provide the details asked for in the template that was shown when you clicked "create issue", like:
Without information like that, you are asking maintainers here to just guess at what's going on, and I don't think that guessing is likely to lead you to a resolution. Based on the information you've provided so far, the only thing I can say is that the Do you see issues with "extrapolating strongly" if you use a metric like |
Hello @jameslamb, thanks a lot for your quick response. I am sorry, that I left a lot of uncertainty in my initial phrasing of the questions and hence would like to address the open points and clarify further so that the maintainers/developers of LGBM can get the bigger picture here. See therefore my description below according to the templates. DescriptionAs stated in the original first post, we are talking about a regression problem with a large featureset (partially due to 70 one-hot features). The total amount of features is ~ 200. The situation we are facing is, that one of the total of 200 features shows the problematic behavior. We identified the feature by using SHAP value analysis. I will try to give you two concrete numeric example here:
Some hypothesis: The question remaining, did you experience something like this? Is it common or even expected behavior and are there any tips/leads to mitigate the problem? It maybe isn't a bug I could imagine. Don't get me hence wrong here, I am not expecting any definitive answer, just maybe some interesting leads or explanations that might or might not support the hypothesis stated above. Reproducible exampleUnfortunately, in the project I am working in, LGBM is just one part of a huge architecture. The data is not publicly available and hence I cannot create an reproducible example here. I am sorry for that. Environment infoCurrently version 2.3.1 is used on the Azure Cloud. Docker and K8s is used. We are planning to soon bump the version but believe, that this is (hopefully) not the root-cause of the issue and that it may likely persist also with the newer versions. But that remains to be seen. Additional CommentsMy initial description regarding the configuration/loss was too unspecific. Actually we are using RMSE (regression_l2) as the loss function during training. We are using also the standard set of hyperparameters, besides the following that were changed from the default values: We played around with regularization and some other parameters to prevent overfitting, but our problems was not mitigated by this as it rather looks like an extrapolation issues due to the clear differences on train and test set (where a feature might reach out of the boundaries ever seen in the test set, see above description). |
Just a quick sidenote: The newest stable release of LGBM was also tried out but did not mitigate the above stated issue. |
@DataDemystifier Thanks for using LightGBM. Ideally, feature values in test set beyond the values in training set should not produce crazy large predictions. Because these feature values only determines which leaves in the decision trees a sample goes. The output value are just summation of leaf predictions values, which is saved in the model and unchanged after training. And a feature with a single value should be abandoned by LightGBM after preprocessing the data. So example 2 also surprised me.
Thanks for the information. |
@DataDemystifier Thanks for your example. Did you use distributed training? Is it possible to get the tree models from your pipeline? It would be super helpful if we can get the tree model and some instances in the validation set which shows abnormal prediction values. |
@shiyu1994 thanks for your take. We don't use distributed training. I am a bit unsure if I will be able to send you the tree and some validation set examples unfortunately. But maybe you can outline some steps you would conduct for your analysis. Hence using the trees_to_dataframe() and create_tree_digraph() methods to see how those high predictions are created could be a good option. Another question: Would you also classify the problem as a overfitting problem in general based on the given information so far? So for example another lead is probably L2 regularization then to check right? If something is found, I would keep you then updated. Thanks a lot and best regards! |
Hey all,
currently, I am using LGBM in a project where I am facing a quite remarkable problem.
It is a regression problem with a large featureset (partially due to 70 one-hot features). The total amount of features is ~ 200.
The situation I am facing is, that I have a feature that is outside of the typical bounds seen in the training data.
The trained model goes crazy and more or less based on judging the loss directly overfits. When predicting, in a normal situation typical MAPE values are at max 100% or so. But in this case the model extrapolates very strongly and predicts almost 800% or even more. So really far away and exceed actually any value seen as a target in the training dataset.
The validation loss shoots up more or less from the first boosting round onwards and keeps increasing.
Played around with the hyperparameters a bit but could no mitigate the problem.
Hence I would like to ask if such problems are common and if somebody knows a good mitigation strategy.
An ideal Early Stopping for example would stop at boosting round 1 which does not seem like a right mitigation at all.
More generally: Is it known when LGBM gets problems with extrapolation? My hope was more thinking from a bagging perspective that extrapolation for a tree based model should happen only in extreme rare cases. But now I would like to understand why I see this on multiple occasions in my problem right now.
Any insights would be very valuable for me. Thanks a lot already!
Best regards!
The text was updated successfully, but these errors were encountered: