-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
xgboost 1.1.1 pred failed, while 0.90 pred success #5841
Comments
So, it seems your model is trained on a dataset with different shape? |
@trivialfis but the pred inst only have one feature, index 999, the 2 models training with max index feats is 2000+ |
There were some heuristics around this that never got documented nor tested. I'm trying to re-establish them in a way that can be precisely documented. |
Also, I think there's something wrong here.
From the error message your model has 1104 number of feature while your training dataset has 2000+ features. |
From #5856 (comment):
I agree that the heuristic has issues, but we need to make a provision for reading LIBSVM files with deficient number of features. The LIBSVM files unfortunately do not store the number of columns, and the column dimension is inferred from the maximum feature ID that occurs. For example, it should be possible to feed in the LIBSVM file
into a model trained with 10 features. The correct way to interpret this LIBSVM file is to assign 100 for the first feature and mark all the other features as missing. IMHO, rejecting LIBSVM files like this would make the built-in LIBSVM reader unusable. |
I also encountered this issue, and the same model works on the same data on 0.9 as described here. It might have a similarity with the following issue I opened that is happening more often (to me at least): |
@hcho3 Could you please help establishing the heuristics we need to support? I'm quite confused by the old heuristics.
|
Also, on Python side, there's a validate feature option, should we lower it down to C++ and make it a parameter? |
Yes.
No. We should raise an error in this case.
No. If we only adopt the first heuristic, the behavior should be fairly predictable, I think.
Now that feature names and types in the C++ layer, we can. I'll let you decide. |
I've just encountered this issue as well. The newer xgboost > 1.0.0 is printing a lot of warnings to screen:
I agree that we should support this case. If it helps, we never saw this issue on xgboost=0.82 |
@lucagiovagnoli I think you have a different issue. XGBoost uses zero-based indexing for features by default, so the diabetes dataset would be recognized as having 9 features. To use one-based indexing, append dtrain = xgb.DMatrix('./diabetes.libsvm?indexing_mode=1') Alternatively, you may use |
Hi @hcho3 , sorry I had deleted the comment before you replied because I noticed it was my mistake :) Original comment for reference:
|
Will look into this. |
@trivialfis the bug has fixed ? Now the xgb 1.1.1 should support predicting on DMatrix with less features than model. Those features will be treated as missing. ? or fix in xgb 1.2 ? |
Yup, please try 1.2 rc. |
I am getting this error when I try to make predictions using version 1.5.1 when the original model was created on 0.9. I have followed the steps to use save_model such that I can load the binary format in a newer version. Is there something I am missing? XGBoostError: [12:48:23] /Users/runner/work/xgboost/xgboost/src/learner.cc:1257: Check failed: learner_model_param_.num_feature >= p_fmat->Info().num_col_ (120 vs. 122) : Number of columns does not match number of features in booster. |
@mshergad Your data must have all the features that the model has. The original issue here is having more features in data than the model, which was clarified and tested after this issue. But having fewer features in data than the model is invalid. |
Ideally one should match the datasets for test and train in the ETL process. |
Thank you so much for the quick reply. My data has a total of 122 columns where 120 are features that the model is trained on and additionally two columns that have the target and the target_probability. When I pass the entire dataset it throws this error that I showed. If I understand correctly learner_model_param_.num_feature is 120 and p_fmat->Info().num_col_ is 122. The dataframe has 122 columns. So I dropped two columns and passed 120 features that the model is originally trained on ...and then it says it expects the two other column names. So I'm going in circles currently. When I use the same code with version 0.90 where the model was originally created, using the same command gives me the predictions. Please do advice! |
@mshergad Could you please share your code and data (maybe synthesized) so that I can take a closer look? |
Sounds good. I will have to seek permissions. I will synthesize and share this soon. Thanks again! |
1line_inst:
0 999:2000.000000
#model1.bin train with xgb0.90
#model2.bin train with xgb1.1.1
CODE1
OUTPUT1
CODE2
OUTPUT2
The text was updated successfully, but these errors were encountered: