Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xgboost 1.1.1 pred failed, while 0.90 pred success #5841

Closed
zwqjoy opened this issue Jun 30, 2020 · 22 comments · Fixed by #5955
Closed

xgboost 1.1.1 pred failed, while 0.90 pred success #5841

zwqjoy opened this issue Jun 30, 2020 · 22 comments · Fixed by #5955
Labels

Comments

@zwqjoy
Copy link

zwqjoy commented Jun 30, 2020

1line_inst:
0 999:2000.000000

#model1.bin train with xgb0.90
#model2.bin train with xgb1.1.1

CODE1

import xgboost as xgb

print(xgb.__version__)
pred = xgb.DMatrix("1line_inst")

bst2 = xgb.Booster({'nthread': 4})  # init model
bst2.load_model('model2.bin')  # load data
print(bst2.predict(pred))

OUTPUT1

1.1.1
[15:16:14] 4x998 matrix with 2105 entries loaded from 1line_inst
Traceback (most recent call last):
  File "pred_zxb.py", line 12, in <module>
    print(bst2.predict(pred))
  File "/Users/zengwenqi/DXM/DXM-codebase/baidu/rimrdp/pipelines/venv/lib/python3.7/site-packages/xgboost/core.py", line 1580, in predict
    ctypes.byref(preds)))
  File "/Users/zengwenqi/DXM/DXM-codebase/baidu/rimrdp/pipelines/venv/lib/python3.7/site-packages/xgboost/core.py", line 190, in _check_call
    raise XGBoostError(py_str(_LIB.XGBGetLastError()))
xgboost.core.XGBoostError: [15:16:14] /Users/travis/build/dmlc/xgboost/src/learner.cc:1070: Check failed: learner_model_param_.num_feature == p_fmat->Info().num_col_ (1104 vs. 998) : Number of columns does not match number of features in booster.
Stack trace:
  [bt] (0) 1   libxgboost.dylib                    0x0000000118c101c0 dmlc::LogMessageFatal::~LogMessageFatal() + 112
  [bt] (1) 2   libxgboost.dylib                    0x0000000118cbda2a xgboost::LearnerImpl::ValidateDMatrix(xgboost::DMatrix*) const + 282
  [bt] (2) 3   libxgboost.dylib                    0x0000000118cbdb13 xgboost::LearnerImpl::PredictRaw(xgboost::DMatrix*, xgboost::PredictionCacheEntry*, bool, unsigned int) const + 67
  [bt] (3) 4   libxgboost.dylib                    0x0000000118cadecc xgboost::LearnerImpl::Predict(std::__1::shared_ptr<xgboost::DMatrix>, bool, xgboost::HostDeviceVector<float>*, unsigned int, bool, bool, bool, bool, bool) + 732

CODE2

import xgboost as xgb

print(xgb.__version__)
pred = xgb.DMatrix("1line_inst")

bst2 = xgb.Booster({'nthread': 4})  # init model
bst2.load_model('model1.bin')  # load data
print(bst2.predict(pred))

OUTPUT2

0.90
[15:20:51] 1x1000 matrix with 1 entries loaded from 1line_test
[0.0208639]

@trivialfis
Copy link
Member

So, it seems your model is trained on a dataset with different shape?

@zwqjoy
Copy link
Author

zwqjoy commented Jul 2, 2020

@trivialfis but the pred inst only have one feature, index 999, the 2 models training with max index feats is 2000+

@trivialfis
Copy link
Member

trivialfis commented Jul 2, 2020

There were some heuristics around this that never got documented nor tested. I'm trying to re-establish them in a way that can be precisely documented.

@trivialfis
Copy link
Member

learner_model_param.num_feature == p_fmat->Info().num_col (1104 vs. 998)

Also, I think there's something wrong here.

the 2 models training with max index feats is 2000+

From the error message your model has 1104 number of feature while your training dataset has 2000+ features.

@hcho3
Copy link
Collaborator

hcho3 commented Jul 6, 2020

@trivialfis @RAMitchell

From #5856 (comment):

In the older versions we had some heuristics to predict on this kind of mismatching number of features, like treating them as missing if the input has less trailing features, or ignore them if input has more. After a offline chat with @RAMitchell , neither of us want to restore that heuristic as it brings more issues than benefits.

I agree that the heuristic has issues, but we need to make a provision for reading LIBSVM files with deficient number of features. The LIBSVM files unfortunately do not store the number of columns, and the column dimension is inferred from the maximum feature ID that occurs. For example, it should be possible to feed in the LIBSVM file

1 0:100

into a model trained with 10 features. The correct way to interpret this LIBSVM file is to assign 100 for the first feature and mark all the other features as missing. IMHO, rejecting LIBSVM files like this would make the built-in LIBSVM reader unusable.

@ranInc
Copy link

ranInc commented Jul 7, 2020

I also encountered this issue, and the same model works on the same data on 0.9 as described here.

It might have a similarity with the following issue I opened that is happening more often (to me at least):
https://github.com/dmlc/xgboost/issues/5848

@trivialfis
Copy link
Member

@hcho3 Could you please help establishing the heuristics we need to support? I'm quite confused by the old heuristics.

  • We should support predicting on DMatrix with less features than model. Those features will be treated as missing.
  • Should we support predicting on DMatrix with more features than model?
  • Should we output a warning on mismatching number of features? With libsvm, this can be quite verbose.

@trivialfis
Copy link
Member

trivialfis commented Jul 13, 2020

Also, on Python side, there's a validate feature option, should we lower it down to C++ and make it a parameter?

@hcho3
Copy link
Collaborator

hcho3 commented Jul 14, 2020

We should support predicting on DMatrix with less features than model. Those features will be treated as missing.

Yes.

Should we support predicting on DMatrix with more features than model?

No. We should raise an error in this case.

Should we output a warning on mismatching number of features? With libsvm, this can be quite verbose.

No. If we only adopt the first heuristic, the behavior should be fairly predictable, I think.

on Python side, there's a validate feature option, should we lower it down to C++ and make it a parameter?

Now that feature names and types in the C++ layer, we can. I'll let you decide.

@lucagiovagnoli
Copy link

I've just encountered this issue as well. The newer xgboost > 1.0.0 is printing a lot of warnings to screen:

WARNING: /xgboost/src/learner.cc:979: Number of columns does not match number of features in booster

We should support predicting on DMatrix with less features than model. Those features will be treated as missing

I agree that we should support this case.

If it helps, we never saw this issue on xgboost=0.82

@hcho3
Copy link
Collaborator

hcho3 commented Jul 23, 2020

@lucagiovagnoli I think you have a different issue. XGBoost uses zero-based indexing for features by default, so the diabetes dataset would be recognized as having 9 features. To use one-based indexing, append ?indexing_mode=1 to the file path:

dtrain = xgb.DMatrix('./diabetes.libsvm?indexing_mode=1')

Alternatively, you may use sklearn.datasets.load_svmlight_file().

@lucagiovagnoli
Copy link

Hi @hcho3 , sorry I had deleted the comment before you replied because I noticed it was my mistake :)
Thanks for the insight, I'll try that!

Original comment for reference:

Actually, this is happening using this dataset with no missing column: https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/diabetes
I wonder if the problem is that columns in the libsvm file are numbered from 1 to 8 without using zero?

@trivialfis
Copy link
Member

Will look into this.

@zwqjoy
Copy link
Author

zwqjoy commented Aug 13, 2020

@trivialfis the bug has fixed ?

Now the xgb 1.1.1 should support predicting on DMatrix with less features than model. Those features will be treated as missing. ?

or fix in xgb 1.2 ?

@trivialfis
Copy link
Member

Yup, please try 1.2 rc.

@mshergad
Copy link

I am getting this error when I try to make predictions using version 1.5.1 when the original model was created on 0.9. I have followed the steps to use save_model such that I can load the binary format in a newer version. Is there something I am missing?

XGBoostError: [12:48:23] /Users/runner/work/xgboost/xgboost/src/learner.cc:1257: Check failed: learner_model_param_.num_feature >= p_fmat->Info().num_col_ (120 vs. 122) : Number of columns does not match number of features in booster.
Stack trace:
[bt] (0) 1 libxgboost.dylib 0x0000000139c7e814 dmlc::LogMessageFatal::~LogMessageFatal() + 116
[bt] (1) 2 libxgboost.dylib 0x0000000139d3319b xgboost::LearnerImpl::ValidateDMatrix(xgboost::DMatrix*, bool) const + 587
[bt] (2) 3 libxgboost.dylib 0x0000000139d333a2 xgboost::LearnerImpl::PredictRaw(xgboost::DMatrix*, xgboost::PredictionCacheEntry*, bool, unsigned int, unsigned int) const + 50
[bt] (3) 4 libxgboost.dylib 0x0000000139d24226 xgboost::LearnerImpl::Predict(std::__1::shared_ptrxgboost::DMatrix, bool, xgboost::HostDeviceVector*, unsigned int, unsigned int, bool, bool, bool, bool, bool) + 598
[bt] (4) 5 libxgboost.dylib 0x0000000139c756db XGBoosterPredictFromDMatrix + 1323
[bt] (5) 6 _ctypes.cpython-37m-darwin.so 0x0000000110124967 ffi_call_unix64 + 79
[bt] (6) 7 ??? 0x00007ff7b0919510 0x0 + 140701795980560

@trivialfis
Copy link
Member

@mshergad Your data must have all the features that the model has. The original issue here is having more features in data than the model, which was clarified and tested after this issue. But having fewer features in data than the model is invalid.

@trivialfis
Copy link
Member

Ideally one should match the datasets for test and train in the ETL process.

@mshergad
Copy link

@mshergad Your data must have all the features that the model has. The original issue here is having more features in data than the model, which was clarified and tested after this issue. But having fewer features in data than the model is invalid.

Thank you so much for the quick reply. My data has a total of 122 columns where 120 are features that the model is trained on and additionally two columns that have the target and the target_probability.

When I pass the entire dataset it throws this error that I showed. If I understand correctly learner_model_param_.num_feature is 120 and p_fmat->Info().num_col_ is 122. The dataframe has 122 columns.

So I dropped two columns and passed 120 features that the model is originally trained on ...and then it says it expects the two other column names.

So I'm going in circles currently.

When I use the same code with version 0.90 where the model was originally created, using the same command gives me the predictions.

Please do advice!

@mshergad
Copy link

Screen Shot 2022-01-12 at 8 45 24 PM
Here's the error when the drop the two columns it is not trained on.

@trivialfis
Copy link
Member

@mshergad Could you please share your code and data (maybe synthesized) so that I can take a closer look?

@mshergad
Copy link

Sounds good. I will have to seek permissions. I will synthesize and share this soon. Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants