Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Divergence in HistGradientBoostingClassifier's scores #1051

Open
maximilianeber opened this issue Dec 19, 2023 · 7 comments
Open

Divergence in HistGradientBoostingClassifier's scores #1051

maximilianeber opened this issue Dec 19, 2023 · 7 comments

Comments

@maximilianeber
Copy link

maximilianeber commented Dec 19, 2023

Hi,

I am trying to build a standard pipeline for tabular data that works nicely with ONNX. Ideally, the pipeline would:

  1. Be based on boosted trees
  2. Gracefully support mixed types (categorical/numerical)
  3. Exploit boosted trees' native support for categoricals
  4. Exploit boosted trees' native support for missing values

To keep debugging simple, I have built a pipeline that covers points 1-3. Preprocessing works fine, but HistGradientBoostingClassifier returns different predictions (see gist).

Any ideas why this might happen? Are there known issues with HistGradientBoostingClassifier?

Thank you!

Package versions:

scikit-learn==1.3.*
skl2onnx==1.16.*
onnxruntime==1.16.*
@maximilianeber
Copy link
Author

After some digging, I think this might be related to missing categorical support — everything works as expected when using one-hot encoding in the preprocessor.

@xadupre I am happy to try filing a PR if you think it's a good idea to add support for categoricals. Wdyt?

@xadupre
Copy link
Collaborator

xadupre commented Jan 3, 2024

I did not check their implementation recently but if scikit-learn supports categories the same way lightgbm does, I guess they use the rule if x in set(cat1, cat2, ...) which is not supported by onnx. onnxmltools deals with that case by multiplying nodes (https://github.com/onnx/onnxmltools/blob/main/onnxmltools/convert/lightgbm/operator_converters/LightGbm.py#L841) but the best way would be to update onnx to supports that rule. That said, I do think it is a good idea to support categorical features.

@xadupre
Copy link
Collaborator

xadupre commented Apr 4, 2024

The right of doing it is to implement the latest onnx specifications (onnx/onnx#5874) and then to update onnxruntime to support it.

@ogencoglu
Copy link

The probloem with one-hot encoding is Histogram Gradient boosting might learn weird interactions between each one-hot encoded feature during modeling. Therefore, it might not be the same as specifying that feature as categorical in model definition with categorical_features: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingRegressor.html#sklearn.ensemble.HistGradientBoostingRegressor

@maximilianeber
Copy link
Author

The right of doing it is to implement the latest onnx specifications (onnx/onnx#5874) and then to update onnxruntime to support it.

Sorry for being so late in replying. Sadly, we haven't found the capacity to contribute upstream this quarter. 👎

Therefore, it might not be the same as specifying that feature as categorical in model definition with categorical_features

Agreed. The other downside of one-hot encoding is that you need a lot of memory when the cardinality of the categorical feature(s) is high.

@adityagoel4512
Copy link

The right of doing it is to implement the latest onnx specifications (onnx/onnx#5874) and then to update onnxruntime to support it.

I think an update to onnxruntime is pending review :)

@khoover
Copy link

khoover commented Jan 23, 2025

I think an update to onnxruntime is pending review :)

It's been merged in now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Can Fix but Waiting for an Answer
Development

No branches or pull requests

5 participants