Fix #168. Enforce float32 type for split condition values for GBT models created using XGBoost #188

izeigerman · 2020-03-30T00:19:26Z

As it turns out the issue reported in #168 is not unique to the "hist" tree construction algorithm. It seems that with "hist" method the likelihood of reprdocue is much higher due to relying on feature histograms. I was able to reproduce the same discrepancy with non-hist methods on a larger sample of test data.

The issue occurs due to a double precision error and reproduces every time when the feature value matches the split condition in one of the tree's nodes.

Example: feature value = 0.671, split condition = 0.671000004. When we hit this condition in the generated code the outcome of 0.671 < 0.671000004 is "true" (or "yes" branch). While in XGBoost the same condition leads to the "no" branch.

After some investigation I noticed that the XGBoost's DMatrix forces all values to be float32 (https://github.com/dmlc/xgboost/blob/master/python-package/xgboost/core.py#L565). At the same time in our assemblers we rely on default 64-bit floats. Forcing the split condition to be float32 seem to address the issue. At least I couldn't reproduce it so far.

…els created using XGBoost

izeigerman · 2020-03-30T00:23:37Z

@StrikerRUS In case if you're curious and want to play with this issue as well as for historical purposes I'm attaching a serialized (pickle) trained XGBoost regression model (trained using the "hist" method) - round_error_xgboost.bin.gz.
The feature vector on which the issue reproduces:

[9.82349, 0.0, 18.1, 0.0, 0.671, 6.794, 98.8, 1.358, 24.0, 666.0, 20.2, 396.9, 10.17]

Note the 0.671 value (feature f4). Because of this value the discrepancy occurs in the estimator with index 7, node ID 4:

"nodeid": 4, "depth": 2, "split": "f4", "split_condition": 0.671000004, "yes": 9, "no": 10, "missing": 10

This is where the generated code follows the "yes" path while XGBoost does the opposite.

coveralls · 2020-03-30T02:43:06Z

Coverage increased (+0.005%) to 95.754% when pulling 645a49d on iaroslav/issue-168 into 52c601b on master.

StrikerRUS · 2020-03-31T20:17:56Z

Ah, brilliant investigation!
I completely forgot about that we can encounter the same issue as dmlc/treelite#95, dmlc/treelite#55.

BTW, doesn't this code contradict the following from one of treelite's issues?

m2cgen/m2cgen/assemblers/tree.py

Lines 54 to 57 in 52c601b

    
           # sklearn's trees internally work with float32 numbers, so in order 
        
           # to have consistent results across all supported languages, we convert 
        
           # all thresholds into float32. 
        
           threshold = threshold.astype(np.float32)

One of the model builder tests failed because scikit-learn uses double-precision floating-point to store split points, whereas Treelite was using single-precision.

Also refer to
https://github.com/scikit-learn/scikit-learn/blob/95d4f0841d57e8b5f6b2a570312e9d832e69debc/sklearn/tree/_tree.pyx#L544-L545
and
https://github.com/scikit-learn/scikit-learn/blob/95d4f0841d57e8b5f6b2a570312e9d832e69debc/sklearn/tree/_tree.pyx#L727

StrikerRUS

Just few minor questions.

StrikerRUS · 2020-03-31T20:21:23Z

m2cgen/assemblers/boosting.py

@@ -134,7 +134,7 @@ def _assemble_tree(self, tree):
        if "leaf" in tree:
            return ast.NumVal(tree["leaf"])

-        threshold = ast.NumVal(tree["split_condition"])
+        threshold = ast.NumVal(np.float32(tree["split_condition"]))


Maybe a more general solution will be to add an optional dtype constructor argument? I mean,

class NumVal(NumExpr): def __init__(self, value, dtype=np.float64):

That's a good idea 👍

StrikerRUS · 2020-03-31T20:23:28Z

tests/e2e/test_e2e.py

        CLASSIFICATION,
    )


-def regression_random(model):
+def regression_random(model, test_fraction=0.02):


Does test_fraction increase for random datasets allow to reproduce the original issue?

Yes, unfortunately the default fraction produced way too few samples to be able to reproduce the issue reliably.

Got it! We definitely need to refactor testing routines to be more tunable, e.g. allow to adjust test_fraction according to programming language. Refer to #114 (comment).

StrikerRUS · 2020-03-31T21:05:40Z

For the reference: a more explicit proof that thresholds are supposed to be float:

https://github.com/dmlc/xgboost/blob/ab7a46a1a4b880607a965f78a9703a56de9915a3/src/tree/tree_model.cc#L248-L253

https://github.com/dmlc/xgboost/blob/ab7a46a1a4b880607a965f78a9703a56de9915a3/include/xgboost/tree_model.h#L150-L153

https://github.com/dmlc/xgboost/blob/ab7a46a1a4b880607a965f78a9703a56de9915a3/include/xgboost/tree_model.h#L103

https://github.com/dmlc/xgboost/blob/8d06878bf9b778db68ae98f68d99a3557c7ea885/include/xgboost/base.h#L110-L111

Also, it seems that threshold can be integer:
https://github.com/dmlc/xgboost/blob/ab7a46a1a4b880607a965f78a9703a56de9915a3/src/tree/tree_model.cc#L229-L239

izeigerman · 2020-04-01T16:13:29Z

@StrikerRUS Thanks for all the additional context!

scikit-learn uses double-precision floating-point to store split points

That's rather weird. I clearly remember that scikit-learn used float32 in its tree implementation as well. Perhaps a more recent fix?

StrikerRUS · 2020-04-01T18:00:41Z

I clearly remember that scikit-learn used float32 in its tree implementation as well.

Have no idea...
Version 0.18 (2016). Thresholds are double.

https://github.com/scikit-learn/scikit-learn/blob/38030a00a7f72a3528bd17f2345f34d1344d6d45/sklearn/tree/_tree.pyx#L186
(was later removed as unused in scikit-learn/scikit-learn#13230)

https://github.com/scikit-learn/scikit-learn/blob/38030a00a7f72a3528bd17f2345f34d1344d6d45/sklearn/tree/_tree.pyx#L527-L528

izeigerman · 2020-04-03T22:47:12Z

@StrikerRUS Got a chance to address your comment only now. Sorry about the delay.

StrikerRUS · 2020-04-04T13:34:06Z

No problem at all! 🙂

Seems that linter is unhappy with imported numpy now. Also, I think we can have a new API test to ensure that passing different dtypes really takes effect.

For scikit-learn issue (#188 (comment)) I believe it is better to have a separate PR.

izeigerman · 2020-04-04T17:12:24Z

Fixed the linter error and added a test. Thanks 👍

StrikerRUS

Thanks a lot for fixing this issue!
Everything looks OK to me.

Fix #168. Enforce float32 type for split condition values for GBT mod…

e13bea3

…els created using XGBoost

izeigerman requested a review from StrikerRUS March 30, 2020 00:19

StrikerRUS reviewed Mar 31, 2020

View reviewed changes

Parametrize NumVal with type

242165e

izeigerman force-pushed the iaroslav/issue-168 branch from 149b0fa to 242165e Compare April 4, 2020 03:30

Fix linter error. Add a test for ast.NumVal

645a49d

StrikerRUS approved these changes Apr 5, 2020

View reviewed changes

izeigerman merged commit 04767ab into master Apr 5, 2020

izeigerman deleted the iaroslav/issue-168 branch April 5, 2020 19:07

This was referenced May 5, 2020

fix future warning by avoiding importing private modules #200

Merged

fix links in comments #204

Merged

izeigerman mentioned this pull request May 6, 2020

Flaky e2e test for the XGBoost model with the 'gblinear' booster #205

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #168. Enforce float32 type for split condition values for GBT models created using XGBoost #188

Fix #168. Enforce float32 type for split condition values for GBT models created using XGBoost #188

izeigerman commented Mar 30, 2020 •

edited

Loading

izeigerman commented Mar 30, 2020 •

edited

Loading

coveralls commented Mar 30, 2020 •

edited

Loading

StrikerRUS commented Mar 31, 2020 •

edited

Loading

StrikerRUS left a comment

StrikerRUS Mar 31, 2020

izeigerman Apr 1, 2020

StrikerRUS Mar 31, 2020

izeigerman Apr 1, 2020

StrikerRUS Apr 1, 2020

StrikerRUS commented Mar 31, 2020 •

edited

Loading

izeigerman commented Apr 1, 2020

StrikerRUS commented Apr 1, 2020

izeigerman commented Apr 3, 2020

StrikerRUS commented Apr 4, 2020

izeigerman commented Apr 4, 2020

StrikerRUS left a comment

Fix #168. Enforce float32 type for split condition values for GBT models created using XGBoost #188

Fix #168. Enforce float32 type for split condition values for GBT models created using XGBoost #188

Conversation

izeigerman commented Mar 30, 2020 • edited Loading

izeigerman commented Mar 30, 2020 • edited Loading

coveralls commented Mar 30, 2020 • edited Loading

StrikerRUS commented Mar 31, 2020 • edited Loading

StrikerRUS left a comment

Choose a reason for hiding this comment

StrikerRUS Mar 31, 2020

Choose a reason for hiding this comment

izeigerman Apr 1, 2020

Choose a reason for hiding this comment

StrikerRUS Mar 31, 2020

Choose a reason for hiding this comment

izeigerman Apr 1, 2020

Choose a reason for hiding this comment

StrikerRUS Apr 1, 2020

Choose a reason for hiding this comment

StrikerRUS commented Mar 31, 2020 • edited Loading

izeigerman commented Apr 1, 2020

StrikerRUS commented Apr 1, 2020

izeigerman commented Apr 3, 2020

StrikerRUS commented Apr 4, 2020

izeigerman commented Apr 4, 2020

StrikerRUS left a comment

Choose a reason for hiding this comment

izeigerman commented Mar 30, 2020 •

edited

Loading

izeigerman commented Mar 30, 2020 •

edited

Loading

coveralls commented Mar 30, 2020 •

edited

Loading

StrikerRUS commented Mar 31, 2020 •

edited

Loading

StrikerRUS commented Mar 31, 2020 •

edited

Loading