Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] compatibility with scikit-learn #2628

Closed
StrikerRUS opened this issue Dec 11, 2019 · 26 comments · Fixed by #2949
Closed

[RFC] compatibility with scikit-learn #2628

StrikerRUS opened this issue Dec 11, 2019 · 26 comments · Fixed by #2949
Labels

Comments

@StrikerRUS
Copy link
Collaborator

StrikerRUS commented Dec 11, 2019

New scikit-learn version (0.22) breaks compatibility with our current estimators from sklearn wrapper (not presented in default conda channel yet, that's why we have green master). There are two failing tests there:

============================= test session starts ==============================
platform linux -- Python 3.6.9, pytest-5.3.1, py-1.8.0, pluggy-0.13.rootdir: /home/travis/build/microsoft/LightGBM
collected 91 items                                                             
../tests/c_api_test/test_.py ..                                          [  2%]
../tests/python_package_test/test_basic.py .............                 [ 16%]
../tests/python_package_test/test_consistency.py ....                    [ 20%]
../tests/python_package_test/test_engine.py ............................ [ 51%]
................                                                         [ 69%]
../tests/python_package_test/test_plotting.py .....                      [ 74%]
../tests/python_package_test/test_sklearn.py ..........F...........F     [100%]
=================================== FAILURES ===================================
_________________________ TestSklearn.test_grid_search _________________________
self = <test_sklearn.TestSklearn testMethod=test_grid_search>
    def test_grid_search(self):
        X, y = load_iris(True)
        y = np.array(list(map(str, y)))  # utilize label encoder at it's max power
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)
        params = {'subsample': 0.8,
                  'subsample_freq': 1}
        grid_params = {'boosting_type': ['rf', 'gbdt'],
                       'n_estimators': [4, 6],
                       'reg_alpha': [0.01, 0.005]}
        fit_params = {'verbose': False,
                      'eval_set': [(X_test, y_test)],
                      'eval_metric': constant_metric,
                      'early_stopping_rounds': 2}
        grid = GridSearchCV(lgb.LGBMClassifier(**params), grid_params, cv=2)
>       grid.fit(X, y, **fit_params)
../tests/python_package_test/test_sklearn.py:164: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../../../../miniconda/envs/test-env/lib/python3.6/site-packages/sklearn/model_selection/_search.py:652: in fit
    fit_params_values = indexable(*fit_params.values())
../../../../miniconda/envs/test-env/lib/python3.6/site-packages/sklearn/utils/validation.py:237: in indexable
    check_consistent_length(*result)
../../../../miniconda/envs/test-env/lib/python3.6/site-packages/sklearn/utils/validation.py:208: in check_consistent_length
    lengths = [_num_samples(X) for X in arrays if X is not None]
../../../../miniconda/envs/test-env/lib/python3.6/site-packages/sklearn/utils/validation.py:208: in <listcomp>
    lengths = [_num_samples(X) for X in arrays if X is not None]
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
x = array(False)
    def _num_samples(x):
        """Return number of samples in array-like x."""
        message = 'Expected sequence or array-like, got %s' % type(x)
        if hasattr(x, 'fit') and callable(x.fit):
            # Don't get num_samples from an ensembles length!
            raise TypeError(message)
    
        if not hasattr(x, '__len__') and not hasattr(x, 'shape'):
            if hasattr(x, '__array__'):
                x = np.asarray(x)
            else:
                raise TypeError(message)
    
        if hasattr(x, 'shape') and x.shape is not None:
            if len(x.shape) == 0:
                raise TypeError("Singleton array %r cannot be considered"
>                               " a valid collection." % x)
E               TypeError: Singleton array array(False) cannot be considered a valid collection.
../../../../miniconda/envs/test-env/lib/python3.6/site-packages/sklearn/utils/validation.py:152: TypeError
_____________________ TestSklearn.test_sklearn_integration _____________________
self = <test_sklearn.TestSklearn testMethod=test_sklearn_integration>
    @unittest.skipIf(sk_version < '0.19.0', 'scikit-learn version is less than 0.19')
    def test_sklearn_integration(self):
        # we cannot use `check_estimator` directly since there is no skip test mechanism
        for name, estimator in ((lgb.sklearn.LGBMClassifier.__name__, lgb.sklearn.LGBMClassifier),
                                (lgb.sklearn.LGBMRegressor.__name__, lgb.sklearn.LGBMRegressor)):
            check_parameters_default_constructible(name, estimator)
            # we cannot leave default params (see https://github.com/microsoft/LightGBM/issues/833)
            estimator = estimator(min_child_samples=1, min_data_in_bin=1)
            for check in _yield_all_checks(name, estimator):
                check_name = check.func.__name__ if hasattr(check, 'func') else check.__name__
                if check_name == 'check_estimators_nan_inf':
                    continue  # skip test because LightGBM deals with nan
                try:
>                   check(name, estimator)
../tests/python_package_test/test_sklearn.py:249: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../../../../miniconda/envs/test-env/lib/python3.6/site-packages/sklearn/utils/_testing.py:327: in wrapper
    return fn(*args, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
name = 'LGBMClassifier'
estimator_orig = LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
               importance_type='split', ..., reg_lambda=0.0,
               silent=True, subsample=1.0, subsample_for_bin=200000,
               subsample_freq=0)
    @ignore_warnings(category=FutureWarning)
    def check_no_attributes_set_in_init(name, estimator_orig):
        """Check setting during init. """
        estimator = clone(estimator_orig)
        if hasattr(type(estimator).__init__, "deprecated_original"):
            return
    
        init_params = _get_args(type(estimator).__init__)
        if IS_PYPY:
            # __init__ signature has additional objects in PyPy
            for key in ['obj']:
                if key in init_params:
                    init_params.remove(key)
        parents_init_params = [param for params_parent in
                               (_get_args(parent) for parent in
                                type(estimator).__mro__)
                               for param in params_parent]
    
        # Test for no setting apart from parameters during init
        invalid_attr = (set(vars(estimator)) - set(init_params)
                        - set(parents_init_params))
        assert not invalid_attr, (
                "Estimator %s should not set any attribute apart"
                " from parameters during init. Found attributes %s."
>               % (name, sorted(invalid_attr)))
E       AssertionError: Estimator LGBMClassifier should not set any attribute apart from parameters during init. Found attributes ['_Booster', '_best_iteration', '_best_score', '_class_map', '_class_weight', '_classes', '_evals_result', '_n_classes', '_n_features', '_objective', '_other_params', 'min_data_in_bin'].
../../../../miniconda/envs/test-env/lib/python3.6/site-packages/sklearn/utils/estimator_checks.py:2413: AssertionError

The first failing test says that we are not allowed to pass any optional arguments in fit method, except sample_weight. There is an issue in scikit-learn repo which after fixing should bring back possibility to pass scalar values in 0.22.1: scikit-learn/scikit-learn#15805. However, it will not solve our problem because we have a lot of non-scalar arguments, e.g. eval_set, eval_names, eval_init_score, etc. Speaking about eval_set, they are thinking about the possibility to add a mechanism of custom validation data in scikit-learn, but haven't designed an API for it yet: scikit-learn/scikit-learn#15127 (comment). So I don't think we should wait for a public API release soon.

The second failing test indicates that estimator "should not set any attribute apart from parameters during init". Seems that it means we cannot use kwargs anymore.

Moreover, it turned out recently that passing check_estimator test doesn't mean that an estimator is "fully compatible" with scikit-learn library: scikit-learn/scikit-learn#15392 (comment). In fact it means we have no any formal indicator that our estimators work fine with tools from scikit-learn.

Given that we know our estimators were compatible with scikit-learn < 0.22 in terms of passing their public API test and in practice as well (we have no any reproducible open issues related to sklearn wrapper in our repo at present), I think that the only option we have for now is document here that the last supported version of scikit-learn is 0.21.3 and prohibit using more recent versions by raising fatal error here. Then we should wait for stable and more complete version of check_estimator test and rewrite our wrapper from scratch preserving our current features (if it will be possible at all). Because for now I don't think we have resources to rewrite our wrapper completely after every new scikit-learn release which can potentially break everything by introducing new checks in check_estimator without any deprecation/future warnings (the second current failing test) or even without any new checks in check_estimator but by new checks directly inside a tool (the first current failing test). Even if we will do that, we have no guarantee that we are "fully compatible" with a new version and will have to check compatibility by testing every tool from scikit-learn with our estimators in different scenarios by hand and wait for issue reports from users.

Maybe we should ask for a help or advice from scikit-learn team as it was kindly suggested here: scikit-learn/scikit-learn#15392 (comment). IDK...

@guolinke, @chivee, @Laurae2, @jameslamb, @henry0312

@guolinke
Copy link
Collaborator

is there any reason for they break the compatibility?

@StrikerRUS
Copy link
Collaborator Author

StrikerRUS commented Dec 11, 2019

Unfortunately, I have no info.

According to scikit-learn/scikit-learn#15805 (comment), seems that "optional data-dependent parameters" actually means "optional parameters should be indexable and have length equal to number of data samples".

1

https://github.com/scikit-learn/scikit-learn/blob/4256542b10bcc6bd10ddf31dfb356ecc084f2e7d/sklearn/utils/validation.py#L215-L238

And "Also it is expected that parameters with trailing _ are not to be set inside the __init__ method." should be read as "Also it is expected that parameters with trailing and leading _ are not to be set inside the __init__ method."

2

@guolinke
Copy link
Collaborator

@StrikerRUS without eval sets, how do sklean doing early stopping?
I think a solution may is, passing all parameters in __init__, with a callback function to partition data.
Then only use X,y, and weight, group, ... etc in fit function?

@StrikerRUS
Copy link
Collaborator Author

@guolinke

without eval sets, how do sklean doing early stopping?

Via valid data fraction parameter and train_test_split internally: scikit-learn/scikit-learn#15127, scikit-learn/scikit-learn#14303 (comment).

I think a solution may is, passing all parameters in init, with a callback function to partition data.

Not a fair replacement for the current behavior I think. What if validation data comes from a different file, for example? Then it might be a not function to partition data, but a more general function that returns some data. However, passing (actually storing) functions is dangerous due to pickle/joblib: #2189.

But hold on - seems they want to revert this check completely for now in 0.22.1: scikit-learn/scikit-learn#15805 (comment).

Speaking generally about sklearn wrapper design, I think we should wait at least to their HistGradientBoosting leaves beta stage. They probably will have the same problems as we do because they want to mimic LightGBM (mostly)/XGBoost/CatBoost behavior: scikit-learn/scikit-learn#15127, scikit-learn/scikit-learn#15841, scikit-learn/scikit-learn#14830, ...

Then only use X,y, and weight, group, ... etc in fit function?

len(group) != X.shape[0] in a general case, I think.

@adrinjalali
Copy link

Yeah we'll be fixing this issue in the next minor release.

Conventionally, we expect the fit params to be sample aligned, and that's why the recent change wasn't considered backward incompatible. We know it has its limitations, and we're working on an API to fix those issues in a better way, but that'll take a while probably.

@StrikerRUS
Copy link
Collaborator Author

StrikerRUS commented Dec 15, 2019

@adrinjalali Thank you very much for heads-up!

However, I strongly believe that library public integration API cannot be developed by internal "convention", but should be documented in an appropriate place ahead actual changes. Unfortunately, third-party developers cannot follow all discussions happened in repo's issues and the only way to get familiar with API conventions and agreements for them is public docs and warnings. Also, it'll very cool to introduce breaking checks firstly in check_estimator test and only after that in a tool itself, so that third-party libraries will have some time to prepare and a test to validate changes.

BTW, is it possible to revert another breaking change in 0.22.1 for the same reason (refer to #2628 (comment))?

@StrikerRUS
Copy link
Collaborator Author

Despite that we have not decided anything, we cannot allow to paralyze our development process due to failing CI tests and nightly releases. So simply prohibit using newer scikit-learn versions for now: #2637.

@glemaitre
Copy link
Contributor

glemaitre commented Dec 24, 2019

Be aware that we are going to revert back the behaviour in scikit-learn/scikit-learn#15863 which will be back-ported in 0.22.1. In your setup.py, you could add scikit-learn != 0.22.0

EDIT: Here I am referring to passing scalar values in fit_params.

@ogrisel
Copy link

ogrisel commented Dec 24, 2019

BTW, is it possible to revert another breaking change in 0.22.1 for the same reason (refer to #2628 (comment))?

Which change specifically?

@ogrisel
Copy link

ogrisel commented Dec 24, 2019

BTW the fact that fit params should be data-dependent is not new. Early stopping parameters (early_stopping_rounds, eval_metric) should be passed as constructor parameters, not fit parameters. This was already documented as such a while ago, for instance in the documentation of scikit-learn 0.18:

Depending on the nature of the algorithm, fit can sometimes also accept additional keywords arguments. However, any parameter that can have a value assigned prior to having access to the data should be an __init__ keyword argument. fit parameters should be restricted to directly data dependent variables. For instance a Gram matrix or an affinity matrix which are precomputed from the data matrix X are data dependent. A tolerance stopping criterion tol is not directly data dependent (although the optimal value according to some scoring function probably is).

For the validation set I am not so sure. In the next major version for scikit-learn we will probably work on generalizing a consistent API for early stopping and we will have to discuss what is the proper way to pass a pre-sampled validation set.

@StrikerRUS
Copy link
Collaborator Author

@glemaitre This is very awesome! Thanks a lot for heads-up!

@ogrisel

Which change specifically?

That one which prohibits setting private fields in __init__:

And "Also it is expected that parameters with trailing _ are not to be set inside the __init__ method." should be read as "Also it is expected that parameters with trailing and leading _ are not to be set inside the __init__ method."
#2628 (comment)

BTW the fact that fit params should be data-dependent is not new.

Agree! However, "data dependent variables" is not very intuitive term to describe a such expected behaviour. Maybe it's better to paraphrase into "data aligned variables" to avoid confusions? Or the best option is to explicitly say what you mean, for example: "data aligned variables which should be indexable and have length equal to number of data samples in training set".

For the validation set I am not so sure. In the next major version for scikit-learn we will probably work on generalizing a consistent API for early stopping and we will have to discuss what is the proper way to pass a pre-sampled validation set.

Another awesome news!

@ogrisel
Copy link

ogrisel commented Dec 24, 2019

Maybe it's better to paraphrase into "data aligned variables" to avoid confusions? Or the best option is to explicitly say what you mean, for example: "data aligned variables which should be indexable and have length equal to number of data samples in training set".

We are still not fully decided about this requirement. It will probably be refined soon.

and leading _ are not to be set inside the __init__ method.

This is being rolled-back in 0.22.1: scikit-learn/scikit-learn#15947

@StrikerRUS
Copy link
Collaborator Author

@ogrisel Great news! Looking forward to the new version!

This is being rolled-back in 0.22.1: scikit-learn/scikit-learn#15947

But I do not see any diffs for
https://github.com/scikit-learn/scikit-learn/blob/9408203ac43f69f51ec068d4ae7a721761e94c9d/sklearn/utils/estimator_checks.py#L2387
which cause incompatibility in scikit-learn/scikit-learn#15947. Am I missing something?

@ogrisel
Copy link

ogrisel commented Jan 2, 2020

scikit-learn 0.22.1 is out. This restores the previous behavior.

@ogrisel
Copy link

ogrisel commented Jan 2, 2020

Hum I just saw the last comment. check_estimator was left unchanged. I think we could allow for setting private attributed in __init__ as long as those attribute values do not depend on any constructor parameters and constructor parameters as set unchanged.

which cause incompatibility in scikit-learn/scikit-learn#15947.

This is not incompatible. scikit-learn/scikit-learn#15947 makes it possible to customize the way an estimators checks that it has been fitted when calling predict / transform.

@StrikerRUS
Copy link
Collaborator Author

@ogrisel Great! Thank you very much for keeping us with the latest news!

This is not incompatible. scikit-learn/scikit-learn#15947 makes it possible to customize the way an estimators checks that it has been fitted when calling predict / transform.

Sorry, but I'm afraid I haven't fully understood this. Does it mean that with 0.22.1 version the following code is allowed in __init__ and will not cause check_estimator to raise errors?

self._Booster = None
self._evals_result = None
self._best_score = None

I'm asking because the new version is not updated at conda yet from where our CI routines get packages.

@ogrisel
Copy link

ogrisel commented Jan 3, 2020

It's true that setting (fixed) private attribute in __init__ is not explicitly forbidden in our documentation: https://scikit-learn.org/dev/developers/develop.html#instantiation but this check has been enforced for all scikit-learn estimators in check_estimators for a long time.

What has changed in scikit-learn/scikit-learn#14511 is that this check used to be only run when check_estimator was called on an estimator class rather than on a estimator instance. Now, it's always run.

Note that you can use check_estimator(lightgbm.LGBMClassifier(), generate_only=True)) to return a generator of (estimator, check_function) that you can introspect to decide if you want to ignore a particular check.

@StrikerRUS
Copy link
Collaborator Author

@ogrisel Ah, got it! Thanks a lot for the detailed info!

@StrikerRUS
Copy link
Collaborator Author

@ogrisel

It's true that setting (fixed) private attribute in __init__ is not explicitly forbidden in our documentation: https://scikit-learn.org/dev/developers/develop.html#instantiation but this check has been enforced for all scikit-learn estimators in check_estimators for a long time.

Just curious, are there any plans to bring a consistency between docs and public API tests for that aspect? Or what is internal convention on private attributes in __init__ ?

@StrikerRUS
Copy link
Collaborator Author

Gentle ping @ogrisel .
Can we expect any fixes in a next release or will prohibition of setting private attributes in __init__ be publicly documented? Or is it better to simply skip that confusing and controversial test for now?

@ogrisel
Copy link

ogrisel commented Jan 27, 2020

@StrikerRUS I opened scikit-learn/scikit-learn#16241 to discuss this.

@StrikerRUS
Copy link
Collaborator Author

Thank you, @ogrisel !

@rth
Copy link

rth commented Mar 26, 2020

Can we expect any fixes in a next release or will prohibition of setting private attributes in init be publicly documented? Or is it better to simply skip that confusing and controversial test for now?

To give a status update on this, we are still waiting for a contributor to propose a fix in scikit-learn/scikit-learn#16241 (have somewhat limited resources atm). Bear in mind that common checks in scikit-learn are by no way perfect, and while we are working to improve them, occasionally they will yield false positives scikit-learn/scikit-learn#6715. They have been made more modular with parametrize_with_checks and estimator tags lately, but some issues remain.

In this case, I think the right thing is to skip that test and indicate scikit-learn != 0.22.0 in requirements. As mentioned above the only thing that changed for check_no_attributes_set_in_init in 0.22 is it being run on instances vs classes, it doesn't impact actual compatibility with scikit-learn. Likely the only thing that test will do in scikit-learn is to make it easier to skip it, so there is no reason to wait. Will try to propose a PR.

@StrikerRUS
Copy link
Collaborator Author

StrikerRUS commented Mar 26, 2020

@rth

Thanks a lot for your feedback!

In this case, I think the right thing is to skip that test and indicate scikit-learn != 0.22.0 in requirements. ... Will try to propose a PR.

Yeah, we've been waiting for an answer exactly for this question 🙂

Or is it better to simply skip that confusing and controversial test for now?

Do not waste your time! I'll prepare changes and ping you when they'll be ready for review.

@rth
Copy link

rth commented Mar 26, 2020

Just made #2946 already :)

@StrikerRUS
Copy link
Collaborator Author

Huh, nice timings!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants