Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LGBMRanker query group setting when using gridsearchcv #3018

Closed
chouisgiser opened this issue Apr 23, 2020 · 10 comments
Closed

LGBMRanker query group setting when using gridsearchcv #3018

chouisgiser opened this issue Apr 23, 2020 · 10 comments

Comments

@chouisgiser
Copy link

#1137

Environment info

Operating System: Mac OS 10.15

CPU/GPU model: no

C++/Python/R version: python 3.7

LightGBM version or commit hash:

Error message

lightgbm.basic.LightGBMError: Sum of query counts is not same with #data

Reproducible examples

estimator_params = {'boosting_type': 'gbdt',
'objective': 'lambdarank',
'min_child_samples': 5,
'importance_type': 'gain',
}

gbm = lgb.LGBMRanker(**estimator_params)

params_grid = {'n_estimators': [10, 20],
'num_leaves': [10, 20],
'max_depth': [10],
'learning_rate': [0.1],
}

cv_group_info = query_train.astype(int)
flatted_group = np.repeat(range(len(cv_group_info)), repeats=cv_group_info)

logo = LeaveOneGroupOut()
cv = logo.split(X_train, y_train, groups=flatted_group)
cv_group = logo.split(X_train, groups=flatted_group)

grid = GridSearchCV(gbm, params_grid, cv=cv, verbose=2,
scoring=make_scorer(ndcg_score, greater_is_better=True), refit=False)

def group_gen(flatted_group, cv):
for train, test in cv:
yield np.unique(flatted_group[train], return_counts=True)[1]

gen = group_gen(flatted_group, cv_group)
params_fit = {
'eval_set': [(X_test, y_test)],
'eval_group': [query_test],
'eval_metric': 'ndcg',
'early_stopping_rounds': 100,
'eval_at': [1, 2, 3],
}

grid.fit(X_train, y_train, group= next(gen), **params_fit)

Steps to reproduce

  1. Data is from the sample data in lambdarank directory
  2. Use the LeaveOneGroup out
  3. When I used my own data in which the query count of each group is the same, the code works. But if I use the data with different query counts in the query file, it reports the error.
@guolinke
Copy link
Collaborator

ping @StrikerRUS for the #1137 (comment)

@StrikerRUS
Copy link
Collaborator

Ah, it's a pity that workaround doesn't work fine anymore.

Maybe cv and cv_group generators produce different indices for some reason?..

Generally speaking, scikit-learn doesn't have any (ranking) estimators that allow to pass additional group argument into fit function (at least, I'm not aware of any, but will be glad to be mistaken). So, that old dirty workaround cannot work very well.

As according to the scikit-learn team plans they are towards moving some old and new parameters into fit method, it'll be good to create a feature request for including group parameter in their plans as well.

Basically, we're trying to move any parameter that is data-specific into fit, or at least out of __init__.
#2966 (comment)

@chouisgiser
Copy link
Author

That is a pity. Thanks for the guidance all the time.

@lowjiajin
Copy link

lowjiajin commented Nov 23, 2020

Can we reopen this issue? I'm also encountering the problem where there's no way to input different group= values for different splits within the CV. The example - which borrows from @StrikerRUS' example snippet here - doesn't actually work, because next(gen) is simply executed once, and all the remaining group values within the iterator are never yielded.

For example, if the train set of the 1st round of CV has the following group sizes: [10, 11, 10], but the 2nd round of CV has these group sizes: [10, 9, 11], the second fit is still executed with [10, 11, 10]. This will result in the stated error, since 31 != 30.

There is also the possibility of silent errors, where the sum of the group sizes might be the same, while their order differs. E.g. if the group size in the second round of CV was hypothetically [10, 10, 11] instead. In that case, the 21st element would have been mislabelled as being in the second group instead of the third, since fit is called with [10, 11, 10] (the first invocation of the gen generator).

There are similar unanswered qns on Stackoverflow too: https://stackoverflow.com/questions/64905119/hyperparameter-optimization-with-lgbmranker

@jameslamb jameslamb reopened this Nov 23, 2020
@StrikerRUS
Copy link
Collaborator

I'm also encountering the problem where there's no way to input different group= values for different splits within the CV.

Please refer to

Generally speaking, scikit-learn doesn't have any (ranking) estimators that allow to pass additional group argument into fit function
#3018 (comment)

Deeper LGBMRanker integration into scikit-learn ecosystem can be discussed after some steps from scikit-learn devs towards this

Basically, we're trying to move any parameter that is data-specific into fit, or at least out of __init__.
#2966 (comment)

Right now LGBMRanker even is not tested for compatibility with scikit-learn.

Scikit-learn doesn't have learning-to-rank applications. So there is no point to test LGBMRanker to be "compatible" with something that doesn't support ranking.
#3894 (comment)

@lowjiajin
Copy link

@StrikerRUS would your recommendation be to not use the scikit-learn wrapper for LGBMRanker then? At the very least, it would be good for such limitations to be made known in the user docs, rather than have folks use the scikit-learn integration thinking that it works.

@StrikerRUS
Copy link
Collaborator

@lowjiajin

would your recommendation be to not use the scikit-learn wrapper for LGBMRanker then?

No. LGBMRanker can be used to train and apply ranking models in easy user-friendly sklearnish way.

At the very least, it would be good for such limitations to be made known in the user docs

Good idea! I'll make a note about that scikit-learn doesn't support ranking, therefore there is no integration for this class with different sklearn tools.

@lowjiajin
Copy link

Thanks @StrikerRUS! I’m sure the docs change would be much appreciated to newcomers like me 🙂

@StrikerRUS
Copy link
Collaborator

@lowjiajin Done in #4243. Thanks for the proposal!

@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants