LGBMRanker query group setting when using gridsearchcv #3018

chouisgiser · 2020-04-23T16:28:58Z

#1137

Environment info

Operating System: Mac OS 10.15

CPU/GPU model: no

C++/Python/R version: python 3.7

LightGBM version or commit hash:

Error message

lightgbm.basic.LightGBMError: Sum of query counts is not same with #data

Reproducible examples

estimator_params = {'boosting_type': 'gbdt',
'objective': 'lambdarank',
'min_child_samples': 5,
'importance_type': 'gain',
}

gbm = lgb.LGBMRanker(**estimator_params)

params_grid = {'n_estimators': [10, 20],
'num_leaves': [10, 20],
'max_depth': [10],
'learning_rate': [0.1],
}

cv_group_info = query_train.astype(int)
flatted_group = np.repeat(range(len(cv_group_info)), repeats=cv_group_info)

logo = LeaveOneGroupOut()
cv = logo.split(X_train, y_train, groups=flatted_group)
cv_group = logo.split(X_train, groups=flatted_group)

grid = GridSearchCV(gbm, params_grid, cv=cv, verbose=2,
scoring=make_scorer(ndcg_score, greater_is_better=True), refit=False)

def group_gen(flatted_group, cv):
for train, test in cv:
yield np.unique(flatted_group[train], return_counts=True)[1]

gen = group_gen(flatted_group, cv_group)
params_fit = {
'eval_set': [(X_test, y_test)],
'eval_group': [query_test],
'eval_metric': 'ndcg',
'early_stopping_rounds': 100,
'eval_at': [1, 2, 3],
}

grid.fit(X_train, y_train, group= next(gen), **params_fit)

Steps to reproduce

Data is from the sample data in lambdarank directory
Use the LeaveOneGroup out
When I used my own data in which the query count of each group is the same, the code works. But if I use the data with different query counts in the query file, it reports the error.

guolinke · 2020-04-25T04:53:35Z

ping @StrikerRUS for the #1137 (comment)

StrikerRUS · 2020-04-27T12:12:52Z

Ah, it's a pity that workaround doesn't work fine anymore.

Maybe cv and cv_group generators produce different indices for some reason?..

Generally speaking, scikit-learn doesn't have any (ranking) estimators that allow to pass additional group argument into fit function (at least, I'm not aware of any, but will be glad to be mistaken). So, that old dirty workaround cannot work very well.

As according to the scikit-learn team plans they are towards moving some old and new parameters into fit method, it'll be good to create a feature request for including group parameter in their plans as well.

Basically, we're trying to move any parameter that is data-specific into fit, or at least out of __init__.
#2966 (comment)

chouisgiser · 2020-05-03T09:32:07Z

That is a pity. Thanks for the guidance all the time.

lowjiajin · 2020-11-23T17:10:03Z

Can we reopen this issue? I'm also encountering the problem where there's no way to input different group= values for different splits within the CV. The example - which borrows from @StrikerRUS' example snippet here - doesn't actually work, because next(gen) is simply executed once, and all the remaining group values within the iterator are never yielded.

For example, if the train set of the 1st round of CV has the following group sizes: [10, 11, 10], but the 2nd round of CV has these group sizes: [10, 9, 11], the second fit is still executed with [10, 11, 10]. This will result in the stated error, since 31 != 30.

There is also the possibility of silent errors, where the sum of the group sizes might be the same, while their order differs. E.g. if the group size in the second round of CV was hypothetically [10, 10, 11] instead. In that case, the 21st element would have been mislabelled as being in the second group instead of the third, since fit is called with [10, 11, 10] (the first invocation of the gen generator).

There are similar unanswered qns on Stackoverflow too: https://stackoverflow.com/questions/64905119/hyperparameter-optimization-with-lgbmranker

StrikerRUS · 2021-04-28T12:21:32Z

I'm also encountering the problem where there's no way to input different group= values for different splits within the CV.

Please refer to

Generally speaking, scikit-learn doesn't have any (ranking) estimators that allow to pass additional group argument into fit function
#3018 (comment)

Deeper LGBMRanker integration into scikit-learn ecosystem can be discussed after some steps from scikit-learn devs towards this

Basically, we're trying to move any parameter that is data-specific into fit, or at least out of __init__.
#2966 (comment)

Right now LGBMRanker even is not tested for compatibility with scikit-learn.

Scikit-learn doesn't have learning-to-rank applications. So there is no point to test LGBMRanker to be "compatible" with something that doesn't support ranking.
#3894 (comment)

lowjiajin · 2021-04-29T00:33:42Z

@StrikerRUS would your recommendation be to not use the scikit-learn wrapper for LGBMRanker then? At the very least, it would be good for such limitations to be made known in the user docs, rather than have folks use the scikit-learn integration thinking that it works.

StrikerRUS · 2021-04-29T21:48:26Z

@lowjiajin

would your recommendation be to not use the scikit-learn wrapper for LGBMRanker then?

No. LGBMRanker can be used to train and apply ranking models in easy user-friendly sklearnish way.

At the very least, it would be good for such limitations to be made known in the user docs

Good idea! I'll make a note about that scikit-learn doesn't support ranking, therefore there is no integration for this class with different sklearn tools.

lowjiajin · 2021-04-30T02:08:41Z

Thanks @StrikerRUS! I’m sure the docs change would be much appreciated to newcomers like me 🙂

StrikerRUS · 2021-04-30T14:49:23Z

@lowjiajin Done in #4243. Thanks for the proposal!

github-actions · 2023-08-23T14:38:47Z

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

chouisgiser closed this as completed May 3, 2020

jameslamb reopened this Nov 23, 2020

StrikerRUS closed this as completed Apr 28, 2021

StrikerRUS mentioned this issue Apr 30, 2021

[docs][python][scikit-learn] added note for LGBMRanker #4243

Merged

StrikerRUS mentioned this issue Oct 9, 2021

[LightGBM] [Fatal] Sum of query counts is not same with #data #4664

Closed

github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LGBMRanker query group setting when using gridsearchcv #3018

LGBMRanker query group setting when using gridsearchcv #3018

chouisgiser commented Apr 23, 2020

guolinke commented Apr 25, 2020

StrikerRUS commented Apr 27, 2020

chouisgiser commented May 3, 2020

lowjiajin commented Nov 23, 2020 •

edited

Loading

StrikerRUS commented Apr 28, 2021

lowjiajin commented Apr 29, 2021

StrikerRUS commented Apr 29, 2021

lowjiajin commented Apr 30, 2021

StrikerRUS commented Apr 30, 2021

github-actions bot commented Aug 23, 2023

LGBMRanker query group setting when using gridsearchcv #3018

LGBMRanker query group setting when using gridsearchcv #3018

Comments

chouisgiser commented Apr 23, 2020

Environment info

Error message

Reproducible examples

Steps to reproduce

guolinke commented Apr 25, 2020

StrikerRUS commented Apr 27, 2020

chouisgiser commented May 3, 2020

lowjiajin commented Nov 23, 2020 • edited Loading

StrikerRUS commented Apr 28, 2021

lowjiajin commented Apr 29, 2021

StrikerRUS commented Apr 29, 2021

lowjiajin commented Apr 30, 2021

StrikerRUS commented Apr 30, 2021

github-actions bot commented Aug 23, 2023

lowjiajin commented Nov 23, 2020 •

edited

Loading