-
-
Notifications
You must be signed in to change notification settings - Fork 8.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to set delta_metric to identity in pairwise objective #11261
Comments
If you are referring to the rank net loss, then simple |
If that is the case, then I can't explain the performance drop I'm seeing in version 3.0.0 compared to 1.7.8. I'm using this parameters to try to reproduce the behaviour of 1.7.8 in 3.0.0 (full query optimization by rank without normalization and pairing method 'mean' as metric is spearman correlation):
I'll try to upload a portion of the data as a dataset on Kaggle along with code to reproduce the issue. |
|
I have trained the same model with the same data using versions 1.7.8 and 3.0.0. The only changes in the code are the parameters for version 3.0.0:
The learning with version 1.7.8 is stable and reaches a correlation of 0.0317 at 13K trees:
The result with version 3.0.0 does not improve after a few trees:
This happens with different dataset subsets, different sets of predictive variables, different general parameters (depth, colsample_bytree, lambda...)
Training with the number of pairs set to 10 the behavior is similar with 3.0.0, it quickly starts to degrade:
I'm not C++ coder and follow the differences between 1.7.8 and 3.0.0 is hard to me as is a whole new refactoring, but I think something is different to cause this behaviors. |
Ok, I get it now. The ranknet loss has no delta metric (1.0), but the 1.0 is normalized by the ranking score difference, which is undesirable. |
I've seen that you've disabled pair normalization in pairwise. This is the behavior of version 1.7. |
Thank you for pointing it out. It's made to prevent overfitting by "smoothing" things out. Sometimes it can make training stagnate as in your example. There are some other similar operations like the one employed by At the time of reworking the ranking, I did some tuning for some popular benchmarking datasets like the MSLR and found the normalization useful. On the other hand, as noted in https://xgboost.readthedocs.io/en/latest/tutorials/learning_to_rank.html#obtaining-good-result , I also find it quite difficult to get good results. Your experiments are welcome and feel free to make suggestions! I will try to make it an option instead. |
@trivialfis This seems fixed with the PR, thank you!
About normalization, I'm thinking in something like: Thank you again! |
I've test the PR in 3.0.0 CPU version and is fine, but when I try to use in compiled version with GPU it doesn't work. |
Could you please elaborate on "it doesn't work"? |
3.0.0 with PR compiled CPU version works fine and correct the training stagnate. |
I will try to reproduce the issue, I have been using GPU build by default and haven't seen it yet. |
@jaguerrerod Could you please provide a snippet that I can run? Note: disabling the parameter can hurt both training and validation results for MSLR. |
Dataset is very big and not public. I'll prepare an ofuscated subset and share in kaggle repository with scripts. |
No worries. If some scripts cannot be made public, feel free to reach out with email. |
I've upload a dataset to kaggle datasets: https://www.kaggle.com/datasets/blindape/noisy-data-for-xgboost-pairwise-testing
Version 3.0.0.1:
Output
Training quickly overfit training data and doesn't improve the hold data. Version 1.7.8.1:
Output
Training doesn't overfit training data and improve the hold data. |
The problem is the new parameter (lambdarank_diff_normalization) in xgb.train is named lambdarank_score_normalization: xgboost/R-package/R/xgb.train.R Lines 793 to 855 in 4bfd4bf
Setting this parameter to FALSE:
The problem of stagnation is fixed. |
This really depends on the dataset and the specific of parameters.
Thank you for sharing. Is this true even with the documented set of parameters for reproducing 1.7? |
1.7.8.1 is faster (13K trees in 5h 30min vs 9K with 3.0.0.1) Change respect previous scripts (leave 5 queries gap) and I'm using optimal parameters for this kind of noisy data.
3.0.0.1
1.7.8.1
|
@trivialfis xgboost/src/objective/rank_obj.cu Lines 875 to 893 in 36eb41c // get lambda weight for the pairs
LambdaWeightComputerT::GetLambdaWeight(lst, &pairs);
// rescale each gradient and hessian so that the lst have constant weighted
float scale = 1.0f / param_.num_pairsample;
if (param_.fix_list_weight != 0.0f) {
scale *= param_.fix_list_weight / (gptr[k + 1] - gptr[k]);
}
for (auto & pair : pairs) {
const ListEntry &pos = lst[pair.pos_index];
const ListEntry &neg = lst[pair.neg_index];
const bst_float w = pair.weight * scale;
const float eps = 1e-16f;
bst_float p = common::Sigmoid(pos.pred - neg.pred);
bst_float g = p - 1.0f;
bst_float h = std::max(p * (1.0f - p), eps);
// accumulate gradient and hessian in both pid, and nid
gpair[pos.rindex] += GradientPair(g * w, 2.0f*w*h);
gpair[neg.rindex] += GradientPair(-g * w, 2.0f*w*h);
} There are two normalizations, one by num_pairsample and another (optional depending of fix_list_weight parameter?!). With both we have scale. // accumulate gradient and hessian in both pid, and nid
gpair[pos.rindex] += GradientPair(g * w, 2.0f*w*h);
gpair[neg.rindex] += GradientPair(-g * w, 2.0f*w*h); I think it isn't enough and normalization of gradients should be done in other way. In 3.0.0, I didn't find where the normalization by num_pairsample is done (like in 1.7 computing scale scale = 1.0f / param_.num_pairsample). The other issue, 3.0.0 take significantly more time to fit the model, I read in some cases you need to do two passes to compute gradients and it is related with some regularizations. |
This is likely caused by XGBoost building deeper trees at the beginning due to the new optimal intercept. This has been observed in the past #9452 (comment) . We can't compare the speed difference when training different models with different starting points.
The cost of calculating the objective is relatively small. It would be suprising if it contributed more than 5% of overall time, don't worry about it.
Normalizing by the number of pairs is possible in theory. We have another normalization that uses the gradient instead of the discrete num pairs: xgboost/src/objective/lambdarank_obj.cc Line 229 in 8fb2468
I think lgbm uses top-k instead of random sampling, but I could be wrong. Do you have a reference? We also discard the same label. xgboost/src/objective/lambdarank_obj.cu Line 104 in 8fb2468
xgboost/src/objective/lambdarank_obj.cc Line 189 in 8fb2468
|
Looking again at your example parameter:
Could you please consider setting the |
I think we should center in main issue (model accuracy) and after the rest (fitting time, options...). Right, lightgbm does't sample, only top_k, but they enumerate all pairs and skip pairs with same label here: If you use: xgboost/src/objective/lambdarank_obj.cu Line 104 in 8fb2468
xgboost/src/objective/lambdarank_obj.cc Line 189 in 8fb2468
why is needed to find the bucket boundary for each label, each query and each iteration. I think it may takes a lot of time. xgboost/src/objective/lambdarank_obj.h Lines 248 to 262 in 4bfd4bf
I think precompute buckets by query or random sampling and checking if pair has different labels is more efficient, but isn't the major issue right now. About the normalization in: xgboost/src/objective/lambdarank_obj.cc Line 229 in 8fb2468
The justification in lightgbm referenced in code: I think precisely in LTR doesn't make sense, as the outputs are not average of labels, in fact in LTR the leaves values are invariant by monotone transformation of labels, at least in pairwise objective. xgboost/src/objective/lambdarank_obj.cc Lines 228 to 232 in 8fb2468
but in my example lambdarank_normalization = FALSE, so doesn't affect. I've tried setting base_score to 0.5:
versus without setting it:
It hasn't impact in accuracy or fitting time. |
@trivialfis I've detected the problem Fitting the model using this parameters (lambdarank_num_pair_per_sample = 1):
The output
Then I change the lambdarank_num_pair_per_sample to 30:
See the train spearman in both. With 30 pairs the model is overfitting training set. For lambdarank_num_pair_per_sample = 1:
The cover of root is 2255480 that is the number of observations in train. For lambdarank_num_pair_per_sample = 30:
the cover now is 30 times the number of observations. The gain are extremely high. The nodes of tree 0 are 1602 and with 1 pair the nodes are 240. This is the cause of fast overfitting and poor accuracy. Now with 1.7.8 and 30 pairs:
See how similar to 3.0.0 with only 1 pair.
The cover is again the number of observations in train and the tree structure (number of nodes) is the same with 30 or 1 pairs. This is the problem. In accumulation of gradients and hessians in 3.0.0 is necessary to divide by number of pairs as you do in 1.7.8 here (Line 878): xgboost/src/objective/rank_obj.cu Lines 875 to 893 in 36eb41c
or preferably divide by the number of pairs in which each observation is present (a counter array that increase each time a observation is on pos or neg position of a pair is only you need). |
Confirmed
With 3.0.0 and 200 pairs the max spearman was 0.0283 at 9K trees (in 5h 24 min) With the problem fixed with this work around, the spearman is similar to 1.7.8 in both train and hold datasets:
Time is still slower than 1.7.8: (13K trees in 5h 37min). 38.6 trees per min. |
Okay, got it. Thank you for the very detailed diagnosis. I will look into it:
As for the 2-passes reduction in the objective, it's done to make GPU computation deterministic (parallel floating point summation). I added the comment for upper bounds in the code as a potential optimization in the future. |
Could you please help take a look at #11322 ? It's not normalized by per-sample number of pairs yet. It's just restoring the old behaviour from 1.7. We can experiment with other approaches later. |
Sure. I'll do this evening. |
Thank you! The PR replaces the gradient-based normalization you mentioned with the |
Feel free to suggest changes. |
@trivialfis seems problem (fast learning due artificial increasing of gradients and hessians) persist with 3.1.0.0 1.7.8.1
3.1.0.0
|
lambdarank_normalization needs to be true after the PR as described previously |
I didn't realized lambdarank_normalization now must be TRUE.
|
Glad that it's fixed! |
delta_metric have been introduced from version 2 refactoring:
xgboost/src/objective/lambdarank_obj.h
Lines 111 to 126 in 600be4d
respect the way of computing gradient and hessians in 1.7 versions:
xgboost/src/objective/rank_obj.cu
Lines 876 to 892 in 36eb41c
Where is defined delta and which is its default value in master version?
My cases of use don't need any delta function to overweight top elements of each query as I'm optimizing spearman correlation of the whole query.
I would like not use delta function to weight pairs.
Is possible disable it or set to identity function?
The text was updated successfully, but these errors were encountered: