Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to set delta_metric to identity in pairwise objective #11261

Open
jaguerrerod opened this issue Feb 17, 2025 · 34 comments
Open

How to set delta_metric to identity in pairwise objective #11261

jaguerrerod opened this issue Feb 17, 2025 · 34 comments

Comments

@jaguerrerod
Copy link

jaguerrerod commented Feb 17, 2025

delta_metric have been introduced from version 2 refactoring:

// Use double whenever possible as we are working on the exp space.
double delta_score = std::abs(s_high - s_low);
double const sigmoid = common::Sigmoid(s_high - s_low);
// Change in metric score like \delta NDCG or \delta MAP
double delta_metric = std::abs(delta(y_high, y_low, rank_high, rank_low));
if (best_score != worst_score) {
delta_metric /= (delta_score + 0.01);
}
if (unbiased) {
*p_cost = std::log(1.0 / (1.0 - sigmoid)) * delta_metric;
}
auto lambda_ij = (sigmoid - 1.0) * delta_metric;
auto hessian_ij = std::max(sigmoid * (1.0 - sigmoid), Eps64()) * delta_metric * 2.0;

respect the way of computing gradient and hessians in 1.7 versions:

LambdaWeightComputerT::GetLambdaWeight(lst, &pairs);
// rescale each gradient and hessian so that the lst have constant weighted
float scale = 1.0f / param_.num_pairsample;
if (param_.fix_list_weight != 0.0f) {
scale *= param_.fix_list_weight / (gptr[k + 1] - gptr[k]);
}
for (auto & pair : pairs) {
const ListEntry &pos = lst[pair.pos_index];
const ListEntry &neg = lst[pair.neg_index];
const bst_float w = pair.weight * scale;
const float eps = 1e-16f;
bst_float p = common::Sigmoid(pos.pred - neg.pred);
bst_float g = p - 1.0f;
bst_float h = std::max(p * (1.0f - p), eps);
// accumulate gradient and hessian in both pid, and nid
gpair[pos.rindex] += GradientPair(g * w, 2.0f*w*h);
gpair[neg.rindex] += GradientPair(-g * w, 2.0f*w*h);

Where is defined delta and which is its default value in master version?

My cases of use don't need any delta function to overweight top elements of each query as I'm optimizing spearman correlation of the whole query.
I would like not use delta function to weight pairs.
Is possible disable it or set to identity function?

@jaguerrerod jaguerrerod changed the title How to set delta_metric to identity in pairwise objetive How to set delta_metric to identity in pairwise objective Feb 17, 2025
@trivialfis
Copy link
Member

If you are referring to the rank net loss, then simple rank:pairwise should suffice.

@jaguerrerod
Copy link
Author

If that is the case, then I can't explain the performance drop I'm seeing in version 3.0.0 compared to 1.7.8.
My dataset has very little signal (predictions reach a correlation of 0.03). The queries are large (5,000 observations).
Something significant changed in the refactoring introduced in version 2.0 that consistently reduces performance.
With 3.0.0, correlation reaches 0.022 after just a few iterations and quickly starts overfitting, dropping below 0.02.
With 1.7.8, the model's performance on the test dataset improves continuously up to 0.03, requiring many thousands of trees to reach that level.
Is there any change in sampling, weighting, or the calculation of gradients/hessians introduced in the refactoring that could explain this?

I'm using this parameters to try to reproduce the behaviour of 1.7.8 in 3.0.0 (full query optimization by rank without normalization and pairing method 'mean' as metric is spearman correlation):

booster = 'gbtree',
 objective = 'rank:pairwise',
 tree_method = 'hist',
 device = 'cuda',
 lambdarank_pair_method = 'mean',
 lambdarank_num_pair_per_sample = 200,
 lambdarank_normalization = FALSE,

I'll try to upload a portion of the data as a dataset on Kaggle along with code to reproduce the issue.

@trivialfis
Copy link
Member

lambdarank_num_pair_per_sample is too large. Could you please experiment with 1?

@jaguerrerod
Copy link
Author

jaguerrerod commented Feb 21, 2025

I have trained the same model with the same data using versions 1.7.8 and 3.0.0.
I used xgb.DMatrix in both cases to avoid introducing a difference by using xgb.QuantileDMatrix.

The only changes in the code are the parameters for version 3.0.0:

  • lambdarank_pair_method = 'mean',
  • lambdarank_num_pair_per_sample = 200,
  • lambdarank_normalization = FALSE.

The learning with version 1.7.8 is stable and reaches a correlation of 0.0317 at 13K trees:

1	0.007926573
10	0.01521309
20	0.01781933
30	0.01781363
40	0.01825633
50	0.01907869
100	0.02007266
150	0.02040283
200	0.02058564
250	0.02081901
300	0.02116498
350	0.02146332
400	0.02172211
450	0.0219907
500	0.02210538
13000   0.03174277

The result with version 3.0.0 does not improve after a few trees:

1	0.004441884
10	0.01096987
20	0.01326301
30	0.01420671
40	0.01612788
50	0.01708117
100	0.01741559
150	0.01584577
200	0.01570146
250	0.01508331
300	0.01510909
350	0.01517024
400	0.01534667
450	0.01440822
500	0.01440194

This happens with different dataset subsets, different sets of predictive variables, different general parameters (depth, colsample_bytree, lambda...)
When I train version 3.0.0 with the number of pairs set to 1, the results are worse:

1	0.005942605
10	0.01068313
20	0.01301092
30	0.01300241
40	0.01367269
50	0.01433194
100	0.01474676
150	0.01513208
200	0.0140801
250	0.01407706
300	0.01390105
350	0.01460263
400	0.01426433
450	0.01336728
500	0.01359152

Training with the number of pairs set to 10 the behavior is similar with 3.0.0, it quickly starts to degrade:

1	0.003311274
10	0.01169847
20	0.01278498
30	0.01404126
40	0.01554281
50	0.01665777
100	0.01611878
150	0.0167202
200	0.01597453
250	0.01562164
300	0.0157668
350	0.01574301
400	0.01558456
450	0.01564038
500	0.01515673

I'm not C++ coder and follow the differences between 1.7.8 and 3.0.0 is hard to me as is a whole new refactoring, but I think something is different to cause this behaviors.
I suspected was related with delta_metric, that is the part new in code I detected in gradient and hessian computation.

@trivialfis
Copy link
Member

Ok, I get it now. The ranknet loss has no delta metric (1.0), but the 1.0 is normalized by the ranking score difference, which is undesirable.

@jaguerrerod
Copy link
Author

jaguerrerod commented Feb 21, 2025

I've seen that you've disabled pair normalization in pairwise. This is the behavior of version 1.7.
I'll test it when it's available to see if I get similar results between both versions.
In datasets with a lot of noise and little signal, this is the best option. However, in datasets with a strong signal, normalizing a pair based on the difference in their labels might make sense.
Perhaps for future versions, including a parameter to choose whether to normalize pairs by the difference in label ranks or not could make the approach more versatile.
What intuitively makes the most sense to me is to use the label rank calculated considering the frequency of each label (like percentiles).
Determining whether pairwise works better with or without pair normalization in more predictable datasets is something worth investigating.
If this option is included in a future release, I commit to running the comparison.
EDIT:
I'll explain a bit how I think the normalization by labels in pairwise.
In this line
delta_metric /= (delta_score + 0.01);
You are normalizing inversely by delta_score, meaning you give more relevance to pairs with similar predictions.
What is the reasoning behind this?
I believe the logic in pairwise should be to give more relevance to pairs that are more different—not based on predictions, but on the true labels.
For example:
delta_metric = std::abs(y_high_rank - y_low_rank)
where y_high_rank and y_low_rank are the percentiles of the true labels, considering their frequency distribution.

@trivialfis
Copy link
Member

You are normalizing inversely by delta_score, meaning you give more relevance to pairs with similar predictions.

Thank you for pointing it out. It's made to prevent overfitting by "smoothing" things out. Sometimes it can make training stagnate as in your example. There are some other similar operations like the one employed by lambdarank_normalization.

At the time of reworking the ranking, I did some tuning for some popular benchmarking datasets like the MSLR and found the normalization useful. On the other hand, as noted in https://xgboost.readthedocs.io/en/latest/tutorials/learning_to_rank.html#obtaining-good-result , I also find it quite difficult to get good results. Your experiments are welcome and feel free to make suggestions!

I will try to make it an option instead.

@jaguerrerod
Copy link
Author

@trivialfis This seems fixed with the PR, thank you!

1	0.00424417
10	0.01324821
20	0.01554005
30	0.01645896
40	0.01771046
50	0.01804327
100	0.01979055
150	0.0208095
200	0.02083933
250	0.02113857
300	0.02120267
350	0.02139126
400	0.02197463
450	0.02219081
500	0.02237842

About normalization, I'm thinking in something like:
delta_metric = (1 + std::pow(std::abs(y_high_rank - y_low_rank), parameter))
if parameter = 0 then delta_metric is constant and no normalization
if parameter > 1 the delta_metric overweight pairs with higher difference of ranked labels. And the parameter itself control the intensity of this overweighting.
I'll review the code for NDCG and MAP and general weighting of queries by size and will propose a parametrization schema for your consideration, as a feature request for future versions.

Thank you again!

@jaguerrerod
Copy link
Author

I've test the PR in 3.0.0 CPU version and is fine, but when I try to use in compiled version with GPU it doesn't work.
Can you guide me on what is happening?

@trivialfis
Copy link
Member

Could you please elaborate on "it doesn't work"?

@jaguerrerod
Copy link
Author

3.0.0 with PR compiled CPU version works fine and correct the training stagnate.
3.0.0 with PR compiled with GPU option has issue of training stagnate both training in CPU and GPU.

@trivialfis
Copy link
Member

I will try to reproduce the issue, I have been using GPU build by default and haven't seen it yet.

@trivialfis
Copy link
Member

trivialfis commented Feb 24, 2025

@jaguerrerod Could you please provide a snippet that I can run?

Note: disabling the parameter can hurt both training and validation results for MSLR.

@jaguerrerod
Copy link
Author

Dataset is very big and not public. I'll prepare an ofuscated subset and share in kaggle repository with scripts.
MSLR is a dataset for ranking queries (documents) and you probaly are using NDCG.
I suspect the main problem is when using whole query optimization, by example using spearman correlation by query as metric and rank_pairwise as objective.
I'll test MSLR using spearman by query averaged using 1.7.8 and 3.0.0, next week, as this week I'll be on traveling and low availability.

@trivialfis
Copy link
Member

No worries. If some scripts cannot be made public, feel free to reach out with email.

@jaguerrerod
Copy link
Author

jaguerrerod commented Mar 5, 2025

@trivialfis

I've upload a dataset to kaggle datasets: https://www.kaggle.com/datasets/blindape/noisy-data-for-xgboost-pairwise-testing
Is a gzip csv.
I've fitted the same model using 1.7.7.1 and 3.0.0.1 (Release candidate) on R using both GPU.
The stagnation problem persist in Release candidate. As the PR is named 'Optional normalization for the ranknet loss' is there a parameter I need to set?
EDIT:
I found the parameter lambdarank_diff_normalization here:
29373ea
I added to the previous script but wasn't recognized by the 3.0.0.1:

[21:54:14] WARNING: /workspace/src/learner.cc:738: 
Parameters: { "lambdarank_diff_normalization" } are not used.

Version 3.0.0.1:

rm(list = ls(all = TRUE)) 
require(data.table)
require(xgboost, lib.loc = "/usr/local/lib/R/site-library")
packageVersion("xgboost")

# Evaluation metric
spearman_by_qid <- function(preds, dataset) {
  labels <- getinfo(dataset, 'label')
  qids <- attr(dataset, 'qid')  
  pred_dt <- data.table(label = labels, pred = preds, qid = qids)
  spe_dt <- pred_dt[, .(spearman = cor(label, pred, method = 'spearman')), by = .(qid)]
  spe <- mean(spe_dt$spearman)
  return(list(metric = 'spearman_by_qid', value = spe))
}

# Dataset
dtb <- fread('noisy_data.csv')
setorder(dtb, qid)
train <- dtb[qid < 850]
hold <- dtb[qid >= 850]

qid_train <- train$qid
y_train <- train$target
qid_n_train <- train[, .N, by = qid]$N
m_train <- data.matrix(train[, paste0('V', 1:705), with = FALSE])
d_train <- xgb.DMatrix(m_train, label = y_train, group = qid_n_train)
attr(d_train, 'qid') <- qid_train

qid_hold <- hold$qid
y_hold <- hold$target
qid_n_hold <- hold[, .N, by = qid]$N
m_hold <- data.matrix(hold[, paste0('V', 1:705), with = FALSE])
d_hold <- xgb.DMatrix(m_hold, label = y_hold, group = qid_n_hold)
attr(d_hold, 'qid') <- qid_hold

params <- list(
  booster = 'gbtree',
  objective = 'rank:pairwise',
  tree_method = 'hist',
  device = 'cuda',
  lambdarank_pair_method = 'mean',
  lambdarank_num_pair_per_sample = 30,
  lambdarank_normalization = FALSE,
  eta = 0.01, 
  max_leaves = 2^10,
  min_child_weight = 8000,
  nthread = 20,
  max_depth = 10,
  colsample_bytree = 0.1,
  subsample = 1,
  colsample_bynode = 1,
  gamma = 0,
  lambda = 1,
  alpha = 0)
  
watchlist <- list(train = d_train, hold = d_hold)
tic = proc.time()[3]
set.seed(6)
model_pw <- xgb.train(params, data = d_train, nround = 601, verbose = 1, evals = watchlist, custom_metric = spearman_by_qid, print_every_n = 25)
cat(round((proc.time()[3] - tic) / 60, 2), ' mins', '\n')

Output

[1] ‘3.0.0.1’
[1]	train-spearman_by_qid:0.073514	hold-spearman_by_qid:0.007333 
[26]	train-spearman_by_qid:0.375031	hold-spearman_by_qid:0.014343 
[51]	train-spearman_by_qid:0.485319	hold-spearman_by_qid:0.015927 
[76]	train-spearman_by_qid:0.555559	hold-spearman_by_qid:0.015996 
[101]	train-spearman_by_qid:0.605441	hold-spearman_by_qid:0.015236 
[126]	train-spearman_by_qid:0.644135	hold-spearman_by_qid:0.014534 
[151]	train-spearman_by_qid:0.674899	hold-spearman_by_qid:0.014604 
[176]	train-spearman_by_qid:0.700418	hold-spearman_by_qid:0.014588 
[201]	train-spearman_by_qid:0.721630	hold-spearman_by_qid:0.014388 
[226]	train-spearman_by_qid:0.739233	hold-spearman_by_qid:0.013981 
[251]	train-spearman_by_qid:0.754917	hold-spearman_by_qid:0.014857 
[276]	train-spearman_by_qid:0.768738	hold-spearman_by_qid:0.014431 
[301]	train-spearman_by_qid:0.780884	hold-spearman_by_qid:0.014267 
[326]	train-spearman_by_qid:0.791535	hold-spearman_by_qid:0.014491 
[351]	train-spearman_by_qid:0.800943	hold-spearman_by_qid:0.015006 
[376]	train-spearman_by_qid:0.809436	hold-spearman_by_qid:0.014815 
[401]	train-spearman_by_qid:0.817371	hold-spearman_by_qid:0.014401 
[426]	train-spearman_by_qid:0.824311	hold-spearman_by_qid:0.014395 
[451]	train-spearman_by_qid:0.830833	hold-spearman_by_qid:0.014555 
[476]	train-spearman_by_qid:0.836697	hold-spearman_by_qid:0.014357 
[501]	train-spearman_by_qid:0.842376	hold-spearman_by_qid:0.014240 
[526]	train-spearman_by_qid:0.847554	hold-spearman_by_qid:0.014465 
[551]	train-spearman_by_qid:0.852082	hold-spearman_by_qid:0.014273 
[576]	train-spearman_by_qid:0.856488	hold-spearman_by_qid:0.014169 
[601]	train-spearman_by_qid:0.860698	hold-spearman_by_qid:0.014131 
14.58  mins 

Training quickly overfit training data and doesn't improve the hold data.

Version 1.7.8.1:

rm(list = ls(all = TRUE)) 
require(data.table)
require(xgboost)
packageVersion("xgboost")

# Evaluation metric
spearman_by_qid <- function(preds, dataset) {
  labels <- getinfo(dataset, 'label')
  qids <- attr(dataset, 'qid')  
  pred_dt <- data.table(label = labels, pred = preds, qid = qids)
  spe_dt <- pred_dt[, .(spearman = cor(label, pred, method = 'spearman')), by = .(qid)]
  spe <- mean(spe_dt$spearman)
  return(list(metric = 'spearman_by_qid', value = spe))
}

# Dataset
dtb <- fread('noisy_data.csv')
setorder(dtb, qid)
train <- dtb[qid < 850]
hold <- dtb[qid >= 850]

qid_train <- train$qid
y_train <- train$target
qid_n_train <- train[, .N, by = qid]$N
m_train <- data.matrix(train[, paste0('V', 1:705), with = FALSE])
d_train <- xgb.DMatrix(m_train, label = y_train, group = qid_n_train)
attr(d_train, 'qid') <- qid_train

qid_hold <- hold$qid
y_hold <- hold$target
qid_n_hold <- hold[, .N, by = qid]$N
m_hold <- data.matrix(hold[, paste0('V', 1:705), with = FALSE])
d_hold <- xgb.DMatrix(m_hold, label = y_hold, group = qid_n_hold)
attr(d_hold, 'qid') <- qid_hold

params <- list(
  booster = 'gbtree',
  objective = 'rank:pairwise',
  tree_method = 'gpu_hist',
  num_pairsample = 30,
  eta = 0.01, 
  max_leaves = 2^10,
  min_child_weight = 8000,
  nthread = 20,
  max_depth = 10,
  colsample_bytree = 0.1,
  subsample = 1,
  colsample_bynode = 1,
  gamma = 0,
  lambda = 1,
  alpha = 0)
  
watchlist <- list(train = d_train, hold = d_hold)
tic = proc.time()[3]
set.seed(6)
model_pw <- xgb.train(params, data = d_train, nround = 601, verbose = 1, watchlist = watchlist, feval = spearman_by_qid, print_every_n = 25)
cat(trial_name, round((proc.time()[3] - tic) / 60, 2), ' mins', '\n')

Output

[1]	train-spearman_by_qid:0.043151	hold-spearman_by_qid:0.005277 
[26]	train-spearman_by_qid:0.124997	hold-spearman_by_qid:0.017733 
[51]	train-spearman_by_qid:0.140326	hold-spearman_by_qid:0.020667 
[76]	train-spearman_by_qid:0.149807	hold-spearman_by_qid:0.021959 
[101]	train-spearman_by_qid:0.154875	hold-spearman_by_qid:0.022085 
[126]	train-spearman_by_qid:0.159220	hold-spearman_by_qid:0.022700 
[151]	train-spearman_by_qid:0.162426	hold-spearman_by_qid:0.022514 
[176]	train-spearman_by_qid:0.165519	hold-spearman_by_qid:0.022922 
[201]	train-spearman_by_qid:0.168994	hold-spearman_by_qid:0.023324 
[226]	train-spearman_by_qid:0.172227	hold-spearman_by_qid:0.023435 
[251]	train-spearman_by_qid:0.175303	hold-spearman_by_qid:0.023645 
[276]	train-spearman_by_qid:0.177972	hold-spearman_by_qid:0.023886 
[301]	train-spearman_by_qid:0.180951	hold-spearman_by_qid:0.024151 
[326]	train-spearman_by_qid:0.184194	hold-spearman_by_qid:0.024269 
[351]	train-spearman_by_qid:0.187312	hold-spearman_by_qid:0.024332 
[376]	train-spearman_by_qid:0.190016	hold-spearman_by_qid:0.024483 
[401]	train-spearman_by_qid:0.192474	hold-spearman_by_qid:0.024565 
[426]	train-spearman_by_qid:0.195147	hold-spearman_by_qid:0.024703 
[451]	train-spearman_by_qid:0.197832	hold-spearman_by_qid:0.024765 
[476]	train-spearman_by_qid:0.200240	hold-spearman_by_qid:0.024885 
[501]	train-spearman_by_qid:0.203286	hold-spearman_by_qid:0.025002 
[526]	train-spearman_by_qid:0.205804	hold-spearman_by_qid:0.025319 
[551]	train-spearman_by_qid:0.208359	hold-spearman_by_qid:0.025422 
[576]	train-spearman_by_qid:0.210802	hold-spearman_by_qid:0.025574 
[601]	train-spearman_by_qid:0.213375	hold-spearman_by_qid:0.025716
13.70 mins

Training doesn't overfit training data and improve the hold data.

@jaguerrerod
Copy link
Author

jaguerrerod commented Mar 5, 2025

The problem is the new parameter (lambdarank_diff_normalization) in xgb.train is named lambdarank_score_normalization:

xgb.params <- function(
objective = NULL,
verbosity = NULL,
nthread = NULL,
seed = NULL,
booster = NULL,
eta = NULL,
learning_rate = NULL,
gamma = NULL,
min_split_loss = NULL,
max_depth = NULL,
min_child_weight = NULL,
max_delta_step = NULL,
subsample = NULL,
sampling_method = NULL,
colsample_bytree = NULL,
colsample_bylevel = NULL,
colsample_bynode = NULL,
lambda = NULL,
reg_lambda = NULL,
alpha = NULL,
reg_alpha = NULL,
tree_method = NULL,
scale_pos_weight = NULL,
updater = NULL,
refresh_leaf = NULL,
grow_policy = NULL,
max_leaves = NULL,
max_bin = NULL,
num_parallel_tree = NULL,
monotone_constraints = NULL,
interaction_constraints = NULL,
multi_strategy = NULL,
base_score = NULL,
eval_metric = NULL,
seed_per_iteration = NULL,
device = NULL,
disable_default_eval_metric = NULL,
use_rmm = NULL,
max_cached_hist_node = NULL,
extmem_single_page = NULL,
max_cat_to_onehot = NULL,
max_cat_threshold = NULL,
sample_type = NULL,
normalize_type = NULL,
rate_drop = NULL,
one_drop = NULL,
skip_drop = NULL,
feature_selector = NULL,
top_k = NULL,
num_class = NULL,
tweedie_variance_power = NULL,
huber_slope = NULL,
quantile_alpha = NULL,
aft_loss_distribution = NULL,
lambdarank_pair_method = NULL,
lambdarank_num_pair_per_sample = NULL,
lambdarank_normalization = NULL,
lambdarank_score_normalization = NULL,
lambdarank_unbiased = NULL,
lambdarank_bias_norm = NULL,
ndcg_exp_gain = NULL
) {

Setting this parameter to FALSE:

[1]	train-spearman_by_qid:0.073514	hold-spearman_by_qid:0.007333 
[26]	train-spearman_by_qid:0.290789	hold-spearman_by_qid:0.019192 
[51]	train-spearman_by_qid:0.341823	hold-spearman_by_qid:0.021591 
[76]	train-spearman_by_qid:0.368757	hold-spearman_by_qid:0.021607 
[101]	train-spearman_by_qid:0.386268	hold-spearman_by_qid:0.022857 
[126]	train-spearman_by_qid:0.399739	hold-spearman_by_qid:0.023421 
[151]	train-spearman_by_qid:0.410943	hold-spearman_by_qid:0.023859 
[176]	train-spearman_by_qid:0.420736	hold-spearman_by_qid:0.024347 
[201]	train-spearman_by_qid:0.429568	hold-spearman_by_qid:0.024503 
[226]	train-spearman_by_qid:0.438519	hold-spearman_by_qid:0.024862 
[251]	train-spearman_by_qid:0.446160	hold-spearman_by_qid:0.025318 
[276]	train-spearman_by_qid:0.453020	hold-spearman_by_qid:0.025542 
[301]	train-spearman_by_qid:0.460329	hold-spearman_by_qid:0.025785 
[326]	train-spearman_by_qid:0.466593	hold-spearman_by_qid:0.026227 
[351]	train-spearman_by_qid:0.472584	hold-spearman_by_qid:0.026642 
[376]	train-spearman_by_qid:0.478565	hold-spearman_by_qid:0.026876 
[401]	train-spearman_by_qid:0.484256	hold-spearman_by_qid:0.026940 
[426]	train-spearman_by_qid:0.489718	hold-spearman_by_qid:0.026943 
[451]	train-spearman_by_qid:0.494735	hold-spearman_by_qid:0.027256 
[476]	train-spearman_by_qid:0.499904	hold-spearman_by_qid:0.027414 
[501]	train-spearman_by_qid:0.504961	hold-spearman_by_qid:0.027704 
[526]	train-spearman_by_qid:0.509645	hold-spearman_by_qid:0.027860 
[551]	train-spearman_by_qid:0.513735	hold-spearman_by_qid:0.027812 
[576]	train-spearman_by_qid:0.517545	hold-spearman_by_qid:0.027951 
[601]	train-spearman_by_qid:0.521753	hold-spearman_by_qid:0.028242
14.22  mins 

The problem of stagnation is fixed.
The training is faster in train dataset and hold dataset.
@trivialfis Do you have idea of why is the learning faster? Inmy experience when model overfit quickly in train the model stop before to improve in hold.
I will check a full model in both 1.7.8 and 3.0.0.1
EDIT:
There is definitely a significant loss in performance and fitting time with 3.0.0.1 compared to 1.7.8. I will post the logs and give details in a while.

@trivialfis
Copy link
Member

Do you have idea of why is the learning faster? Inmy experience when model overfit quickly in train the model stop before to improve in hold

This really depends on the dataset and the specific of parameters.

There is definitely a significant loss in performance and fitting time with 3.0.0.1 compared to 1.7.8.

Thank you for sharing. Is this true even with the documented set of parameters for reproducing 1.7?

@jaguerrerod
Copy link
Author

1.7.8.1 is faster (13K trees in 5h 30min vs 9K with 3.0.0.1)
spearman 0.0285 3.0.0.1 vs spearman 0.0324 1.7.8.1.
I think the key is the fast learning in 3.0.0.1, train dataset is overfitted.
Is like gradient scale was higher and learning rate wasn't small enough
const bst_float w = pair.weight * scale;
I'll revise the code.

Change respect previous scripts (leave 5 queries gap) and I'm using optimal parameters for this kind of noisy data.

# Dataset
dtb <- fread('noisy_data.csv')
setorder(dtb, qid)
train <- dtb[qid < 845]
hold <- dtb[qid >= 850]

3.0.0.1

params <- list(
  booster = 'gbtree',
  objective = 'rank:pairwise',
  tree_method = 'hist',
  device = 'cuda',
  lambdarank_pair_method = 'mean',
  lambdarank_num_pair_per_sample = 200,
  lambdarank_normalization = FALSE,
  lambdarank_score_normalization = FALSE,
  eta = 0.005, 
  max_leaves = 2^10,
  min_child_weight = 8000,
  nthread = 20,
  max_depth = 10,
  colsample_bytree = 0.1,
  subsample = 1,
  colsample_bynode = 1,
  gamma = 0,
  lambda = 1,
  alpha = 0)
[1]	train-spearman_by_qid:0.077000	hold-spearman_by_qid:0.006653 
[26]	train-spearman_by_qid:0.317920	hold-spearman_by_qid:0.017047 
[51]	train-spearman_by_qid:0.369777	hold-spearman_by_qid:0.017682 
[76]	train-spearman_by_qid:0.395939	hold-spearman_by_qid:0.018190 
[101]	train-spearman_by_qid:0.411933	hold-spearman_by_qid:0.018947 
[126]	train-spearman_by_qid:0.423789	hold-spearman_by_qid:0.019412 
[151]	train-spearman_by_qid:0.432964	hold-spearman_by_qid:0.020026 
[176]	train-spearman_by_qid:0.441123	hold-spearman_by_qid:0.020745 
[201]	train-spearman_by_qid:0.447934	hold-spearman_by_qid:0.021128 
[226]	train-spearman_by_qid:0.454823	hold-spearman_by_qid:0.021494 
[251]	train-spearman_by_qid:0.459981	hold-spearman_by_qid:0.022007 
[276]	train-spearman_by_qid:0.465183	hold-spearman_by_qid:0.022325 
[301]	train-spearman_by_qid:0.470380	hold-spearman_by_qid:0.022218 
[326]	train-spearman_by_qid:0.474393	hold-spearman_by_qid:0.022580 
[351]	train-spearman_by_qid:0.479016	hold-spearman_by_qid:0.022960 
[376]	train-spearman_by_qid:0.483284	hold-spearman_by_qid:0.023101 
[401]	train-spearman_by_qid:0.487224	hold-spearman_by_qid:0.023350 
[426]	train-spearman_by_qid:0.490592	hold-spearman_by_qid:0.023412 
[451]	train-spearman_by_qid:0.494293	hold-spearman_by_qid:0.023642 
[476]	train-spearman_by_qid:0.498165	hold-spearman_by_qid:0.023668 
[501]	train-spearman_by_qid:0.501887	hold-spearman_by_qid:0.023899 
[526]	train-spearman_by_qid:0.505008	hold-spearman_by_qid:0.024080 
[551]	train-spearman_by_qid:0.508136	hold-spearman_by_qid:0.024140 
[576]	train-spearman_by_qid:0.511079	hold-spearman_by_qid:0.024370 
[601]	train-spearman_by_qid:0.514356	hold-spearman_by_qid:0.024312 
[626]	train-spearman_by_qid:0.516890	hold-spearman_by_qid:0.024461 
[651]	train-spearman_by_qid:0.519754	hold-spearman_by_qid:0.024627 
[676]	train-spearman_by_qid:0.522652	hold-spearman_by_qid:0.024655 
[701]	train-spearman_by_qid:0.525542	hold-spearman_by_qid:0.024638 
[726]	train-spearman_by_qid:0.528469	hold-spearman_by_qid:0.024642 
[751]	train-spearman_by_qid:0.531251	hold-spearman_by_qid:0.024713 
[776]	train-spearman_by_qid:0.534037	hold-spearman_by_qid:0.024830 
[801]	train-spearman_by_qid:0.536650	hold-spearman_by_qid:0.024878 
[826]	train-spearman_by_qid:0.539215	hold-spearman_by_qid:0.024842 
[851]	train-spearman_by_qid:0.541909	hold-spearman_by_qid:0.025063 
[876]	train-spearman_by_qid:0.544397	hold-spearman_by_qid:0.025070 
[901]	train-spearman_by_qid:0.546962	hold-spearman_by_qid:0.025207 
[926]	train-spearman_by_qid:0.549332	hold-spearman_by_qid:0.025151 
[951]	train-spearman_by_qid:0.551627	hold-spearman_by_qid:0.025318 
[976]	train-spearman_by_qid:0.553829	hold-spearman_by_qid:0.025522 
[1001]	train-spearman_by_qid:0.556202	hold-spearman_by_qid:0.025480 

[8001]	train-spearman_by_qid:0.766351	hold-spearman_by_qid:0.028490 
[8026]	train-spearman_by_qid:0.766649	hold-spearman_by_qid:0.028493 
[8051]	train-spearman_by_qid:0.766943	hold-spearman_by_qid:0.028493 
[8076]	train-spearman_by_qid:0.767232	hold-spearman_by_qid:0.028439 
[8101]	train-spearman_by_qid:0.767505	hold-spearman_by_qid:0.028422 
[8126]	train-spearman_by_qid:0.767770	hold-spearman_by_qid:0.028388 
[8151]	train-spearman_by_qid:0.768076	hold-spearman_by_qid:0.028405 
[8176]	train-spearman_by_qid:0.768363	hold-spearman_by_qid:0.028412 
[8201]	train-spearman_by_qid:0.768644	hold-spearman_by_qid:0.028411 
[8226]	train-spearman_by_qid:0.768939	hold-spearman_by_qid:0.028404 
[8251]	train-spearman_by_qid:0.769198	hold-spearman_by_qid:0.028388 
[8276]	train-spearman_by_qid:0.769496	hold-spearman_by_qid:0.028357 
[8301]	train-spearman_by_qid:0.769782	hold-spearman_by_qid:0.028379 
[8326]	train-spearman_by_qid:0.770040	hold-spearman_by_qid:0.028356 
[8351]	train-spearman_by_qid:0.770298	hold-spearman_by_qid:0.028354 
[8376]	train-spearman_by_qid:0.770570	hold-spearman_by_qid:0.028349 
[8401]	train-spearman_by_qid:0.770854	hold-spearman_by_qid:0.028355 
[8426]	train-spearman_by_qid:0.771118	hold-spearman_by_qid:0.028360 
[8451]	train-spearman_by_qid:0.771392	hold-spearman_by_qid:0.028359 
[8476]	train-spearman_by_qid:0.771658	hold-spearman_by_qid:0.028391 
[8501]	train-spearman_by_qid:0.771900	hold-spearman_by_qid:0.028388 
[8526]	train-spearman_by_qid:0.772166	hold-spearman_by_qid:0.028376 
[8551]	train-spearman_by_qid:0.772428	hold-spearman_by_qid:0.028342 
[8576]	train-spearman_by_qid:0.772707	hold-spearman_by_qid:0.028322 
[8601]	train-spearman_by_qid:0.772989	hold-spearman_by_qid:0.028307 
[8626]	train-spearman_by_qid:0.773235	hold-spearman_by_qid:0.028340 
[8651]	train-spearman_by_qid:0.773508	hold-spearman_by_qid:0.028346 
[8676]	train-spearman_by_qid:0.773789	hold-spearman_by_qid:0.028340 
[8701]	train-spearman_by_qid:0.774046	hold-spearman_by_qid:0.028333 
[8726]	train-spearman_by_qid:0.774297	hold-spearman_by_qid:0.028335 
[8751]	train-spearman_by_qid:0.774542	hold-spearman_by_qid:0.028360 
[8776]	train-spearman_by_qid:0.774812	hold-spearman_by_qid:0.028348 
[8801]	train-spearman_by_qid:0.775052	hold-spearman_by_qid:0.028359 
[8826]	train-spearman_by_qid:0.775321	hold-spearman_by_qid:0.028339 
[8851]	train-spearman_by_qid:0.775584	hold-spearman_by_qid:0.028315 
[8876]	train-spearman_by_qid:0.775849	hold-spearman_by_qid:0.028322 
[8901]	train-spearman_by_qid:0.776123	hold-spearman_by_qid:0.028289 
[8926]	train-spearman_by_qid:0.776394	hold-spearman_by_qid:0.028297 
[8951]	train-spearman_by_qid:0.776634	hold-spearman_by_qid:0.028302 
[8976]	train-spearman_by_qid:0.776907	hold-spearman_by_qid:0.028307 
[9001]	train-spearman_by_qid:0.777155	hold-spearman_by_qid:0.028303 
5h 24min

1.7.8.1

params <- list(
  booster = 'gbtree',
  objective = 'rank:pairwise',
  tree_method = 'gpu_hist',
  num_pairsample = 200,
  eta = 0.005, 
  max_leaves = 2^10,
  min_child_weight = 8000,
  nthread = 20,
  max_depth = 10,
  colsample_bytree = 0.1,
  subsample = 1,
  colsample_bynode = 1,
  gamma = 0,
  lambda = 1,
  alpha = 0)
[551]	train-spearman_by_qid:0.181432	hold-spearman_by_qid:0.023316 
[576]	train-spearman_by_qid:0.182923	hold-spearman_by_qid:0.023393 
[601]	train-spearman_by_qid:0.184379	hold-spearman_by_qid:0.023452 
[626]	train-spearman_by_qid:0.185805	hold-spearman_by_qid:0.023531 
[651]	train-spearman_by_qid:0.187222	hold-spearman_by_qid:0.023546 
[676]	train-spearman_by_qid:0.188684	hold-spearman_by_qid:0.023705 
[701]	train-spearman_by_qid:0.190019	hold-spearman_by_qid:0.023857 
[726]	train-spearman_by_qid:0.191269	hold-spearman_by_qid:0.023928 
[751]	train-spearman_by_qid:0.192713	hold-spearman_by_qid:0.024069 
[776]	train-spearman_by_qid:0.194331	hold-spearman_by_qid:0.024194 
[801]	train-spearman_by_qid:0.195660	hold-spearman_by_qid:0.024304 
[826]	train-spearman_by_qid:0.197160	hold-spearman_by_qid:0.024349 
[851]	train-spearman_by_qid:0.198436	hold-spearman_by_qid:0.024395 
[876]	train-spearman_by_qid:0.199822	hold-spearman_by_qid:0.024400 
[901]	train-spearman_by_qid:0.201211	hold-spearman_by_qid:0.024428 
[926]	train-spearman_by_qid:0.202475	hold-spearman_by_qid:0.024506 
[951]	train-spearman_by_qid:0.203855	hold-spearman_by_qid:0.024585 
[976]	train-spearman_by_qid:0.205255	hold-spearman_by_qid:0.024664 
[1001]	train-spearman_by_qid:0.206406	hold-spearman_by_qid:0.024764 
[1026]	train-spearman_by_qid:0.207651	hold-spearman_by_qid:0.024837 
[1051]	train-spearman_by_qid:0.208835	hold-spearman_by_qid:0.024876 
[1076]	train-spearman_by_qid:0.210165	hold-spearman_by_qid:0.024953 
[1101]	train-spearman_by_qid:0.211485	hold-spearman_by_qid:0.024965 
[1126]	train-spearman_by_qid:0.212584	hold-spearman_by_qid:0.025032 
[1151]	train-spearman_by_qid:0.213627	hold-spearman_by_qid:0.025043 
[1176]	train-spearman_by_qid:0.214862	hold-spearman_by_qid:0.025168 
[1201]	train-spearman_by_qid:0.216096	hold-spearman_by_qid:0.025225 
[1226]	train-spearman_by_qid:0.217243	hold-spearman_by_qid:0.025284 
[1251]	train-spearman_by_qid:0.218465	hold-spearman_by_qid:0.025368 
[1276]	train-spearman_by_qid:0.219647	hold-spearman_by_qid:0.025410 
[1301]	train-spearman_by_qid:0.220761	hold-spearman_by_qid:0.025455 
[1326]	train-spearman_by_qid:0.222037	hold-spearman_by_qid:0.025488 
[1351]	train-spearman_by_qid:0.223212	hold-spearman_by_qid:0.025498 
[1376]	train-spearman_by_qid:0.224397	hold-spearman_by_qid:0.025518 
[1401]	train-spearman_by_qid:0.225676	hold-spearman_by_qid:0.025558 
[1426]	train-spearman_by_qid:0.226861	hold-spearman_by_qid:0.025692 
[1451]	train-spearman_by_qid:0.227993	hold-spearman_by_qid:0.025736 
[1476]	train-spearman_by_qid:0.229108	hold-spearman_by_qid:0.025775 
[1501]	train-spearman_by_qid:0.230267	hold-spearman_by_qid:0.025846 

[12001]	train-spearman_by_qid:0.459669	hold-spearman_by_qid:0.032351 
[12026]	train-spearman_by_qid:0.459975	hold-spearman_by_qid:0.032351 
[12051]	train-spearman_by_qid:0.460293	hold-spearman_by_qid:0.032361 
[12076]	train-spearman_by_qid:0.460601	hold-spearman_by_qid:0.032364 
[12101]	train-spearman_by_qid:0.460916	hold-spearman_by_qid:0.032354 
[12126]	train-spearman_by_qid:0.461211	hold-spearman_by_qid:0.032354 
[12151]	train-spearman_by_qid:0.461525	hold-spearman_by_qid:0.032351 
[12176]	train-spearman_by_qid:0.461867	hold-spearman_by_qid:0.032354 
[12201]	train-spearman_by_qid:0.462162	hold-spearman_by_qid:0.032357 
[12226]	train-spearman_by_qid:0.462451	hold-spearman_by_qid:0.032349 
[12251]	train-spearman_by_qid:0.462772	hold-spearman_by_qid:0.032339 
[12276]	train-spearman_by_qid:0.463075	hold-spearman_by_qid:0.032347 
[12301]	train-spearman_by_qid:0.463368	hold-spearman_by_qid:0.032359 
[12326]	train-spearman_by_qid:0.463645	hold-spearman_by_qid:0.032360 
[12351]	train-spearman_by_qid:0.463944	hold-spearman_by_qid:0.032365 
[12376]	train-spearman_by_qid:0.464271	hold-spearman_by_qid:0.032375 
[12401]	train-spearman_by_qid:0.464564	hold-spearman_by_qid:0.032379 
[12426]	train-spearman_by_qid:0.464870	hold-spearman_by_qid:0.032381 
[12451]	train-spearman_by_qid:0.465164	hold-spearman_by_qid:0.032384 
[12476]	train-spearman_by_qid:0.465476	hold-spearman_by_qid:0.032386 
[12501]	train-spearman_by_qid:0.465787	hold-spearman_by_qid:0.032388 
[12526]	train-spearman_by_qid:0.466060	hold-spearman_by_qid:0.032382 
[12551]	train-spearman_by_qid:0.466380	hold-spearman_by_qid:0.032389 
[12576]	train-spearman_by_qid:0.466699	hold-spearman_by_qid:0.032400 
[12601]	train-spearman_by_qid:0.467011	hold-spearman_by_qid:0.032399 
[12626]	train-spearman_by_qid:0.467322	hold-spearman_by_qid:0.032394 
[12651]	train-spearman_by_qid:0.467624	hold-spearman_by_qid:0.032387 
[12676]	train-spearman_by_qid:0.467952	hold-spearman_by_qid:0.032383 
[12701]	train-spearman_by_qid:0.468256	hold-spearman_by_qid:0.032374 
[12726]	train-spearman_by_qid:0.468547	hold-spearman_by_qid:0.032349 
[12751]	train-spearman_by_qid:0.468838	hold-spearman_by_qid:0.032351 
[12776]	train-spearman_by_qid:0.469147	hold-spearman_by_qid:0.032361 
[12801]	train-spearman_by_qid:0.469431	hold-spearman_by_qid:0.032333 
[12826]	train-spearman_by_qid:0.469703	hold-spearman_by_qid:0.032324 
[12851]	train-spearman_by_qid:0.470013	hold-spearman_by_qid:0.032304 
[12876]	train-spearman_by_qid:0.470292	hold-spearman_by_qid:0.032317 
[12901]	train-spearman_by_qid:0.470569	hold-spearman_by_qid:0.032313 
[12926]	train-spearman_by_qid:0.470848	hold-spearman_by_qid:0.032324 
[12951]	train-spearman_by_qid:0.471158	hold-spearman_by_qid:0.032336 
[12976]	train-spearman_by_qid:0.471459	hold-spearman_by_qid:0.032331 
[13001]	train-spearman_by_qid:0.471737	hold-spearman_by_qid:0.032347 
5h 37min

@jaguerrerod
Copy link
Author

jaguerrerod commented Mar 7, 2025

@trivialfis
It's hard to me follow the code, as my known of C++ is limited. Sorry in advance if I say something wrong about the code, please.
A bit easier in 1.7 than 3.0... for me.
Some things generates doubts.
Here (in 1.7):

xgboost/src/objective/rank_obj.cu

Lines 875 to 893 in 36eb41c

 // get lambda weight for the pairs 
 LambdaWeightComputerT::GetLambdaWeight(lst, &pairs); 
 // rescale each gradient and hessian so that the lst have constant weighted 
 float scale = 1.0f / param_.num_pairsample; 
 if (param_.fix_list_weight != 0.0f) { 
   scale *= param_.fix_list_weight / (gptr[k + 1] - gptr[k]); 
 } 
 for (auto & pair : pairs) { 
   const ListEntry &pos = lst[pair.pos_index]; 
   const ListEntry &neg = lst[pair.neg_index]; 
   const bst_float w = pair.weight * scale; 
   const float eps = 1e-16f; 
   bst_float p = common::Sigmoid(pos.pred - neg.pred); 
   bst_float g = p - 1.0f; 
   bst_float h = std::max(p * (1.0f - p), eps); 
   // accumulate gradient and hessian in both pid, and nid 
   gpair[pos.rindex] += GradientPair(g * w, 2.0f*w*h); 
   gpair[neg.rindex] += GradientPair(-g * w, 2.0f*w*h); 
 } 

There are two normalizations, one by num_pairsample and another (optional depending of fix_list_weight parameter?!). With both we have scale.
Then w = pair.weight * scale (I don't know what is pair.weight in rank:pairwise, assume is 1)
w is used in gradients and hessians:

// accumulate gradient and hessian in both pid, and nid
gpair[pos.rindex] += GradientPair(g * w, 2.0f*w*h);
gpair[neg.rindex] += GradientPair(-g * w, 2.0f*w*h);

I think it isn't enough and normalization of gradients should be done in other way.
We sample for each observation num_pairsample pairs. When we accumulate each pair in pos.rindex and neg.rindex we will have in average 2 * num_pairsample observations for each original observation, but distribution may be very uneven. By example if num_pairsample is 10 some observations will be present in 15 pairs and others in 25 pairs. If 10 is used as constant in scaling the gradients and hessians will be wrong. I don't found in code where this (different number of pairs of each observation belongs) is taken into account. This only requires a counter por each observation increasing it each time is used on pos or neg position of the pair, and normalize the gradient as sum(gradients) / number of pairs in which observation is present.

In 3.0.0, I didn't find where the normalization by num_pairsample is done (like in 1.7 computing scale scale = 1.0f / param_.num_pairsample).
If no normalization by number of pairs was done the gradients will be higher as number of pairs increase and this can explain the superfast learning and reach a suboptimal solution.

The other issue, 3.0.0 take significantly more time to fit the model, I read in some cases you need to do two passes to compute gradients and it is related with some regularizations.
I think in pairwise without regularization (lambdarank_score_normalization = FALSE) this isn't necessary (I think you already is taken into account this).
For pair sampling you compute the subsets of observations with same label in each query in each iteration (I'm not sure but is what understand from the code). This can take a lot of time for big datasets with thousands of queries.
lightgbm uses other approach, sample pairs and discard pairs with same label. I think in most of cases is faster, and in future permit use a parameter min_label_difference to set the minimal absolute difference in labels of a pair to be selected (discarding pairs with very close labels). Other approach is precompute for each query the subsets of labels only once and use it in each iteration.

@trivialfis
Copy link
Member

The other issue, 3.0.0 take significantly more time to fit the model

This is likely caused by XGBoost building deeper trees at the beginning due to the new optimal intercept. This has been observed in the past #9452 (comment) . We can't compare the speed difference when training different models with different starting points.

I read in some cases you need to do two passes to compute gradients and it is related with some regularization.

The cost of calculating the objective is relatively small. It would be suprising if it contributed more than 5% of overall time, don't worry about it.

I think it isn't enough and normalization of gradients should be done in other way.

Normalizing by the number of pairs is possible in theory. We have another normalization that uses the gradient instead of the discrete num pairs:

double norm = std::log2(1.0 + sum_lambda) / sum_lambda;

lightgbm uses other approach, sample pairs and discard pairs with same label

I think lgbm uses top-k instead of random sampling, but I could be wrong. Do you have a reference? We also discard the same label.

if (g_label(g_rank[i]) == g_label(g_rank[j])) {

if (g_label(g_rank[rank_high]) == g_label(g_rank[rank_low])) {

@trivialfis
Copy link
Member

Looking again at your example parameter:

params <- list(
  booster = 'gbtree',
  objective = 'rank:pairwise',
  tree_method = 'hist',
  device = 'cuda',
  lambdarank_pair_method = 'mean',
  lambdarank_num_pair_per_sample = 200,
  lambdarank_normalization = FALSE,
  lambdarank_score_normalization = FALSE,
  eta = 0.005, 
  max_leaves = 2^10,
  min_child_weight = 8000,
  nthread = 20,
  max_depth = 10,
  colsample_bytree = 0.1,
  subsample = 1,
  colsample_bynode = 1,
  gamma = 0,
  lambda = 1,
  alpha = 0)

Could you please consider setting the base_score to 0.5 to disable the intercept estimation?

@jaguerrerod
Copy link
Author

jaguerrerod commented Mar 7, 2025

I think we should center in main issue (model accuracy) and after the rest (fitting time, options...).
I answer your questions and then return to main issue:

Right, lightgbm does't sample, only top_k, but they enumerate all pairs and skip pairs with same label here:

https://github.com/microsoft/LightGBM/blob/6437645c4a0c17046be59e4f57d09952e2e0185f/src/objective/rank_objective.hpp#L212-L213

If you use:

if (g_label(g_rank[i]) == g_label(g_rank[j])) {

if (g_label(g_rank[rank_high]) == g_label(g_rank[rank_low])) {

why is needed to find the bucket boundary for each label, each query and each iteration. I think it may takes a lot of time.

for (std::size_t i = 0; i < cnt;) {
std::size_t j = i + 1;
// find the bucket boundary
while (j < cnt && rev_it[i] == rev_it[j]) {
++j;
}
// Bucket [i,j), construct n_samples pairs for each sample inside the bucket with
// another sample outside the bucket.
//
// n elements left to the bucket, and n elements right to the bucket
std::size_t n_lefts = i, n_rights = static_cast<std::size_t>(cnt - j);
if (n_lefts + n_rights == 0) {
i = j;
continue;
}

I think precompute buckets by query or random sampling and checking if pair has different labels is more efficient, but isn't the major issue right now.

About the normalization in:

double norm = std::log2(1.0 + sum_lambda) / sum_lambda;

The justification in lightgbm referenced in code:
microsoft/LightGBM#2331 (comment)

I think precisely in LTR doesn't make sense, as the outputs are not average of labels, in fact in LTR the leaves values are invariant by monotone transformation of labels, at least in pairwise objective.
I don't know exactly what is doing this:

if (sum_lambda > 0.0 && param_.lambdarank_normalization) {
double norm = std::log2(1.0 + sum_lambda) / sum_lambda;
std::transform(g_gpair.Values().data(), g_gpair.Values().data() + g_gpair.Size(),
g_gpair.Values().data(), [norm](GradientPair const& g) { return g * norm; });
}

but in my example lambdarank_normalization = FALSE, so doesn't affect.

I've tried setting base_score to 0.5:

[2501]	train-spearman_by_qid:0.647817	hold-spearman_by_qid:0.027990 
[2551]	train-spearman_by_qid:0.649925	hold-spearman_by_qid:0.028033 
[2601]	train-spearman_by_qid:0.651988	hold-spearman_by_qid:0.028019 
[2651]	train-spearman_by_qid:0.654017	hold-spearman_by_qid:0.028124 
[2701]	train-spearman_by_qid:0.656058	hold-spearman_by_qid:0.028189 
[2751]	train-spearman_by_qid:0.657965	hold-spearman_by_qid:0.028232 
[2801]	train-spearman_by_qid:0.659933	hold-spearman_by_qid:0.028222 
[2851]	train-spearman_by_qid:0.661833	hold-spearman_by_qid:0.028182 
[2901]	train-spearman_by_qid:0.663652	hold-spearman_by_qid:0.028179 
[2951]	train-spearman_by_qid:0.665462	hold-spearman_by_qid:0.028228 
[3001]	train-spearman_by_qid:0.667271	hold-spearman_by_qid:0.028174

versus without setting it:

[2501]	train-spearman_by_qid:0.647564	hold-spearman_by_qid:0.027745 
[2526]	train-spearman_by_qid:0.648628	hold-spearman_by_qid:0.027827 
[2551]	train-spearman_by_qid:0.649688	hold-spearman_by_qid:0.027831 
[2576]	train-spearman_by_qid:0.650723	hold-spearman_by_qid:0.027831 
[2601]	train-spearman_by_qid:0.651724	hold-spearman_by_qid:0.027901 
[2626]	train-spearman_by_qid:0.652783	hold-spearman_by_qid:0.027916 
[2651]	train-spearman_by_qid:0.653798	hold-spearman_by_qid:0.028014 
[2676]	train-spearman_by_qid:0.654851	hold-spearman_by_qid:0.028055 
[2701]	train-spearman_by_qid:0.655841	hold-spearman_by_qid:0.028100 
[2726]	train-spearman_by_qid:0.656771	hold-spearman_by_qid:0.028091 
[2751]	train-spearman_by_qid:0.657731	hold-spearman_by_qid:0.028069 
[2776]	train-spearman_by_qid:0.658775	hold-spearman_by_qid:0.028063 
[2801]	train-spearman_by_qid:0.659695	hold-spearman_by_qid:0.028121 
[2826]	train-spearman_by_qid:0.660685	hold-spearman_by_qid:0.028088 
[2851]	train-spearman_by_qid:0.661604	hold-spearman_by_qid:0.028138 
[2876]	train-spearman_by_qid:0.662544	hold-spearman_by_qid:0.028115 
[2901]	train-spearman_by_qid:0.663415	hold-spearman_by_qid:0.028131 
[2926]	train-spearman_by_qid:0.664318	hold-spearman_by_qid:0.028168 
[2951]	train-spearman_by_qid:0.665256	hold-spearman_by_qid:0.028156 
[2976]	train-spearman_by_qid:0.666171	hold-spearman_by_qid:0.028207 
[3001]	train-spearman_by_qid:0.667080	hold-spearman_by_qid:0.028183

It hasn't impact in accuracy or fitting time.

@jaguerrerod
Copy link
Author

jaguerrerod commented Mar 7, 2025

@trivialfis I've detected the problem

Fitting the model using this parameters (lambdarank_num_pair_per_sample = 1):

params <- list(
  booster = 'gbtree',
  objective = 'rank:pairwise',
  tree_method = 'hist',
  device = 'cuda',
  lambdarank_pair_method = 'mean',
  lambdarank_num_pair_per_sample = 1,
  lambdarank_normalization = FALSE,
  lambdarank_score_normalization = FALSE,
  base_score = 0.5,
  eta = 0.005, 
  max_leaves = 2^10,
  min_child_weight = 8000,
  nthread = 20,
  max_depth = 10,
  colsample_bytree = 0.1,
  subsample = 1,
  colsample_bynode = 1,
  gamma = 0,
  lambda = 1,
  alpha = 0)

The output

[1]	train-spearman_by_qid:0.037416	hold-spearman_by_qid:0.007597 
[51]	train-spearman_by_qid:0.136896	hold-spearman_by_qid:0.019145 
[101]	train-spearman_by_qid:0.144917	hold-spearman_by_qid:0.020341 
2.11  mins 

Then I change the lambdarank_num_pair_per_sample to 30:

[1]	train-spearman_by_qid:0.075092	hold-spearman_by_qid:0.007691 
[51]	train-spearman_by_qid:0.339578	hold-spearman_by_qid:0.018594 
[101]	train-spearman_by_qid:0.377105	hold-spearman_by_qid:0.020517 
2.39  mins 

See the train spearman in both. With 30 pairs the model is overfitting training set.
Let see the dump of the first tree of the model:

For lambdarank_num_pair_per_sample = 1:

booster[0]
0:[V641<4] yes=1,no=2,missing=2,gain=89.2688675,cover=2255480
	1:[V351<2] yes=3,no=4,missing=4,gain=65.1546021,cover=1354025.5
		3:[V587<5] yes=7,no=8,missing=8,gain=28.3486176,cover=317039
			7:[V583<2] yes=15,no=16,missing=16,gain=28.00634,cover=254954.5
				15:[V443<3] yes=31,no=32,missing=32,gain=26.714529,cover=52255.5

The cover of root is 2255480 that is the number of observations in train.

For lambdarank_num_pair_per_sample = 30:

booster[0]
0:[V677<5] yes=1,no=2,missing=2,gain=2768.71338,cover=67664400
	1:[V351<2] yes=3,no=4,missing=4,gain=1709.44702,cover=53721736
		3:[V56<5] yes=7,no=8,missing=8,gain=931.140015,cover=12109082
			7:[V429<2] yes=15,no=16,missing=16,gain=751.497803,cover=10067561
				15:[V409<3] yes=31,no=32,missing=32,gain=497.40625,cover=4394222

the cover now is 30 times the number of observations. The gain are extremely high. The nodes of tree 0 are 1602 and with 1 pair the nodes are 240. This is the cause of fast overfitting and poor accuracy.

Now with 1.7.8 and 30 pairs:

[1]	train-spearman_by_qid:0.043578	hold-spearman_by_qid:0.005865 
[51]	train-spearman_by_qid:0.136368	hold-spearman_by_qid:0.019409 
[101]	train-spearman_by_qid:0.148734	hold-spearman_by_qid:0.020399 
2.48  mins 

See how similar to 3.0.0 with only 1 pair.
And the first tree dump:

booster[0]
0:[f488<4] yes=1,no=2,missing=2,gain=51.8500748,cover=2255479.5
	1:[f42<3] yes=3,no=4,missing=4,gain=42.0142517,cover=1354243.88
		3:[f278<3] yes=7,no=8,missing=8,gain=76.0427017,cover=670639.75
			7:[f488<3] yes=15,no=16,missing=16,gain=23.512207,cover=269437.031

The cover is again the number of observations in train and the tree structure (number of nodes) is the same with 30 or 1 pairs.
The total cover in 1.7.8 is invariant to number of pairs, in 3.0.0 is scaled multiplicatively by number of pairs.

This is the problem. In accumulation of gradients and hessians in 3.0.0 is necessary to divide by number of pairs as you do in 1.7.8 here (Line 878):

// get lambda weight for the pairs
LambdaWeightComputerT::GetLambdaWeight(lst, &pairs);
// rescale each gradient and hessian so that the lst have constant weighted
float scale = 1.0f / param_.num_pairsample;
if (param_.fix_list_weight != 0.0f) {
scale *= param_.fix_list_weight / (gptr[k + 1] - gptr[k]);
}
for (auto & pair : pairs) {
const ListEntry &pos = lst[pair.pos_index];
const ListEntry &neg = lst[pair.neg_index];
const bst_float w = pair.weight * scale;
const float eps = 1e-16f;
bst_float p = common::Sigmoid(pos.pred - neg.pred);
bst_float g = p - 1.0f;
bst_float h = std::max(p * (1.0f - p), eps);
// accumulate gradient and hessian in both pid, and nid
gpair[pos.rindex] += GradientPair(g * w, 2.0f*w*h);
gpair[neg.rindex] += GradientPair(-g * w, 2.0f*w*h);
}

or preferably divide by the number of pairs in which each observation is present (a counter array that increase each time a observation is on pos or neg position of a pair is only you need).
Divide by the exact number of pairs is better when num_pairs is small, as in that case the number of terms adding to the gradient can be very different. In my case I use big numbers, like 200, probably the reason I found it is better is related with how the normalization is done in 1.7.8 (dividing by a constant number of pairs without taking into account the variability).

@jaguerrerod
Copy link
Author

jaguerrerod commented Mar 7, 2025

Confirmed
I used a work around to fix the problem. Multiplying the min_child_weight with lambdarank_num_pair_per_sample is a way to compensate the artificial increasing of gradients / hessians.

params <- list(
  booster = 'gbtree',
  objective = 'rank:pairwise',
  tree_method = 'hist',
  device = 'cuda',
  lambdarank_pair_method = 'mean',
  lambdarank_num_pair_per_sample = 200,
  lambdarank_normalization = FALSE,
  lambdarank_score_normalization = FALSE,
  base_score = 0.5,
  eta = 0.005, 
  max_leaves = 2^10,
  min_child_weight = 8000 * 200,
  nthread = 20,
  max_depth = 10,
  colsample_bytree = 0.1,
  subsample = 1,
  colsample_bynode = 1,
  gamma = 0,
  lambda = 1,
  alpha = 0)

With 3.0.0 and 200 pairs the max spearman was 0.0283 at 9K trees (in 5h 24 min)

With the problem fixed with this work around, the spearman is similar to 1.7.8 in both train and hold datasets:

[1]	train-spearman_by_qid:0.041397	hold-spearman_by_qid:0.007486 
[51]	train-spearman_by_qid:0.135359	hold-spearman_by_qid:0.019823 
[101]	train-spearman_by_qid:0.144769	hold-spearman_by_qid:0.021577 
[151]	train-spearman_by_qid:0.150307	hold-spearman_by_qid:0.021562 
[201]	train-spearman_by_qid:0.155436	hold-spearman_by_qid:0.022253 
[251]	train-spearman_by_qid:0.159878	hold-spearman_by_qid:0.022673 
[301]	train-spearman_by_qid:0.163573	hold-spearman_by_qid:0.022863 
[351]	train-spearman_by_qid:0.166534	hold-spearman_by_qid:0.023030 
[401]	train-spearman_by_qid:0.170095	hold-spearman_by_qid:0.023127 
[451]	train-spearman_by_qid:0.173500	hold-spearman_by_qid:0.023220 
[501]	train-spearman_by_qid:0.177232	hold-spearman_by_qid:0.023443 

[5001]	train-spearman_by_qid:0.341905	hold-spearman_by_qid:0.030768 
[5051]	train-spearman_by_qid:0.343039	hold-spearman_by_qid:0.030800 
[5101]	train-spearman_by_qid:0.344219	hold-spearman_by_qid:0.030856 
[5151]	train-spearman_by_qid:0.345337	hold-spearman_by_qid:0.030906 
[5201]	train-spearman_by_qid:0.346483	hold-spearman_by_qid:0.030947 
[5251]	train-spearman_by_qid:0.347593	hold-spearman_by_qid:0.030980 
[5301]	train-spearman_by_qid:0.348610	hold-spearman_by_qid:0.031035 
[5351]	train-spearman_by_qid:0.349752	hold-spearman_by_qid:0.031091 
[5401]	train-spearman_by_qid:0.350878	hold-spearman_by_qid:0.031074 
[5451]	train-spearman_by_qid:0.351970	hold-spearman_by_qid:0.031129 
[5501]	train-spearman_by_qid:0.353115	hold-spearman_by_qid:0.031157 
2h 55min

Time is still slower than 1.7.8: (13K trees in 5h 37min). 38.6 trees per min.
In 3.0.0 with work around: (5.5K trees in 2h 55min). 31.4 trees per min.

@trivialfis
Copy link
Member

Okay, got it. Thank you for the very detailed diagnosis. I will look into it:

  • Normalization with the number of pairs.
  • Look into eliminating one of the bucket skipping routines when random sampling is used. (The duplication doesn't happen when top-k is used).

As for the 2-passes reduction in the objective, it's done to make GPU computation deterministic (parallel floating point summation). I added the comment for upper bounds in the code as a potential optimization in the future.

@trivialfis
Copy link
Member

Could you please help take a look at #11322 ?

It's not normalized by per-sample number of pairs yet. It's just restoring the old behaviour from 1.7. We can experiment with other approaches later.

@jaguerrerod
Copy link
Author

Sure. I'll do this evening.

@trivialfis
Copy link
Member

Thank you! The PR replaces the gradient-based normalization you mentioned with the n_pairs-based normalization for mean pair method.

@trivialfis
Copy link
Member

Feel free to suggest changes.

@jaguerrerod
Copy link
Author

@trivialfis seems problem (fast learning due artificial increasing of gradients and hessians) persist with 3.1.0.0

1.7.8.1

# Dataset
dtb <- fread('noisy_data.csv')
setorder(dtb, qid)
train <- dtb[qid < 845]
hold <- dtb[qid >= 850]
# Evaluation metric
spearman_by_qid <- function(preds, dataset) {
  labels <- getinfo(dataset, 'label')
  qids <- attr(dataset, 'qid')  
  pred_dt <- data.table(label = labels, pred = preds, qid = qids)
  spe_dt <- pred_dt[, .(spearman = cor(label, pred, method = 'spearman')), by = .(qid)]
  spe <- mean(spe_dt$spearman)
  return(list(metric = 'spearman_by_qid', value = spe))
}
params <- list(
  booster = 'gbtree',
  objective = 'rank:pairwise',
  tree_method = 'gpu_hist',
  num_pairsample = 200,
  eta = 0.005, 
  max_leaves = 2^10,
  min_child_weight = 8000,
  nthread = 20,
  max_depth = 10,
  base_score = 0.5,
  colsample_bytree = 0.1,
  subsample = 1,
  colsample_bynode = 1,
  gamma = 0,
  lambda = 1,
  alpha = 0)
[1]	train-spearman_by_qid:0.043348	hold-spearman_by_qid:0.005994 
[26]	train-spearman_by_qid:0.123439	hold-spearman_by_qid:0.018658 
[51]	train-spearman_by_qid:0.137564	hold-spearman_by_qid:0.020584 
[76]	train-spearman_by_qid:0.144548	hold-spearman_by_qid:0.021188 
[101]	train-spearman_by_qid:0.149301	hold-spearman_by_qid:0.020846 
[126]	train-spearman_by_qid:0.152239	hold-spearman_by_qid:0.021507 
[151]	train-spearman_by_qid:0.154758	hold-spearman_by_qid:0.021513 
[176]	train-spearman_by_qid:0.156103	hold-spearman_by_qid:0.021470 
[201]	train-spearman_by_qid:0.157840	hold-spearman_by_qid:0.021557 
[226]	train-spearman_by_qid:0.159671	hold-spearman_by_qid:0.021662 
[251]	train-spearman_by_qid:0.161381	hold-spearman_by_qid:0.021932 
[276]	train-spearman_by_qid:0.163246	hold-spearman_by_qid:0.021966 
[301]	train-spearman_by_qid:0.165045	hold-spearman_by_qid:0.022225 
[326]	train-spearman_by_qid:0.167323	hold-spearman_by_qid:0.022248 
[351]	train-spearman_by_qid:0.169210	hold-spearman_by_qid:0.022335 
[376]	train-spearman_by_qid:0.170575	hold-spearman_by_qid:0.022385 
[401]	train-spearman_by_qid:0.172408	hold-spearman_by_qid:0.022630 
[426]	train-spearman_by_qid:0.173795	hold-spearman_by_qid:0.022727 
[451]	train-spearman_by_qid:0.175481	hold-spearman_by_qid:0.022811 
[476]	train-spearman_by_qid:0.176903	hold-spearman_by_qid:0.022933 
[501]	train-spearman_by_qid:0.178572	hold-spearman_by_qid:0.023106 
11.8  mins 
booster[0]
0:[f488<4] yes=1,no=2,missing=2,gain=51.7038651,cover=2255480
	1:[f42<3] yes=3,no=4,missing=4,gain=42.7140656,cover=1354297.75
		3:[f278<3] yes=7,no=8,missing=8,gain=75.9448242,cover=670654.125
			7:[f292<2] yes=15,no=16,missing=16,gain=23.0349464,cover=269448.219

3.1.0.0

params <- list(
  booster = 'gbtree',
  objective = 'rank:pairwise',
  tree_method = 'hist',
  device = 'cuda',
  lambdarank_pair_method = 'mean',
  lambdarank_num_pair_per_sample = 200,
  lambdarank_normalization = FALSE,
  lambdarank_score_normalization = FALSE,
  base_score = 0.5,
  eta = 0.005, 
  max_leaves = 2^10,
  min_child_weight = 8000,
  nthread = 20,
  max_depth = 10,
  colsample_bytree = 0.1,
  subsample = 1,
  colsample_bynode = 1,
  gamma = 0,
  lambda = 1,
  alpha = 0)
[1]	train-spearman_by_qid:0.078018	hold-spearman_by_qid:0.010093 
[26]	train-spearman_by_qid:0.318855	hold-spearman_by_qid:0.015975 
[51]	train-spearman_by_qid:0.371434	hold-spearman_by_qid:0.017787 
[76]	train-spearman_by_qid:0.396481	hold-spearman_by_qid:0.018538 
[101]	train-spearman_by_qid:0.411969	hold-spearman_by_qid:0.019783 
[126]	train-spearman_by_qid:0.424177	hold-spearman_by_qid:0.019852 
[151]	train-spearman_by_qid:0.432930	hold-spearman_by_qid:0.020203 
[176]	train-spearman_by_qid:0.441516	hold-spearman_by_qid:0.020611 
[201]	train-spearman_by_qid:0.448603	hold-spearman_by_qid:0.021235 
[226]	train-spearman_by_qid:0.455518	hold-spearman_by_qid:0.021740 
[251]	train-spearman_by_qid:0.460993	hold-spearman_by_qid:0.021890 
[276]	train-spearman_by_qid:0.465959	hold-spearman_by_qid:0.022131 
[301]	train-spearman_by_qid:0.471070	hold-spearman_by_qid:0.022310 
[326]	train-spearman_by_qid:0.475097	hold-spearman_by_qid:0.022762 
[351]	train-spearman_by_qid:0.479642	hold-spearman_by_qid:0.023085 
[376]	train-spearman_by_qid:0.483829	hold-spearman_by_qid:0.023233 
[401]	train-spearman_by_qid:0.487830	hold-spearman_by_qid:0.023441 
[426]	train-spearman_by_qid:0.491379	hold-spearman_by_qid:0.023426 
[451]	train-spearman_by_qid:0.495067	hold-spearman_by_qid:0.023686 
[476]	train-spearman_by_qid:0.498842	hold-spearman_by_qid:0.023799 
[501]	train-spearman_by_qid:0.502463	hold-spearman_by_qid:0.023968 
14.52  mins
booster[0]
0:[V677<5] yes=1,no=2,missing=2,gain=18452.502,cover=451096000
	1:[V351<2] yes=3,no=4,missing=4,gain=11562.1973,cover=358153472
		3:[V56<5] yes=7,no=8,missing=8,gain=6683.22607,cover=80740000
			7:[V429<2] yes=15,no=16,missing=16,gain=4797.55762,cover=67125424

@trivialfis
Copy link
Member

lambdarank_normalization needs to be true after the PR as described previously

@jaguerrerod
Copy link
Author

jaguerrerod commented Mar 10, 2025

I didn't realized lambdarank_normalization now must be TRUE.
The problem is fixed. Thanks a lot!
I can't give you now time references as I'm fitting a big model in the GPU just now.
I'll check it tomorrow.
EDIT:
3.1.0.0. Time 15.33 min: 32.7 trees per min.
1.7.8.1. Time 11.8 min: 42.5 trees per min.

[1]	train-spearman_by_qid:0.042527	hold-spearman_by_qid:0.008583 
[26]	train-spearman_by_qid:0.123254	hold-spearman_by_qid:0.018884 
[51]	train-spearman_by_qid:0.135860	hold-spearman_by_qid:0.020468 
[76]	train-spearman_by_qid:0.141668	hold-spearman_by_qid:0.020506 
[101]	train-spearman_by_qid:0.145516	hold-spearman_by_qid:0.021272 
[126]	train-spearman_by_qid:0.148338	hold-spearman_by_qid:0.021371 
[151]	train-spearman_by_qid:0.150906	hold-spearman_by_qid:0.021286 
[176]	train-spearman_by_qid:0.153391	hold-spearman_by_qid:0.021628 
[201]	train-spearman_by_qid:0.155721	hold-spearman_by_qid:0.021821 
[226]	train-spearman_by_qid:0.157931	hold-spearman_by_qid:0.022130 
[251]	train-spearman_by_qid:0.159943	hold-spearman_by_qid:0.022414 
[276]	train-spearman_by_qid:0.161713	hold-spearman_by_qid:0.022636 
[301]	train-spearman_by_qid:0.163331	hold-spearman_by_qid:0.022609 
[326]	train-spearman_by_qid:0.164823	hold-spearman_by_qid:0.022789 
[351]	train-spearman_by_qid:0.166643	hold-spearman_by_qid:0.022793 
[376]	train-spearman_by_qid:0.168541	hold-spearman_by_qid:0.022912 
[401]	train-spearman_by_qid:0.170151	hold-spearman_by_qid:0.022948 
[426]	train-spearman_by_qid:0.171735	hold-spearman_by_qid:0.022837 
[451]	train-spearman_by_qid:0.173516	hold-spearman_by_qid:0.022897 
[476]	train-spearman_by_qid:0.175324	hold-spearman_by_qid:0.022887 
[501]	train-spearman_by_qid:0.177249	hold-spearman_by_qid:0.022990
booster[0]
0:[V677<5] yes=1,no=2,missing=2,gain=92.2623367,cover=2255480
	1:[V351<2] yes=3,no=4,missing=4,gain=57.8109131,cover=1790767.38
		3:[V56<5] yes=7,no=8,missing=8,gain=33.4158859,cover=403699.969
			7:[V429<2] yes=15,no=16,missing=16,gain=23.9874687,cover=335627.094

@trivialfis
Copy link
Member

Glad that it's fixed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants