Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Incorrect scores for evaluation #746

Open
haraldvschweiger opened this issue Sep 20, 2023 · 0 comments
Open

[BUG] Incorrect scores for evaluation #746

haraldvschweiger opened this issue Sep 20, 2023 · 0 comments
Labels
bug Something isn't working status/needs-triage

Comments

@haraldvschweiger
Copy link

Bug Description

It appears that a significant issue is affecting XLNet with CLM and potentially other models. When using the trainer's
evaluation method, even after just one training epoch, the NDCG and MRR scores approach near-perfection.
Upon inspecting the evaluation process, it seems that the model is able to predict the missing item_id,
most likely due to information leakage.

This bug impacts the trainer.evaluation method and, consequently, all eval_steps during training,
causing the automatic best model saving procedure to produce incorrect results.

Steps/Code to Reproduce the Bug

To replicate this issue, you can refer to the code provided here,
which is based on the Yoochoose e-commerce dataset example.

In 01-ETL-with-NVTabular,
the dataset is randomly split into a training and a validation set. The validation set is then duplicated and
transformed into a test set that contains the same entries as the validation set, but with the last item removed from each sequence.
The transformation has been simplified since the item_ids are the only input feature for the transformer being trained.

In 02-End-to-End-Session-Based-with-Evaluation,
an XLNet model is trained for a next-item prediction task.
According to your PR,
the last item in the sequence is the one to be predicted for evaluation.
After training and running the evaluation method, the results show exceptionally high accuracy scores (MRR > 0.95).

To rule out the possibility that the validation scores are inflated due to similarities or identical entries,
the trainer class is used to make predictions on the test set (which, as previously stated, is identical to the validation set).
Calculating the MRR based on these predictions results in a more reasonable score of MRR ≈ 0.2.

Environment Details

This bug persists across different versions and has been observed in the following environment:

  • Transformers4Rec version: 23.8.0
  • Platform: Ubuntu 22
  • Python version: 3.10
  • Huggingface Transformers version: 4.28
  • PyTorch version (GPU): 2.0.1+cu118 (GPU)
  • Tensorflow version (GPU): N/A

Additional Context

I hope that this issue might be due to a coding mistake on my part.
However, if it turns out to be a genuine bug, I recommend addressing it as a high-priority matter.
Thank you for your amazing support!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working status/needs-triage
Projects
None yet
Development

No branches or pull requests

1 participant