-
Notifications
You must be signed in to change notification settings - Fork 31.9k
Reduce atol values in test_dynamic_cache_exportability #39667
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce atol values in test_dynamic_cache_exportability #39667
Conversation
|
It looks like changed tests are failed in CI. I will investigate it. |
|
hey @st81 , that is probably related to hardware and the diffs are actually higher when running the test in our runners :) |
78f16f6 to
32a1b02
Compare
|
@zucchini-nlp Thank you for your feedback! The change in #39412 slightly increases the diff, so Also, there's an unrelated test failure that passed in the first commit - looks like it might be flaky. FAILED tests/trainer/test_trainer.py::TrainerIntegrationWithHubTester::test_push_to_hub_with_revision - huggingface_hub.errors.BadRequestError: (Request ID: Root=1-68841c4d-227c4b7537d5ffb65b67ac7a;0d116056-5371-4b08-9a6a-4d21fd440c2a)
Bad request for commit endpoint:
Your push was rejected because an LFS pointer pointed to a file that does not exist. For instance, this can happen if you used git push --no-verify to push your changes. Offending file: - model.safetensors - training_args.binThis test was passed in first commit: https://app.circleci.com/pipelines/github/huggingface/transformers/139418/workflows/4fa288a0-deab-4516-9c1e-a4b2eb655880/jobs/1847114/steps [gw0] [ 29%] PASSED tests/trainer/test_trainer.py::TrainerIntegrationWithHubTester::test_push_to_hub_with_revision |
| use_cache=True, | ||
| ) | ||
| self.assertTrue(torch.allclose(res.logits, res_eager.logits, atol=1e-5)) | ||
| self.assertTrue(torch.allclose(res.logits, res_eager.logits, atol=1e-7)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
export creates the attention mask as fake tensors and thus we fall to different paths in sdpa attention, which is if attention_mask is None. enable GQA in sdpa, if not None, repeat kv outside. see the logic in #39412.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But did you observe larger atol is necessary when you worked on that PR? I am testing in our CI runners and it works with 1e-7. I am just wondering what is the motivation of increasing to 1e-5 (I saw you even change it to 1e-3 once from the commit history).
@st81 I personally would not be bother of keeping it as 1e-5 however.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ydshieh Thank you for the feedback! I agree that keeping it at 1e-5 wouldn't be problematic.
The main motivation for reducing it to 1e-7 is to make the test more sensitive to future changes - this way we can quickly detect if any modifications increase the difference between the original and exported models.
I should also mention that the tests were passing locally on my machine even without any atol changes (except logits), which initially led me to believe that PR #39412 introduced minimal differences in the exported model. (I discovered the hardware-dependent differences after creating this PR and running it on CI runners.)
So if you prefer to keep it at 1e-5 for stability across different hardware environments, I'm completely fine with that approach as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood. Just a relevant thread if you are interested to read
pytorch/pytorch#157274 (comment)
that is one reason I might keep it as 1e-5 now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly relevant #39412 (comment)
The difference comes from the sdpa path either taking the enable_gqa (internal sdpa kernel) or manually materializing the kv repetitions. The diffs are understandably very low due to materializing vs kernel.
Iirc, I also only needed atol=1e-7 (not touching rtol). Not sure if the fp precision issues really is the same here but I find both options reasonable (1e-5 as cushion or 1e-7 to be strict).
|
After the discussion, let's keep 1e-5 though. Thank you @st81 however for taking care the details in the codebase. |
What does this PR do?
The
atolvalues were increased in #39412, but this change doesn't actually increase the difference between original and exported model key value layers. The logitsatolwas also set too high at 1e-5 when 1e-7 is sufficient.After this change, the test still passes.
$ python3 -m pytest tests/utils/test_cache_utils.py -k test_dynamic_cache_exportability ======================================================================================= test session starts ======================================================================================== platform linux -- Python 3.12.11, pytest-8.4.1, pluggy-1.6.0 rootdir: /home/shutotakahashi/projects/transformers-uv/transformers configfile: pyproject.toml collected 61 items / 59 deselected / 2 selected tests/utils/test_cache_utils.py::CacheExportIntegrationTest::test_dynamic_cache_exportability PASSED [ 50%] tests/utils/test_cache_utils.py::CacheExportIntegrationTest::test_dynamic_cache_exportability_multiple_run PASSEDBefore submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?