Reduce atol values in test_dynamic_cache_exportability #39667

st81 · 2025-07-25T14:13:26Z

What does this PR do?

The atol values were increased in #39412, but this change doesn't actually increase the difference between original and exported model key value layers. The logits atol was also set too high at 1e-5 when 1e-7 is sufficient.

After this change, the test still passes.

$ python3 -m pytest tests/utils/test_cache_utils.py  -k test_dynamic_cache_exportability
======================================================================================= test session starts ========================================================================================
platform linux -- Python 3.12.11, pytest-8.4.1, pluggy-1.6.0
rootdir: /home/shutotakahashi/projects/transformers-uv/transformers
configfile: pyproject.toml
collected 61 items / 59 deselected / 2 selected                                                                                                                                                    

tests/utils/test_cache_utils.py::CacheExportIntegrationTest::test_dynamic_cache_exportability PASSED                                                                                         [ 50%]
tests/utils/test_cache_utils.py::CacheExportIntegrationTest::test_dynamic_cache_exportability_multiple_run PASSED

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@vasqu
generate: @zucchini-nlp

st81 · 2025-07-25T14:24:05Z

It looks like changed tests are failed in CI. I will investigate it.

zucchini-nlp · 2025-07-25T15:21:53Z

hey @st81 , that is probably related to hardware and the diffs are actually higher when running the test in our runners :)

st81 · 2025-07-26T01:08:27Z

@zucchini-nlp Thank you for your feedback! The change in #39412 slightly increases the diff, so atol=1e-7 should be sufficient. Tests pass with this setting in CI. Could you review the changes?

Also, there's an unrelated test failure that passed in the first commit - looks like it might be flaky.
Failed test: https://app.circleci.com/pipelines/github/huggingface/transformers/139547/workflows/e779ed32-f3c9-49ea-b850-8ff49e19986c/jobs/1848760/steps

FAILED tests/trainer/test_trainer.py::TrainerIntegrationWithHubTester::test_push_to_hub_with_revision - huggingface_hub.errors.BadRequestError: (Request ID: Root=1-68841c4d-227c4b7537d5ffb65b67ac7a;0d116056-5371-4b08-9a6a-4d21fd440c2a)

Bad request for commit endpoint:
Your push was rejected because an LFS pointer pointed to a file that does not exist. For instance, this can happen if you used git push --no-verify to push your changes. Offending file: - model.safetensors - training_args.bin

This test was passed in first commit: https://app.circleci.com/pipelines/github/huggingface/transformers/139418/workflows/4fa288a0-deab-4516-9c1e-a4b2eb655880/jobs/1847114/steps

[gw0] [ 29%] PASSED tests/trainer/test_trainer.py::TrainerIntegrationWithHubTester::test_push_to_hub_with_revision

ydshieh · 2025-07-28T16:06:46Z

tests/utils/test_cache_utils.py

            use_cache=True,
        )
-        self.assertTrue(torch.allclose(res.logits, res_eager.logits, atol=1e-5))
+        self.assertTrue(torch.allclose(res.logits, res_eager.logits, atol=1e-7))


running on CI runners, and it passes.

I would like to hear from @sywangyi why the atol is increased in #39412, maybe this is some other test cases need this larger atol.

export creates the attention mask as fake tensors and thus we fall to different paths in sdpa attention, which is if attention_mask is None. enable GQA in sdpa, if not None, repeat kv outside. see the logic in #39412.

But did you observe larger atol is necessary when you worked on that PR? I am testing in our CI runners and it works with 1e-7. I am just wondering what is the motivation of increasing to 1e-5 (I saw you even change it to 1e-3 once from the commit history).

@st81 I personally would not be bother of keeping it as 1e-5 however.

@ydshieh Thank you for the feedback! I agree that keeping it at 1e-5 wouldn't be problematic.

The main motivation for reducing it to 1e-7 is to make the test more sensitive to future changes - this way we can quickly detect if any modifications increase the difference between the original and exported models.

I should also mention that the tests were passing locally on my machine even without any atol changes (except logits), which initially led me to believe that PR #39412 introduced minimal differences in the exported model. (I discovered the hardware-dependent differences after creating this PR and running it on CI runners.)

So if you prefer to keep it at 1e-5 for stability across different hardware environments, I'm completely fine with that approach as well.

Understood. Just a relevant thread if you are interested to read

pytorch/pytorch#157274 (comment)

that is one reason I might keep it as 1e-5 now

Possibly relevant #39412 (comment)

The difference comes from the sdpa path either taking the enable_gqa (internal sdpa kernel) or manually materializing the kv repetitions. The diffs are understandably very low due to materializing vs kernel.

Iirc, I also only needed atol=1e-7 (not touching rtol). Not sure if the fp precision issues really is the same here but I find both options reasonable (1e-5 as cushion or 1e-7 to be strict).

ydshieh · 2025-08-01T08:14:48Z

After the discussion, let's keep 1e-5 though.

Thank you @st81 however for taking care the details in the codebase.

st81 marked this pull request as draft July 25, 2025 14:24

st81 added 2 commits July 26, 2025 09:03

Reduce atol values in test_dynamic_cache_exportability

7959e83

Increase atol to 1e-7

32a1b02

st81 force-pushed the reduce_atol_values_in_test_dynamic_cache_exportability branch from 78f16f6 to 32a1b02 Compare July 26, 2025 00:03

st81 marked this pull request as ready for review July 26, 2025 01:11

github-actions bot requested a review from ydshieh July 26, 2025 01:12

ydshieh reviewed Jul 28, 2025

View reviewed changes

ydshieh closed this Aug 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce atol values in test_dynamic_cache_exportability #39667

Reduce atol values in test_dynamic_cache_exportability #39667

Uh oh!

st81 commented Jul 25, 2025

Uh oh!

st81 commented Jul 25, 2025

Uh oh!

zucchini-nlp commented Jul 25, 2025

Uh oh!

st81 commented Jul 26, 2025

Uh oh!

ydshieh Jul 28, 2025

Uh oh!

sywangyi Jul 29, 2025

Uh oh!

ydshieh Jul 29, 2025

Uh oh!

st81 Jul 29, 2025

Uh oh!

ydshieh Jul 29, 2025

Uh oh!

vasqu Jul 29, 2025

Uh oh!

ydshieh commented Aug 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Reduce atol values in test_dynamic_cache_exportability #39667

Reduce atol values in test_dynamic_cache_exportability #39667

Uh oh!

Conversation

st81 commented Jul 25, 2025

What does this PR do?

Before submitting

Who can review?

Uh oh!

st81 commented Jul 25, 2025

Uh oh!

zucchini-nlp commented Jul 25, 2025

Uh oh!

st81 commented Jul 26, 2025

Uh oh!

ydshieh Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

sywangyi Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

ydshieh Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

st81 Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

ydshieh Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

vasqu Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

ydshieh commented Aug 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants