Skip to content

Conversation

@st81
Copy link
Contributor

@st81 st81 commented Jul 25, 2025

What does this PR do?

The atol values were increased in #39412, but this change doesn't actually increase the difference between original and exported model key value layers. The logits atol was also set too high at 1e-5 when 1e-7 is sufficient.

After this change, the test still passes.

$ python3 -m pytest tests/utils/test_cache_utils.py  -k test_dynamic_cache_exportability
======================================================================================= test session starts ========================================================================================
platform linux -- Python 3.12.11, pytest-8.4.1, pluggy-1.6.0
rootdir: /home/shutotakahashi/projects/transformers-uv/transformers
configfile: pyproject.toml
collected 61 items / 59 deselected / 2 selected                                                                                                                                                    

tests/utils/test_cache_utils.py::CacheExportIntegrationTest::test_dynamic_cache_exportability PASSED                                                                                         [ 50%]
tests/utils/test_cache_utils.py::CacheExportIntegrationTest::test_dynamic_cache_exportability_multiple_run PASSED       

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@st81
Copy link
Contributor Author

st81 commented Jul 25, 2025

It looks like changed tests are failed in CI. I will investigate it.

@st81 st81 marked this pull request as draft July 25, 2025 14:24
@zucchini-nlp
Copy link
Member

hey @st81 , that is probably related to hardware and the diffs are actually higher when running the test in our runners :)

@st81 st81 force-pushed the reduce_atol_values_in_test_dynamic_cache_exportability branch from 78f16f6 to 32a1b02 Compare July 26, 2025 00:03
@st81
Copy link
Contributor Author

st81 commented Jul 26, 2025

@zucchini-nlp Thank you for your feedback! The change in #39412 slightly increases the diff, so atol=1e-7 should be sufficient. Tests pass with this setting in CI. Could you review the changes?

Also, there's an unrelated test failure that passed in the first commit - looks like it might be flaky.
Failed test: https://app.circleci.com/pipelines/github/huggingface/transformers/139547/workflows/e779ed32-f3c9-49ea-b850-8ff49e19986c/jobs/1848760/steps

FAILED tests/trainer/test_trainer.py::TrainerIntegrationWithHubTester::test_push_to_hub_with_revision - huggingface_hub.errors.BadRequestError: (Request ID: Root=1-68841c4d-227c4b7537d5ffb65b67ac7a;0d116056-5371-4b08-9a6a-4d21fd440c2a)

Bad request for commit endpoint:
Your push was rejected because an LFS pointer pointed to a file that does not exist. For instance, this can happen if you used git push --no-verify to push your changes. Offending file: - model.safetensors - training_args.bin

This test was passed in first commit: https://app.circleci.com/pipelines/github/huggingface/transformers/139418/workflows/4fa288a0-deab-4516-9c1e-a4b2eb655880/jobs/1847114/steps

[gw0] [ 29%] PASSED tests/trainer/test_trainer.py::TrainerIntegrationWithHubTester::test_push_to_hub_with_revision 

@st81 st81 marked this pull request as ready for review July 26, 2025 01:11
@github-actions github-actions bot requested a review from ydshieh July 26, 2025 01:12
use_cache=True,
)
self.assertTrue(torch.allclose(res.logits, res_eager.logits, atol=1e-5))
self.assertTrue(torch.allclose(res.logits, res_eager.logits, atol=1e-7))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

running on CI runners, and it passes.

I would like to hear from @sywangyi why the atol is increased in #39412, maybe this is some other test cases need this larger atol.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

export creates the attention mask as fake tensors and thus we fall to different paths in sdpa attention, which is if attention_mask is None. enable GQA in sdpa, if not None, repeat kv outside. see the logic in #39412.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But did you observe larger atol is necessary when you worked on that PR? I am testing in our CI runners and it works with 1e-7. I am just wondering what is the motivation of increasing to 1e-5 (I saw you even change it to 1e-3 once from the commit history).

@st81 I personally would not be bother of keeping it as 1e-5 however.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ydshieh Thank you for the feedback! I agree that keeping it at 1e-5 wouldn't be problematic.

The main motivation for reducing it to 1e-7 is to make the test more sensitive to future changes - this way we can quickly detect if any modifications increase the difference between the original and exported models.

I should also mention that the tests were passing locally on my machine even without any atol changes (except logits), which initially led me to believe that PR #39412 introduced minimal differences in the exported model. (I discovered the hardware-dependent differences after creating this PR and running it on CI runners.)

So if you prefer to keep it at 1e-5 for stability across different hardware environments, I'm completely fine with that approach as well.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. Just a relevant thread if you are interested to read

pytorch/pytorch#157274 (comment)

that is one reason I might keep it as 1e-5 now

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly relevant #39412 (comment)

The difference comes from the sdpa path either taking the enable_gqa (internal sdpa kernel) or manually materializing the kv repetitions. The diffs are understandably very low due to materializing vs kernel.

Iirc, I also only needed atol=1e-7 (not touching rtol). Not sure if the fp precision issues really is the same here but I find both options reasonable (1e-5 as cushion or 1e-7 to be strict).

@ydshieh
Copy link
Collaborator

ydshieh commented Aug 1, 2025

After the discussion, let's keep 1e-5 though.

Thank you @st81 however for taking care the details in the codebase.

@ydshieh ydshieh closed this Aug 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants