Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 5 additions & 5 deletions tests/utils/test_cache_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -643,10 +643,10 @@ def test_dynamic_cache_exportability(self):
past_key_values=past_key_values_eager,
use_cache=True,
)
self.assertTrue(torch.allclose(res.logits, res_eager.logits, atol=1e-5))
self.assertTrue(torch.allclose(res.logits, res_eager.logits, atol=1e-7))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

running on CI runners, and it passes.

I would like to hear from @sywangyi why the atol is increased in #39412, maybe this is some other test cases need this larger atol.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

export creates the attention mask as fake tensors and thus we fall to different paths in sdpa attention, which is if attention_mask is None. enable GQA in sdpa, if not None, repeat kv outside. see the logic in #39412.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But did you observe larger atol is necessary when you worked on that PR? I am testing in our CI runners and it works with 1e-7. I am just wondering what is the motivation of increasing to 1e-5 (I saw you even change it to 1e-3 once from the commit history).

@st81 I personally would not be bother of keeping it as 1e-5 however.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ydshieh Thank you for the feedback! I agree that keeping it at 1e-5 wouldn't be problematic.

The main motivation for reducing it to 1e-7 is to make the test more sensitive to future changes - this way we can quickly detect if any modifications increase the difference between the original and exported models.

I should also mention that the tests were passing locally on my machine even without any atol changes (except logits), which initially led me to believe that PR #39412 introduced minimal differences in the exported model. (I discovered the hardware-dependent differences after creating this PR and running it on CI runners.)

So if you prefer to keep it at 1e-5 for stability across different hardware environments, I'm completely fine with that approach as well.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. Just a relevant thread if you are interested to read

pytorch/pytorch#157274 (comment)

that is one reason I might keep it as 1e-5 now

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly relevant #39412 (comment)

The difference comes from the sdpa path either taking the enable_gqa (internal sdpa kernel) or manually materializing the kv repetitions. The diffs are understandably very low due to materializing vs kernel.

Iirc, I also only needed atol=1e-7 (not touching rtol). Not sure if the fp precision issues really is the same here but I find both options reasonable (1e-5 as cushion or 1e-7 to be strict).

for l1, l2 in zip(res.past_key_values.layers, res_eager.past_key_values.layers):
self.assertTrue(torch.allclose(l1.keys, l2.keys, atol=1e-5))
self.assertTrue(torch.allclose(l1.values, l2.values, atol=1e-5))
self.assertTrue(torch.allclose(l1.keys, l2.keys, atol=1e-7))
self.assertTrue(torch.allclose(l1.values, l2.values, atol=1e-7))

def test_dynamic_cache_exportability_multiple_run(self):
# When exporting with DynamicCache, you should export two graphs:
Expand Down Expand Up @@ -739,8 +739,8 @@ def test_dynamic_cache_exportability_multiple_run(self):
)

for l1, l2 in zip(res_export_2.past_key_values.layers, res_eager_2.past_key_values.layers):
self.assertTrue(torch.allclose(l1.keys, l2.keys, atol=1e-5))
self.assertTrue(torch.allclose(l1.values, l2.values, atol=1e-5))
self.assertTrue(torch.allclose(l1.keys, l2.keys, atol=1e-7))
self.assertTrue(torch.allclose(l1.values, l2.values, atol=1e-7))

@unittest.skip("Runs on my machine locally, passed, no idea why it does not online")
def test_static_cache_exportability(self):
Expand Down