-
Notifications
You must be signed in to change notification settings - Fork 31.9k
Reduce atol values in test_dynamic_cache_exportability #39667
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
st81
wants to merge
2
commits into
huggingface:main
from
st81:reduce_atol_values_in_test_dynamic_cache_exportability
Closed
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
running on CI runners, and it passes.
I would like to hear from @sywangyi why the atol is increased in #39412, maybe this is some other test cases need this larger atol.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
export creates the attention mask as fake tensors and thus we fall to different paths in sdpa attention, which is if attention_mask is None. enable GQA in sdpa, if not None, repeat kv outside. see the logic in #39412.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But did you observe larger atol is necessary when you worked on that PR? I am testing in our CI runners and it works with 1e-7. I am just wondering what is the motivation of increasing to 1e-5 (I saw you even change it to 1e-3 once from the commit history).
@st81 I personally would not be bother of keeping it as 1e-5 however.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ydshieh Thank you for the feedback! I agree that keeping it at 1e-5 wouldn't be problematic.
The main motivation for reducing it to 1e-7 is to make the test more sensitive to future changes - this way we can quickly detect if any modifications increase the difference between the original and exported models.
I should also mention that the tests were passing locally on my machine even without any atol changes (except logits), which initially led me to believe that PR #39412 introduced minimal differences in the exported model. (I discovered the hardware-dependent differences after creating this PR and running it on CI runners.)
So if you prefer to keep it at 1e-5 for stability across different hardware environments, I'm completely fine with that approach as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood. Just a relevant thread if you are interested to read
pytorch/pytorch#157274 (comment)
that is one reason I might keep it as 1e-5 now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly relevant #39412 (comment)
The difference comes from the sdpa path either taking the
enable_gqa(internal sdpa kernel) or manually materializing the kv repetitions. The diffs are understandably very low due to materializing vs kernel.Iirc, I also only needed
atol=1e-7(not touching rtol). Not sure if the fp precision issues really is the same here but I find both options reasonable (1e-5 as cushion or 1e-7 to be strict).