Threshold fix wvSplitk for occasional CI fails#34013
Threshold fix wvSplitk for occasional CI fails#34013tjtanaa merged 3 commits intovllm-project:mainfrom
Conversation
…er init, which were casuing fails depending on rand seed. Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>
There was a problem hiding this comment.
Code Review
This pull request increases the tolerance in a numerical test for the wvSplitk FP8 kernel on ROCm to fix CI failures. While this may resolve the flakiness, the new relative tolerance of 5% is quite high and could potentially mask future numerical regressions. I've added a comment to highlight this concern and suggest improvements.
| if xnorm: | ||
| assert torch.allclose(out, ref_out, atol=1e-3, rtol=1e-8) | ||
| elif k >= 32 * 1024: | ||
| assert torch.allclose(out, ref_out, 0.05) # wider thresh for large-K & no xnorm |
There was a problem hiding this comment.
A relative tolerance of 5% (rtol=0.05) is very high for a numerical test and may hide future regressions in the wvSplitKQ kernel. This significantly weakens the test's ability to catch numerical precision issues.
If this high tolerance is unavoidable due to the nature of FP8 arithmetic over large dimensions, please consider adding a more detailed comment explaining the necessity of this value, for example by referencing the specific conditions and CI failures that prompted this change. This will be very helpful for future maintenance.
For clarity, it is also better to use the keyword argument rtol explicitly.
| assert torch.allclose(out, ref_out, 0.05) # wider thresh for large-K & no xnorm | |
| assert torch.allclose(out, ref_out, rtol=0.05) # wider thresh for large-K & no xnorm |
|
@amd-hhashemi can you share your test results? I triggered it. The tests seems to still fail https://buildkite.com/vllm/amd-ci/builds/4355/steps/canvas?jid=019c35ff-16ab-4fab-80e6-16a065daa2c9&tab=output |
Interesting. I noticed that that CI is running on a mi325, while I've been running tests on a mi350. |
| assert torch.allclose(out, ref_out, atol=1e-3, rtol=1e-8) | ||
| elif k >= 32 * 1024: | ||
| # wider pytrch thresh for large-K & no xnorm | ||
| assert torch.allclose(out, ref_out, atol=0.07, rtol=5e-2) |
There was a problem hiding this comment.
@amd-hhashemi Can you please use testing.torch.assert_close instead?
656f211 to
df82839
Compare
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>
df82839 to
a2bfdc5
Compare
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com> Signed-off-by: Eldar Kurtic <research@neuralmagic.com>
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>
Purpose
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.