Threshold fix wvSplitk for occasional CI fails by amd-hhashemi · Pull Request #34013 · vllm-project/vllm

amd-hhashemi · 2026-02-06T19:12:05Z

Purpose

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

…er init, which were casuing fails depending on rand seed. Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>

gemini-code-assist

Code Review

This pull request increases the tolerance in a numerical test for the wvSplitk FP8 kernel on ROCm to fix CI failures. While this may resolve the flakiness, the new relative tolerance of 5% is quite high and could potentially mask future numerical regressions. I've added a comment to highlight this concern and suggest improvements.

gemini-code-assist · 2026-02-06T19:14:39Z

tests/kernels/quantization/test_rocm_skinny_gemms.py

    if xnorm:
        assert torch.allclose(out, ref_out, atol=1e-3, rtol=1e-8)
+    elif k >= 32 * 1024:
+        assert torch.allclose(out, ref_out, 0.05)  # wider thresh for large-K & no xnorm


A relative tolerance of 5% (rtol=0.05) is very high for a numerical test and may hide future regressions in the wvSplitKQ kernel. This significantly weakens the test's ability to catch numerical precision issues.

If this high tolerance is unavoidable due to the nature of FP8 arithmetic over large dimensions, please consider adding a more detailed comment explaining the necessity of this value, for example by referencing the specific conditions and CI failures that prompted this change. This will be very helpful for future maintenance.

For clarity, it is also better to use the keyword argument rtol explicitly.

Suggested change

assert torch.allclose(out, ref_out, 0.05) # wider thresh for large-K & no xnorm

assert torch.allclose(out, ref_out, rtol=0.05) # wider thresh for large-K & no xnorm

tjtanaa · 2026-02-07T08:35:10Z

@amd-hhashemi can you share your test results?

I triggered it. The tests seems to still fail https://buildkite.com/vllm/amd-ci/builds/4355/steps/canvas?jid=019c35ff-16ab-4fab-80e6-16a065daa2c9&tab=output

amd-hhashemi · 2026-02-07T20:07:47Z

@amd-hhashemi can you share your test results?

I triggered it. The tests seems to still fail https://buildkite.com/vllm/amd-ci/builds/4355/steps/canvas?jid=019c35ff-16ab-4fab-80e6-16a065daa2c9&tab=output

Interesting. I noticed that that CI is running on a mi325, while I've been running tests on a mi350.
I snagged a mi325 just now, and sure enough those new big-k tests with no xnorm still fail there.
It looks like with these big Ks on mi325 we need to set it to exactly the same threshold pytorch uses (as loose as that seems). No idea how mi350 gets by with a tighter threshold. mi325 before after last checkin attached (again mi350 runs were all passing before and after).
test_rocm_skinny_gemms_mi325_before.log
test_rocm_skinny_gemms_mi325_after.log
test_rocm_skinny_gemms_mi350_before.log

rasmith · 2026-02-09T19:20:40Z

tests/kernels/quantization/test_rocm_skinny_gemms.py

        assert torch.allclose(out, ref_out, atol=1e-3, rtol=1e-8)
+    elif k >= 32 * 1024:
+        # wider pytrch thresh for large-K & no xnorm
+        assert torch.allclose(out, ref_out, atol=0.07, rtol=5e-2)


@amd-hhashemi Can you please use testing.torch.assert_close instead?

Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>

Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com> Signed-off-by: Eldar Kurtic <research@neuralmagic.com>

Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>

amd-hhashemi and others added 2 commits February 6, 2026 19:04

Loosen test threshold for wvSplit_fp8 tests with large-K and non-xavi…

5786cfb

…er init, which were casuing fails depending on rand seed. Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>

Merge branch 'vllm-project:main' into wvSplitKthrshFix

9908dae

amd-hhashemi requested a review from tjtanaa as a code owner February 6, 2026 19:12

mergify bot added the rocm Related to AMD ROCm label Feb 6, 2026

github-project-automation bot added this to AMD Feb 6, 2026

github-project-automation bot moved this to Todo in AMD Feb 6, 2026

gemini-code-assist bot reviewed Feb 6, 2026

View reviewed changes

amd-hhashemi mentioned this pull request Feb 6, 2026

Add padding support to wvSplitK solution for skinny GEMMs #33762

Merged

5 tasks

amd-hhashemi mentioned this pull request Feb 8, 2026

Convert wvSplitKQ to 16x16 MFMA in prep for mi4xx. #34100

Merged

5 tasks

rasmith reviewed Feb 9, 2026

View reviewed changes

mergify bot added the ci/build label Feb 9, 2026

amd-hhashemi force-pushed the wvSplitKthrshFix branch from 656f211 to df82839 Compare February 9, 2026 20:07

Make large-Ks threshold match pytorch, and switch to assert_close()

a2bfdc5

Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>

amd-hhashemi force-pushed the wvSplitKthrshFix branch from df82839 to a2bfdc5 Compare February 9, 2026 20:42

gshtras approved these changes Feb 9, 2026

View reviewed changes

rasmith mentioned this pull request Feb 10, 2026

[CI][AMD][BugFix] Use torch.testing.assert_close instead of assert torch.allclose in test_rocm_skinny_gemms.py #34181

Merged

tjtanaa approved these changes Feb 11, 2026

View reviewed changes

tjtanaa added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 11, 2026

tjtanaa enabled auto-merge (squash) February 11, 2026 03:01

tjtanaa merged commit 1b3540e into vllm-project:main Feb 11, 2026
16 of 17 checks passed

github-project-automation bot moved this from Todo to Done in AMD Feb 11, 2026

samutamm pushed a commit to samutamm/vllm that referenced this pull request Feb 11, 2026

Threshold fix wvSplitk for occasional CI fails (vllm-project#34013)

123e763

Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>

eldarkurtic pushed a commit to eldarkurtic/vllm that referenced this pull request Feb 19, 2026

Threshold fix wvSplitk for occasional CI fails (vllm-project#34013)

8ec0811

Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com> Signed-off-by: Eldar Kurtic <research@neuralmagic.com>

llsj14 pushed a commit to llsj14/vllm that referenced this pull request Mar 1, 2026

Threshold fix wvSplitk for occasional CI fails (vllm-project#34013)

8f3ea40

Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>

tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Mar 4, 2026

Threshold fix wvSplitk for occasional CI fails (vllm-project#34013)

623300b

Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>

liuchenbing2026 pushed a commit to liuchenbing2026/vllm that referenced this pull request Apr 4, 2026

Threshold fix wvSplitk for occasional CI fails (vllm-project#34013)

22a1350

Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Threshold fix wvSplitk for occasional CI fails#34013

Threshold fix wvSplitk for occasional CI fails#34013
tjtanaa merged 3 commits intovllm-project:mainfrom
amd-hhashemi:wvSplitKthrshFix

amd-hhashemi commented Feb 6, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 6, 2026

Uh oh!

tjtanaa commented Feb 7, 2026

Uh oh!

amd-hhashemi commented Feb 7, 2026 •

edited

Loading

Uh oh!

rasmith Feb 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	assert torch.allclose(out, ref_out, 0.05) # wider thresh for large-K & no xnorm
	assert torch.allclose(out, ref_out, rtol=0.05) # wider thresh for large-K & no xnorm

Uh oh!

Conversation

amd-hhashemi commented Feb 6, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

tjtanaa commented Feb 7, 2026

Uh oh!

amd-hhashemi commented Feb 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rasmith Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

amd-hhashemi commented Feb 6, 2026 •

edited by github-actions bot

Loading

amd-hhashemi commented Feb 7, 2026 •

edited

Loading