Skip to content

Threshold fix wvSplitk for occasional CI fails#34013

Merged
tjtanaa merged 3 commits intovllm-project:mainfrom
amd-hhashemi:wvSplitKthrshFix
Feb 11, 2026
Merged

Threshold fix wvSplitk for occasional CI fails#34013
tjtanaa merged 3 commits intovllm-project:mainfrom
amd-hhashemi:wvSplitKthrshFix

Conversation

@amd-hhashemi
Copy link
Copy Markdown
Contributor

@amd-hhashemi amd-hhashemi commented Feb 6, 2026

Purpose

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

amd-hhashemi and others added 2 commits February 6, 2026 19:04
…er init,

which were casuing fails depending on rand seed.

Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>
@amd-hhashemi amd-hhashemi requested a review from tjtanaa as a code owner February 6, 2026 19:12
@mergify mergify bot added the rocm Related to AMD ROCm label Feb 6, 2026
@github-project-automation github-project-automation bot moved this to Todo in AMD Feb 6, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request increases the tolerance in a numerical test for the wvSplitk FP8 kernel on ROCm to fix CI failures. While this may resolve the flakiness, the new relative tolerance of 5% is quite high and could potentially mask future numerical regressions. I've added a comment to highlight this concern and suggest improvements.

if xnorm:
assert torch.allclose(out, ref_out, atol=1e-3, rtol=1e-8)
elif k >= 32 * 1024:
assert torch.allclose(out, ref_out, 0.05) # wider thresh for large-K & no xnorm
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

A relative tolerance of 5% (rtol=0.05) is very high for a numerical test and may hide future regressions in the wvSplitKQ kernel. This significantly weakens the test's ability to catch numerical precision issues.

If this high tolerance is unavoidable due to the nature of FP8 arithmetic over large dimensions, please consider adding a more detailed comment explaining the necessity of this value, for example by referencing the specific conditions and CI failures that prompted this change. This will be very helpful for future maintenance.

For clarity, it is also better to use the keyword argument rtol explicitly.

Suggested change
assert torch.allclose(out, ref_out, 0.05) # wider thresh for large-K & no xnorm
assert torch.allclose(out, ref_out, rtol=0.05) # wider thresh for large-K & no xnorm

@tjtanaa
Copy link
Copy Markdown
Collaborator

tjtanaa commented Feb 7, 2026

@amd-hhashemi can you share your test results?

I triggered it. The tests seems to still fail https://buildkite.com/vllm/amd-ci/builds/4355/steps/canvas?jid=019c35ff-16ab-4fab-80e6-16a065daa2c9&tab=output

@amd-hhashemi
Copy link
Copy Markdown
Contributor Author

amd-hhashemi commented Feb 7, 2026

@amd-hhashemi can you share your test results?

I triggered it. The tests seems to still fail https://buildkite.com/vllm/amd-ci/builds/4355/steps/canvas?jid=019c35ff-16ab-4fab-80e6-16a065daa2c9&tab=output

@amd-hhashemi can you share your test results?

I triggered it. The tests seems to still fail https://buildkite.com/vllm/amd-ci/builds/4355/steps/canvas?jid=019c35ff-16ab-4fab-80e6-16a065daa2c9&tab=output

Interesting. I noticed that that CI is running on a mi325, while I've been running tests on a mi350.
I snagged a mi325 just now, and sure enough those new big-k tests with no xnorm still fail there.
It looks like with these big Ks on mi325 we need to set it to exactly the same threshold pytorch uses (as loose as that seems). No idea how mi350 gets by with a tighter threshold. mi325 before after last checkin attached (again mi350 runs were all passing before and after).
test_rocm_skinny_gemms_mi325_before.log
test_rocm_skinny_gemms_mi325_after.log
test_rocm_skinny_gemms_mi350_before.log

assert torch.allclose(out, ref_out, atol=1e-3, rtol=1e-8)
elif k >= 32 * 1024:
# wider pytrch thresh for large-K & no xnorm
assert torch.allclose(out, ref_out, atol=0.07, rtol=5e-2)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amd-hhashemi Can you please use testing.torch.assert_close instead?

Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>
@tjtanaa tjtanaa added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 11, 2026
@tjtanaa tjtanaa enabled auto-merge (squash) February 11, 2026 03:01
@tjtanaa tjtanaa merged commit 1b3540e into vllm-project:main Feb 11, 2026
16 of 17 checks passed
@github-project-automation github-project-automation bot moved this from Todo to Done in AMD Feb 11, 2026
samutamm pushed a commit to samutamm/vllm that referenced this pull request Feb 11, 2026
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>
eldarkurtic pushed a commit to eldarkurtic/vllm that referenced this pull request Feb 19, 2026
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>
Signed-off-by: Eldar Kurtic <research@neuralmagic.com>
llsj14 pushed a commit to llsj14/vllm that referenced this pull request Mar 1, 2026
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>
tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Mar 4, 2026
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>
liuchenbing2026 pushed a commit to liuchenbing2026/vllm that referenced this pull request Apr 4, 2026
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants