[ROCm][CI] Retrying in case of batch variance effects and reducing flakiness by AndreasKaratzas · Pull Request #36442 · vllm-project/vllm

AndreasKaratzas · 2026-03-09T03:56:46Z

On ROCm there is non-determinism in the case of batch sizes greater than 1, and in the case of preemption test, this is exactly what is being tested. Therefore, we add a retry so that any chance of this test failing is met with a second attempt. Furthermore, we disable skinny GEMMs which is particularly useful for MI355 test deployment.

cc @kenroche

…akiness Signed-off-by: Andreas Karatzas <akaratza@amd.com>

gemini-code-assist

Code Review

This pull request introduces several changes to improve CI stability on ROCm platforms. The main changes include adding a retry mechanism for a flaky test (test_with_ngram_gpu_spec_decoding), disabling skinny GEMM via an environment variable to mitigate non-determinism, and disabling prefix caching for the asynchronous scheduling tests. These changes are well-contained and only affect ROCm environments. The implementation appears correct and should contribute to more reliable CI runs.

_{Note: Security Review is unavailable for this PR.}

AndreasKaratzas · 2026-03-09T04:12:25Z

Adding ready label for tests to start.

gshtras · 2026-03-09T15:44:57Z

Disabling skinny GEMMs for tests is not a good idea. This is a production feature that is meant to always be on. Without it the tests are testing a synthetic use case that doesn't match production reality.
If you believe the GEMMs are bugged, we need to disable them until fixed. If not, the tests need to acomodate for it
If there is a specific batch invariance test that is meant to test a specific deterministic run, it may fit there, but it needs to be clearly stated in the docs.

AndreasKaratzas · 2026-03-09T15:48:11Z

Disabling skinny GEMMs for tests is not a good idea. This is a production feature that is meant to always be on. Without it the tests are testing a synthetic use case that doesn't match production reality. If you believe the GEMMs are bugged, we need to disable them until fixed. If not, the tests need to acomodate for it If there is a specific batch invariance test that is meant to test a specific deterministic run, it may fit there, but it needs to be clearly stated in the docs.

@gshtras Skinny GEMMs are bugged and I am working with @amd-hhashemi to resolve this. However, I think that having skinny GEMM failure affect all tests is not a good principle. I am working on introducing a skinny GEMM test amongst others here: #35183

That way, the skinny GEMM test will fail but others will remain intact.

mergify · 2026-03-13T21:26:28Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @AndreasKaratzas.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

…_e2e

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

tjtanaa

LGTM

noooop · 2026-03-16T08:44:29Z

I couldn't help but laugh when I saw this headline.

Based on my understanding, batch variance eliminates all randomness.

batch variance? flakiness? retrying?

This sounds like a bug.

AndreasKaratzas · 2026-03-16T15:56:08Z

I couldn't help but laugh when I saw this headline.

Based on my understanding, batch variance eliminates all randomness.

batch variance? flakiness? retrying?

This sounds like a bug.

@noooop So, the flakiness is likely due to batch variance (not batch invariance). ROCm does not have batch invariance yet like CUDA. At the same time, I am still exploring why this test fails sometimes (rarely but still). I will confess however, that retry might have been an overkill 😅

noooop · 2026-03-16T16:12:03Z

Sorry, I misread that. I thought it was batch invariance.

…akiness (vllm-project#36442) Signed-off-by: Andreas Karatzas <akaratza@amd.com>

[ROCm][CI] Retrying in case of batch variance effects and reducing fl…

64c9111

…akiness Signed-off-by: Andreas Karatzas <akaratza@amd.com>

mergify bot added rocm Related to AMD ROCm v1 labels Mar 9, 2026

github-project-automation bot added this to AMD Mar 9, 2026

github-project-automation bot moved this to Todo in AMD Mar 9, 2026

gemini-code-assist bot reviewed Mar 9, 2026

View reviewed changes

AndreasKaratzas added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 9, 2026

AndreasKaratzas mentioned this pull request Mar 10, 2026

[CI Failure]: mi325_1: V1 Test e2e + engine #29510

Closed

3 tasks

mergify bot added the needs-rebase label Mar 13, 2026

Merge remote-tracking branch 'origin/main' into akaratza_stabilize_v1…

927f522

…_e2e

mergify bot removed the needs-rebase label Mar 14, 2026

AndreasKaratzas mentioned this pull request Mar 15, 2026

[ROCm][CI] Revamping AMD mirrors #35897

Open

AndreasKaratzas added 2 commits March 15, 2026 14:52

Merge remote-tracking branch 'origin/main' into akaratza_stabilize_v1…

008e9a0

…_e2e

Removed env override for ROCm

cd8f729

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

tjtanaa approved these changes Mar 16, 2026

View reviewed changes

tjtanaa merged commit a2956a0 into vllm-project:main Mar 16, 2026
18 checks passed

github-project-automation bot moved this from Todo to Done in AMD Mar 16, 2026

AndreasKaratzas deleted the akaratza_stabilize_v1_e2e branch March 16, 2026 15:56

Lucaskabela pushed a commit to Lucaskabela/vllm that referenced this pull request Mar 17, 2026

[ROCm][CI] Retrying in case of batch variance effects and reducing fl…

4df712a

…akiness (vllm-project#36442) Signed-off-by: Andreas Karatzas <akaratza@amd.com>

wendyliu235 pushed a commit to wendyliu235/vllm-public that referenced this pull request Mar 18, 2026

[ROCm][CI] Retrying in case of batch variance effects and reducing fl…

6ec1af0

…akiness (vllm-project#36442) Signed-off-by: Andreas Karatzas <akaratza@amd.com>

fxdawnn pushed a commit to fxdawnn/vllm that referenced this pull request Mar 19, 2026

[ROCm][CI] Retrying in case of batch variance effects and reducing fl…

7c0ff38

…akiness (vllm-project#36442) Signed-off-by: Andreas Karatzas <akaratza@amd.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ROCm][CI] Retrying in case of batch variance effects and reducing flakiness#36442

[ROCm][CI] Retrying in case of batch variance effects and reducing flakiness#36442
tjtanaa merged 4 commits intovllm-project:mainfrom
ROCm:akaratza_stabilize_v1_e2e

AndreasKaratzas commented Mar 9, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

AndreasKaratzas commented Mar 9, 2026

Uh oh!

gshtras commented Mar 9, 2026

Uh oh!

AndreasKaratzas commented Mar 9, 2026

Uh oh!

mergify bot commented Mar 13, 2026

Uh oh!

tjtanaa left a comment

Uh oh!

Uh oh!

noooop commented Mar 16, 2026 •

edited

Loading

Uh oh!

AndreasKaratzas commented Mar 16, 2026

Uh oh!

noooop commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

AndreasKaratzas commented Mar 9, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

AndreasKaratzas commented Mar 9, 2026

Uh oh!

gshtras commented Mar 9, 2026

Uh oh!

AndreasKaratzas commented Mar 9, 2026

Uh oh!

mergify bot commented Mar 13, 2026

Uh oh!

tjtanaa left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

noooop commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AndreasKaratzas commented Mar 16, 2026

Uh oh!

noooop commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

AndreasKaratzas commented Mar 9, 2026 •

edited by github-actions bot

Loading

noooop commented Mar 16, 2026 •

edited

Loading