Skip to content

[ROCm][CI] Retrying in case of batch variance effects and reducing flakiness#36442

Merged
tjtanaa merged 4 commits intovllm-project:mainfrom
ROCm:akaratza_stabilize_v1_e2e
Mar 16, 2026
Merged

[ROCm][CI] Retrying in case of batch variance effects and reducing flakiness#36442
tjtanaa merged 4 commits intovllm-project:mainfrom
ROCm:akaratza_stabilize_v1_e2e

Conversation

@AndreasKaratzas
Copy link
Collaborator

@AndreasKaratzas AndreasKaratzas commented Mar 9, 2026

On ROCm there is non-determinism in the case of batch sizes greater than 1, and in the case of preemption test, this is exactly what is being tested. Therefore, we add a retry so that any chance of this test failing is met with a second attempt. Furthermore, we disable skinny GEMMs which is particularly useful for MI355 test deployment.

cc @kenroche

…akiness

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
@mergify mergify bot added rocm Related to AMD ROCm v1 labels Mar 9, 2026
@github-project-automation github-project-automation bot moved this to Todo in AMD Mar 9, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several changes to improve CI stability on ROCm platforms. The main changes include adding a retry mechanism for a flaky test (test_with_ngram_gpu_spec_decoding), disabling skinny GEMM via an environment variable to mitigate non-determinism, and disabling prefix caching for the asynchronous scheduling tests. These changes are well-contained and only affect ROCm environments. The implementation appears correct and should contribute to more reliable CI runs.

Note: Security Review is unavailable for this PR.

@AndreasKaratzas AndreasKaratzas added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 9, 2026
@AndreasKaratzas
Copy link
Collaborator Author

Adding ready label for tests to start.

@gshtras
Copy link
Collaborator

gshtras commented Mar 9, 2026

Disabling skinny GEMMs for tests is not a good idea. This is a production feature that is meant to always be on. Without it the tests are testing a synthetic use case that doesn't match production reality.
If you believe the GEMMs are bugged, we need to disable them until fixed. If not, the tests need to acomodate for it
If there is a specific batch invariance test that is meant to test a specific deterministic run, it may fit there, but it needs to be clearly stated in the docs.

@AndreasKaratzas
Copy link
Collaborator Author

Disabling skinny GEMMs for tests is not a good idea. This is a production feature that is meant to always be on. Without it the tests are testing a synthetic use case that doesn't match production reality. If you believe the GEMMs are bugged, we need to disable them until fixed. If not, the tests need to acomodate for it If there is a specific batch invariance test that is meant to test a specific deterministic run, it may fit there, but it needs to be clearly stated in the docs.

@gshtras Skinny GEMMs are bugged and I am working with @amd-hhashemi to resolve this. However, I think that having skinny GEMM failure affect all tests is not a good principle. I am working on introducing a skinny GEMM test amongst others here: #35183

That way, the skinny GEMM test will fail but others will remain intact.

@mergify
Copy link

mergify bot commented Mar 13, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @AndreasKaratzas.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Mar 13, 2026
Copy link
Collaborator

@tjtanaa tjtanaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tjtanaa tjtanaa merged commit a2956a0 into vllm-project:main Mar 16, 2026
18 checks passed
@github-project-automation github-project-automation bot moved this from Todo to Done in AMD Mar 16, 2026
@noooop
Copy link
Collaborator

noooop commented Mar 16, 2026

I couldn't help but laugh when I saw this headline.

Based on my understanding, batch variance eliminates all randomness.

batch variance? flakiness? retrying?

This sounds like a bug.

@AndreasKaratzas
Copy link
Collaborator Author

I couldn't help but laugh when I saw this headline.

Based on my understanding, batch variance eliminates all randomness.

batch variance? flakiness? retrying?

This sounds like a bug.

@noooop So, the flakiness is likely due to batch variance (not batch invariance). ROCm does not have batch invariance yet like CUDA. At the same time, I am still exploring why this test fails sometimes (rarely but still). I will confess however, that retry might have been an overkill 😅

@AndreasKaratzas AndreasKaratzas deleted the akaratza_stabilize_v1_e2e branch March 16, 2026 15:56
@noooop
Copy link
Collaborator

noooop commented Mar 16, 2026

Sorry, I misread that. I thought it was batch invariance.

Lucaskabela pushed a commit to Lucaskabela/vllm that referenced this pull request Mar 17, 2026
…akiness (vllm-project#36442)

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
wendyliu235 pushed a commit to wendyliu235/vllm-public that referenced this pull request Mar 18, 2026
…akiness (vllm-project#36442)

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
fxdawnn pushed a commit to fxdawnn/vllm that referenced this pull request Mar 19, 2026
…akiness (vllm-project#36442)

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants