[ROCm][CI] Retrying in case of batch variance effects and reducing flakiness#36442
[ROCm][CI] Retrying in case of batch variance effects and reducing flakiness#36442tjtanaa merged 4 commits intovllm-project:mainfrom
Conversation
…akiness Signed-off-by: Andreas Karatzas <akaratza@amd.com>
There was a problem hiding this comment.
Code Review
This pull request introduces several changes to improve CI stability on ROCm platforms. The main changes include adding a retry mechanism for a flaky test (test_with_ngram_gpu_spec_decoding), disabling skinny GEMM via an environment variable to mitigate non-determinism, and disabling prefix caching for the asynchronous scheduling tests. These changes are well-contained and only affect ROCm environments. The implementation appears correct and should contribute to more reliable CI runs.
Note: Security Review is unavailable for this PR.
|
Adding |
|
Disabling skinny GEMMs for tests is not a good idea. This is a production feature that is meant to always be on. Without it the tests are testing a synthetic use case that doesn't match production reality. |
@gshtras Skinny GEMMs are bugged and I am working with @amd-hhashemi to resolve this. However, I think that having skinny GEMM failure affect all tests is not a good principle. I am working on introducing a skinny GEMM test amongst others here: #35183 That way, the skinny GEMM test will fail but others will remain intact. |
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
|
I couldn't help but laugh when I saw this headline. Based on my understanding, batch variance eliminates all randomness. batch variance? flakiness? retrying? This sounds like a bug. |
@noooop So, the flakiness is likely due to batch variance (not batch invariance). ROCm does not have batch invariance yet like CUDA. At the same time, I am still exploring why this test fails sometimes (rarely but still). I will confess however, that retry might have been an overkill 😅 |
|
Sorry, I misread that. I thought it was batch invariance. |
…akiness (vllm-project#36442) Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…akiness (vllm-project#36442) Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…akiness (vllm-project#36442) Signed-off-by: Andreas Karatzas <akaratza@amd.com>
On ROCm there is non-determinism in the case of batch sizes greater than 1, and in the case of preemption test, this is exactly what is being tested. Therefore, we add a retry so that any chance of this test failing is met with a second attempt. Furthermore, we disable skinny GEMMs which is particularly useful for MI355 test deployment.
cc @kenroche