[ROCm][CI] Attempt to fix the failures under a subgroup of the e2e the test group#29358
[ROCm][CI] Attempt to fix the failures under a subgroup of the e2e the test group#29358DarkLight1337 merged 11 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
There was a problem hiding this comment.
Code Review
This pull request addresses a CI failure on ROCm within test_async_scheduling. The root cause was that the test forced the FLEX_ATTENTION backend, which is unsupported on ROCm, leading to a runtime error. The proposed fix introduces a platform-specific check to set a supported attention backend. For ROCm, it now correctly uses TRITON_ATTN, while for other platforms, it preserves the existing behavior of using FLEX_ATTENTION. The change is well-targeted, correct, and should resolve the CI failure.
|
I synced your branch with main and tried running the tests. I am getting 2 failure cases, are they intended to be failing and handled in other PR? Summary of the tests run
|
… without affecting other platforms Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
|
I built the upstream dockers (both base and For the newly introduced modifications, here are some details: Platform-specific attention backends
Platform-specific dtype
Relaxed tolerances for ROCm
Reduced test configurations on ROCm
Skip strict logprobs check for ROCm spec-decoding
I think I'm still respecting the purpose of this particular test with these new changes. I am open to feedback. Even for NVIDIA the only attention backend that works is |
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
|
Btw, commands for test execution: docker build -f docker/Dockerfile.rocm_base -t rocm/vllm-dev:base . && docker build --no-cache -f docker/Dockerfile.rocm --target test -t vllm-rocm-test:latest .
docker run --rm -it --device /dev/kfd --device /dev/dri --network=host --shm-size=16gb --group-add video -w /vllm-workspace/tests -e HF_TOKEN=<YOUR_HF_TOKEN> -e PYTHONPATH=/vllm-workspace:$PYTHONPATH -e VLLM_WORKER_MULTIPROC_METHOD=spawn vllm-rocm-test:latest bash -c "pytest -v -s v1/e2e/test_async_scheduling.py" |
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
|
|
||
| # Data processing | ||
| xgrammar==0.1.27 | ||
| xgrammar @ git+https://github.com/divakar-amd/xgrammar@3272f7c520564858056a60480d5afdf69ae79c84 |
There was a problem hiding this comment.
Is this intentional? Can you please explain why you are changing this to @divakar-amd's fork?
There was a problem hiding this comment.
Hi Sage, yes this is intentional. The problem is that in xgrammar there is a hard coded WARP SIZE:
divakar-amd/xgrammar@41a849f
So the pip package (or for that matter the upstream xgrammar repo) do not work correctly on ROCm. There is an open PR on xgrammar for this:
mlc-ai/xgrammar#476
But as you can see this PR has been there for more than 2 weeks. So for the test to work for now, we are going with this solution. I am in contact with @divakar-amd and as soon as his PR gets merged, we are going to change the requirements to at least be in parity with CUDA.
| if current_platform.is_rocm() and not is_testing_with_spec_decoding: | ||
| dtype = "float16" | ||
| else: | ||
| dtype = "float32" |
There was a problem hiding this comment.
Why not use float32 for everything? Would you still have to update tolerances for rocm if everything was float32?
There was a problem hiding this comment.
TL;DR
ROCM_AITER_FA which is the only one that can satisfy the that particular subgroup in the test for ROCm, does not support fp32.
I think that, even for NVIDIA, the only attention backend that is also accurate enough for this test is FLEX_ATTENTION. That's the only attention backend that is more numerically accurate than others. Unfortunately, FLEX_ATTENTION backend is not fully supported on ROCm yet. So all other backends are going to inject more numerical error on any platform. Other than that, our backends, especially the AITER ones, are tested on float16, bfloat16, fp8, and in some cases fp4. So for now and for ROCm we are going to go with the float16 precision. In the future we will introduce a more accurate backend to better serve fp32 workloads.
SageMoore
left a comment
There was a problem hiding this comment.
This is a fine temporary solution, but let's push on mlc-ai/xgrammar#476.
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Micah Williamson <micah.williamson@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…e test group (vllm-project#29358) Signed-off-by: Andreas Karatzas <akaratza@amd.com> Signed-off-by: Micah Williamson <micah.williamson@amd.com> Co-authored-by: Micah Williamson <micah.williamson@amd.com> Signed-off-by: Ubuntu <mjtaheri68@gmail.com>
…e test group (vllm-project#29358) Signed-off-by: Andreas Karatzas <akaratza@amd.com> Signed-off-by: Micah Williamson <micah.williamson@amd.com> Co-authored-by: Micah Williamson <micah.williamson@amd.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
This PR ensures that during
test_async_scheduling, we utilize theTRITON_ATTNbackend which is the default attention backend for ROCm.Test used to verify functionality on ROCm:
pytest -v -s tests/v1/e2e/test_async_scheduling.py