Skip to content

[ROCm][CI] Fix test_cudagraph_mode failure in AMD CI#29367

Merged
tjtanaa merged 4 commits intovllm-project:mainfrom
ROCm:micah/CI_cudagraph_test
Nov 25, 2025
Merged

[ROCm][CI] Fix test_cudagraph_mode failure in AMD CI#29367
tjtanaa merged 4 commits intovllm-project:mainfrom
ROCm:micah/CI_cudagraph_test

Conversation

@micah-wil
Copy link
Copy Markdown
Contributor

@micah-wil micah-wil commented Nov 25, 2025

We are seeing failures in the tests/v1/cudagraph/test_cudagraph_mode.py test in AMD CI after #26980 was merged. It fails because it reaches the error "V0 attention backends have been removed. Set VLLM_USE_V1=1 to select a supported backend" since the test tries to use the FlashAttn backend. I updated the test to test ROCm attention backends if current_platform.is_rocm().

After this PR, we see:

pytest -v -s v1/cudagraph/test_cudagraph_mode.py:

=================================================== warnings summary ====================================================
<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
================================= 12 passed, 4 skipped, 2 warnings in 150.51s (0:02:30) =================================

Signed-off-by: Micah Williamson <micah.williamson@amd.com>
@micah-wil micah-wil requested a review from tjtanaa as a code owner November 25, 2025 03:19
@mergify mergify bot added nvidia rocm Related to AMD ROCm labels Nov 25, 2025
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request resolves a CI failure on ROCm by implementing a fallback to the Triton attention backend when an unsupported backend is selected, instead of raising a RuntimeError. This is a sensible approach to make the system more robust. My review includes a suggestion to refine the warning message to be more specific about the reason for the fallback, which will improve clarity and aid in future debugging efforts.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Signed-off-by: Micah Williamson <micah.williamson@amd.com>
Copy link
Copy Markdown
Collaborator

@tjtanaa tjtanaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the fix.

@github-project-automation github-project-automation bot moved this to In review in NVIDIA Nov 25, 2025
@tjtanaa tjtanaa enabled auto-merge (squash) November 25, 2025 04:10
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 25, 2025
Signed-off-by: Micah Williamson <micah.williamson@amd.com>
auto-merge was automatically disabled November 25, 2025 05:25

Head branch was pushed to by a user without write access

@mergify mergify bot added the v1 label Nov 25, 2025
@micah-wil
Copy link
Copy Markdown
Contributor Author

micah-wil commented Nov 25, 2025

Hey @tjtanaa, I have updated the test_cudagraph_mode test itself to test ROCm attention backends. I also reverted the change of defaulting to TritonAttn when trying to use an invalid attention backend. Could you please take another look? Thanks

cc @ProExpertProg

Signed-off-by: Micah Williamson <micah.williamson@amd.com>
@micah-wil micah-wil changed the title [ROCm][CI] Fall back to Triton Unified Attention on ROCm to resolve test_cudagraph_mode failure [ROCm][CI] Fix test_cudagraph_mode failure in AMD CI Nov 25, 2025
@tjtanaa tjtanaa enabled auto-merge (squash) November 25, 2025 06:10
@tjtanaa tjtanaa merged commit ef1f703 into vllm-project:main Nov 25, 2025
49 checks passed
@github-project-automation github-project-automation bot moved this from In review to Done in NVIDIA Nov 25, 2025
@micah-wil micah-wil deleted the micah/CI_cudagraph_test branch November 25, 2025 13:35
devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
)

Signed-off-by: Micah Williamson <micah.williamson@amd.com>
kitaekatt pushed a commit to kitaekatt/vllm that referenced this pull request Dec 1, 2025
)

Signed-off-by: Micah Williamson <micah.williamson@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

nvidia ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants