Skip to content

[ROCm][CI] Remove deepep DBO tests on gfx90a#37614

Merged
DarkLight1337 merged 1 commit intovllm-project:mainfrom
ROCm:akaratza_remove_incompatible_test
Mar 20, 2026
Merged

[ROCm][CI] Remove deepep DBO tests on gfx90a#37614
DarkLight1337 merged 1 commit intovllm-project:mainfrom
ROCm:akaratza_remove_incompatible_test

Conversation

@AndreasKaratzas
Copy link
Collaborator

Follow-up for:

Removes dpo test from gfx90a, since DeepEP is not compatible with gfx90a arch. Addresses failure in mi250_2: Distributed Tests (2 GPUs)(H100-MI250)

Motivation: https://buildkite.com/vllm/amd-ci/builds/6701/steps/canvas?sid=019d07a7-1a2e-4d29-91e7-9eb765bc4904&tab=output

cc @kenroche

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
@mergify mergify bot added ci/build rocm Related to AMD ROCm labels Mar 19, 2026
@github-project-automation github-project-automation bot moved this to Todo in AMD Mar 19, 2026
@AndreasKaratzas AndreasKaratzas added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 19, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request removes the DeepEP DBO tests from running on the gfx90a architecture, which is correct as it's not compatible. It achieves this by moving a CI job from mi250 to mi325 hardware and removing the DBO test from it. While this works, it introduces inconsistencies in the CI configuration. The job's label and mirror_hardwares list are no longer accurate, which can be misleading. I've added a comment to suggest updating them for clarity and maintainability.

timeout_in_minutes: 180
mirror_hardwares: [amdexperimental, amdproduction, amdgfx90anightly, amdmi250]
agent_pool: mi250_2
agent_pool: mi325_2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

With the agent_pool changed to mi325_2, the job label on line 1402 ("Distributed Tests (2 GPUs)(H100-MI250) # TBD") is now misleading. Please update it to reflect the new hardware (e.g., MI325).

Additionally, the mirror_hardwares on line 1404 might need to be updated. Another job on mi325_2 (starting on line 2594) uses [amdexperimental, amdproduction, amdgfx942nightly, amdmi325]. Consider aligning this for consistency.

Copy link
Collaborator Author

@AndreasKaratzas AndreasKaratzas Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not applicable. The attempt here is to deduplicate tests, and only stick on hardware-specific tests for each platform to save CI infra time.

@AndreasKaratzas
Copy link
Collaborator Author

Testing MI250 to see if issue is resolved (added ready label).

@AndreasKaratzas AndreasKaratzas marked this pull request as ready for review March 20, 2026 05:46
@AndreasKaratzas
Copy link
Collaborator Author

@DarkLight1337 DarkLight1337 merged commit 37cd9fc into vllm-project:main Mar 20, 2026
15 checks passed
@github-project-automation github-project-automation bot moved this from Todo to Done in AMD Mar 20, 2026
@AndreasKaratzas AndreasKaratzas deleted the akaratza_remove_incompatible_test branch March 20, 2026 15:16
chooper26 pushed a commit to intellistream/vllm-hust that referenced this pull request Mar 21, 2026
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants