Skip to content

[ROCm][CI] Added MI325 mirrors#34923

Merged
simon-mo merged 9 commits intovllm-project:mainfrom
ROCm:akaratza_gating_amdci_stage_b
Feb 24, 2026
Merged

[ROCm][CI] Added MI325 mirrors#34923
simon-mo merged 9 commits intovllm-project:mainfrom
ROCm:akaratza_gating_amdci_stage_b

Conversation

@AndreasKaratzas
Copy link
Collaborator

@AndreasKaratzas AndreasKaratzas commented Feb 19, 2026

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
@mergify mergify bot added ci/build rocm Related to AMD ROCm labels Feb 19, 2026
@github-project-automation github-project-automation bot moved this to Todo in AMD Feb 19, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds CI test mirrors for the new MI325 ROCm hardware, expanding test coverage to this platform. The changes primarily involve modifications to Buildkite YAML configuration files to add new test steps or mirror existing ones. Additionally, a performance fix for ROCm is included in the Dockerfile by setting MIOpen environment variables to mitigate a known performance regression in 3D convolution kernels. My review identifies a couple of areas for improvement in the CI configuration to avoid redundant test execution and improve maintainability. Overall, the changes are a good step towards ensuring stability on the new hardware.

Comment on lines +36 to +38
commands:
- pytest -v -s v1/e2e
- pytest -v -s v1/engine
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The commands in this mirror are redundant. They are functionally equivalent to the commands in the parent step. To improve maintainability and consistency with other mirrors in this PR, you can remove the commands section from this mirror and let it inherit the commands from the parent step.

Comment on lines +51 to +53
commands:
- pip install git+https://github.com/TIGER-AI-Lab/Mantis.git
- pytest -v -s models/multimodal/processing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The commands in this mirror lead to redundant test execution. The pytest -v -s models/multimodal/processing command runs all tests in the directory, which includes tests already covered by the mirror for the Multi-Modal Processor Test (CPU) step. This is inefficient.

To fix this, the mirror should run the same command as the parent step (pytest -v -s models/multimodal/processing/test_tensor_schema.py). The simplest way is to remove the commands section from the mirror, allowing it to inherit the commands from the parent step. The pip install is also redundant.

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
@AndreasKaratzas
Copy link
Collaborator Author

cc @kenroche

@AndreasKaratzas
Copy link
Collaborator Author

There is another PR that shall be merged before we merge this:

@AndreasKaratzas AndreasKaratzas added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 23, 2026
@AndreasKaratzas
Copy link
Collaborator Author

All critical PRs have been merged. I just rebased this branch. Added the ready label to start evaluation.

@AndreasKaratzas
Copy link
Collaborator Author

I had identified some mistakes in the .buildkite/scripts/hardware_ci/run-amd-test.sh script: #34839
But it seems I shall include the fixes here. Pushing the updates soon to trigger the tests again.

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
@AndreasKaratzas
Copy link
Collaborator Author

The recent change in the moriio test was inspired by: #35164
cc @rasmith

librdmacm1 \
libibverbs1 \
ibverbs-providers \
ibverbs-utils \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only package that is needed by the test in order to skip, since the ibverbs-utils has a command line utility that is useful. The MORI library is already successfully using libibverbs as it is.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If tests pass, I'll probably do a follow-up here. Otherwise I'll integrate it here, but let's hope things pass here 😅

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RIXL wheel bundles UCX built with --with-verbs, but UCX's verbs transport dynamically loads system RDMA libraries at runtime:

  • Without ibverbs-providers plugins (e.g. libmlx5.so), ibv_devinfo would probably return "No IB devices found" even on RDMA-capable hosts, so the skip guard permanently skips. Technically a transitive dep of ibverbs-utils but worth being explicit.
  • UCX's connection manager needs librdmacm.so.1. So librdmacm1 is not a transitive dep of ibverbs-utils. test_moriio_handshake_returns_metadata calls connector.register_kv_caches() which goes through real MORI/UCX init, I think that without rdmacm, the transport layer would fail.
  • libibverbs1 is transitive dep of ibverbs-utils, so this one could be dropped from the explicit list.

Happy to drop libibverbs1 from the explicit install, but librdmacm1 and ibverbs-providers are needed. All in all I think we can keep then there for completeness. But happy to hear your thoughts on that as well.

@pytest.mark.skipif(
not aiter_available, reason="Requires aiter package for ROCm FlashAttention backend"
)
@pytest.mark.skipif(not rdma_available, reason="No RDMA devices available")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just skip at module level instead of skipping each individually:

if not rdma_available:
   pytest.skip("No RDMA devices available", allow_module_level = True)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm planning to do a follow-up PR based on your feedback. Regarding this one, right now, only two tests (test_register_kv_caches and test_moriio_handshake_returns_metadata) require RDMA and aiter. So if we put it on top of the file, wouldn't that skip all tests? Even though some are passing.

- vllm/
- tests/entrypoints/openai/responses
commands:
- export VLLM_WORKER_MULTIPROC_METHOD=spawn
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this needed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haven't attempted to run it without it honestly. I just know we always use it in ROCm. But maybe it is set during execution and I can delete it later on.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a mechanism to select this based on the need at

def _maybe_force_spawn():

If certain tests fail to utilize it, please make sure to address the tests in a follow up.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was actually something that I forgot there after I removed some of the mirrors. I am removing it, sorry for overseeing this.


wait_for_clean_gpus() {
echo "--- Waiting for clean GPU state"
while true; do
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a timeout on this? Can it get stuck forever?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully this version here resolves this concern. It was like that in the original script by the way, and for some reason we never observed that to be getting stuck. However, I did add a generous timeout there, let me know if it looks good now :)

@simon-mo simon-mo merged commit 067c5d9 into vllm-project:main Feb 24, 2026
107 of 111 checks passed
@github-project-automation github-project-automation bot moved this from Todo to Done in AMD Feb 24, 2026
@AndreasKaratzas AndreasKaratzas deleted the akaratza_gating_amdci_stage_b branch February 24, 2026 21:38
tom-zju pushed a commit to tom-zju/vllm that referenced this pull request Feb 26, 2026
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
llsj14 pushed a commit to llsj14/vllm that referenced this pull request Mar 1, 2026
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Mar 4, 2026
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
askliar pushed a commit to askliar/vllm that referenced this pull request Mar 9, 2026
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andrii Skliar <askliar@nvidia.com>
Copilot AI pushed a commit to machov/vllm that referenced this pull request Mar 10, 2026
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build kv-connector ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants