[ROCm][CI] Added MI325 mirrors by AndreasKaratzas · Pull Request #34923 · vllm-project/vllm

AndreasKaratzas · 2026-02-19T23:18:01Z

In this PR, we are gating more tests. This PR is dependent on:

[ROCm][AITER] Fix aiter paged_attention_v1 decode for sliding window and head_size < 64 #34570
[ROCm][CI] Fix spec decode logprobs flakiness and parametrize tree attention backends #34599
[CI] Fix ColBERT HF comparison tests on AMD CI + refactor #34567
[CI][MCP][Harmony] Heavy refactoring Harmony & MCP response tests and stabilizing with deterministic test infrastructure #33949
[ROCm][CI] Loosen RemoteOpenAIServer Startup Timeout #34922

I've put this PR up to start the evaluation of these tests and get more gating tests ready for merging as soon as issues are confirmed to be resolved.

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

gemini-code-assist

Code Review

This pull request adds CI test mirrors for the new MI325 ROCm hardware, expanding test coverage to this platform. The changes primarily involve modifications to Buildkite YAML configuration files to add new test steps or mirror existing ones. Additionally, a performance fix for ROCm is included in the Dockerfile by setting MIOpen environment variables to mitigate a known performance regression in 3D convolution kernels. My review identifies a couple of areas for improvement in the CI configuration to avoid redundant test execution and improve maintainability. Overall, the changes are a good step towards ensuring stability on the new hardware.

gemini-code-assist · 2026-02-19T23:20:16Z

.buildkite/test_areas/engine.yaml

+      commands:
+      - pytest -v -s v1/e2e
+      - pytest -v -s v1/engine


The commands in this mirror are redundant. They are functionally equivalent to the commands in the parent step. To improve maintainability and consistency with other mirrors in this PR, you can remove the commands section from this mirror and let it inherit the commands from the parent step.

gemini-code-assist · 2026-02-19T23:20:16Z

.buildkite/test_areas/models_multimodal.yaml

+      commands:
+      - pip install git+https://github.com/TIGER-AI-Lab/Mantis.git
+      - pytest -v -s models/multimodal/processing


The commands in this mirror lead to redundant test execution. The pytest -v -s models/multimodal/processing command runs all tests in the directory, which includes tests already covered by the mirror for the Multi-Modal Processor Test (CPU) step. This is inefficient.

To fix this, the mirror should run the same command as the parent step (pytest -v -s models/multimodal/processing/test_tensor_schema.py). The simplest way is to remove the commands section from the mirror, allowing it to inherit the commands from the parent step. The pip install is also redundant.

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

AndreasKaratzas · 2026-02-20T00:20:36Z

cc @kenroche

…_stage_b

AndreasKaratzas · 2026-02-22T04:14:39Z

There is another PR that shall be merged before we merge this:

[ROCm][CI] Fix spec decode profile assertion and logprob test determinism #35043

…_stage_b

AndreasKaratzas · 2026-02-23T18:24:01Z

All critical PRs have been merged. I just rebased this branch. Added the ready label to start evaluation.

AndreasKaratzas · 2026-02-23T21:49:39Z

I had identified some mistakes in the .buildkite/scripts/hardware_ci/run-amd-test.sh script: #34839
But it seems I shall include the fixes here. Pushing the updates soon to trigger the tests again.

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

…_stage_b

AndreasKaratzas · 2026-02-24T02:30:25Z

The recent change in the moriio test was inspired by: #35164
cc @rasmith

rasmith · 2026-02-24T02:32:29Z

docker/Dockerfile.rocm

+    librdmacm1 \
+    libibverbs1 \
+    ibverbs-providers \
+    ibverbs-utils \


This is the only package that is needed by the test in order to skip, since the ibverbs-utils has a command line utility that is useful. The MORI library is already successfully using libibverbs as it is.

If tests pass, I'll probably do a follow-up here. Otherwise I'll integrate it here, but let's hope things pass here 😅

The RIXL wheel bundles UCX built with --with-verbs, but UCX's verbs transport dynamically loads system RDMA libraries at runtime:

Without ibverbs-providers plugins (e.g. libmlx5.so), ibv_devinfo would probably return "No IB devices found" even on RDMA-capable hosts, so the skip guard permanently skips. Technically a transitive dep of ibverbs-utils but worth being explicit.

UCX's connection manager needs librdmacm.so.1. So librdmacm1 is not a transitive dep of ibverbs-utils. test_moriio_handshake_returns_metadata calls connector.register_kv_caches() which goes through real MORI/UCX init, I think that without rdmacm, the transport layer would fail.

libibverbs1 is transitive dep of ibverbs-utils, so this one could be dropped from the explicit list.

Happy to drop libibverbs1 from the explicit install, but librdmacm1 and ibverbs-providers are needed. All in all I think we can keep then there for completeness. But happy to hear your thoughts on that as well.

rasmith · 2026-02-24T04:25:25Z

tests/v1/kv_connector/unit/test_moriio_connector.py

 @pytest.mark.skipif(
    not aiter_available, reason="Requires aiter package for ROCm FlashAttention backend"
 )
+@pytest.mark.skipif(not rdma_available, reason="No RDMA devices available")


You can just skip at module level instead of skipping each individually:

if not rdma_available: pytest.skip("No RDMA devices available", allow_module_level = True)

I'm planning to do a follow-up PR based on your feedback. Regarding this one, right now, only two tests (test_register_kv_caches and test_moriio_handshake_returns_metadata) require RDMA and aiter. So if we put it on top of the file, wouldn't that skip all tests? Even though some are passing.

khluu · 2026-02-24T06:25:47Z

.buildkite/test_areas/entrypoints.yaml

  - vllm/
  - tests/entrypoints/openai/responses
  commands:
+  - export VLLM_WORKER_MULTIPROC_METHOD=spawn


why is this needed?

haven't attempted to run it without it honestly. I just know we always use it in ROCm. But maybe it is set during execution and I can delete it later on.

There is a mechanism to select this based on the need at

vllm/vllm/utils/system_utils.py

Line 114 in 60da0e1

def _maybe_force_spawn():

If certain tests fail to utilize it, please make sure to address the tests in a follow up.

This was actually something that I forgot there after I removed some of the mirrors. I am removing it, sorry for overseeing this.

gshtras · 2026-02-24T16:14:46Z

.buildkite/scripts/hardware_ci/run-amd-test.sh

+
+wait_for_clean_gpus() {
+  echo "--- Waiting for clean GPU state"
+  while true; do


Is there a timeout on this? Can it get stuck forever?

Hopefully this version here resolves this concern. It was like that in the original script by the way, and for some reason we never observed that to be getting stuck. However, I did add a generous timeout there, let me know if it looks good now :)

… property in yaml Signed-off-by: Andreas Karatzas <akaratza@amd.com>

…_stage_b

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

Signed-off-by: Andreas Karatzas <akaratza@amd.com> Signed-off-by: Andrii Skliar <askliar@nvidia.com>

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

[ROCm][CI] Added MI325 mirrors

8328201

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

AndreasKaratzas requested review from gshtras and tjtanaa as code owners February 19, 2026 23:18

mergify bot added ci/build rocm Related to AMD ROCm labels Feb 19, 2026

github-project-automation bot added this to AMD Feb 19, 2026

github-project-automation bot moved this to Todo in AMD Feb 19, 2026

gemini-code-assist bot reviewed Feb 19, 2026

View reviewed changes

Reduced proposed list

8f84291

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

Merge remote-tracking branch 'origin/main' into akaratza_gating_amdci…

21c4f4e

…_stage_b

Merge remote-tracking branch 'origin/main' into akaratza_gating_amdci…

c777f44

…_stage_b

AndreasKaratzas added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 23, 2026

AndreasKaratzas added 2 commits February 23, 2026 15:51

sed unquoted markers to re-wraps them

d0608fc

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

[ROCm][CI] Added skip if RDMA devices are not found during moriio test

1d009e2

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

AndreasKaratzas requested review from ApostaC and orozery as code owners February 24, 2026 02:13

mergify bot added v1 kv-connector labels Feb 24, 2026

Merge remote-tracking branch 'origin/main' into akaratza_gating_amdci…

2cf2887

…_stage_b

rasmith reviewed Feb 24, 2026

View reviewed changes

rasmith mentioned this pull request Feb 24, 2026

[CI][AMD][BugFix][P/D] Skip test_moriio_connector.py tests if IB verbs is not available #35164

Closed

2 tasks

AndreasKaratzas mentioned this pull request Feb 24, 2026

[ROCm][CI] Adding infiniband mappings for moriio tests #35170

Merged

khluu approved these changes Feb 24, 2026

View reviewed changes

gshtras reviewed Feb 24, 2026

View reviewed changes

AndreasKaratzas added 2 commits February 24, 2026 11:34

[ROCm][CI] Resolving possible CI runner hanging and purging left-over…

fc5e6d6

… property in yaml Signed-off-by: Andreas Karatzas <akaratza@amd.com>

Merge remote-tracking branch 'origin/main' into akaratza_gating_amdci…

86a3aed

…_stage_b

gshtras approved these changes Feb 24, 2026

View reviewed changes

simon-mo merged commit 067c5d9 into vllm-project:main Feb 24, 2026
107 of 111 checks passed

github-project-automation bot moved this from Todo to Done in AMD Feb 24, 2026

AndreasKaratzas deleted the akaratza_gating_amdci_stage_b branch February 24, 2026 21:38

AndreasKaratzas mentioned this pull request Feb 25, 2026

[ROCm][CI] Amending deletion of AMD mirror #35322

Merged

tom-zju pushed a commit to tom-zju/vllm that referenced this pull request Feb 26, 2026

[ROCm][CI] Added MI325 mirrors (vllm-project#34923)

15961e1

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

AndreasKaratzas mentioned this pull request Feb 28, 2026

[CI] Defining extended V1 e2e + engine tests #35580

Merged

llsj14 pushed a commit to llsj14/vllm that referenced this pull request Mar 1, 2026

[ROCm][CI] Added MI325 mirrors (vllm-project#34923)

9d73073

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Mar 4, 2026

[ROCm][CI] Added MI325 mirrors (vllm-project#34923)

ae18848

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

askliar pushed a commit to askliar/vllm that referenced this pull request Mar 9, 2026

[ROCm][CI] Added MI325 mirrors (vllm-project#34923)

63e6034

Signed-off-by: Andreas Karatzas <akaratza@amd.com> Signed-off-by: Andrii Skliar <askliar@nvidia.com>

Copilot AI pushed a commit to machov/vllm that referenced this pull request Mar 10, 2026

[ROCm][CI] Added MI325 mirrors (vllm-project#34923)

b2ae76d

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

Uh oh!

Conversation

AndreasKaratzas commented Feb 19, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

AndreasKaratzas commented Feb 20, 2026

Uh oh!

AndreasKaratzas commented Feb 22, 2026

Uh oh!

AndreasKaratzas commented Feb 23, 2026

Uh oh!

AndreasKaratzas commented Feb 23, 2026

Uh oh!

AndreasKaratzas commented Feb 24, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

AndreasKaratzas commented Feb 19, 2026 •

edited by github-actions bot

Loading