Skip to content

[ROCm][CI] Add rocm support for run-multi-node-test.sh#31922

Merged
tjtanaa merged 9 commits intovllm-project:mainfrom
ROCm:amd/enable_2node_test
Jan 8, 2026
Merged

[ROCm][CI] Add rocm support for run-multi-node-test.sh#31922
tjtanaa merged 9 commits intovllm-project:mainfrom
ROCm:amd/enable_2node_test

Conversation

@charlifu
Copy link
Copy Markdown
Contributor

@charlifu charlifu commented Jan 7, 2026

Now run-multi-node-test.sh only works for cuda, this PR add support for rocm by:

  • check if on rocm or not
  • using different flags for rocm to control gpu devices used inside ci image.

Signed-off-by: charlifu <charlifu@amd.com>
Signed-off-by: charlifu <charlifu@amd.com>
Signed-off-by: charlifu <charlifu@amd.com>
Signed-off-by: charlifu <charlifu@amd.com>
Signed-off-by: charlifu <charlifu@amd.com>
Signed-off-by: charlifu <charlifu@amd.com>
Signed-off-by: charlifu <charlifu@amd.com>
Signed-off-by: charlifu <charlifu@amd.com>
@mergify mergify bot added ci/build rocm Related to AMD ROCm labels Jan 7, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for ROCm to the multi-node test script. The changes include logic to detect a ROCm environment and to use the appropriate Docker flags for either ROCm or CUDA. The overall approach is sound, but I've found a critical syntax error in the shell script that will cause it to fail. Please see the specific comment for details.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Charlie Fu <Charlie.Fu@amd.com>
Copy link
Copy Markdown
Collaborator

@tjtanaa tjtanaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tjtanaa tjtanaa added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 8, 2026
@tjtanaa tjtanaa enabled auto-merge (squash) January 8, 2026 03:11
@tjtanaa
Copy link
Copy Markdown
Collaborator

tjtanaa commented Jan 8, 2026

@charlifu we will also need to update the ci-infra repo as this bash script is not called at all. Everything is run by run-amd-test.sh right now.

https://buildkite.com/vllm/amd-ci/builds/2429/steps/canvas?sid=019b9440-620f-4e65-9204-d1f4b5939080

For CUDA they have this render_cuda_config that triggers tests with num_nodes >= 2 to be run by run-multi-node-test.sh
https://github.com/vllm-project/ci-infra/blob/46c6cc39549c3cbecc827983bf002a1ed23d426c/buildkite/test-template-ci.j2#L52

For AMD CI test template, but this render_cuda_config macro is never used
https://github.com/vllm-project/ci-infra/blob/46c6cc39549c3cbecc827983bf002a1ed23d426c/buildkite/test-template-amd.j2#L97

CC @AndreasKaratzas

@tjtanaa tjtanaa merged commit cddbc2b into vllm-project:main Jan 8, 2026
19 of 20 checks passed
yugong333 pushed a commit to yugong333/vllm that referenced this pull request Jan 9, 2026
…31922)

Signed-off-by: charlifu <charlifu@amd.com>
Signed-off-by: Charlie Fu <Charlie.Fu@amd.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
akh64bit pushed a commit to akh64bit/vllm that referenced this pull request Jan 16, 2026
…31922)

Signed-off-by: charlifu <charlifu@amd.com>
Signed-off-by: Charlie Fu <Charlie.Fu@amd.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026
…31922)

Signed-off-by: charlifu <charlifu@amd.com>
Signed-off-by: Charlie Fu <Charlie.Fu@amd.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026
…31922)

Signed-off-by: charlifu <charlifu@amd.com>
Signed-off-by: Charlie Fu <Charlie.Fu@amd.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants