[ROCm][CI] Add rocm support for run-multi-node-test.sh#31922
[ROCm][CI] Add rocm support for run-multi-node-test.sh#31922tjtanaa merged 9 commits intovllm-project:mainfrom
Conversation
Signed-off-by: charlifu <charlifu@amd.com>
Signed-off-by: charlifu <charlifu@amd.com>
Signed-off-by: charlifu <charlifu@amd.com>
Signed-off-by: charlifu <charlifu@amd.com>
Signed-off-by: charlifu <charlifu@amd.com>
Signed-off-by: charlifu <charlifu@amd.com>
Signed-off-by: charlifu <charlifu@amd.com>
There was a problem hiding this comment.
Code Review
This pull request adds support for ROCm to the multi-node test script. The changes include logic to detect a ROCm environment and to use the appropriate Docker flags for either ROCm or CUDA. The overall approach is sound, but I've found a critical syntax error in the shell script that will cause it to fail. Please see the specific comment for details.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Charlie Fu <Charlie.Fu@amd.com>
|
@charlifu we will also need to update the https://buildkite.com/vllm/amd-ci/builds/2429/steps/canvas?sid=019b9440-620f-4e65-9204-d1f4b5939080 For CUDA they have this render_cuda_config that triggers tests with num_nodes >= 2 to be run by run-multi-node-test.sh For AMD CI test template, but this render_cuda_config macro is never used |
…31922) Signed-off-by: charlifu <charlifu@amd.com> Signed-off-by: Charlie Fu <Charlie.Fu@amd.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…31922) Signed-off-by: charlifu <charlifu@amd.com> Signed-off-by: Charlie Fu <Charlie.Fu@amd.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…31922) Signed-off-by: charlifu <charlifu@amd.com> Signed-off-by: Charlie Fu <Charlie.Fu@amd.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
…31922) Signed-off-by: charlifu <charlifu@amd.com> Signed-off-by: Charlie Fu <Charlie.Fu@amd.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Now run-multi-node-test.sh only works for cuda, this PR add support for rocm by: