Adding the test-amd.yaml for test definitions for the AMD backend. (alternative PR) by Alexei-V-Ivanov-AMD · Pull Request #26718 · vllm-project/vllm

Alexei-V-Ivanov-AMD · 2025-10-13T17:04:32Z

Adding the test-amd.yaml for test definitions for the AMD backend.

Signed-off-by: Alexei V. Ivanov alexei.ivanov@amd.com

Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com>

gemini-code-assist

Code Review

This pull request adds a new test definition file for the AMD backend. However, the file appears to be a direct copy from a configuration for an NVIDIA backend. It contains numerous references to NVIDIA-specific technologies, commands, and GPU architectures such as CUDA, CUTLASS, NCCL, nvidia-smi, A100, H200, and Blackwell. These are incorrect for an AMD environment and will likely cause test failures or incorrect test execution. The file needs a thorough review to replace all NVIDIA-specific elements with their AMD equivalents (e.g., using rocm-smi instead of nvidia-smi, ROCR_VISIBLE_DEVICES instead of CUDA_VISIBLE_DEVICES, and targeting AMD GPUs).

gemini-code-assist · 2025-10-13T17:06:05Z

.buildkite/test-amd.yaml

+  - tests/basic_correctness/test_cumem.py
+  commands:
+  - export VLLM_WORKER_MULTIPROC_METHOD=spawn
+  - pytest -v -s basic_correctness/test_cumem.py


The test basic_correctness/test_cumem.py seems to be CUDA-specific, as indicated by the cu prefix in cumem. This test may not be relevant or may fail on an AMD backend.

gemini-code-assist · 2025-10-13T17:06:06Z

.buildkite/test-amd.yaml

+  - pytest -v -s v1/engine/test_engine_core_client.py::test_kv_cache_events_dp
+  - pytest -v -s distributed/test_utils.py
+  - pytest -v -s compile/test_basic_correctness.py
+  - pytest -v -s distributed/test_pynccl.py


This test runs distributed/test_pynccl.py. NCCL is NVIDIA's collective communications library. For AMD, RCCL should be used. This test is likely incorrect for an AMD backend.

gemini-code-assist · 2025-10-13T17:06:06Z

.buildkite/test-amd.yaml

+    - python3 offline_inference/spec_decode.py --test --method eagle --num_spec_tokens 3 --dataset-name hf --dataset-path philschmid/mt-bench --num-prompts 80 --temp 0 --top-p 1.0 --top-k -1 --tp 1 --enable-chunked-prefill --max-model-len 2048
+    - python3 offline_inference/spec_decode.py --test --method eagle3 --num_spec_tokens 3 --dataset-name hf --dataset-path philschmid/mt-bench --num-prompts 80 --temp 0 --top-p 1.0 --top-k -1 --tp 1 --enable-chunked-prefill --max-model-len 2048
+
+- label: Platform Tests (CUDA) # 4min


The label for this test step is Platform Tests (CUDA). This is incorrect for a test file intended for the AMD backend. The label and the associated tests in this step should be updated to be AMD-specific.

gemini-code-assist · 2025-10-13T17:06:06Z

.buildkite/test-amd.yaml

+  agent_pool: mi325_8
+  # grade: Blocking
+  source_file_dependencies:
+  - csrc/quantization/cutlass_w8a8/moe/


The source file dependency csrc/quantization/cutlass_w8a8/moe/ refers to CUTLASS, which is an NVIDIA library for CUDA kernels. This is inappropriate for an AMD backend test configuration.

gemini-code-assist · 2025-10-13T17:06:06Z

.buildkite/test-amd.yaml

+  # since torchao nightly is only compatible with torch nightly currently
+  # https://github.com/pytorch/ao/issues/2919, we'll have to skip new torchao tests for now
+  # we can only upgrade after this is resolved
+  - pip install --pre torchao==0.13.0.dev20250814 --index-url https://download.pytorch.org/whl/nightly/cu128


This command installs a torchao wheel built for CUDA 12.8 (cu128). This is incorrect for an AMD backend and will fail. You should use a wheel built for ROCm.

- pip install --pre torchao==<rocm_compatible_version> --index-url https://download.pytorch.org/whl/nightly/rocm<rocm_version>

gemini-code-assist · 2025-10-13T17:06:06Z

.buildkite/test-amd.yaml

+- label: Blackwell Test # 38 min
+  timeout_in_minutes: 60
+  working_dir: "/vllm-workspace/"
+  gpu: b200
+  # optional: true
+  source_file_dependencies:
+  - csrc/quantization/fp4/
+  - csrc/attention/mla/
+  - csrc/quantization/cutlass_w8a8/moe/
+  - vllm/model_executor/layers/fused_moe/cutlass_moe.py
+  - vllm/model_executor/layers/fused_moe/flashinfer_cutlass_moe.py
+  - vllm/model_executor/layers/fused_moe/flashinfer_cutlass_prepare_finalize.py
+  - vllm/model_executor/layers/quantization/utils/flashinfer_utils.py
+  - vllm/v1/attention/backends/flashinfer.py
+  - vllm/compilation/fusion.py
+  - vllm/compilation/fusion_attn.py
+  commands:
+    - nvidia-smi
+    - python3 examples/offline_inference/basic/chat.py
+    # Attention
+    # num_heads2 broken by https://github.com/flashinfer-ai/flashinfer/issues/1353
+    - pytest -v -s tests/kernels/attention/test_flashinfer.py -k 'not num_heads2'
+    - pytest -v -s tests/kernels/attention/test_flashinfer_trtllm_attention.py
+    - pytest -v -s tests/kernels/attention/test_cutlass_mla_decode.py
+    - pytest -v -s tests/kernels/attention/test_flashinfer_mla_decode.py
+    # Quantization
+    - pytest -v -s tests/kernels/quantization/test_cutlass_scaled_mm.py -k 'fp8'
+    - pytest -v -s tests/kernels/quantization/test_nvfp4_quant.py
+    - pytest -v -s tests/kernels/quantization/test_silu_mul_nvfp4_quant.py
+    - pytest -v -s tests/kernels/quantization/test_nvfp4_scaled_mm.py
+    - pytest -v -s tests/kernels/quantization/test_flashinfer_scaled_mm.py
+    - pytest -v -s tests/kernels/quantization/test_flashinfer_nvfp4_scaled_mm.py
+    - pytest -v -s tests/kernels/moe/test_nvfp4_moe.py
+    - pytest -v -s tests/kernels/moe/test_ocp_mx_moe.py
+    # Fusion
+    - pytest -v -s tests/compile/test_fusion_all_reduce.py
+    - pytest -v -s tests/compile/test_fusion_attn.py::test_attention_quant_pattern
+    - pytest -v -s tests/kernels/moe/test_flashinfer.py
+    - pytest -v -s tests/compile/test_silu_mul_quant_fusion.py


This entire test step is labeled Blackwell Test and configured to run on a b200 GPU, which are NVIDIA's architecture and hardware. It also uses nvidia-smi and tests for NVIDIA-specific features like CUTLASS and TRTLLM. This entire block is irrelevant for an AMD backend and should be removed or replaced with AMD-equivalent tests.

gemini-code-assist · 2025-10-13T17:06:06Z

.buildkite/test-amd.yaml

+  - pytest -v -s ./compile/test_wrapper.py
+  - VLLM_TEST_SAME_HOST=1 torchrun --nproc-per-node=4 distributed/test_same_node.py | grep 'Same node test passed'
+  - pytest -v -s distributed/test_sequence_parallel.py
+  - CUDA_VISIBLE_DEVICES=0,1 pytest -v -s v1/shutdown


The environment variable CUDA_VISIBLE_DEVICES is used here, which is specific to NVIDIA GPUs. For an AMD backend, you should use ROCR_VISIBLE_DEVICES or HIP_VISIBLE_DEVICES.

- ROCR_VISIBLE_DEVICES=0,1 pytest -v -s v1/shutdown

gemini-code-assist · 2025-10-13T17:06:06Z

.buildkite/test-amd.yaml

+##### A100 test #####
+
+- label: Distributed Tests (A100) # optional
+  gpu: a100
+  optional: true
+  num_gpus: 4
+  source_file_dependencies:
+  - vllm/
+  commands:
+  # NOTE: don't test llama model here, it seems hf implementation is buggy
+  # see https://github.com/vllm-project/vllm/pull/5689 for details
+  - pytest -v -s distributed/test_custom_all_reduce.py
+  - torchrun --nproc_per_node=2 distributed/test_ca_buffer_sharing.py
+  - TARGET_TEST_SUITE=A100 pytest basic_correctness/ -v -s -m 'distributed(num_gpus=2)'
+  - pytest -v -s -x lora/test_mixtral.py
+
+- label: LM Eval Large Models # optional
+  gpu: a100
+  optional: true
+  num_gpus: 4
+  working_dir: "/vllm-workspace/.buildkite/lm-eval-harness"
+  source_file_dependencies:
+  - csrc/
+  - vllm/model_executor/layers/quantization
+  commands:
+  - export VLLM_WORKER_MULTIPROC_METHOD=spawn
+  - pytest -s -v test_lm_eval_correctness.py --config-list-file=configs/models-large.txt --tp-size=4


This entire section is for testing on NVIDIA A100 GPUs. It specifies gpu: a100 and includes tests that might be NVIDIA-specific. This section should be adapted for AMD GPUs or removed if not applicable.

gemini-code-assist · 2025-10-13T17:06:06Z

.buildkite/test-amd.yaml

+##### H200 test #####
+- label: Distrubted Tests (H200) # optional
+  gpu: h200
+  optional: true
+  working_dir: "/vllm-workspace/"
+  num_gpus: 2
+  commands:
+    - pytest -v -s tests/distributed/test_context_parallel.py
+    - CUDA_VISIBLE_DEVICES=1,2 VLLM_ALL2ALL_BACKEND=deepep_high_throughput VLLM_USE_DEEP_GEMM=1 VLLM_LOGGING_LEVEL=DEBUG python3 examples/offline_inference/data_parallel.py --model Qwen/Qwen1.5-MoE-A2.7B --tp-size=1  --dp-size=2 --max-model-len 2048


This test section is for NVIDIA H200 GPUs. It specifies gpu: h200 and uses CUDA_VISIBLE_DEVICES. This is incorrect for an AMD test file.

gemini-code-assist · 2025-10-13T17:06:06Z

.buildkite/test-amd.yaml

+##### B200 test #####
+- label: Distributed Tests (B200) # optional
+  gpu: b200
+  optional: true
+  working_dir: "/vllm-workspace/"
+  num_gpus: 2
+  commands:
+    - pytest -v -s tests/distributed/test_context_parallel.py
+    - pytest -v -s tests/distributed/test_nccl_symm_mem_allreduce.py


This test section is for NVIDIA B200 GPUs. It specifies gpu: b200 and includes a test for test_nccl_symm_mem_allreduce.py, which is NVIDIA-specific (NCCL). This is incorrect for an AMD test file.

…d. (alternative PR) (vllm-project#26718) Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com> Signed-off-by: Dhruvil Bhatt <bhattdbh@amazon.com>

…d. (alternative PR) (vllm-project#26718) Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com> Signed-off-by: bbartels <benjamin@bartels.dev>

…d. (alternative PR) (vllm-project#26718) Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com>

…d. (alternative PR) (vllm-project#26718) Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>

…d. (alternative PR) (vllm-project#26718) Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com>

Adding a new test definition file for AMD.

ea8d9b3

Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com>

mergify bot added ci/build rocm Related to AMD ROCm labels Oct 13, 2025

gemini-code-assist bot reviewed Oct 13, 2025

View reviewed changes

khluu enabled auto-merge (squash) October 13, 2025 21:21

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 13, 2025

khluu approved these changes Oct 14, 2025

View reviewed changes

khluu merged commit d3cc842 into vllm-project:main Oct 14, 2025
22 of 23 checks passed

lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025

[ci] Adding the test-amd.yaml for test definitions for the AMD backen…

ee239fe

…d. (alternative PR) (vllm-project#26718) Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com>

alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025

[ci] Adding the test-amd.yaml for test definitions for the AMD backen…

e87ec83

…d. (alternative PR) (vllm-project#26718) Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com>

rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025

[ci] Adding the test-amd.yaml for test definitions for the AMD backen…

43584ab

…d. (alternative PR) (vllm-project#26718) Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com>

devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025

[ci] Adding the test-amd.yaml for test definitions for the AMD backen…

39ab3f3

…d. (alternative PR) (vllm-project#26718) Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com>

Uh oh!

Conversation

Alexei-V-Ivanov-AMD commented Oct 13, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

art39print-c Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Alexei-V-Ivanov-AMD commented Oct 13, 2025 •

edited by github-actions bot

Loading