Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 1 addition & 8 deletions .buildkite/test-amd.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1402,7 +1402,7 @@ steps:
- label: Distributed Tests (2 GPUs)(H100-MI250) # TBD
timeout_in_minutes: 180
mirror_hardwares: [amdexperimental, amdproduction, amdgfx90anightly, amdmi250]
agent_pool: mi250_2
agent_pool: mi325_2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

With the agent_pool changed to mi325_2, the job label on line 1402 ("Distributed Tests (2 GPUs)(H100-MI250) # TBD") is now misleading. Please update it to reflect the new hardware (e.g., MI325).

Additionally, the mirror_hardwares on line 1404 might need to be updated. Another job on mi325_2 (starting on line 2594) uses [amdexperimental, amdproduction, amdgfx942nightly, amdmi325]. Consider aligning this for consistency.

Copy link
Collaborator Author

@AndreasKaratzas AndreasKaratzas Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not applicable. The attempt here is to deduplicate tests, and only stick on hardware-specific tests for each platform to save CI infra time.

num_gpus: 2
working_dir: "/vllm-workspace/"
source_file_dependencies:
Expand All @@ -1412,15 +1412,13 @@ steps:
- vllm/v1/attention/backends/
- vllm/v1/attention/selector.py
- tests/distributed/test_context_parallel.py
- tests/v1/distributed/test_dbo.py
- examples/offline_inference/data_parallel.py
- vllm/_aiter_ops.py
- vllm/platforms/rocm.py
commands:
- export TORCH_NCCL_BLOCKING_WAIT=1
- pytest -v -s tests/distributed/test_context_parallel.py
- VLLM_LOGGING_LEVEL=DEBUG python3 examples/offline_inference/data_parallel.py --model=Qwen/Qwen1.5-MoE-A2.7B -tp=1 -dp=2 --max-model-len=2048 --all2all-backend=allgather_reducescatter --disable-nccl-for-dp-synchronization
- pytest -v -s tests/v1/distributed/test_dbo.py


#####################################################################################################################################
Expand Down Expand Up @@ -2596,21 +2594,16 @@ steps:
mirror_hardwares: [amdexperimental, amdproduction, amdgfx942nightly, amdmi325]
agent_pool: mi325_2
num_gpus: 2
optional: true
working_dir: "/vllm-workspace/"
source_file_dependencies:
- vllm/distributed/
- vllm/v1/distributed/
- vllm/model_executor/layers/fused_moe/
- tests/distributed/test_context_parallel.py
- tests/v1/distributed/test_dbo.py
- examples/offline_inference/data_parallel.py
- vllm/_aiter_ops.py
- vllm/platforms/rocm.py
commands:
- export TORCH_NCCL_BLOCKING_WAIT=1
- pytest -v -s tests/distributed/test_context_parallel.py
- VLLM_LOGGING_LEVEL=DEBUG python3 examples/offline_inference/data_parallel.py --model=Qwen/Qwen1.5-MoE-A2.7B -tp=1 -dp=2 --max-model-len=2048 --all2all-backend=deepep_high_throughput
- pytest -v -s tests/v1/distributed/test_dbo.py


Expand Down
Loading