Skip to content

[ROCm][CI] Revamping AMD mirrors#35897

Open
AndreasKaratzas wants to merge 8 commits intovllm-project:mainfrom
ROCm:akaratza_amd_ci_rework
Open

[ROCm][CI] Revamping AMD mirrors#35897
AndreasKaratzas wants to merge 8 commits intovllm-project:mainfrom
ROCm:akaratza_amd_ci_rework

Conversation

@AndreasKaratzas
Copy link
Collaborator

@AndreasKaratzas AndreasKaratzas commented Mar 3, 2026

In this PR, we are refining gating tests to utilize MI250 agents as well. I've put this PR up to start the evaluation of these tests and get tests ready for merging as soon as issues are confirmed to be resolved.

Mirrored AMD Tests

File Test Label AMD Device
basic_correctness.yaml Basic Correctness mi250_1
distributed.yaml Distributed (2 GPUs) mi250_2
distributed.yaml Distributed Torchrun + Examples (4 GPUs) mi250_4
distributed.yaml Distributed Tests (2 GPUs)(H100-MI325) mi250_2
engine.yaml Engine (1 GPU) mi325_1
engine.yaml e2e Scheduling (1 GPU) mi325_1
engine.yaml e2e Core (1 GPU) mi325_1
engine.yaml V1 e2e (2 GPUs) mi250_2
engine.yaml V1 e2e (4 GPUs) mi325_4
entrypoints.yaml Entrypoints Integration (API Server 1) mi325_1
entrypoints.yaml Entrypoints Integration (API Server 2) mi250_1
entrypoints.yaml Entrypoints V1 mi250_1
lora.yaml LoRA %N mi250_1
misc.yaml V1 Others mi325_1
misc.yaml Examples mi250_1
models_basic.yaml Basic Models Tests (Other) mi250_1
models_language.yaml Language Models Tests (Standard) mi325_1
models_language.yaml Language Models Tests (Extra Standard) %N mi250_1
models_language.yaml Language Models Test (Extended Generation) mi325_1
models_language.yaml Language Models Test (Extended Pooling) mi250_1
models_multimodal.yaml Multi-Modal Models (Standard) 1: qwen2 mi325_1
models_multimodal.yaml Multi-Modal Models (Standard) 2: qwen3 + gemma mi325_1
models_multimodal.yaml Multi-Modal Models (Standard) 3: llava + qwen2_vl mi325_1
models_multimodal.yaml Multi-Modal Models (Standard) 4: other + whisper mi325_1
models_multimodal.yaml Multi-Modal Processor mi325_1
models_multimodal.yaml Multi-Modal Models (Extended) 1 mi325_1
models_multimodal.yaml Multi-Modal Models (Extended) 2 mi250_1
plugins.yaml Plugin Tests (2 GPUs) mi250_2
quantization.yaml Quantized Models Test mi355_1
samplers.yaml Samplers Test mi250_1

Total mirrored AMD tests: 30

Device Breakdown

Device Count
mi325_1 13
mi250_1 8
mi250_2 4
mi325_4 1
mi250_4 1
mi355_1 1

Dependencies:

cc @kenroche

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
@mergify mergify bot added the ci/build label Mar 3, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request attempts to fix a quotation issue by stripping surrounding double quotes from a variable. However, the implementation introduces code duplication in two places within the re_quote_pytest_markers function. My review includes a suggestion to refactor this duplicated logic into a reusable shell function to improve code maintainability.

Comment on lines +211 to +214
if [[ "$marker_buf" == '"'*'"' ]]; then
marker_buf="${marker_buf#\"}"
marker_buf="${marker_buf%\"}"
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This logic for stripping surrounding quotes is duplicated later in the script (lines 253-256). To improve maintainability and avoid code duplication, consider extracting this logic into a reusable shell function.

For example, you could define a function at the start of re_quote_pytest_markers:

_strip_surrounding_quotes() {
    local s=$1
    if [[ "$s" == '"'*'"' ]]; then
        s="${s#\"}"
        s="${s%\"}"
    fi
    echo "$s"
}

Then you can replace this block and the duplicated one with a call to this function:

marker_buf=$(_strip_surrounding_quotes "$marker_buf")

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DRYing the code in this point is pointless. This script will and has to heavily change. We are simply putting some patches in place to accelerate the gating test process.

@AndreasKaratzas AndreasKaratzas added ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm and removed ready ONLY add when PR is ready to merge/full CI is needed labels Mar 3, 2026
@github-project-automation github-project-automation bot moved this to Todo in AMD Mar 3, 2026
@AndreasKaratzas AndreasKaratzas changed the title [DO NOT MERGE] Attempting to fix quotation [ROCm][CI] Added MI325 mirrors (stage D) Mar 5, 2026
@AndreasKaratzas AndreasKaratzas removed the ready ONLY add when PR is ready to merge/full CI is needed label Mar 5, 2026
@AndreasKaratzas
Copy link
Collaborator Author

Manually cancelled the runs to save resources. Will re-enable ready label and CI runs when the aforementioned dependencies are resolved.

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
@mergify
Copy link

mergify bot commented Mar 15, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @AndreasKaratzas.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Mar 15, 2026
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
@mergify mergify bot removed the needs-rebase label Mar 15, 2026
@AndreasKaratzas AndreasKaratzas added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 15, 2026
@AndreasKaratzas AndreasKaratzas changed the title [ROCm][CI] Added MI325 mirrors (stage D) [ROCm][CI] Revamping AMD mirrors Mar 15, 2026
@AndreasKaratzas
Copy link
Collaborator Author

Observed 2 failures:

The above comments were made based on: https://buildkite.com/vllm/ci/builds/56267/steps/canvas

@AndreasKaratzas
Copy link
Collaborator Author

Update on AMD: Language Models Test (Extended Pooling) (mi250_1)

Looks like it was an intermittent failure: https://buildkite.com/vllm/amd-ci/builds/6559/steps/canvas?sid=019cf53b-06a0-4731-aa2a-420b95f27459&tab=output

Will be monitoring in the upcoming nightlies as well as the upcoming re-evals of this PR.

@AndreasKaratzas
Copy link
Collaborator Author

AndreasKaratzas commented Mar 16, 2026

I also put a PR up to stabilize the RemoteOpenAIServer class:

@mergify
Copy link

mergify bot commented Mar 18, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @AndreasKaratzas.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Mar 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build needs-rebase ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm

Projects

Status: Todo

Development

Successfully merging this pull request may close these issues.

1 participant