Skip to content

[ROCm][CI] Override PYTORCH_ROCM_ARCH with detected GPU arch in test containers#38165

Merged
gshtras merged 4 commits intovllm-project:mainfrom
ROCm:akaratza_fix_quark_arch_ci_target
Mar 26, 2026
Merged

[ROCm][CI] Override PYTORCH_ROCM_ARCH with detected GPU arch in test containers#38165
gshtras merged 4 commits intovllm-project:mainfrom
ROCm:akaratza_fix_quark_arch_ci_target

Conversation

@AndreasKaratzas
Copy link
Copy Markdown
Collaborator

The base Docker image sets PYTORCH_ROCM_ARCH to all supported architectures for build-time compilation. This value persists as a runtime env var. When Quark's JIT kernel compilation runs during tests, it picks up this broad list and compiles for all architectures instead of just the GPU present on the machine.

Quark has auto-detection logic (set_rocm_user_architecture) but it only activates when PYTORCH_ROCM_ARCH is unset. Since the Docker image always sets it, auto-detection is skipped.

Change

In run-amd-test.sh, detect the host GPU arch using rocm_agent_enumerator (filtering out gfx000 which represents the CPU) and override PYTORCH_ROCM_ARCH via docker run -e. Falls back to gfx90a;gfx942;gfx950 if detection fails.

cc @kenroche

…containers

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
@AndreasKaratzas AndreasKaratzas added ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm labels Mar 26, 2026
@github-project-automation github-project-automation bot moved this to Todo in AMD Mar 26, 2026
@mergify mergify bot added the ci/build label Mar 26, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces dynamic detection of the GPU architecture using rocm_agent_enumerator to set the PYTORCH_ROCM_ARCH environment variable for Docker containers, optimizing JIT compilation. A potential issue was identified where the grep -v command in the architecture detection pipeline could cause the script to exit prematurely under certain shell configurations, and a more robust approach was suggested.

# Detect the actual GPU architecture on the host so that runtime JIT
# compilation (e.g. Quark kernels) only targets the present hardware
# instead of every architecture baked into the Docker image.
RUNTIME_ROCM_ARCH=$(/opt/rocm/bin/rocm_agent_enumerator 2>/dev/null | grep -v 'gfx000' | sort -u | paste -sd ';')
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This command pipeline can cause the script to exit prematurely if set -e and set -o pipefail are enabled. grep -v exits with status 1 if no non-matching lines are found (e.g., if rocm_agent_enumerator only returns gfx000 or is empty). With pipefail, this would cause the entire pipeline to fail, and set -e would terminate the script.

To make this more robust, you can ensure the grep command doesn't cause the pipeline to fail on 'no match'. This allows the subsequent if statement to handle an empty architecture list gracefully.

Suggested change
RUNTIME_ROCM_ARCH=$(/opt/rocm/bin/rocm_agent_enumerator 2>/dev/null | grep -v 'gfx000' | sort -u | paste -sd ';')
RUNTIME_ROCM_ARCH=$(/opt/rocm/bin/rocm_agent_enumerator 2>/dev/null | (grep -v 'gfx000' || true) | sort -u | paste -sd ';')

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
@AndreasKaratzas AndreasKaratzas marked this pull request as ready for review March 26, 2026 09:06
Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@AndreasKaratzas
Copy link
Copy Markdown
Collaborator Author

Build is looking healthy.

@gshtras
Copy link
Copy Markdown
Collaborator

gshtras commented Mar 26, 2026

Why don't you just unset the PYTORCH_ROCM_ARCH for the tests if it'll make quark behave? It's only used for the build really I think.

@gshtras gshtras enabled auto-merge (squash) March 26, 2026 17:44
@gshtras gshtras merged commit bff9a1c into vllm-project:main Mar 26, 2026
13 of 14 checks passed
@github-project-automation github-project-automation bot moved this from Todo to Done in AMD Mar 26, 2026
@AndreasKaratzas AndreasKaratzas deleted the akaratza_fix_quark_arch_ci_target branch March 26, 2026 18:37
asrvastava pushed a commit to asrvastava/vllm that referenced this pull request Mar 26, 2026
…containers (vllm-project#38165)

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: iamvastava <iamvastava@gmail.com>
RhizoNymph pushed a commit to RhizoNymph/vllm that referenced this pull request Mar 26, 2026
…containers (vllm-project#38165)

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
nithinvc pushed a commit to nithinvc/vllm that referenced this pull request Mar 27, 2026
…containers (vllm-project#38165)

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

Signed-off-by: Nithin Chalapathi <nithin.ch10@gmail.com>
JiantaoXu pushed a commit to JiantaoXu/vllm that referenced this pull request Mar 28, 2026
…containers (vllm-project#38165)

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
@mergify mergify bot added the intel-gpu Related to Intel GPU label Mar 31, 2026
puririshi98 pushed a commit to puririshi98/vllm that referenced this pull request Apr 7, 2026
…containers (vllm-project#38165)

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Rishi Puri <riship@nvidia.com>
mtparet pushed a commit to blackfuel-ai/vllm that referenced this pull request Apr 9, 2026
…containers (vllm-project#38165)

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build intel-gpu Related to Intel GPU ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants