[ROCm][CI] Override PYTORCH_ROCM_ARCH with detected GPU arch in test containers by AndreasKaratzas · Pull Request #38165 · vllm-project/vllm

AndreasKaratzas · 2026-03-26T00:11:46Z

The base Docker image sets PYTORCH_ROCM_ARCH to all supported architectures for build-time compilation. This value persists as a runtime env var. When Quark's JIT kernel compilation runs during tests, it picks up this broad list and compiles for all architectures instead of just the GPU present on the machine.

Quark has auto-detection logic (set_rocm_user_architecture) but it only activates when PYTORCH_ROCM_ARCH is unset. Since the Docker image always sets it, auto-detection is skipped.

Change

In run-amd-test.sh, detect the host GPU arch using rocm_agent_enumerator (filtering out gfx000 which represents the CPU) and override PYTORCH_ROCM_ARCH via docker run -e. Falls back to gfx90a;gfx942;gfx950 if detection fails.

cc @kenroche

…containers Signed-off-by: Andreas Karatzas <akaratza@amd.com>

gemini-code-assist

Code Review

This pull request introduces dynamic detection of the GPU architecture using rocm_agent_enumerator to set the PYTORCH_ROCM_ARCH environment variable for Docker containers, optimizing JIT compilation. A potential issue was identified where the grep -v command in the architecture detection pipeline could cause the script to exit prematurely under certain shell configurations, and a more robust approach was suggested.

gemini-code-assist · 2026-03-26T00:14:59Z

+  # Detect the actual GPU architecture on the host so that runtime JIT
+  # compilation (e.g. Quark kernels) only targets the present hardware
+  # instead of every architecture baked into the Docker image.
+  RUNTIME_ROCM_ARCH=$(/opt/rocm/bin/rocm_agent_enumerator 2>/dev/null | grep -v 'gfx000' | sort -u | paste -sd ';')


This command pipeline can cause the script to exit prematurely if set -e and set -o pipefail are enabled. grep -v exits with status 1 if no non-matching lines are found (e.g., if rocm_agent_enumerator only returns gfx000 or is empty). With pipefail, this would cause the entire pipeline to fail, and set -e would terminate the script.

To make this more robust, you can ensure the grep command doesn't cause the pipeline to fail on 'no match'. This allows the subsequent if statement to handle an empty architecture list gracefully.

Suggested change

RUNTIME_ROCM_ARCH=$(/opt/rocm/bin/rocm_agent_enumerator 2>/dev/null | grep -v 'gfx000' | sort -u | paste -sd ';')

RUNTIME_ROCM_ARCH=$(/opt/rocm/bin/rocm_agent_enumerator 2>/dev/null | (grep -v 'gfx000' || true) | sort -u | paste -sd ';')

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

AndreasKaratzas · 2026-03-26T09:06:20Z

Build is looking healthy.

gshtras · 2026-03-26T16:48:10Z

Why don't you just unset the PYTORCH_ROCM_ARCH for the tests if it'll make quark behave? It's only used for the build really I think.

…ch_ci_target

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

…containers (vllm-project#38165) Signed-off-by: Andreas Karatzas <akaratza@amd.com> Signed-off-by: iamvastava <iamvastava@gmail.com>

…containers (vllm-project#38165) Signed-off-by: Andreas Karatzas <akaratza@amd.com>

…containers (vllm-project#38165) Signed-off-by: Andreas Karatzas <akaratza@amd.com> Signed-off-by: Nithin Chalapathi <nithin.ch10@gmail.com>

…containers (vllm-project#38165) Signed-off-by: Andreas Karatzas <akaratza@amd.com>

…containers (vllm-project#38165) Signed-off-by: Andreas Karatzas <akaratza@amd.com> Signed-off-by: Rishi Puri <riship@nvidia.com>

…containers (vllm-project#38165) Signed-off-by: Andreas Karatzas <akaratza@amd.com>

[CI][ROCm] Override PYTORCH_ROCM_ARCH with detected GPU arch in test …

7a57ff6

…containers Signed-off-by: Andreas Karatzas <akaratza@amd.com>

AndreasKaratzas added ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm labels Mar 26, 2026

github-project-automation bot added this to AMD Mar 26, 2026

github-project-automation bot moved this to Todo in AMD Mar 26, 2026

mergify bot added the ci/build label Mar 26, 2026

gemini-code-assist bot reviewed Mar 26, 2026

View reviewed changes

Detect host GPU arch and pass PYTORCH_ROCM_ARCH to test container

603e23e

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

AndreasKaratzas marked this pull request as ready for review March 26, 2026 09:06

claude bot reviewed Mar 26, 2026

View reviewed changes

AndreasKaratzas requested review from DarkLight1337 and tjtanaa March 26, 2026 09:12

AndreasKaratzas added 2 commits March 26, 2026 12:03

Merge remote-tracking branch 'origin/main' into akaratza_fix_quark_ar…

da9d78c

…ch_ci_target

Unsetting arch instead

2997d7b

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

gshtras approved these changes Mar 26, 2026

View reviewed changes

gshtras enabled auto-merge (squash) March 26, 2026 17:44

gshtras merged commit bff9a1c into vllm-project:main Mar 26, 2026
13 of 14 checks passed

github-project-automation bot moved this from Todo to Done in AMD Mar 26, 2026

AndreasKaratzas deleted the akaratza_fix_quark_arch_ci_target branch March 26, 2026 18:37

RhizoNymph pushed a commit to RhizoNymph/vllm that referenced this pull request Mar 26, 2026

[ROCm][CI] Override PYTORCH_ROCM_ARCH with detected GPU arch in test …

59f3fd0

…containers (vllm-project#38165) Signed-off-by: Andreas Karatzas <akaratza@amd.com>

JiantaoXu pushed a commit to JiantaoXu/vllm that referenced this pull request Mar 28, 2026

[ROCm][CI] Override PYTORCH_ROCM_ARCH with detected GPU arch in test …

839a2b3

…containers (vllm-project#38165) Signed-off-by: Andreas Karatzas <akaratza@amd.com>

mergify bot added the intel-gpu Related to Intel GPU label Mar 31, 2026

mtparet pushed a commit to blackfuel-ai/vllm that referenced this pull request Apr 9, 2026

[ROCm][CI] Override PYTORCH_ROCM_ARCH with detected GPU arch in test …

4ed07a9

…containers (vllm-project#38165) Signed-off-by: Andreas Karatzas <akaratza@amd.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ROCm][CI] Override PYTORCH_ROCM_ARCH with detected GPU arch in test containers#38165

[ROCm][CI] Override PYTORCH_ROCM_ARCH with detected GPU arch in test containers#38165
gshtras merged 4 commits intovllm-project:mainfrom
ROCm:akaratza_fix_quark_arch_ci_target

AndreasKaratzas commented Mar 26, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 26, 2026

Uh oh!

claude bot left a comment

Uh oh!

AndreasKaratzas commented Mar 26, 2026

Uh oh!

gshtras commented Mar 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	RUNTIME_ROCM_ARCH=$(/opt/rocm/bin/rocm_agent_enumerator 2>/dev/null \| grep -v 'gfx000' \| sort -u \| paste -sd ';')
	RUNTIME_ROCM_ARCH=$(/opt/rocm/bin/rocm_agent_enumerator 2>/dev/null \| (grep -v 'gfx000' \|\| true) \| sort -u \| paste -sd ';')

Uh oh!

Conversation

AndreasKaratzas commented Mar 26, 2026

Change

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

AndreasKaratzas commented Mar 26, 2026

Uh oh!

gshtras commented Mar 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants