[ROCm][CI] Override PYTORCH_ROCM_ARCH with detected GPU arch in test containers#38165
Conversation
…containers Signed-off-by: Andreas Karatzas <akaratza@amd.com>
There was a problem hiding this comment.
Code Review
This pull request introduces dynamic detection of the GPU architecture using rocm_agent_enumerator to set the PYTORCH_ROCM_ARCH environment variable for Docker containers, optimizing JIT compilation. A potential issue was identified where the grep -v command in the architecture detection pipeline could cause the script to exit prematurely under certain shell configurations, and a more robust approach was suggested.
| # Detect the actual GPU architecture on the host so that runtime JIT | ||
| # compilation (e.g. Quark kernels) only targets the present hardware | ||
| # instead of every architecture baked into the Docker image. | ||
| RUNTIME_ROCM_ARCH=$(/opt/rocm/bin/rocm_agent_enumerator 2>/dev/null | grep -v 'gfx000' | sort -u | paste -sd ';') |
There was a problem hiding this comment.
This command pipeline can cause the script to exit prematurely if set -e and set -o pipefail are enabled. grep -v exits with status 1 if no non-matching lines are found (e.g., if rocm_agent_enumerator only returns gfx000 or is empty). With pipefail, this would cause the entire pipeline to fail, and set -e would terminate the script.
To make this more robust, you can ensure the grep command doesn't cause the pipeline to fail on 'no match'. This allows the subsequent if statement to handle an empty architecture list gracefully.
| RUNTIME_ROCM_ARCH=$(/opt/rocm/bin/rocm_agent_enumerator 2>/dev/null | grep -v 'gfx000' | sort -u | paste -sd ';') | |
| RUNTIME_ROCM_ARCH=$(/opt/rocm/bin/rocm_agent_enumerator 2>/dev/null | (grep -v 'gfx000' || true) | sort -u | paste -sd ';') |
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
|
Build is looking healthy. |
|
Why don't you just unset the PYTORCH_ROCM_ARCH for the tests if it'll make quark behave? It's only used for the build really I think. |
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…containers (vllm-project#38165) Signed-off-by: Andreas Karatzas <akaratza@amd.com> Signed-off-by: iamvastava <iamvastava@gmail.com>
…containers (vllm-project#38165) Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…containers (vllm-project#38165) Signed-off-by: Andreas Karatzas <akaratza@amd.com> Signed-off-by: Nithin Chalapathi <nithin.ch10@gmail.com>
…containers (vllm-project#38165) Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…containers (vllm-project#38165) Signed-off-by: Andreas Karatzas <akaratza@amd.com> Signed-off-by: Rishi Puri <riship@nvidia.com>
…containers (vllm-project#38165) Signed-off-by: Andreas Karatzas <akaratza@amd.com>
The base Docker image sets
PYTORCH_ROCM_ARCHto all supported architectures for build-time compilation. This value persists as a runtime env var. When Quark's JIT kernel compilation runs during tests, it picks up this broad list and compiles for all architectures instead of just the GPU present on the machine.Quark has auto-detection logic (
set_rocm_user_architecture) but it only activates whenPYTORCH_ROCM_ARCHis unset. Since the Docker image always sets it, auto-detection is skipped.Change
In
run-amd-test.sh, detect the host GPU arch usingrocm_agent_enumerator(filtering outgfx000which represents the CPU) and overridePYTORCH_ROCM_ARCHviadocker run -e. Falls back togfx90a;gfx942;gfx950if detection fails.cc @kenroche