Skip to content
6 changes: 5 additions & 1 deletion .github/configs/amd-master.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -338,7 +338,7 @@ kimik2.5-int4-mi325x-vllm:
- { tp: 8, conc-start: 4, conc-end: 64 }

kimik2.5-fp4-mi355x-vllm:
image: vllm/vllm-openai-rocm:v0.16.0
image: vllm/vllm-openai-rocm:v0.18.0
model: amd/Kimi-K2.5-MXFP4
model-prefix: kimik2.5
runner: mi355x
Expand All @@ -350,14 +350,18 @@ kimik2.5-fp4-mi355x-vllm:
osl: 1024
search-space:
- { tp: 8, conc-start: 4, conc-end: 64 }
- { tp: 4, conc-start: 4, conc-end: 64 }
- isl: 1024
osl: 8192
search-space:
- { tp: 8, conc-start: 4, conc-end: 64 }
- { tp: 4, conc-start: 4, conc-end: 64 }
- { tp: 4, ep: 4, conc-start: 4, conc-end: 64 }
- isl: 8192
osl: 1024
search-space:
- { tp: 8, conc-start: 4, conc-end: 64 }
- { tp: 4, conc-start: 4, conc-end: 64 }

minimaxm2.5-fp8-mi355x-vllm:
image: vllm/vllm-openai-rocm:v0.18.0
Expand Down
33 changes: 27 additions & 6 deletions benchmarks/single_node/kimik2.5_fp4_mi355x.sh
Original file line number Diff line number Diff line change
Expand Up @@ -31,9 +31,29 @@ fi
SERVER_LOG=/workspace/server.log
PORT=${PORT:-8888}

# do not enable aiter due to Aiter MLA not currently supporting num_heads=8
# https://github.com/vllm-project/vllm/issues/35641
# export VLLM_ROCM_USE_AITER=1
# If the machine runs a MEC FW older than 177, RCCL
# cannot reclaim some memory.
# Disable that features to avoid crashes.
# This is related to the changes in the driver at:
# https://rocm.docs.amd.com/en/docs-6.4.3/about/release-notes.html#amdgpu-driver-updates
version=`rocm-smi --showfw | grep MEC | head -n 1 | awk '{print $NF}'`
if [[ "$version" == "" || $version -lt 177 ]]; then
export HSA_NO_SCRATCH_RECLAIM=1
fi
Comment on lines +39 to +42
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The MEC firmware version check at line 40 fails to set HSA_NO_SCRATCH_RECLAIM=1 when rocm-smi returns a non-numeric string (e.g., N/A): the empty-string guard is false, and bash arithmetic comparison -lt on a non-integer operand inside [[ ]] returns false, so the short-circuit OR is false overall and the safety flag is never set. Since HSA_NO_SCRATCH_RECLAIM=1 prevents crashes on older firmware, the safe default for an indeterminate version should be to set it; fix by adding a regex guard: [[ "$version" == "" || ! "$version" =~ ^[0-9]+$ || $version -lt 177 ]].

Extended reasoning...

Bug Analysis

The newly added MEC firmware version check at benchmarks/single_node/kimik2.5_fp4_mi355x.sh:39-42 reads:

version=$(rocm-smi --showfw | grep MEC | head -n 1 | awk '{print $NF}')
if [[ "$version" == "" || $version -lt 177 ]]; then
  export HSA_NO_SCRATCH_RECLAIM=1
fi

How the Bug Manifests

The condition uses short-circuit OR with two branches: "$version" == "" handles the empty case, and $version -lt 177 handles the numeric comparison. A third case is unhandled: when version is a non-empty, non-numeric string such as N/A or unknown — common sentinels in ROCm tooling.

Inside bash [[ ]], the -lt operator performs arithmetic comparison. When the left-hand operand is a non-integer string, bash either silently coerces it to 0 or produces an error and returns false. The verifiers confirmed that for non-integer strings the comparison returns false in practice.

The Specific Code Path

  1. rocm-smi --showfw outputs a firmware table; grep MEC | head -n 1 | awk '{print $NF}' extracts the last field of the first MEC line.
  2. If that field is N/A, version is set to N/A.
  3. [[ "N/A" == "" ]] evaluates to false (not empty).
  4. [[ N/A -lt 177 ]] — bash arithmetic on a non-integer returns false.
  5. Short-circuit OR: false || falsefalse — the if body is skipped.
  6. HSA_NO_SCRATCH_RECLAIM is never exported.

Why Existing Code Does Not Prevent It

The empty-string guard correctly covers the case where awk produces no output. It does not cover non-empty, non-numeric values. There is no type check or regex validation before using -lt on $version.

Impact

On systems where rocm-smi --showfw returns a non-numeric firmware version field, HSA_NO_SCRATCH_RECLAIM=1 will silently not be set. Per the script comment, this flag is needed on MEC firmware older than version 177 to prevent RCCL scratch-memory reclaim crashes. A machine with old firmware but a non-numeric version string in rocm-smi output would be left in the crash-prone configuration without any warning. The practical risk is low since rocm-smi output is normally numeric on production hardware, but the violated fail-safe assumption is a real defect.

Step-by-Step Proof

  1. rocm-smi --showfw on a specific machine outputs a MEC line whose last field is N/A.
  2. version is assigned the string N/A.
  3. [[ "N/A" == "" ]] evaluates to false.
  4. [[ N/A -lt 177 ]] — bash arithmetic comparison with non-integer — evaluates to false.
  5. Entire OR condition is false; if body is not entered.
  6. HSA_NO_SCRATCH_RECLAIM remains unset, leaving potential for crash on older firmware.

Recommended Fix

if [[ "$version" == "" || ! "$version" =~ ^[0-9]+$ || $version -lt 177 ]]; then
  export HSA_NO_SCRATCH_RECLAIM=1
fi

The added ! "$version" =~ ^[0-9]+$ clause ensures that any non-numeric (indeterminate) version string causes the safety flag to be set, correctly implementing the fail-safe default.

Comment on lines +40 to +42
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The MEC firmware version check at line 41 fails silently when rocm-smi returns a non-numeric string (e.g., N/A): the arithmetic comparison $version -lt 177 errors in bash [[ ]] context, causing the overall OR condition to return false and leaving HSA_NO_SCRATCH_RECLAIM unset. Since the comment explains this flag prevents crashes on older/unknown firmware, the safe default for indeterminate versions should be to set the flag — add a regex guard like ! "$version" =~ ^[0-9]+$ as the middle condition.

Extended reasoning...

The firmware version check is intended to protect against crashes on older MEC firmware by setting HSA_NO_SCRATCH_RECLAIM=1. The check reads:

version=`rocm-smi --showfw | grep MEC | head -n 1 | awk '{print $NF}'`
if [[ "$version" == "" || $version -lt 177 ]]; then
  export HSA_NO_SCRATCH_RECLAIM=1
fi

How the bug manifests: If rocm-smi --showfw outputs a non-numeric last field for the MEC line — for example N/A, unknown, or a version string with a non-numeric suffix — the empty-string guard passes (version is not empty), but the second operand $version -lt 177 attempts a bash integer arithmetic comparison on a non-integer. Inside [[ ]], this silently returns false (bash evaluates it as an error/false rather than raising a fatal error). Due to short-circuit OR semantics, the overall condition is false, so HSA_NO_SCRATCH_RECLAIM is never exported.

Why existing code doesn't prevent it: The empty string check "$version" == "" only handles the case where awk produces no output. It does not guard against non-integer strings like N/A that awk could reasonably extract as the last field of rocm-smi output. Bash integer comparison $var -lt N inside [[ ]] does NOT raise an error that propagates — it simply evaluates to false, making the entire OR condition false.

Concrete proof: Suppose rocm-smi --showfw outputs a line like ... MEC ... N/A. After awk '{print $NF}', version=N/A. Then:

  1. [[ "N/A" == "" ]] → false (short-circuit does NOT skip second condition)
  2. [[ N/A -lt 177 ]] → bash arithmetic error, condition evaluates to false
  3. Overall [[ false || false ]] → false
  4. HSA_NO_SCRATCH_RECLAIM=1 is NOT set

The machine may then experience the RCCL memory reclaim crashes the flag was meant to prevent.

Impact: When firmware version is indeterminate (non-numeric), the workaround is silently skipped. The safe/fail-safe behavior should be to apply the workaround when the version cannot be confirmed to be ≥ 177.

Fix: Add a numeric guard as the middle condition:

if [[ "$version" == "" || ! "$version" =~ ^[0-9]+$ || $version -lt 177 ]]; then
  export HSA_NO_SCRATCH_RECLAIM=1
fi

This ensures: empty string → set flag, non-numeric string → set flag, numeric < 177 → set flag, numeric ≥ 177 → do not set flag.


export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=INT4

Comment thread
seungrokj marked this conversation as resolved.
# Disable AITER RMSNorm for TP < 8 due to accuracy issues
if [ "${TP}" -lt 8 ]; then
export VLLM_ROCM_USE_AITER_RMSNORM=0
fi

if [ "${EP_SIZE:-0}" -gt 1 ]; then
EP=" --enable-expert-parallel"
else
EP=" "
fi

# following AMD andy luo's recipe
# https://x.com/linluo77/status/2017024513595301985
Expand All @@ -44,10 +64,11 @@ start_gpu_monitor
set -x
vllm serve $MODEL --port $PORT \
--tensor-parallel-size=$TP \
--gpu-memory-utilization 0.95 \
$EP \
--gpu-memory-utilization 0.90 \
--max-model-len $MAX_MODEL_LEN \
--block-size=64 \
--disable-log-requests \
--block-size=1 \
--no-enable-prefix-caching \
--trust-remote-code \
--mm-encoder-tp-mode data > $SERVER_LOG 2>&1 &
Comment thread
cquil11 marked this conversation as resolved.

Expand Down
9 changes: 9 additions & 0 deletions perf-changelog.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1068,3 +1068,12 @@
- "dsr1-fp8-h200-sglang: v0.5.9-cu129-amd64 → v0.5.9-cu130"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/943

- config-keys:
- kimik2.5-fp4-mi355x-vllm
description:
- "Upgrade vLLM ROCm image from v0.16.0 to v0.18.0"
- "Enable AITER with INT4 quick reduce; disable AITER RMSNorm for TP < 8 (accuracy)"
- "Add expert parallel, TP4, and TP4/EP4 search spaces"
- "Switch block-size 64 to 1 gpu-memory-utilization 0.95 to 0.90"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/936

Loading