-
Notifications
You must be signed in to change notification settings - Fork 156
[AMD] Optimize Kimi-K2.5-MXFP4 on MI355X: Enable AITER, Expert Parallel, and update to vLLM v0.18.0 #936
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AMD] Optimize Kimi-K2.5-MXFP4 on MI355X: Enable AITER, Expert Parallel, and update to vLLM v0.18.0 #936
Changes from all commits
87f3c74
d26c7c9
26b127b
8e2adec
4dfed36
a720974
a2d37a7
5e221bc
366731e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -31,9 +31,29 @@ fi | |
| SERVER_LOG=/workspace/server.log | ||
| PORT=${PORT:-8888} | ||
|
|
||
| # do not enable aiter due to Aiter MLA not currently supporting num_heads=8 | ||
| # https://github.com/vllm-project/vllm/issues/35641 | ||
| # export VLLM_ROCM_USE_AITER=1 | ||
| # If the machine runs a MEC FW older than 177, RCCL | ||
| # cannot reclaim some memory. | ||
| # Disable that features to avoid crashes. | ||
| # This is related to the changes in the driver at: | ||
| # https://rocm.docs.amd.com/en/docs-6.4.3/about/release-notes.html#amdgpu-driver-updates | ||
| version=`rocm-smi --showfw | grep MEC | head -n 1 | awk '{print $NF}'` | ||
| if [[ "$version" == "" || $version -lt 177 ]]; then | ||
| export HSA_NO_SCRATCH_RECLAIM=1 | ||
| fi | ||
|
Comment on lines
+40
to
+42
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🟡 The MEC firmware version check at line 41 fails silently when Extended reasoning...The firmware version check is intended to protect against crashes on older MEC firmware by setting version=`rocm-smi --showfw | grep MEC | head -n 1 | awk '{print $NF}'`
if [[ "$version" == "" || $version -lt 177 ]]; then
export HSA_NO_SCRATCH_RECLAIM=1
fiHow the bug manifests: If Why existing code doesn't prevent it: The empty string check Concrete proof: Suppose
The machine may then experience the RCCL memory reclaim crashes the flag was meant to prevent. Impact: When firmware version is indeterminate (non-numeric), the workaround is silently skipped. The safe/fail-safe behavior should be to apply the workaround when the version cannot be confirmed to be ≥ 177. Fix: Add a numeric guard as the middle condition: if [[ "$version" == "" || ! "$version" =~ ^[0-9]+$ || $version -lt 177 ]]; then
export HSA_NO_SCRATCH_RECLAIM=1
fiThis ensures: empty string → set flag, non-numeric string → set flag, numeric < 177 → set flag, numeric ≥ 177 → do not set flag. |
||
|
|
||
| export VLLM_ROCM_USE_AITER=1 | ||
| export VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=INT4 | ||
|
|
||
|
seungrokj marked this conversation as resolved.
|
||
| # Disable AITER RMSNorm for TP < 8 due to accuracy issues | ||
| if [ "${TP}" -lt 8 ]; then | ||
| export VLLM_ROCM_USE_AITER_RMSNORM=0 | ||
| fi | ||
|
|
||
| if [ "${EP_SIZE:-0}" -gt 1 ]; then | ||
| EP=" --enable-expert-parallel" | ||
| else | ||
| EP=" " | ||
| fi | ||
|
|
||
| # following AMD andy luo's recipe | ||
| # https://x.com/linluo77/status/2017024513595301985 | ||
|
|
@@ -44,10 +64,11 @@ start_gpu_monitor | |
| set -x | ||
| vllm serve $MODEL --port $PORT \ | ||
| --tensor-parallel-size=$TP \ | ||
| --gpu-memory-utilization 0.95 \ | ||
| $EP \ | ||
| --gpu-memory-utilization 0.90 \ | ||
| --max-model-len $MAX_MODEL_LEN \ | ||
| --block-size=64 \ | ||
| --disable-log-requests \ | ||
| --block-size=1 \ | ||
| --no-enable-prefix-caching \ | ||
| --trust-remote-code \ | ||
| --mm-encoder-tp-mode data > $SERVER_LOG 2>&1 & | ||
|
cquil11 marked this conversation as resolved.
|
||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🟡 The MEC firmware version check at line 40 fails to set HSA_NO_SCRATCH_RECLAIM=1 when rocm-smi returns a non-numeric string (e.g., N/A): the empty-string guard is false, and bash arithmetic comparison -lt on a non-integer operand inside [[ ]] returns false, so the short-circuit OR is false overall and the safety flag is never set. Since HSA_NO_SCRATCH_RECLAIM=1 prevents crashes on older firmware, the safe default for an indeterminate version should be to set it; fix by adding a regex guard: [[ "$version" == "" || ! "$version" =~ ^[0-9]+$ || $version -lt 177 ]].
Extended reasoning...
Bug Analysis
The newly added MEC firmware version check at
benchmarks/single_node/kimik2.5_fp4_mi355x.sh:39-42reads:How the Bug Manifests
The condition uses short-circuit OR with two branches:
"$version" == ""handles the empty case, and$version -lt 177handles the numeric comparison. A third case is unhandled: whenversionis a non-empty, non-numeric string such asN/Aorunknown— common sentinels in ROCm tooling.Inside bash
[[ ]], the-ltoperator performs arithmetic comparison. When the left-hand operand is a non-integer string, bash either silently coerces it to0or produces an error and returns false. The verifiers confirmed that for non-integer strings the comparison returns false in practice.The Specific Code Path
rocm-smi --showfwoutputs a firmware table;grep MEC | head -n 1 | awk '{print $NF}'extracts the last field of the first MEC line.N/A,versionis set toN/A.[[ "N/A" == "" ]]evaluates to false (not empty).[[ N/A -lt 177 ]]— bash arithmetic on a non-integer returns false.false || false→ false — theifbody is skipped.HSA_NO_SCRATCH_RECLAIMis never exported.Why Existing Code Does Not Prevent It
The empty-string guard correctly covers the case where
awkproduces no output. It does not cover non-empty, non-numeric values. There is no type check or regex validation before using-lton$version.Impact
On systems where
rocm-smi --showfwreturns a non-numeric firmware version field,HSA_NO_SCRATCH_RECLAIM=1will silently not be set. Per the script comment, this flag is needed on MEC firmware older than version 177 to prevent RCCL scratch-memory reclaim crashes. A machine with old firmware but a non-numeric version string inrocm-smioutput would be left in the crash-prone configuration without any warning. The practical risk is low sincerocm-smioutput is normally numeric on production hardware, but the violated fail-safe assumption is a real defect.Step-by-Step Proof
rocm-smi --showfwon a specific machine outputs a MEC line whose last field isN/A.versionis assigned the stringN/A.[[ "N/A" == "" ]]evaluates to false.[[ N/A -lt 177 ]]— bash arithmetic comparison with non-integer — evaluates to false.ifbody is not entered.HSA_NO_SCRATCH_RECLAIMremains unset, leaving potential for crash on older firmware.Recommended Fix
The added
! "$version" =~ ^[0-9]+$clause ensures that any non-numeric (indeterminate) version string causes the safety flag to be set, correctly implementing the fail-safe default.