[AMD] Add GLM-5.1-FP8 nightly accuracy and performance benchmarks for MI30x and MI35x#22336
[AMD] Add GLM-5.1-FP8 nightly accuracy and performance benchmarks for MI30x and MI35x#22336
Conversation
… MI30x and MI35x Add bench_one_batch perf tests and GSM8K accuracy tests for GLM-5.1-FP8 (MoE, 754B) with NSA attention backend on both MI30x and MI35x. GLM-5.1 uses glm_moe_dsa architecture requiring TP=8 + EP=8, matching the MiniMax-M2.5 expert parallelism pattern. - New: test/registered/amd/accuracy/mi30x/test_glm51_eval_amd.py (suite: nightly-amd-accuracy-8-gpu-glm51) - New: test/registered/amd/accuracy/mi35x/test_glm51_eval_mi35x.py (suite: nightly-amd-8-gpu-mi35x-glm51) - New: test/registered/amd/perf/mi30x/test_glm51_perf_amd.py (suite: nightly-perf-8-gpu-glm51) - New: test/registered/amd/perf/mi35x/test_glm51_perf_mi35x.py (suite: nightly-perf-8-gpu-mi35x-glm51) - Modified: nightly-test-amd.yml and nightly-test-amd-rocm720.yml with GLM-5.1 jobs (accuracy + perf in same job) Server config: - Model: zai-org/GLM-5.1-FP8 with --tp 8 --ep-size 8 - NSA: --nsa-prefill-backend tilelang --nsa-decode-backend tilelang - Parsers: --reasoning-parser=glm45 --tool-call-parser=glm47 - Perf: --kv-cache-dtype fp8_e4m3, --mem-fraction-static 0.85 - MI35x perf adds SGLANG_ROCM_FUSED_DECODE_MLA=0, ROCM_QUICK_REDUCE_QUANTIZATION=INT4, SAFETENSORS_FAST_GPU=1 Workflow: accuracy has no continue-on-error (failure skips perf), perf has continue-on-error: true (perf failures don't block CI)
There was a problem hiding this comment.
Code Review
This pull request introduces accuracy and performance evaluation tests for the GLM-5.1 model on AMD MI30x and MI35x hardware. The review feedback identifies several improvement opportunities, including the removal of hardcoded environment-specific paths and the correction of PEP 8 import order violations in the MI35x test scripts. Additionally, the reviewer pointed out potential division-by-zero errors in the performance metrics calculation and noted configuration inconsistencies between the accuracy and performance benchmarks for the MI35x variant.
| os.environ.setdefault("HF_HOME", "/data2/models/huggingface") | ||
| os.environ.setdefault("HF_HUB_CACHE", "/data2/models/huggingface/hub") |
There was a problem hiding this comment.
Hardcoding environment-specific paths like /data2/models/huggingface reduces the portability of the test script. Additionally, placing these statements between imports violates PEP 8. Consider moving these settings to a configuration file or environment variables set outside the script, or at least moving them after all imports.
| '{"enable_multithread_load": true}', | ||
| "--watchdog-timeout", | ||
| "1200", | ||
| ], | ||
| env_vars={}, |
There was a problem hiding this comment.
The configuration for MI35x accuracy is missing several environment variables and parameters specified in the PR description's server config table (e.g., num_threads: 8 and SGLANG_USE_AITER). This inconsistency might lead to suboptimal performance or different behavior compared to the performance benchmarks.
'{"enable_multithread_load": true, "num_threads": 8}',
"--watchdog-timeout",
"1200",
],
env_vars={
"SGLANG_USE_AITER": "1",
"SGLANG_ROCM_FUSED_DECODE_MLA": "0",
"ROCM_QUICK_REDUCE_QUANTIZATION": "INT4",
"SAFETENSORS_FAST_GPU": "1",
},| ) | ||
|
|
||
| for result in report_results: | ||
| itl = 1 / (result.output_throughput / result.batch_size) * 1000 |
There was a problem hiding this comment.
Potential ZeroDivisionError if result.output_throughput is zero. It is safer to check for a non-zero value before performing the division.
| itl = 1 / (result.output_throughput / result.batch_size) * 1000 | |
| itl = (result.batch_size / result.output_throughput * 1000) if result.output_throughput > 0 else 0 |
| os.environ.setdefault("HF_HOME", "/data2/models/huggingface") | ||
| os.environ.setdefault("HF_HUB_CACHE", "/data2/models/huggingface/hub") |
| ) | ||
|
|
||
| for result in report_results: | ||
| itl = 1 / (result.output_throughput / result.batch_size) * 1000 |
There was a problem hiding this comment.
Potential ZeroDivisionError if result.output_throughput is zero. It is safer to check for a non-zero value before performing the division.
| itl = 1 / (result.output_throughput / result.batch_size) * 1000 | |
| itl = (result.batch_size / result.output_throughput * 1000) if result.output_throughput > 0 else 0 |
CI Validation - All 4 GLM-5.1 jobs passed ✅Nightly Test (AMD) — run 24122988145
Nightly Test (AMD ROCm 7.2) — run 24122989432
All accuracy tests (GSM8K, threshold 0.93) and performance tests ( |
Drop the nightly-8-gpu-mi35x-glm47-fp8-rocm720 job, its job_select dropdown entry, and its nightly-check dependency. GLM-4.7 is superseded by GLM-5 and GLM-5.1 benchmarks.
The GLM-5 performance test on MI35x crashes with a GPU memory access fault (write to read-only page) during the first large prefill batch. Root cause: the fused_append_shared_experts Triton kernel triggers a gfx950 codegen issue when shared expert fusion is active with FP8 KV cache under TP-only (no EP) mode. GLM-5.1 (which uses EP=8 and thus bypasses shared expert fusion) is unaffected and keeps its perf test. MI30x GLM-5 perf test also stays since gfx942 is not affected. Keep the GLM-5 MI35x accuracy test which passes reliably.
|
Followed the model configs for GLM-5-FP8 and GLM-5.1-FP8 — they have identical architecture (same hidden size, layer count) and are both MoE models (256 routed experts, top-8). Why does GLM-5.1 add --ep-size 8 while GLM-5 runs with pure TP? If EP is beneficial here, should the GLM-5 benchmark be updated to match? |
Confirmed via config.json diff: GLM-5-FP8 and GLM-5.1-FP8 have identical architecture (GlmMoeDsaForCausalLM, 256 routed experts, top-8, same hidden_size/layers). The only diff is transformers_version. GLM-5 benchmarks run with pure TP (no EP), and --ep-size without --moe-a2a-backend is either a no-op or an assertion hazard (FP8 Cutlass MoE and Triton kernel MoE both require ep_size == 1). Align GLM-5.1 config with GLM-5 for consistency.
updated. |
Summary
Add GLM-5.1-FP8 nightly accuracy + perf benchmarks (
bench_one_batch) for MI30x and MI35x. Both GLM-5-FP8 and GLM-5.1-FP8 share identical architecture (GlmMoeDsaForCausalLM, 256 routed experts, top-8, hidden_size=6144, 78 layers). GLM-5.1 test config mirrors GLM-5 (pure TP=8, NSA tilelang backend).Based on the GLM-5 test pattern from #21710.
Changes
test/registered/amd/accuracy/mi30x/test_glm51_eval_amd.py(suite:nightly-amd-accuracy-8-gpu-glm51)test/registered/amd/accuracy/mi35x/test_glm51_eval_mi35x.py(suite:nightly-amd-8-gpu-mi35x-glm51)test/registered/amd/perf/mi30x/test_glm51_perf_amd.py(suite:nightly-perf-8-gpu-glm51)test/registered/amd/perf/mi35x/test_glm51_perf_mi35x.py(suite:nightly-perf-8-gpu-mi35x-glm51)nightly-test-amd.yml— add MI30x + MI35x GLM-5.1 jobsnightly-test-amd-rocm720.yml— add MI30x + MI35x GLM-5.1 jobs; remove GLM-4.7-FP8 job (superseded)Server config
Same config as GLM-5 — pure TP=8, no EP (EP without
--moe-a2a-backendis a no-op, and FP8 Cutlass/Triton MoE kernels assertep_size == 1).{"enable_multithread_load": true}{"enable_multithread_load": true, "num_threads": 8}--reasoning-parser=glm45 --tool-call-parser=glm47SGLANG_USE_AITER=1SGLANG_USE_AITER=1,SGLANG_ROCM_FUSED_DECODE_MLA=0,ROCM_QUICK_REDUCE_QUANTIZATION=INT4,SAFETENSORS_FAST_GPU=1Workflow behavior
continue-on-error— if it fails, perf is skipped and the job failscontinue-on-error: true— perf failures don't block CICI validation
nightly-8-gpu-glm51nightly-8-gpu-mi35x-glm51continue-on-error: truenightly-8-gpu-glm51-rocm720nightly-8-gpu-mi35x-glm51-rocm720All 4 accuracy tests passed. 3/4 perf tests passed; MI35x default perf was cancelled by runner (not a test failure).
Test plan
job_filter=nightly-8-gpu-glm51,nightly-8-gpu-mi35x-glm51job_filter=nightly-8-gpu-glm51-rocm720,nightly-8-gpu-mi35x-glm51-rocm720