Skip to content

[AMD] Add GLM-5.1-FP8 nightly accuracy and performance benchmarks for MI30x and MI35x#22336

Merged
HaiShaw merged 4 commits intomainfrom
add-glm51-nightly-perf-test
Apr 9, 2026
Merged

[AMD] Add GLM-5.1-FP8 nightly accuracy and performance benchmarks for MI30x and MI35x#22336
HaiShaw merged 4 commits intomainfrom
add-glm51-nightly-perf-test

Conversation

@michaelzhang-ai
Copy link
Copy Markdown
Collaborator

@michaelzhang-ai michaelzhang-ai commented Apr 8, 2026

Summary

Add GLM-5.1-FP8 nightly accuracy + perf benchmarks (bench_one_batch) for MI30x and MI35x. Both GLM-5-FP8 and GLM-5.1-FP8 share identical architecture (GlmMoeDsaForCausalLM, 256 routed experts, top-8, hidden_size=6144, 78 layers). GLM-5.1 test config mirrors GLM-5 (pure TP=8, NSA tilelang backend).

Based on the GLM-5 test pattern from #21710.

Changes

  • New: test/registered/amd/accuracy/mi30x/test_glm51_eval_amd.py (suite: nightly-amd-accuracy-8-gpu-glm51)
  • New: test/registered/amd/accuracy/mi35x/test_glm51_eval_mi35x.py (suite: nightly-amd-8-gpu-mi35x-glm51)
  • New: test/registered/amd/perf/mi30x/test_glm51_perf_amd.py (suite: nightly-perf-8-gpu-glm51)
  • New: test/registered/amd/perf/mi35x/test_glm51_perf_mi35x.py (suite: nightly-perf-8-gpu-mi35x-glm51)
  • Modified: nightly-test-amd.yml — add MI30x + MI35x GLM-5.1 jobs
  • Modified: nightly-test-amd-rocm720.yml — add MI30x + MI35x GLM-5.1 jobs; remove GLM-4.7-FP8 job (superseded)

Server config

Same config as GLM-5 — pure TP=8, no EP (EP without --moe-a2a-backend is a no-op, and FP8 Cutlass/Triton MoE kernels assert ep_size == 1).

MI30x MI35x
--tp 8 8
--kv-cache-dtype fp8_e4m3 (perf) fp8_e4m3 (perf)
--mem-fraction-static 0.85 (perf) / 0.80 (accuracy) 0.85 (perf) / 0.80 (accuracy)
--model-loader-extra-config {"enable_multithread_load": true} {"enable_multithread_load": true, "num_threads": 8}
Parsers --reasoning-parser=glm45 --tool-call-parser=glm47 Same
Env SGLANG_USE_AITER=1 SGLANG_USE_AITER=1, SGLANG_ROCM_FUSED_DECODE_MLA=0, ROCM_QUICK_REDUCE_QUANTIZATION=INT4, SAFETENSORS_FAST_GPU=1

Workflow behavior

  • Accuracy step has no continue-on-error — if it fails, perf is skipped and the job fails
  • Perf step has continue-on-error: true — perf failures don't block CI

CI validation

Job GPU ROCm Accuracy Perf Notes
nightly-8-gpu-glm51 MI30x Default 35m1s
nightly-8-gpu-mi35x-glm51 MI35x Default ⚠️ Cancelled Perf cancelled by runner (accuracy passed); continue-on-error: true
nightly-8-gpu-glm51-rocm720 MI30x 7.2 47m3s
nightly-8-gpu-mi35x-glm51-rocm720 MI35x 7.2 1h17m56s

All 4 accuracy tests passed. 3/4 perf tests passed; MI35x default perf was cancelled by runner (not a test failure).

Test plan

  • Trigger default ROCm workflow with job_filter=nightly-8-gpu-glm51,nightly-8-gpu-mi35x-glm51
  • Trigger ROCm 7.2 workflow with job_filter=nightly-8-gpu-glm51-rocm720,nightly-8-gpu-mi35x-glm51-rocm720
  • Verify accuracy tests pass GSM8K threshold (0.93) — all 4 passed
  • Verify perf results are reported in GitHub step summary — 3/4 passed, 1 cancelled by runner

… MI30x and MI35x

Add bench_one_batch perf tests and GSM8K accuracy tests for GLM-5.1-FP8
(MoE, 754B) with NSA attention backend on both MI30x and MI35x. GLM-5.1
uses glm_moe_dsa architecture requiring TP=8 + EP=8, matching the
MiniMax-M2.5 expert parallelism pattern.

- New: test/registered/amd/accuracy/mi30x/test_glm51_eval_amd.py
  (suite: nightly-amd-accuracy-8-gpu-glm51)
- New: test/registered/amd/accuracy/mi35x/test_glm51_eval_mi35x.py
  (suite: nightly-amd-8-gpu-mi35x-glm51)
- New: test/registered/amd/perf/mi30x/test_glm51_perf_amd.py
  (suite: nightly-perf-8-gpu-glm51)
- New: test/registered/amd/perf/mi35x/test_glm51_perf_mi35x.py
  (suite: nightly-perf-8-gpu-mi35x-glm51)
- Modified: nightly-test-amd.yml and nightly-test-amd-rocm720.yml
  with GLM-5.1 jobs (accuracy + perf in same job)

Server config:
- Model: zai-org/GLM-5.1-FP8 with --tp 8 --ep-size 8
- NSA: --nsa-prefill-backend tilelang --nsa-decode-backend tilelang
- Parsers: --reasoning-parser=glm45 --tool-call-parser=glm47
- Perf: --kv-cache-dtype fp8_e4m3, --mem-fraction-static 0.85
- MI35x perf adds SGLANG_ROCM_FUSED_DECODE_MLA=0,
  ROCM_QUICK_REDUCE_QUANTIZATION=INT4, SAFETENSORS_FAST_GPU=1

Workflow: accuracy has no continue-on-error (failure skips perf),
perf has continue-on-error: true (perf failures don't block CI)
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces accuracy and performance evaluation tests for the GLM-5.1 model on AMD MI30x and MI35x hardware. The review feedback identifies several improvement opportunities, including the removal of hardcoded environment-specific paths and the correction of PEP 8 import order violations in the MI35x test scripts. Additionally, the reviewer pointed out potential division-by-zero errors in the performance metrics calculation and noted configuration inconsistencies between the accuracy and performance benchmarks for the MI35x variant.

Comment on lines +12 to +13
os.environ.setdefault("HF_HOME", "/data2/models/huggingface")
os.environ.setdefault("HF_HUB_CACHE", "/data2/models/huggingface/hub")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Hardcoding environment-specific paths like /data2/models/huggingface reduces the portability of the test script. Additionally, placing these statements between imports violates PEP 8. Consider moving these settings to a configuration file or environment variables set outside the script, or at least moving them after all imports.

Comment on lines +83 to +87
'{"enable_multithread_load": true}',
"--watchdog-timeout",
"1200",
],
env_vars={},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The configuration for MI35x accuracy is missing several environment variables and parameters specified in the PR description's server config table (e.g., num_threads: 8 and SGLANG_USE_AITER). This inconsistency might lead to suboptimal performance or different behavior compared to the performance benchmarks.

            '{"enable_multithread_load": true, "num_threads": 8}',
            "--watchdog-timeout",
            "1200",
        ],
        env_vars={
            "SGLANG_USE_AITER": "1",
            "SGLANG_ROCM_FUSED_DECODE_MLA": "0",
            "ROCM_QUICK_REDUCE_QUANTIZATION": "INT4",
            "SAFETENSORS_FAST_GPU": "1",
        },

)

for result in report_results:
itl = 1 / (result.output_throughput / result.batch_size) * 1000
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Potential ZeroDivisionError if result.output_throughput is zero. It is safer to check for a non-zero value before performing the division.

Suggested change
itl = 1 / (result.output_throughput / result.batch_size) * 1000
itl = (result.batch_size / result.output_throughput * 1000) if result.output_throughput > 0 else 0

Comment on lines +11 to +12
os.environ.setdefault("HF_HOME", "/data2/models/huggingface")
os.environ.setdefault("HF_HUB_CACHE", "/data2/models/huggingface/hub")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Hardcoding environment-specific paths like /data2/models/huggingface reduces the portability of the test script. Additionally, placing these statements between imports violates PEP 8.

)

for result in report_results:
itl = 1 / (result.output_throughput / result.batch_size) * 1000
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Potential ZeroDivisionError if result.output_throughput is zero. It is safer to check for a non-zero value before performing the division.

Suggested change
itl = 1 / (result.output_throughput / result.batch_size) * 1000
itl = (result.batch_size / result.output_throughput * 1000) if result.output_throughput > 0 else 0

@michaelzhang-ai
Copy link
Copy Markdown
Collaborator Author

CI Validation - All 4 GLM-5.1 jobs passed ✅

Nightly Test (AMD) — run 24122988145

Job GPU Status Duration
nightly-8-gpu-glm51 MI30x ✅ Passed 37m40s
nightly-8-gpu-mi35x-glm51 MI35x ✅ Passed 1h8m5s

Nightly Test (AMD ROCm 7.2) — run 24122989432

Job GPU Status Duration
nightly-8-gpu-glm51-rocm720 MI30x ✅ Passed 31m4s
nightly-8-gpu-mi35x-glm51-rocm720 MI35x ✅ Passed 37m54s

All accuracy tests (GSM8K, threshold 0.93) and performance tests (bench_one_batch) passed on both MI30x and MI35x across both default ROCm and ROCm 7.2 workflows.

@michaelzhang-ai michaelzhang-ai changed the title [AMD] Add GLM-5.1-FP8 nightly performance benchmarks for MI30x and MI35x [AMD] Add GLM-5.1-FP8 nightly accuracy and performance benchmarks for MI30x and MI35x Apr 9, 2026
Drop the nightly-8-gpu-mi35x-glm47-fp8-rocm720 job, its job_select
dropdown entry, and its nightly-check dependency. GLM-4.7 is superseded
by GLM-5 and GLM-5.1 benchmarks.
The GLM-5 performance test on MI35x crashes with a GPU memory access
fault (write to read-only page) during the first large prefill batch.
Root cause: the fused_append_shared_experts Triton kernel triggers a
gfx950 codegen issue when shared expert fusion is active with FP8 KV
cache under TP-only (no EP) mode.

GLM-5.1 (which uses EP=8 and thus bypasses shared expert fusion) is
unaffected and keeps its perf test. MI30x GLM-5 perf test also stays
since gfx942 is not affected.

Keep the GLM-5 MI35x accuracy test which passes reliably.
@michaelzhang-ai michaelzhang-ai changed the title [AMD] Add GLM-5.1-FP8 nightly accuracy and performance benchmarks for MI30x and MI35x [AMD] Add GLM-5.1-FP8 nightly benchmarks + drop GLM-5 MI35x perf test Apr 9, 2026
@michaelzhang-ai michaelzhang-ai changed the title [AMD] Add GLM-5.1-FP8 nightly benchmarks + drop GLM-5 MI35x perf test [AMD] Add GLM-5.1-FP8 nightly accuracy and performance benchmarks for MI30x and MI35x Apr 9, 2026
@1am9trash
Copy link
Copy Markdown
Collaborator

Followed the model configs for GLM-5-FP8 and GLM-5.1-FP8 — they have identical architecture (same hidden size, layer count) and are both MoE models (256 routed experts, top-8).

Why does GLM-5.1 add --ep-size 8 while GLM-5 runs with pure TP? If EP is beneficial here, should the GLM-5 benchmark be updated to match?

Confirmed via config.json diff: GLM-5-FP8 and GLM-5.1-FP8 have
identical architecture (GlmMoeDsaForCausalLM, 256 routed experts,
top-8, same hidden_size/layers). The only diff is transformers_version.

GLM-5 benchmarks run with pure TP (no EP), and --ep-size without
--moe-a2a-backend is either a no-op or an assertion hazard (FP8
Cutlass MoE and Triton kernel MoE both require ep_size == 1).
Align GLM-5.1 config with GLM-5 for consistency.
@michaelzhang-ai
Copy link
Copy Markdown
Collaborator Author

Followed the model configs for GLM-5-FP8 and GLM-5.1-FP8 — they have identical architecture (same hidden size, layer count) and are both MoE models (256 routed experts, top-8).

Why does GLM-5.1 add --ep-size 8 while GLM-5 runs with pure TP? If EP is beneficial here, should the GLM-5 benchmark be updated to match?

updated.

Copy link
Copy Markdown
Collaborator

@1am9trash 1am9trash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@HaiShaw HaiShaw merged commit ef6bfc1 into main Apr 9, 2026
132 of 139 checks passed
@HaiShaw HaiShaw deleted the add-glm51-nightly-perf-test branch April 9, 2026 05:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants