[AMD] Add GLM-5.1-FP8 nightly accuracy and performance benchmarks for MI30x and MI35x by michaelzhang-ai · Pull Request #22336 · sgl-project/sglang

michaelzhang-ai · 2026-04-08T07:16:16Z

Summary

Add GLM-5.1-FP8 nightly accuracy + perf benchmarks (bench_one_batch) for MI30x and MI35x. Both GLM-5-FP8 and GLM-5.1-FP8 share identical architecture (GlmMoeDsaForCausalLM, 256 routed experts, top-8, hidden_size=6144, 78 layers). GLM-5.1 test config mirrors GLM-5 (pure TP=8, NSA tilelang backend).

Based on the GLM-5 test pattern from #21710.

Changes

New: test/registered/amd/accuracy/mi30x/test_glm51_eval_amd.py (suite: nightly-amd-accuracy-8-gpu-glm51)
New: test/registered/amd/accuracy/mi35x/test_glm51_eval_mi35x.py (suite: nightly-amd-8-gpu-mi35x-glm51)
New: test/registered/amd/perf/mi30x/test_glm51_perf_amd.py (suite: nightly-perf-8-gpu-glm51)
New: test/registered/amd/perf/mi35x/test_glm51_perf_mi35x.py (suite: nightly-perf-8-gpu-mi35x-glm51)
Modified: nightly-test-amd.yml — add MI30x + MI35x GLM-5.1 jobs
Modified: nightly-test-amd-rocm720.yml — add MI30x + MI35x GLM-5.1 jobs; remove GLM-4.7-FP8 job (superseded)

Server config

Same config as GLM-5 — pure TP=8, no EP (EP without --moe-a2a-backend is a no-op, and FP8 Cutlass/Triton MoE kernels assert ep_size == 1).

	MI30x	MI35x
--tp	8	8
--kv-cache-dtype	fp8_e4m3 (perf)	fp8_e4m3 (perf)
--mem-fraction-static	0.85 (perf) / 0.80 (accuracy)	0.85 (perf) / 0.80 (accuracy)
--model-loader-extra-config	`{"enable_multithread_load": true}`	`{"enable_multithread_load": true, "num_threads": 8}`
Parsers	`--reasoning-parser=glm45 --tool-call-parser=glm47`	Same
Env	`SGLANG_USE_AITER=1`	`SGLANG_USE_AITER=1`, `SGLANG_ROCM_FUSED_DECODE_MLA=0`, `ROCM_QUICK_REDUCE_QUANTIZATION=INT4`, `SAFETENSORS_FAST_GPU=1`

Workflow behavior

Accuracy step has no continue-on-error — if it fails, perf is skipped and the job fails
Perf step has continue-on-error: true — perf failures don't block CI

CI validation

Nightly Test (AMD): https://github.com/sgl-project/sglang/actions/runs/24171310605
Nightly Test (AMD ROCm 7.2): https://github.com/sgl-project/sglang/actions/runs/24171311352

Job	GPU	ROCm	Accuracy	Perf	Notes
`nightly-8-gpu-glm51`	MI30x	Default	✅	✅	35m1s
`nightly-8-gpu-mi35x-glm51`	MI35x	Default	✅	⚠️ Cancelled	Perf cancelled by runner (accuracy passed); `continue-on-error: true`
`nightly-8-gpu-glm51-rocm720`	MI30x	7.2	✅	✅	47m3s
`nightly-8-gpu-mi35x-glm51-rocm720`	MI35x	7.2	✅	✅	1h17m56s

All 4 accuracy tests passed. 3/4 perf tests passed; MI35x default perf was cancelled by runner (not a test failure).

Test plan

Trigger default ROCm workflow with job_filter=nightly-8-gpu-glm51,nightly-8-gpu-mi35x-glm51
Trigger ROCm 7.2 workflow with job_filter=nightly-8-gpu-glm51-rocm720,nightly-8-gpu-mi35x-glm51-rocm720
Verify accuracy tests pass GSM8K threshold (0.93) — all 4 passed
Verify perf results are reported in GitHub step summary — 3/4 passed, 1 cancelled by runner

… MI30x and MI35x Add bench_one_batch perf tests and GSM8K accuracy tests for GLM-5.1-FP8 (MoE, 754B) with NSA attention backend on both MI30x and MI35x. GLM-5.1 uses glm_moe_dsa architecture requiring TP=8 + EP=8, matching the MiniMax-M2.5 expert parallelism pattern. - New: test/registered/amd/accuracy/mi30x/test_glm51_eval_amd.py (suite: nightly-amd-accuracy-8-gpu-glm51) - New: test/registered/amd/accuracy/mi35x/test_glm51_eval_mi35x.py (suite: nightly-amd-8-gpu-mi35x-glm51) - New: test/registered/amd/perf/mi30x/test_glm51_perf_amd.py (suite: nightly-perf-8-gpu-glm51) - New: test/registered/amd/perf/mi35x/test_glm51_perf_mi35x.py (suite: nightly-perf-8-gpu-mi35x-glm51) - Modified: nightly-test-amd.yml and nightly-test-amd-rocm720.yml with GLM-5.1 jobs (accuracy + perf in same job) Server config: - Model: zai-org/GLM-5.1-FP8 with --tp 8 --ep-size 8 - NSA: --nsa-prefill-backend tilelang --nsa-decode-backend tilelang - Parsers: --reasoning-parser=glm45 --tool-call-parser=glm47 - Perf: --kv-cache-dtype fp8_e4m3, --mem-fraction-static 0.85 - MI35x perf adds SGLANG_ROCM_FUSED_DECODE_MLA=0, ROCM_QUICK_REDUCE_QUANTIZATION=INT4, SAFETENSORS_FAST_GPU=1 Workflow: accuracy has no continue-on-error (failure skips perf), perf has continue-on-error: true (perf failures don't block CI)

gemini-code-assist

Code Review

This pull request introduces accuracy and performance evaluation tests for the GLM-5.1 model on AMD MI30x and MI35x hardware. The review feedback identifies several improvement opportunities, including the removal of hardcoded environment-specific paths and the correction of PEP 8 import order violations in the MI35x test scripts. Additionally, the reviewer pointed out potential division-by-zero errors in the performance metrics calculation and noted configuration inconsistencies between the accuracy and performance benchmarks for the MI35x variant.

gemini-code-assist · 2026-04-08T07:17:48Z

test/registered/amd/accuracy/mi35x/test_glm51_eval_mi35x.py

+os.environ.setdefault("HF_HOME", "/data2/models/huggingface")
+os.environ.setdefault("HF_HUB_CACHE", "/data2/models/huggingface/hub")


Hardcoding environment-specific paths like /data2/models/huggingface reduces the portability of the test script. Additionally, placing these statements between imports violates PEP 8. Consider moving these settings to a configuration file or environment variables set outside the script, or at least moving them after all imports.

gemini-code-assist · 2026-04-08T07:17:48Z

test/registered/amd/accuracy/mi35x/test_glm51_eval_mi35x.py

+            '{"enable_multithread_load": true}',
+            "--watchdog-timeout",
+            "1200",
+        ],
+        env_vars={},


The configuration for MI35x accuracy is missing several environment variables and parameters specified in the PR description's server config table (e.g., num_threads: 8 and SGLANG_USE_AITER). This inconsistency might lead to suboptimal performance or different behavior compared to the performance benchmarks.

'{"enable_multithread_load": true, "num_threads": 8}', "--watchdog-timeout", "1200", ], env_vars={ "SGLANG_USE_AITER": "1", "SGLANG_ROCM_FUSED_DECODE_MLA": "0", "ROCM_QUICK_REDUCE_QUANTIZATION": "INT4", "SAFETENSORS_FAST_GPU": "1", },

gemini-code-assist · 2026-04-08T07:17:48Z

test/registered/amd/perf/mi30x/test_glm51_perf_amd.py

+    )
+
+    for result in report_results:
+        itl = 1 / (result.output_throughput / result.batch_size) * 1000


Potential ZeroDivisionError if result.output_throughput is zero. It is safer to check for a non-zero value before performing the division.

Suggested change

itl = 1 / (result.output_throughput / result.batch_size) * 1000

itl = (result.batch_size / result.output_throughput * 1000) if result.output_throughput > 0 else 0

gemini-code-assist · 2026-04-08T07:17:48Z

test/registered/amd/perf/mi35x/test_glm51_perf_mi35x.py

+os.environ.setdefault("HF_HOME", "/data2/models/huggingface")
+os.environ.setdefault("HF_HUB_CACHE", "/data2/models/huggingface/hub")


Hardcoding environment-specific paths like /data2/models/huggingface reduces the portability of the test script. Additionally, placing these statements between imports violates PEP 8.

gemini-code-assist · 2026-04-08T07:17:48Z

test/registered/amd/perf/mi35x/test_glm51_perf_mi35x.py

+    )
+
+    for result in report_results:
+        itl = 1 / (result.output_throughput / result.batch_size) * 1000


Potential ZeroDivisionError if result.output_throughput is zero. It is safer to check for a non-zero value before performing the division.

Suggested change

itl = 1 / (result.output_throughput / result.batch_size) * 1000

itl = (result.batch_size / result.output_throughput * 1000) if result.output_throughput > 0 else 0

michaelzhang-ai · 2026-04-09T00:41:29Z

CI Validation - All 4 GLM-5.1 jobs passed ✅

Nightly Test (AMD) — run 24122988145

Job	GPU	Status	Duration
`nightly-8-gpu-glm51`	MI30x	✅ Passed	37m40s
`nightly-8-gpu-mi35x-glm51`	MI35x	✅ Passed	1h8m5s

Nightly Test (AMD ROCm 7.2) — run 24122989432

Job	GPU	Status	Duration
`nightly-8-gpu-glm51-rocm720`	MI30x	✅ Passed	31m4s
`nightly-8-gpu-mi35x-glm51-rocm720`	MI35x	✅ Passed	37m54s

All accuracy tests (GSM8K, threshold 0.93) and performance tests (bench_one_batch) passed on both MI30x and MI35x across both default ROCm and ROCm 7.2 workflows.

Drop the nightly-8-gpu-mi35x-glm47-fp8-rocm720 job, its job_select dropdown entry, and its nightly-check dependency. GLM-4.7 is superseded by GLM-5 and GLM-5.1 benchmarks.

The GLM-5 performance test on MI35x crashes with a GPU memory access fault (write to read-only page) during the first large prefill batch. Root cause: the fused_append_shared_experts Triton kernel triggers a gfx950 codegen issue when shared expert fusion is active with FP8 KV cache under TP-only (no EP) mode. GLM-5.1 (which uses EP=8 and thus bypasses shared expert fusion) is unaffected and keeps its perf test. MI30x GLM-5 perf test also stays since gfx942 is not affected. Keep the GLM-5 MI35x accuracy test which passes reliably.

1am9trash · 2026-04-09T01:53:55Z

Followed the model configs for GLM-5-FP8 and GLM-5.1-FP8 — they have identical architecture (same hidden size, layer count) and are both MoE models (256 routed experts, top-8).

Why does GLM-5.1 add --ep-size 8 while GLM-5 runs with pure TP? If EP is beneficial here, should the GLM-5 benchmark be updated to match?

Confirmed via config.json diff: GLM-5-FP8 and GLM-5.1-FP8 have identical architecture (GlmMoeDsaForCausalLM, 256 routed experts, top-8, same hidden_size/layers). The only diff is transformers_version. GLM-5 benchmarks run with pure TP (no EP), and --ep-size without --moe-a2a-backend is either a no-op or an assertion hazard (FP8 Cutlass MoE and Triton kernel MoE both require ep_size == 1). Align GLM-5.1 config with GLM-5 for consistency.

michaelzhang-ai · 2026-04-09T04:04:15Z

Followed the model configs for GLM-5-FP8 and GLM-5.1-FP8 — they have identical architecture (same hidden size, layer count) and are both MoE models (256 routed experts, top-8).

Why does GLM-5.1 add --ep-size 8 while GLM-5 runs with pure TP? If EP is beneficial here, should the GLM-5 benchmark be updated to match?

updated.

1am9trash

LGTM

michaelzhang-ai requested review from Fridge003, Kangyan-Zhou, bingxche, ispobock and merrymercy as code owners April 8, 2026 07:16

github-actions bot added the amd label Apr 8, 2026

gemini-code-assist bot reviewed Apr 8, 2026

View reviewed changes

michaelzhang-ai requested a review from 1am9trash April 9, 2026 00:43

michaelzhang-ai changed the title ~~[AMD] Add GLM-5.1-FP8 nightly performance benchmarks for MI30x and MI35x~~ [AMD] Add GLM-5.1-FP8 nightly accuracy and performance benchmarks for MI30x and MI35x Apr 9, 2026

michaelzhang-ai added 2 commits April 8, 2026 19:59

[AMD][CI] Remove GLM-4.7-FP8 nightly job from ROCm 7.2 workflow

1a664fa

Drop the nightly-8-gpu-mi35x-glm47-fp8-rocm720 job, its job_select dropdown entry, and its nightly-check dependency. GLM-4.7 is superseded by GLM-5 and GLM-5.1 benchmarks.

michaelzhang-ai changed the title ~~[AMD] Add GLM-5.1-FP8 nightly accuracy and performance benchmarks for MI30x and MI35x~~ [AMD] Add GLM-5.1-FP8 nightly benchmarks + drop GLM-5 MI35x perf test Apr 9, 2026

michaelzhang-ai changed the title ~~[AMD] Add GLM-5.1-FP8 nightly benchmarks + drop GLM-5 MI35x perf test~~ [AMD] Add GLM-5.1-FP8 nightly accuracy and performance benchmarks for MI30x and MI35x Apr 9, 2026

1am9trash approved these changes Apr 9, 2026

View reviewed changes

HaiShaw approved these changes Apr 9, 2026

View reviewed changes

HaiShaw merged commit ef6bfc1 into main Apr 9, 2026
132 of 139 checks passed

HaiShaw deleted the add-glm51-nightly-perf-test branch April 9, 2026 05:57

michaelzhang-ai requested review from 1am9trash and HaiShaw April 9, 2026 18:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD] Add GLM-5.1-FP8 nightly accuracy and performance benchmarks for MI30x and MI35x#22336

[AMD] Add GLM-5.1-FP8 nightly accuracy and performance benchmarks for MI30x and MI35x#22336
HaiShaw merged 4 commits intomainfrom
add-glm51-nightly-perf-test

michaelzhang-ai commented Apr 8, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 8, 2026

Uh oh!

gemini-code-assist bot Apr 8, 2026

Uh oh!

gemini-code-assist bot Apr 8, 2026

Uh oh!

gemini-code-assist bot Apr 8, 2026

Uh oh!

gemini-code-assist bot Apr 8, 2026

Uh oh!

michaelzhang-ai commented Apr 9, 2026

Uh oh!

1am9trash commented Apr 9, 2026

Uh oh!

michaelzhang-ai commented Apr 9, 2026

Uh oh!

1am9trash left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		os.environ.setdefault("HF_HOME", "/data2/models/huggingface")
		os.environ.setdefault("HF_HUB_CACHE", "/data2/models/huggingface/hub")

	itl = 1 / (result.output_throughput / result.batch_size) * 1000
	itl = (result.batch_size / result.output_throughput * 1000) if result.output_throughput > 0 else 0

Conversation

michaelzhang-ai commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Server config

Workflow behavior

CI validation

Test plan

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

michaelzhang-ai commented Apr 9, 2026

CI Validation - All 4 GLM-5.1 jobs passed ✅

Nightly Test (AMD) — run 24122988145

Nightly Test (AMD ROCm 7.2) — run 24122989432

Uh oh!

1am9trash commented Apr 9, 2026

Uh oh!

michaelzhang-ai commented Apr 9, 2026

Uh oh!

1am9trash left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

michaelzhang-ai commented Apr 8, 2026 •

edited

Loading