[AMD] Add GLM-5-FP8 nightly performance benchmarks for MI30x and MI35x#21710
[AMD] Add GLM-5-FP8 nightly performance benchmarks for MI30x and MI35x#21710
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces nightly performance benchmarks for the GLM-5 model on AMD MI30x and MI35x platforms. The reviewer identified several areas for improvement, including refactoring duplicated report generation logic into a shared utility, fixing potential division-by-zero errors in throughput calculations, and enhancing test portability by avoiding hardcoded local paths. Additionally, it was noted that certain environment variables should be consistently applied across different GPU configurations to ensure optimal performance.
| "--watchdog-timeout", | ||
| "1200", | ||
| ], | ||
| "env_vars": {}, |
There was a problem hiding this comment.
4db8b49 to
9141485
Compare
|
Maybe we can add |
50c89f2 to
f64c7af
Compare
8a360a8 to
b9ec6b9
Compare
…MI35x Add bench_one_batch perf tests for GLM-5-FP8 with NSA attention backend, running after accuracy tests in the same CI job. Perf failures do not block CI when accuracy passes (continue-on-error: true). - Use zai-org/GLM-5-FP8 for both accuracy and perf tests - Add --reasoning-parser=glm45 --tool-call-parser=glm47 for consistency with NV tests and InferenceX benchmarks - Enable --kv-cache-dtype fp8_e4m3 in perf tests for FP8 KV cache - MI35x perf uses env tuning from InferenceX and PR #21511
2a78aa5 to
3815bea
Compare
Summary
Add GLM-5-FP8 nightly perf benchmarks (
bench_one_batch) for MI30x and MI35x. Both accuracy and perf usezai-org/GLM-5-FP8with NSA tilelang backend, TP=8, FP8 KV cache, and--reasoning-parser=glm45 --tool-call-parser=glm47(matching NV/InferenceX configs).Changes
test/registered/amd/perf/mi30x/test_glm5_perf_amd.py(suite:nightly-perf-8-gpu-glm5)test/registered/amd/perf/mi35x/test_glm5_perf_mi35x.py(suite:nightly-perf-8-gpu-mi35x-glm5)mi30x/andmi35x/— switch model tozai-org/GLM-5-FP8, add parser flagsnightly-test-amd.ymlandnightly-test-amd-rocm720.yml— add perf step after accuracy in each GLM-5 jobWorkflow behavior
continue-on-error— if it fails, perf is skipped and the job failscontinue-on-error: true— perf failures don't block CIDependencies
--kv-cache-dtype fp8_e4m3on MI30x)Server config
--kv-cache-dtype--mem-fraction-static--model-loader-extra-config{"enable_multithread_load": true}{"enable_multithread_load": true, "num_threads": 8}SGLANG_USE_AITER=1SGLANG_ROCM_FUSED_DECODE_MLA=0,ROCM_QUICK_REDUCE_QUANTIZATION=INT4,SAFETENSORS_FAST_GPU=1CI validation
MI30x perf results (earlier run, without FP8 KV cache)
From AMD run / ROCm 7.2 run: