[AMD] Add Qwen3.5-397B FP8 nightly perf benchmarks for MI30x and MI35x#21669
Conversation
…I35x Add bench_one_batch performance tests for Qwen3.5-397B-A17B-FP8 on both MI325/MI300X and MI35x GPUs. Perf steps run after existing accuracy tests with continue-on-error so perf failures don't block CI when accuracy passes. - New test files using triton attention backend, TP=8, mem-fraction 0.8 - Perf steps added to both default ROCm and ROCm 7.2 nightly workflows - Suite names: nightly-perf-8-gpu-qwen35-fp8, nightly-perf-8-gpu-mi35x-qwen35-fp8
There was a problem hiding this comment.
Code Review
This pull request introduces nightly performance benchmarks for the Qwen3.5-397B-A17B FP8 model on AMD MI30x and MI35x platforms. The reviewer identified significant code duplication between the benchmark scripts and suggested refactoring shared logic into a common module. Other feedback includes fixing a potential division-by-zero error in the ITL calculation and replacing hardcoded model paths with more portable environment-based configurations.
| ) | ||
|
|
||
| for result in report_results: | ||
| itl = 1 / (result.output_throughput / result.batch_size) * 1000 |
There was a problem hiding this comment.
The calculation for itl could lead to a ZeroDivisionError if result.output_throughput is zero. It's safer to check for this case to prevent the test from crashing during report generation. Rewriting the expression also improves readability.
| itl = 1 / (result.output_throughput / result.batch_size) * 1000 | |
| itl = (result.batch_size / result.output_throughput) * 1000 if result.output_throughput > 0 else 0 |
| os.environ.setdefault("HF_HOME", "/data2/models/huggingface") | ||
| os.environ.setdefault("HF_HUB_CACHE", "/data2/models/huggingface/hub") |
There was a problem hiding this comment.
Hardcoding paths like /data2/models/huggingface makes the test less portable and dependent on a specific machine's setup. It's better to rely on the CI environment to set these environment variables, or use a more generic default that works in different environments (e.g., a path relative to the user's home directory).
| ) | ||
|
|
||
| for result in report_results: | ||
| itl = 1 / (result.output_throughput / result.batch_size) * 1000 |
There was a problem hiding this comment.
The calculation for itl could lead to a ZeroDivisionError if result.output_throughput is zero. It's safer to check for this case to prevent the test from crashing during report generation. Rewriting the expression also improves readability.
| itl = 1 / (result.output_throughput / result.batch_size) * 1000 | |
| itl = (result.batch_size / result.output_throughput) * 1000 if result.output_throughput > 0 else 0 |
Override test_lm_eval in the Qwen3.5 accuracy tests to write a markdown results table to GITHUB_STEP_SUMMARY, matching the pattern used by the MXFP4 combined tests. No common code changed.
Jackycheng0808
left a comment
There was a problem hiding this comment.
We should use --attention-backend aiter instead.
Summary
bench_one_batchperformance tests for Qwen3.5-397B-A17B-FP8 on both MI325/MI300X and MI35x GPUscontinue-on-error: trueso perf failures don't block CI when accuracy passesChanges
New test files
test/registered/amd/perf/mi30x/test_qwen35_fp8_perf_amd.pynightly-perf-8-gpu-qwen35-fp8test/registered/amd/perf/mi35x/test_qwen35_fp8_perf_mi35x.pynightly-perf-8-gpu-mi35x-qwen35-fp8Server configuration (matches InferenceX benchmarks)
Qwen/Qwen3.5-397B-A17B-FP8(pre-quantized FP8 checkpoint)--attention-backend aiter--tp 8,--mem-fraction-static 0.8--model-loader-extra-config '{"enable_multithread_load": true}'--watchdog-timeout 1200SGLANG_USE_AITER=1Workflow updates
.github/workflows/nightly-test-amd.yml: Added perf steps tonightly-8-gpu-qwen35andnightly-8-gpu-mi35x-qwen35jobs.github/workflows/nightly-test-amd-rocm720.yml: Added perf steps tonightly-8-gpu-qwen35-rocm720andnightly-8-gpu-mi35x-qwen35-rocm720jobsAccuracy summary fix
test_lm_evalintest_qwen35_eval_amd.pyandtest_qwen35_eval_mi35x.pyto write lm-eval results table toGITHUB_STEP_SUMMARY(same pattern astest_qwen3_instruct_mxfp4.py). No common code changed.Behavior
continue-on-erroron accuracy step)continue-on-error: trueon perf step)CI validation
Run 3 — aiter attention backend (latest)
Run 2 — triton attention backend + accuracy summary fix ✅
Run 1 — initial (perf only, triton backend)
Test plan
yaml.safe_load)black,ruff,isortchecks pass on all new/modified test filesregister_amd_ci()calls andrun_suite.pyinvocations