Skip to content

[AMD] Add Qwen3.5-397B FP8 nightly perf benchmarks for MI30x and MI35x#21669

Merged
HaiShaw merged 5 commits into
mainfrom
amd/qwen35-fp8-perf-test
Apr 7, 2026
Merged

[AMD] Add Qwen3.5-397B FP8 nightly perf benchmarks for MI30x and MI35x#21669
HaiShaw merged 5 commits into
mainfrom
amd/qwen35-fp8-perf-test

Conversation

@michaelzhang-ai
Copy link
Copy Markdown
Collaborator

@michaelzhang-ai michaelzhang-ai commented Mar 30, 2026

Summary

  • Add bench_one_batch performance tests for Qwen3.5-397B-A17B-FP8 on both MI325/MI300X and MI35x GPUs
  • Perf steps run after existing Qwen3.5 accuracy tests in the same CI job, with continue-on-error: true so perf failures don't block CI when accuracy passes
  • Updated all 4 workflow locations: default ROCm + ROCm 7.2 × MI30x + MI35x
  • Write Qwen3.5 lm-eval accuracy results to GitHub step summary (same pattern as MXFP4 tests)

Changes

New test files

GPU File Suite
MI30x test/registered/amd/perf/mi30x/test_qwen35_fp8_perf_amd.py nightly-perf-8-gpu-qwen35-fp8
MI35x test/registered/amd/perf/mi35x/test_qwen35_fp8_perf_mi35x.py nightly-perf-8-gpu-mi35x-qwen35-fp8

Server configuration (matches InferenceX benchmarks)

  • Model: Qwen/Qwen3.5-397B-A17B-FP8 (pre-quantized FP8 checkpoint)
  • --attention-backend aiter
  • --tp 8, --mem-fraction-static 0.8
  • --model-loader-extra-config '{"enable_multithread_load": true}'
  • --watchdog-timeout 1200
  • SGLANG_USE_AITER=1

Workflow updates

  • .github/workflows/nightly-test-amd.yml: Added perf steps to nightly-8-gpu-qwen35 and nightly-8-gpu-mi35x-qwen35 jobs
  • .github/workflows/nightly-test-amd-rocm720.yml: Added perf steps to nightly-8-gpu-qwen35-rocm720 and nightly-8-gpu-mi35x-qwen35-rocm720 jobs

Accuracy summary fix

  • Override test_lm_eval in test_qwen35_eval_amd.py and test_qwen35_eval_mi35x.py to write lm-eval results table to GITHUB_STEP_SUMMARY (same pattern as test_qwen3_instruct_mxfp4.py). No common code changed.

Behavior

  • If accuracy fails → perf step is skipped (no continue-on-error on accuracy step)
  • If accuracy passes but perf fails → job still passes (continue-on-error: true on perf step)

CI validation

Run 3 — aiter attention backend (latest)

Run 2 — triton attention backend + accuracy summary fix ✅

Job Duration Accuracy Performance
MI30x default ROCm 59m
MI35x default ROCm 47m
MI30x ROCm 7.2 60m
MI35x ROCm 7.2 47m

Run 1 — initial (perf only, triton backend)

Test plan

  • Verify YAML syntax is valid (done locally via yaml.safe_load)
  • Verify black, ruff, isort checks pass on all new/modified test files
  • Suite names match between register_amd_ci() calls and run_suite.py invocations
  • Run nightly on MI325 and MI35x — default ROCm ✅
  • Run nightly on MI325 and MI35x — ROCm 7.2 ✅
  • Verify accuracy results appear in step summary ✅
  • Verify aiter attention backend passes (Run 3)

…I35x

Add bench_one_batch performance tests for Qwen3.5-397B-A17B-FP8 on both
MI325/MI300X and MI35x GPUs. Perf steps run after existing accuracy tests
with continue-on-error so perf failures don't block CI when accuracy passes.

- New test files using triton attention backend, TP=8, mem-fraction 0.8
- Perf steps added to both default ROCm and ROCm 7.2 nightly workflows
- Suite names: nightly-perf-8-gpu-qwen35-fp8, nightly-perf-8-gpu-mi35x-qwen35-fp8
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces nightly performance benchmarks for the Qwen3.5-397B-A17B FP8 model on AMD MI30x and MI35x platforms. The reviewer identified significant code duplication between the benchmark scripts and suggested refactoring shared logic into a common module. Other feedback includes fixing a potential division-by-zero error in the ITL calculation and replacing hardcoded model paths with more portable environment-based configurations.

Comment thread test/registered/amd/perf/mi30x/test_qwen35_fp8_perf_amd.py
)

for result in report_results:
itl = 1 / (result.output_throughput / result.batch_size) * 1000
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The calculation for itl could lead to a ZeroDivisionError if result.output_throughput is zero. It's safer to check for this case to prevent the test from crashing during report generation. Rewriting the expression also improves readability.

Suggested change
itl = 1 / (result.output_throughput / result.batch_size) * 1000
itl = (result.batch_size / result.output_throughput) * 1000 if result.output_throughput > 0 else 0

Comment on lines +11 to +12
os.environ.setdefault("HF_HOME", "/data2/models/huggingface")
os.environ.setdefault("HF_HUB_CACHE", "/data2/models/huggingface/hub")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Hardcoding paths like /data2/models/huggingface makes the test less portable and dependent on a specific machine's setup. It's better to rely on the CI environment to set these environment variables, or use a more generic default that works in different environments (e.g., a path relative to the user's home directory).

)

for result in report_results:
itl = 1 / (result.output_throughput / result.batch_size) * 1000
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The calculation for itl could lead to a ZeroDivisionError if result.output_throughput is zero. It's safer to check for this case to prevent the test from crashing during report generation. Rewriting the expression also improves readability.

Suggested change
itl = 1 / (result.output_throughput / result.batch_size) * 1000
itl = (result.batch_size / result.output_throughput) * 1000 if result.output_throughput > 0 else 0

Override test_lm_eval in the Qwen3.5 accuracy tests to write a
markdown results table to GITHUB_STEP_SUMMARY, matching the pattern
used by the MXFP4 combined tests. No common code changed.
@michaelzhang-ai michaelzhang-ai requested a review from yichiche April 7, 2026 02:56
@michaelzhang-ai michaelzhang-ai changed the title [AMD CI] Add Qwen3.5-397B FP8 nightly perf benchmarks for MI30x and MI35x [AMD] Add Qwen3.5-397B FP8 nightly perf benchmarks for MI30x and MI35x Apr 7, 2026
Copy link
Copy Markdown

@Jackycheng0808 Jackycheng0808 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use --attention-backend aiter instead.

Copy link
Copy Markdown
Collaborator

@yichiche yichiche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@HaiShaw HaiShaw merged commit ba78f6e into main Apr 7, 2026
58 of 64 checks passed
@HaiShaw HaiShaw deleted the amd/qwen35-fp8-perf-test branch April 7, 2026 06:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants