[AMD] Add MI35x nightly CI tests#16588
Conversation
- test_mi35x_basic_1gpu.py: 1-GPU basic model tests - test_mi35x_eval_2gpu.py: 2-GPU evaluation tests (TP=2) - test_mi35x_large_8gpu.py: 8-GPU large model tests (TP=8) Uses runners: linux-mi35x-gpu-1, linux-mi35x-gpu-2, linux-mi35x-gpu-8
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
- Added new nightly test jobs for MI35x GPUs, including 2-GPU and VLM evaluation tests. - Updated existing nightly test configurations to include MI35x jobs. - Introduced new test files for GSM8K completion and evaluation, along with VLM MMMU evaluation tests. - Removed outdated 1-GPU and 2-GPU MI35x test files. This update improves coverage for AMD's MI35x architecture in the nightly CI pipeline.
- Introduced new nightly test jobs for MI35x 8-GPU configurations, including tests for GPT-OSS, GROK, and DeepSeek models. - Updated the run suite to include the new MI35x 8-GPU suite. - Added a new test file for GSM8K completion evaluation specific to MI35x models. This enhances the testing framework for AMD's MI35x architecture, ensuring comprehensive coverage in the nightly CI pipeline.
- Consolidated DeepSeek-R1 tests into a single job with combined DP and TC configurations. - Introduced new performance benchmark for DeepSeek-R1-MXFP4 model on MI35x. - Updated model configurations to include basic and MTP variants for MI35x. - Enhanced test descriptions for clarity and accuracy in nightly evaluation. This update streamlines the testing process and improves coverage for DeepSeek-R1 models in the nightly CI pipeline.
- Refactored model path configuration to prioritize environment variable, local path, and HuggingFace model ID. - Introduced a new function to determine the effective model path based on availability. - Updated test classes to utilize the new model path logic, improving flexibility and clarity in model sourcing. This update streamlines the model path management in the nightly performance benchmarks for DeepSeek-R1-MXFP4 on MI35x.
- Removed the pull_request trigger from the nightly test workflow for AMD. - Enhanced code readability by formatting multi-line function calls and string concatenations in the DeepSeek-R1-MXFP4 performance test. - Cleaned up trailing whitespace in several test files. This update streamlines the nightly testing process and improves code clarity across the AMD test suite.
… and MI35x - Added pull_request trigger to the nightly test workflow for AMD. - Consolidated DeepSeek-R1 tests into a single job with all variants (basic, MTP, DP, TC) for MI35x. - Updated model configurations to reflect the new naming and structure, ensuring consistency across tests. - Enhanced logging to include variant names in test summaries for better clarity. This update improves the nightly testing process and ensures comprehensive coverage for DeepSeek-R1 models.
- Changed MI35x accuracy tests from deepseek-r1-all to deepseek-r1 - Only runs basic and MTP variants (DP/TC cause timeout with full model) - DeepSeek-R1-0528 (~91GB/GPU) too large for DP initialization on MI35x - MXFP4 still used for perf tests - Reduced timeout from 300 to 180 minutes
|
CI all green: https://github.com/sgl-project/sglang/actions/runs/20770209451. Ready for review and merge. @bingxche @yctseng0211 @HaiShaw |
These suites were migrated to test/registered/amd/nightly/ and are now managed by test/run_suite.py using the registry system.
eb98159 to
e8e0071
Compare
003e220 to
e62e96b
Compare
|
PR test pass before merge upstream: https://github.com/sgl-project/sglang/actions/runs/20773365925. cc: @HaiShaw @yctseng0211 @bingxche #15712 merged cause AMD CI failed stage-a-test-1-amd (linux-mi325-gpu-1). @Fridge003 |
|
These three failures will be fixed by #16675 |
|
https://github.com/sgl-project/sglang/actions/runs/20837677036 all AMD PR test pass. Ready to merge @HaiShaw. Thanks! |
yctseng0211
left a comment
There was a problem hiding this comment.
LGTM, changed amd only

Motivation
Add nightly CI tests for the new MI35x cluster, enabling comprehensive testing of SGLang on AMD's MI35x architecture alongside existing MI300X tests.
MI35x Coverage: +32 model/tests total — 17 TP1/TP2 models, 5 VLMs, 2 GPT-OSS, 3 GROK, 2 DeepSeek-R1-MXFP4 variants (basic, MTP), and 3 perf benchmarks (2 GROK + 1 DeepSeek).
CI all green: https://github.com/sgl-project/sglang/actions/runs/20770209451
Please help to review. @yctseng0211 @bingxche @HaiShaw
Modifications
New MI35x test jobs (
linux-mi35x-gpu-2,linux-mi35x-gpu-8runners):nightly-test-2-gpu-mi35x- 2-GPU evaluation testsnightly-test-2-gpu-vlm-mi35x- 2-GPU VLM MMMU testsnightly-test-8-gpu-mi35x-gpt-oss- GPT-OSS models (openai/* paths)nightly-test-8-gpu-mi35x-grok- GROK modelsnightly-test-8-gpu-mi35x-deepseek-r1- DeepSeek-R1 (basic + MTP)nightly-perf-8-gpu-mi35x-grok- GROK performance benchmarksnightly-perf-8-gpu-mi35x-deepseek-r1-mxfp4- DeepSeek-R1-MXFP4 performanceMI300X improvements:
deepseek-r1job to run basic + MTP variants + DP attention + torch compile teststest/srt/nightly/totest/registered/amd/nightly/per CI reorg roadmapAccuracy Tests
TestNightlyGsm8KEval (TP=2)
Model Group: grok
MI35x Model Group: deepseek-r1
details in CI : https://github.com/sgl-project/sglang/actions/runs/20770209451
Benchmarking and Profiling
TestNightlyGrokPerformance
amd/grok-1-W4A8KV8 (grok1)
xai-org/grok-2 (grok2)
TestNightlyDeepseekR1MXFP4Performance
amd/DeepSeek-R1-MXFP4-Preview (basic)
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci) or contact authorized users to do so.