[AMD] CI - Add MI35x nightly/PR tests for kv-cache-fp8 and allreduce-fusion (DeepSeek)#19834
[AMD] CI - Add MI35x nightly/PR tests for kv-cache-fp8 and allreduce-fusion (DeepSeek)#19834
Conversation
…uce-fusion variants
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
…he test lasts very short compared to runner init time, will cause longer queue time for 8-gpu-runner
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
@bingxche @michaelzhang-ai please help review, thanks!
PR Test:
|
|
Please double check with test args. The accuracy tests launch the server with Accuracy test ( other_args=[
"--attention-backend", "aiter",
"--chunked-prefill-size", "131072",
"--disable-radix-cache",
"--mem-fraction-static", "0.85",
"--trust-remote-code",
"--kv-cache-dtype", "fp8_e4m3",
],
env_vars={"SGLANG_USE_AITER": "1"},Perf test ( "other_args": [
"--trust-remote-code",
"--tp", "8",
"--chunked-prefill-size", "131072",
"--disable-radix-cache",
"--mem-fraction-static", "0.85",
"--kv-cache-dtype", "fp8_e4m3",
],This means the perf tests benchmark a different server configuration than what the accuracy tests validate. If the aiter backend is needed for the MXFP4 model on MI35x, the perf numbers won't reflect production behavior. The same issue exists for the allreduce-fusion variant. |
@michaelzhang-ai thanks for your review, the configs are based on the existing amd nightly test |
…fusion (DeepSeek) (sgl-project#19834) Co-authored-by: bingxche <Bingxu.Chen@amd.com>
…fusion (DeepSeek) (sgl-project#19834) Co-authored-by: bingxche <Bingxu.Chen@amd.com>
…fusion (DeepSeek) (sgl-project#19834) Co-authored-by: bingxche <Bingxu.Chen@amd.com>
…fusion (DeepSeek) (sgl-project#19834) Co-authored-by: bingxche <Bingxu.Chen@amd.com>
Motivation
Track accuracy and performance regression for two new DeepSeek-R1-MXFP4 server configurations on MI35x:
--kv-cache-dtype fp8_e4m3--enable-aiter-allreduce-fusionUpdate
work_flow dispatchmechanism to enable multi-job triggering.Add PD/D test to
pr-test-amd-rocm720.ymlModifications
test/registered/amd/nightly-test-amd.ymlandnightly-test-amd-rocm720.ymlAccuracy Tests
Nightly Test:
PR Test:
Benchmarking and Profiling
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci