[rebase]Deepseek_v4 support w4(mxfp4)a16 on hopper#24986
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
/tag-and-rerun-ci |
|
/rerun-stage stage-c-test-dsv4-8-gpu-h200 |
|
🚀 Triggered |
55d32cd to
91d1df1
Compare
|
/rerun-stage stage-c-test-dsv4-8-gpu-h200 |
|
🚀 Triggered |
|
@shiyu7 Please take a look at this issue https://github.com/sgl-project/sglang/actions/runs/25769201102/job/75688469859 |
|
Thanks @Fridge003 @yhyang201 ~ I found that when an nvshmem error occurs, the watchdog triggers and attempts to generate a Python dump. I believe these two issues are related. Since we haven't modified any DeepEP-related code, I've slightly increased the watchdog timeout to 900s. Could you please rerun the CI? |
|
/rerun-stage stage-c-test-dsv4-8-gpu-h200 |
|
🚀 Triggered |
|
/rerun-stage stage-c-test-dsv4-8-gpu-h200 |
|
🚀 Triggered |
Motivation
According to @zhangxiaolei123456 #23686
Modifications
Rebase the MXFP4 support from the deepseek_v4 branch onto the main branch.
Accuracy Tests
V4 Flash
DeepSeek V4 Flash run command:
MMLU
GSM8K
GPQA
LongBench
V4 Pro
I have added the accuracy tests from Pro. Please let me know if we need any other testing requirements.
Note : PR #24952 is necessary
Run command
GSM8K
Speed Tests and Profiling
Checklist
Review and Merge Process
/tag-and-rerun-ci,/tag-run-ci-label,/rerun-failed-ci