Skip to content

Deepseek_v4 support w4(mxfp4)a16 on hopper#23686

Merged
Fridge003 merged 30 commits into
sgl-project:deepseek_v4from
zhangxiaolei123456:deepseek_v4_w4a16
Apr 28, 2026
Merged

Deepseek_v4 support w4(mxfp4)a16 on hopper#23686
Fridge003 merged 30 commits into
sgl-project:deepseek_v4from
zhangxiaolei123456:deepseek_v4_w4a16

Conversation

@zhangxiaolei123456
Copy link
Copy Markdown
Contributor

@zhangxiaolei123456 zhangxiaolei123456 commented Apr 25, 2026

Motivation

Co-authored-by: shiyu7

Modifications

Accuracy Tests

Flash

SGLANG_DSV4_FP4_EXPERTS=1 SGLANG_JIT_DEEPGEMM_PRECOMPILE=0 GLOO_SOCKET_IFNAME=eth0 NCCL_MIN_NCHANNELS=24 NCCL_IB_QPS_PER_CONNECTION=8 sglang serve --trust-remote-code --model-path /data00/models/DeepSeek-V4-Flash --tp 8 --cuda-graph-max-bs 256 --max-running-requests 256 --enable-metrics --host 0.0.0.0 --port 30300 --mem-fraction-static 0.8 --tool-call-parser deepseekv4 --reasoning-parser deepseek-v4 --moe-runner-backend marlin --speculative-algo EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4
curl -X POST http://localhost:30300/v1/chat/completions -H "Content-Type: application/json" -d '{
    "model": "deepseek-v4-fp8",
    "messages": [
        {
            "role": "user",
            "content": "做一份北京出行攻略"
        }
    ],
    "max_tokens": 500,
    "temperature": 0.6
}'
{"id":"86101dde6e494c11a9b5c5f34b073115","object":"chat.completion","created":1777086102,"model":"deepseek-v4-fp8","choices":[{"index":0,"message":{"role":"assistant","content":"北京作为中国的首都,是一座拥有悠久历史和丰富文化的城市。以下是一份北京出行攻略,帮助您规划一次愉快的北京之旅。\n\n### 1. 行前准备\n\n-   **交通**:北京交通便利,建议选择地铁、公交或打车。下载“北京通”或“亿通行”APP,方便乘坐地铁。\n-   **住宿**:建议选择市中心或交通枢纽附近的酒店,如王府井、国贸、中关村等区域。\n-   **最佳时间**:春秋季(4月-6月、9月-11月)气候宜人,秋高气爽。夏季(7月-8月)炎热,冬季(12月-2月)寒冷,需注意保暖。\n\n### 2. 必游景点\n\n- **故宫**:游览“世界五大宫殿”之一的故宫,感受皇家宫殿的宏伟。\n- **天坛**:参观祈年殿、回廊等,了解古代祭祀文化。\n- **颐和园**:游览皇家园林,欣赏昆明湖和长廊。\n- **长城**:八达岭长城、慕田峪长城,感受“不到长城非好汉”的豪情。\n- **天安门**:参观天安门广场,感受政治文化中心的庄重。\n- **圆明园**:参观皇家园林遗址,了解历史变迁。\n\n### 3. 美食推荐\n\n- **北京烤鸭**:全聚德、大董等老字号,皮脆肉嫩。\n- **涮羊肉**:东来顺、海底捞等,肉质鲜嫩。\n- **炸酱面**:面条筋道,酱料浓郁。\n- **糖葫芦**:老北京小吃,酸甜可口。\n- **北京小吃**:豆汁儿、焦圈、卤煮等,体验地道风味。\n\n### 4. 购物推荐\n\n- **王府井大街**:老字号购物街,购买传统工艺品。\n- **国贸**:现代商业区,购物、餐饮、娱乐一体。\n- **三里屯**:时尚潮流聚集地,酒吧、餐厅众多。\n- **大栅栏**:传统商业街,购买北京特色小吃和手工艺品。\n\n### 5. 交通指南\n\n- **市内交通**:地铁、公交、出租车、网约车等。建议使用“北京交通”APP,查询实时路况。\n- **长途交通**:北京","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"length","matched_stop":null}],"usage":{"prompt_tokens":9,"total_tokens":509,"completion_tokens":500,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}

GSM8K

python3 bench_sglang.py --host http://localhost  --port 30300 --data-path /data00 --num-questions 5000 --parallel 100
100%|████████████████████████████| 1319/1319 [01:49<00:00, 12.06it/s]
Accuracy: 0.955
Invalid: 0.000
Latency: 109.555 s
Output throughput: 1083.640 token/s

MMLU

python3 bench_sglang.py --parallel 128 --backend srt --host http://127.0.0.1 --port 30300 --data_dir /data00/mmlu
100%|██████████████████████████| 14042/14042 [08:05<00:00, 28.92it/s]
subject: abstract_algebra, #q:100, acc: 0.880
subject: anatomy, #q:135, acc: 0.889
subject: astronomy, #q:152, acc: 0.941
subject: business_ethics, #q:100, acc: 0.850
subject: clinical_knowledge, #q:265, acc: 0.891
subject: college_biology, #q:144, acc: 0.979
subject: college_chemistry, #q:100, acc: 0.690
subject: college_computer_science, #q:100, acc: 0.910
subject: college_mathematics, #q:100, acc: 0.860
subject: college_medicine, #q:173, acc: 0.850
subject: college_physics, #q:102, acc: 0.951
subject: computer_security, #q:100, acc: 0.840
subject: conceptual_physics, #q:235, acc: 0.966
subject: econometrics, #q:114, acc: 0.816
subject: electrical_engineering, #q:145, acc: 0.890
subject: elementary_mathematics, #q:378, acc: 0.960
subject: formal_logic, #q:126, acc: 0.770
subject: global_facts, #q:100, acc: 0.750
subject: high_school_biology, #q:310, acc: 0.965
subject: high_school_chemistry, #q:203, acc: 0.882
subject: high_school_computer_science, #q:100, acc: 0.950
subject: high_school_european_history, #q:165, acc: 0.897
subject: high_school_geography, #q:198, acc: 0.960
subject: high_school_government_and_politics, #q:193, acc: 0.995
subject: high_school_macroeconomics, #q:390, acc: 0.933
subject: high_school_mathematics, #q:270, acc: 0.807
subject: high_school_microeconomics, #q:238, acc: 0.975
subject: high_school_physics, #q:151, acc: 0.868
subject: high_school_psychology, #q:545, acc: 0.961
subject: high_school_statistics, #q:216, acc: 0.917
subject: high_school_us_history, #q:204, acc: 0.941
subject: high_school_world_history, #q:237, acc: 0.932
subject: human_aging, #q:223, acc: 0.870
subject: human_sexuality, #q:131, acc: 0.931
subject: international_law, #q:121, acc: 0.942
subject: jurisprudence, #q:108, acc: 0.917
subject: logical_fallacies, #q:163, acc: 0.914
subject: machine_learning, #q:112, acc: 0.839
subject: management, #q:103, acc: 0.922
subject: marketing, #q:234, acc: 0.970
subject: medical_genetics, #q:100, acc: 0.980
subject: miscellaneous, #q:783, acc: 0.957
subject: moral_disputes, #q:346, acc: 0.867
subject: moral_scenarios, #q:895, acc: 0.793
subject: nutrition, #q:306, acc: 0.938
subject: philosophy, #q:311, acc: 0.913
subject: prehistory, #q:324, acc: 0.941
subject: professional_accounting, #q:282, acc: 0.858
subject: professional_law, #q:1534, acc: 0.731
subject: professional_medicine, #q:272, acc: 0.941
subject: professional_psychology, #q:612, acc: 0.930
subject: public_relations, #q:110, acc: 0.818
subject: security_studies, #q:245, acc: 0.873
subject: sociology, #q:201, acc: 0.970
subject: us_foreign_policy, #q:100, acc: 0.960
subject: virology, #q:166, acc: 0.590
subject: world_religions, #q:171, acc: 0.930
Total latency: 485.510
Average accuracy: 0.885

gpqa

python -m sglang.test.run_eval --port 30300 --eval-name gpqa --num-examples 32 --max-tokens 128000 --repeat 8 --top-p 0.95 --temperature 1.0 --thinking-mode deepseek-v3
====================
Repeat: 8, mean: 0.906
Scores: ['0.906', '0.938', '0.938', '0.875', '0.875', '0.906', '0.906', '0.906']
====================
[METRIC] gpqa_mean_score=0.90625 labels={"model": "/data00/models/DeepSeek-V4-Flash", "eval": "gpqa", "repeat": 8}
Writing report to /tmp/gpqa__data00_models_DeepSeek-V4-Flash.html
{'chars': np.float64(1353.125), 'chars:std': np.float64(481.00173791266076), 'score:std': np.float64(0.2914805954090255), 'scores': ['0.906', '0.938', '0.938', '0.875', '0.875', '0.906', '0.906', '0.906'], 'mean_score': np.float64(0.90625)}
Writing results to /tmp/gpqa__data00_models_DeepSeek-V4-Flash.json

longbench_v2

python result.py
file='DeepSeek-V4-Flash.jsonl' (easy_acc + hard_acc) / len(pred_data)=0.6923076923076923
['Model\tOverall\tEasy\tHard\tShort\tMedium\tLong', 'DeepSeek-V4-Flash\t69.2\t77.1\t63.8\t72.1\t68.5\t65.0']

Speed Tests and Profiling

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@zhangxiaolei123456 zhangxiaolei123456 changed the title Deepseek_v4 support w4a16 Deepseek_v4 support w4a16 on hopper Apr 25, 2026
@zhangxiaolei123456 zhangxiaolei123456 mentioned this pull request Apr 25, 2026
34 tasks
@zhangxiaolei123456 zhangxiaolei123456 changed the title Deepseek_v4 support w4a16 on hopper Deepseek_v4 support w4(mxfp4)a16 on hopper Apr 27, 2026
size_k=K,
is_k_full=is_k_full,
use_atomic_add=use_atomic_add,
use_atomic_add=False,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Protect this with an if branch

@Fridge003
Copy link
Copy Markdown
Collaborator

Simple smoke test:

# Launch
sglang serve --trust-remote-code --model-path deepseek-ai/DeepSeek-V4-Pro --tp 8 --cuda-graph-max-bs 256 --max-running-requests 256 --enable-metrics --host 0.0.0.0 --port 30300 --mem-fraction-static 0.8 --tool-call-parser deepseekv4 --reasoning-parser deepseek-v4 --moe-runner-backend marlin --speculative-algo EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4

# Send-one
python3 -m sglang.test.send_one --port 30300
+-------------+--------+------------+-----------------+
| Latency (s) | Tokens | Acc Length | Speed (token/s) |
+-------------+--------+------------+-----------------+
|    3.290    |  512   |   2.926    |     155.62      |
+-------------+--------+------------+-----------------+

@Fridge003
Copy link
Copy Markdown
Collaborator

Longbench result on flash model with marlin:

file='DeepSeek-V4-Flash.jsonl' (easy_acc + hard_acc) / len(pred_data)=0.6923076923076923
['Model\tOverall\tEasy\tHard\tShort\tMedium\tLong', 'DeepSeek-V4-Flash\t69.2\t77.1\t63.8\t72.1\t68.5\t65.0']

@EanWang211123
Copy link
Copy Markdown
Contributor

Hey, I tried using your PR on H20. I noticed that you modified part of the sgl-kernel code, but I ran into some errors during compilation, for example:

      Cloning into 'repo-deepgemm-src'...
      fatal: reference is not a tree: 54f99a8af537b3c6eb4819b69907ccbe2b600792
      CMake Error at repo-deepgemm-subbuild/repo-deepgemm-populate-prefix/tmp/repo-deepgemm-populate-gitclone.cmake:61 (message):
        Failed to checkout tag: '54f99a8af537b3c6eb4819b69907ccbe2b600792'

Could you share how you temporarily compiled sgl-kernel? That would be very helpful to me.

@zhangxiaolei123456
Copy link
Copy Markdown
Contributor Author

you need modify the 54f99a8af537b3c6eb4819b69907ccbe2b600792 to ffe2b6b97420a9f8c58268ca55755168e6e2f360

@EanWang211123
Copy link
Copy Markdown
Contributor

you need modify the 54f99a8af537b3c6eb4819b69907ccbe2b600792 to ffe2b6b97420a9f8c58268ca55755168e6e2f360

thx ! is there anything else I should pay attention to, or do I only need to run make install?

@Fridge003
Copy link
Copy Markdown
Collaborator

Fridge003 commented Apr 28, 2026

Aime25 test:

# Server launch
 SGLANG_ENABLE_THINKING=1 SGLANG_REASONING_EFFORT=max sglang serve --trust-remote-code --model-path deepseek-ai/DeepSeek-V4-Pro --tp 8 --cuda-graph-max-bs 256 --max-running-requests 256 --enable-metrics --host 0.0.0.0 --port 30300 --mem-fraction-static 0.8 --tool-call-parser deepseekv4 --reasoning-parser deepseek-v4 --moe-runner-backend marlin --speculative-algo EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4


# Result
nemo-run_1/0 ----------------------------------------- aime25 ----------------------------------------
nemo-run_1/0 evaluation_mode   | num_entries | avg_tokens | gen_seconds | symbolic_correct | no_answer
nemo-run_1/0 pass@1[avg-of-16] | 30          | 17393      | 19133       | 96.25% ± 2.95%   | 2.29%    
nemo-run_1/0 majority@16       | 30          | 17393      | 19133       | 100.00%          | 0.00%    
nemo-run_1/0 pass@16           | 30          | 17393      | 19133       | 100.00%          | 0.00%    

# After regrading (bypass the boxed limitation)
----------------------------------------- aime25 ----------------------------------------
evaluation_mode   | num_entries | avg_tokens | gen_seconds | symbolic_correct | no_answer
pass@1[avg-of-16] | 30          | 17393      | 19133       | 98.33% ± 2.11%   | 0.21%    
majority@16       | 30          | 17393      | 19133       | 100.00%          | 0.00%    
pass@16           | 30          | 17393      | 19133       | 100.00%          | 0.00%    

@Fridge003
Copy link
Copy Markdown
Collaborator

@zhangxiaolei123456 Thanks for your contribution!

@Fridge003 Fridge003 merged commit 2ef8310 into sgl-project:deepseek_v4 Apr 28, 2026
1 check passed
yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 28, 2026


The Hopper w4a16 PR (sgl-project#23686) restructured the FP4 expert weight
processing branches in a way that blocks the default deepgemm/auto
backend path with a NotImplementedError. This restores the original
logic and treats marlin as the special-case addition.
yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 28, 2026


The Hopper w4a16 PR (sgl-project#23686) restructured the FP4 expert weight
processing branches in a way that blocks the default deepgemm/auto
backend path with a NotImplementedError. This restores the original
logic and treats marlin as the special-case addition.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants