Deepseek_v4 support w4(mxfp4)a16 on hopper by zhangxiaolei123456 · Pull Request #23686 · sgl-project/sglang

zhangxiaolei123456 · 2026-04-25T03:17:16Z

Motivation

Co-authored-by: shiyu7

Modifications

Accuracy Tests

Flash

SGLANG_DSV4_FP4_EXPERTS=1 SGLANG_JIT_DEEPGEMM_PRECOMPILE=0 GLOO_SOCKET_IFNAME=eth0 NCCL_MIN_NCHANNELS=24 NCCL_IB_QPS_PER_CONNECTION=8 sglang serve --trust-remote-code --model-path /data00/models/DeepSeek-V4-Flash --tp 8 --cuda-graph-max-bs 256 --max-running-requests 256 --enable-metrics --host 0.0.0.0 --port 30300 --mem-fraction-static 0.8 --tool-call-parser deepseekv4 --reasoning-parser deepseek-v4 --moe-runner-backend marlin --speculative-algo EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4

curl -X POST http://localhost:30300/v1/chat/completions -H "Content-Type: application/json" -d '{
    "model": "deepseek-v4-fp8",
    "messages": [
        {
            "role": "user",
            "content": "做一份北京出行攻略"
        }
    ],
    "max_tokens": 500,
    "temperature": 0.6
}'
{"id":"86101dde6e494c11a9b5c5f34b073115","object":"chat.completion","created":1777086102,"model":"deepseek-v4-fp8","choices":[{"index":0,"message":{"role":"assistant","content":"北京作为中国的首都，是一座拥有悠久历史和丰富文化的城市。以下是一份北京出行攻略，帮助您规划一次愉快的北京之旅。\n\n### 1. 行前准备\n\n-   **交通**：北京交通便利，建议选择地铁、公交或打车。下载“北京通”或“亿通行”APP，方便乘坐地铁。\n-   **住宿**：建议选择市中心或交通枢纽附近的酒店，如王府井、国贸、中关村等区域。\n-   **最佳时间**：春秋季（4月-6月、9月-11月）气候宜人，秋高气爽。夏季（7月-8月）炎热，冬季（12月-2月）寒冷，需注意保暖。\n\n### 2. 必游景点\n\n- **故宫**：游览“世界五大宫殿”之一的故宫，感受皇家宫殿的宏伟。\n- **天坛**：参观祈年殿、回廊等，了解古代祭祀文化。\n- **颐和园**：游览皇家园林，欣赏昆明湖和长廊。\n- **长城**：八达岭长城、慕田峪长城，感受“不到长城非好汉”的豪情。\n- **天安门**：参观天安门广场，感受政治文化中心的庄重。\n- **圆明园**：参观皇家园林遗址，了解历史变迁。\n\n### 3. 美食推荐\n\n- **北京烤鸭**：全聚德、大董等老字号，皮脆肉嫩。\n- **涮羊肉**：东来顺、海底捞等，肉质鲜嫩。\n- **炸酱面**：面条筋道，酱料浓郁。\n- **糖葫芦**：老北京小吃，酸甜可口。\n- **北京小吃**：豆汁儿、焦圈、卤煮等，体验地道风味。\n\n### 4. 购物推荐\n\n- **王府井大街**：老字号购物街，购买传统工艺品。\n- **国贸**：现代商业区，购物、餐饮、娱乐一体。\n- **三里屯**：时尚潮流聚集地，酒吧、餐厅众多。\n- **大栅栏**：传统商业街，购买北京特色小吃和手工艺品。\n\n### 5. 交通指南\n\n- **市内交通**：地铁、公交、出租车、网约车等。建议使用“北京交通”APP，查询实时路况。\n- **长途交通**：北京","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"length","matched_stop":null}],"usage":{"prompt_tokens":9,"total_tokens":509,"completion_tokens":500,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}

GSM8K

python3 bench_sglang.py --host http://localhost  --port 30300 --data-path /data00 --num-questions 5000 --parallel 100
100%|████████████████████████████| 1319/1319 [01:49<00:00, 12.06it/s]
Accuracy: 0.955
Invalid: 0.000
Latency: 109.555 s
Output throughput: 1083.640 token/s

MMLU

python3 bench_sglang.py --parallel 128 --backend srt --host http://127.0.0.1 --port 30300 --data_dir /data00/mmlu
100%|██████████████████████████| 14042/14042 [08:05<00:00, 28.92it/s]
subject: abstract_algebra, #q:100, acc: 0.880
subject: anatomy, #q:135, acc: 0.889
subject: astronomy, #q:152, acc: 0.941
subject: business_ethics, #q:100, acc: 0.850
subject: clinical_knowledge, #q:265, acc: 0.891
subject: college_biology, #q:144, acc: 0.979
subject: college_chemistry, #q:100, acc: 0.690
subject: college_computer_science, #q:100, acc: 0.910
subject: college_mathematics, #q:100, acc: 0.860
subject: college_medicine, #q:173, acc: 0.850
subject: college_physics, #q:102, acc: 0.951
subject: computer_security, #q:100, acc: 0.840
subject: conceptual_physics, #q:235, acc: 0.966
subject: econometrics, #q:114, acc: 0.816
subject: electrical_engineering, #q:145, acc: 0.890
subject: elementary_mathematics, #q:378, acc: 0.960
subject: formal_logic, #q:126, acc: 0.770
subject: global_facts, #q:100, acc: 0.750
subject: high_school_biology, #q:310, acc: 0.965
subject: high_school_chemistry, #q:203, acc: 0.882
subject: high_school_computer_science, #q:100, acc: 0.950
subject: high_school_european_history, #q:165, acc: 0.897
subject: high_school_geography, #q:198, acc: 0.960
subject: high_school_government_and_politics, #q:193, acc: 0.995
subject: high_school_macroeconomics, #q:390, acc: 0.933
subject: high_school_mathematics, #q:270, acc: 0.807
subject: high_school_microeconomics, #q:238, acc: 0.975
subject: high_school_physics, #q:151, acc: 0.868
subject: high_school_psychology, #q:545, acc: 0.961
subject: high_school_statistics, #q:216, acc: 0.917
subject: high_school_us_history, #q:204, acc: 0.941
subject: high_school_world_history, #q:237, acc: 0.932
subject: human_aging, #q:223, acc: 0.870
subject: human_sexuality, #q:131, acc: 0.931
subject: international_law, #q:121, acc: 0.942
subject: jurisprudence, #q:108, acc: 0.917
subject: logical_fallacies, #q:163, acc: 0.914
subject: machine_learning, #q:112, acc: 0.839
subject: management, #q:103, acc: 0.922
subject: marketing, #q:234, acc: 0.970
subject: medical_genetics, #q:100, acc: 0.980
subject: miscellaneous, #q:783, acc: 0.957
subject: moral_disputes, #q:346, acc: 0.867
subject: moral_scenarios, #q:895, acc: 0.793
subject: nutrition, #q:306, acc: 0.938
subject: philosophy, #q:311, acc: 0.913
subject: prehistory, #q:324, acc: 0.941
subject: professional_accounting, #q:282, acc: 0.858
subject: professional_law, #q:1534, acc: 0.731
subject: professional_medicine, #q:272, acc: 0.941
subject: professional_psychology, #q:612, acc: 0.930
subject: public_relations, #q:110, acc: 0.818
subject: security_studies, #q:245, acc: 0.873
subject: sociology, #q:201, acc: 0.970
subject: us_foreign_policy, #q:100, acc: 0.960
subject: virology, #q:166, acc: 0.590
subject: world_religions, #q:171, acc: 0.930
Total latency: 485.510
Average accuracy: 0.885

gpqa

python -m sglang.test.run_eval --port 30300 --eval-name gpqa --num-examples 32 --max-tokens 128000 --repeat 8 --top-p 0.95 --temperature 1.0 --thinking-mode deepseek-v3
====================
Repeat: 8, mean: 0.906
Scores: ['0.906', '0.938', '0.938', '0.875', '0.875', '0.906', '0.906', '0.906']
====================
[METRIC] gpqa_mean_score=0.90625 labels={"model": "/data00/models/DeepSeek-V4-Flash", "eval": "gpqa", "repeat": 8}
Writing report to /tmp/gpqa__data00_models_DeepSeek-V4-Flash.html
{'chars': np.float64(1353.125), 'chars:std': np.float64(481.00173791266076), 'score:std': np.float64(0.2914805954090255), 'scores': ['0.906', '0.938', '0.938', '0.875', '0.875', '0.906', '0.906', '0.906'], 'mean_score': np.float64(0.90625)}
Writing results to /tmp/gpqa__data00_models_DeepSeek-V4-Flash.json

longbench_v2

python result.py
file='DeepSeek-V4-Flash.jsonl' (easy_acc + hard_acc) / len(pred_data)=0.6923076923076923
['Model\tOverall\tEasy\tHard\tShort\tMedium\tLong', 'DeepSeek-V4-Flash\t69.2\t77.1\t63.8\t72.1\t68.5\t65.0']

Speed Tests and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

gemini-code-assist · 2026-04-25T03:17:20Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Fridge003 · 2026-04-27T23:50:08Z

        size_k=K,
        is_k_full=is_k_full,
-        use_atomic_add=use_atomic_add,
+        use_atomic_add=False,


Protect this with an if branch

Fridge003 · 2026-04-28T00:44:30Z

Simple smoke test:

# Launch
sglang serve --trust-remote-code --model-path deepseek-ai/DeepSeek-V4-Pro --tp 8 --cuda-graph-max-bs 256 --max-running-requests 256 --enable-metrics --host 0.0.0.0 --port 30300 --mem-fraction-static 0.8 --tool-call-parser deepseekv4 --reasoning-parser deepseek-v4 --moe-runner-backend marlin --speculative-algo EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4

# Send-one
python3 -m sglang.test.send_one --port 30300
+-------------+--------+------------+-----------------+
| Latency (s) | Tokens | Acc Length | Speed (token/s) |
+-------------+--------+------------+-----------------+
|    3.290    |  512   |   2.926    |     155.62      |
+-------------+--------+------------+-----------------+

Fridge003 · 2026-04-28T00:49:57Z

Longbench result on flash model with marlin:

file='DeepSeek-V4-Flash.jsonl' (easy_acc + hard_acc) / len(pred_data)=0.6923076923076923
['Model\tOverall\tEasy\tHard\tShort\tMedium\tLong', 'DeepSeek-V4-Flash\t69.2\t77.1\t63.8\t72.1\t68.5\t65.0']

EanWang211123 · 2026-04-28T02:59:16Z

Hey, I tried using your PR on H20. I noticed that you modified part of the sgl-kernel code, but I ran into some errors during compilation, for example:

      Cloning into 'repo-deepgemm-src'...
      fatal: reference is not a tree: 54f99a8af537b3c6eb4819b69907ccbe2b600792
      CMake Error at repo-deepgemm-subbuild/repo-deepgemm-populate-prefix/tmp/repo-deepgemm-populate-gitclone.cmake:61 (message):
        Failed to checkout tag: '54f99a8af537b3c6eb4819b69907ccbe2b600792'

Could you share how you temporarily compiled sgl-kernel? That would be very helpful to me.

zhangxiaolei123456 · 2026-04-28T03:01:28Z

you need modify the 54f99a8af537b3c6eb4819b69907ccbe2b600792 to ffe2b6b97420a9f8c58268ca55755168e6e2f360

EanWang211123 · 2026-04-28T03:07:59Z

you need modify the 54f99a8af537b3c6eb4819b69907ccbe2b600792 to ffe2b6b97420a9f8c58268ca55755168e6e2f360

thx ! is there anything else I should pay attention to, or do I only need to run make install?

Fridge003 · 2026-04-28T07:04:35Z

Aime25 test:

# Server launch
 SGLANG_ENABLE_THINKING=1 SGLANG_REASONING_EFFORT=max sglang serve --trust-remote-code --model-path deepseek-ai/DeepSeek-V4-Pro --tp 8 --cuda-graph-max-bs 256 --max-running-requests 256 --enable-metrics --host 0.0.0.0 --port 30300 --mem-fraction-static 0.8 --tool-call-parser deepseekv4 --reasoning-parser deepseek-v4 --moe-runner-backend marlin --speculative-algo EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4


# Result
nemo-run_1/0 ----------------------------------------- aime25 ----------------------------------------
nemo-run_1/0 evaluation_mode   | num_entries | avg_tokens | gen_seconds | symbolic_correct | no_answer
nemo-run_1/0 pass@1[avg-of-16] | 30          | 17393      | 19133       | 96.25% ± 2.95%   | 2.29%    
nemo-run_1/0 majority@16       | 30          | 17393      | 19133       | 100.00%          | 0.00%    
nemo-run_1/0 pass@16           | 30          | 17393      | 19133       | 100.00%          | 0.00%    

# After regrading (bypass the boxed limitation)
----------------------------------------- aime25 ----------------------------------------
evaluation_mode   | num_entries | avg_tokens | gen_seconds | symbolic_correct | no_answer
pass@1[avg-of-16] | 30          | 17393      | 19133       | 98.33% ± 2.11%   | 0.21%    
majority@16       | 30          | 17393      | 19133       | 100.00%          | 0.00%    
pass@16           | 30          | 17393      | 19133       | 100.00%          | 0.00%

Fridge003 · 2026-04-28T07:04:46Z

@zhangxiaolei123456 Thanks for your contribution!

The Hopper w4a16 PR (sgl-project#23686) restructured the FP4 expert weight processing branches in a way that blocks the default deepgemm/auto backend path with a NotImplementedError. This restores the original logic and treats marlin as the special-case addition.

zhangxiaolei123456 added 7 commits April 25, 2026 10:52

Create marlin_utils_fp4.py

f7c007a

Update marlin_template.h

e6d566d

Update server_args.py

2e9f5ad

Update mxfp4_deepseek.py

c20d661

Update fused_marlin_moe.py

30cd6a2

Update marlin.py

fffc4fa

Update fp8.py

b40572f

zhangxiaolei123456 requested review from AniZpZ, BBuf, Edwardf0t1, FlamingoPg, Fridge003, HaiShaw, Ying1123, ch-wan, ispobock, merrymercy and yizhang2077 as code owners April 25, 2026 03:17

github-actions Bot added deepseek sgl-kernel labels Apr 25, 2026

zhangxiaolei123456 changed the title ~~Deepseek_v4 support w4a16~~ Deepseek_v4 support w4a16 on hopper Apr 25, 2026

zhangxiaolei123456 added 2 commits April 25, 2026 13:48

Update marlin.py

ab45dd7

Update fused_marlin_moe.py

af9b12f

zhangxiaolei123456 mentioned this pull request Apr 25, 2026

DeepSeek V4 Roadmap #23602

Open

34 tasks

zhangxiaolei123456 added 5 commits April 25, 2026 23:02

Update fp8.py

0e58b9d

Update marlin.py

c196b1f

Update marlin_utils_fp4.py

bbc0856

Update mxfp4_deepseek.py

1d34449

Update fused_marlin_moe.py

c10de75

zhangxiaolei123456 added 3 commits April 26, 2026 14:26

Update marlin_utils_fp4.py

0c4d0ea

Update marlin_template.h

c06086b

Update ops.cu

2c3ab62

zhangxiaolei123456 changed the title ~~Deepseek_v4 support w4a16 on hopper~~ Deepseek_v4 support w4(mxfp4)a16 on hopper Apr 27, 2026

Fridge003 assigned Fridge003 and yhyang201 Apr 27, 2026

Merge branch 'deepseek_v4' into deepseek_v4_w4a16

1b4f142

Fridge003 reviewed Apr 28, 2026

View reviewed changes

small fix

25159a3

Fridge003 approved these changes Apr 28, 2026

View reviewed changes

Merge branch 'deepseek_v4' into deepseek_v4_w4a16

0177192

Fridge003 added the high priority label Apr 28, 2026

Merge branch 'deepseek_v4' into deepseek_v4_w4a16

ab3ae45

zhangxiaolei123456 mentioned this pull request Apr 28, 2026

Deepseek-v4-Pro share expert tp1 on H20 #23911

Closed

5 tasks

Fridge003 merged commit 2ef8310 into sgl-project:deepseek_v4 Apr 28, 2026
1 check passed

yhyang201 mentioned this pull request Apr 28, 2026

fix: restore FP4 deepgemm path for Blackwell broken by #23686 #23948

Merged

Fridge003 pushed a commit that referenced this pull request Apr 28, 2026

fix: restore FP4 deepgemm path for Blackwell broken by #23686 (#23948)

79281e8

parrot18 mentioned this pull request Apr 29, 2026

feat: port SGLANG_JIT_DEEPGEMM_FAST_WARMUP to deepseek_v4 branch #23756

Merged

5 tasks

yhyang201 mentioned this pull request May 6, 2026

Port MXFP4 Marlin MoE support to JIT kernel path #24490

Merged

4 tasks

yiakwy-xpu-ml-framework-team mentioned this pull request May 10, 2026

Add FlashInfer SM90 cutlass MXFP4 MoE backend (W4A16) for GPT-OSS + DeepSeek-V4 #24816

Merged

shiyu7 mentioned this pull request May 11, 2026

[rebase]Deepseek_v4 support w4(mxfp4)a16 on hopper #24986

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deepseek_v4 support w4(mxfp4)a16 on hopper#23686

Deepseek_v4 support w4(mxfp4)a16 on hopper#23686
Fridge003 merged 30 commits into
sgl-project:deepseek_v4from
zhangxiaolei123456:deepseek_v4_w4a16

zhangxiaolei123456 commented Apr 25, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Apr 25, 2026

Uh oh!

Fridge003 Apr 27, 2026

Uh oh!

Fridge003 commented Apr 28, 2026

Uh oh!

Fridge003 commented Apr 28, 2026

Uh oh!

EanWang211123 commented Apr 28, 2026

Uh oh!

zhangxiaolei123456 commented Apr 28, 2026

Uh oh!

EanWang211123 commented Apr 28, 2026

Uh oh!

Fridge003 commented Apr 28, 2026 •

edited

Loading

Uh oh!

Fridge003 commented Apr 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

zhangxiaolei123456 commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

Uh oh!

gemini-code-assist Bot commented Apr 25, 2026

Uh oh!

Fridge003 Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Fridge003 commented Apr 28, 2026

Uh oh!

Fridge003 commented Apr 28, 2026

Uh oh!

EanWang211123 commented Apr 28, 2026

Uh oh!

zhangxiaolei123456 commented Apr 28, 2026

Uh oh!

EanWang211123 commented Apr 28, 2026

Uh oh!

Fridge003 commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Fridge003 commented Apr 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zhangxiaolei123456 commented Apr 25, 2026 •

edited

Loading

Fridge003 commented Apr 28, 2026 •

edited

Loading