[New Model] Support DeepseekV4 by zyongye · Pull Request #40760 · vllm-project/vllm

zyongye · 2026-04-24T03:00:25Z

Congratulations on Deepseek-ai to release the model. Thanks for all Inferact member's effort for support this.

Note: This model implementation is highly optimized. All the component is coupled. Lot of manually fused kernel. Please consult @WoosukKwon @zyongye @ivanium before making any changes.

Please see https://recipes.vllm.ai/deepseek-ai/DeepSeek-V4-Pro for recipes

Signed-off-by: youkaichao <youkaichao@gmail.com> Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu> Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> Signed-off-by: Nick Hill <nickhill123@gmail.com> Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: yasong.wang <yasong.wang@inferact.ai> Signed-off-by: Zhewen Li <zhewenli@inferact.ai> Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

Windswithyou · 2026-04-24T16:07:54Z

Any cookbook？And How can I run it with hopper?

Please see https://recipes.vllm.ai/deepseek-ai/DeepSeek-V4-Pro for recipes

I am trying to serve deepseek-ai/DeepSeek-V4-Pro using vLLM across 2 nodes, each equipped with 8x H100 80GB GPUs (16 GPUs in total). I am following the instructions from the vLLM recipes page (https://recipes.vllm.ai/deepseek-ai/DeepSeek-V4-Pro).
However, when initializing the vLLM engine, it consistently crashes with an Out of Memory (OOM) error. Even, set model-len 32k, set max-num-seqs, gpu-memory-utilization, etc., nothing worked.
Do you have any best practices for a 2-node H100 setup?
The 4-node works fine.
Just to add, using speculative_config causes issues as well.

@Yang1032 Hi bro, may I ask how did you manage to run v4-pro on 4 nodes? I still meet CUDA OOM on 4 nodes with the default DP num=cards num setup. Thank you very much!

@wxsms H100 80G V4-Pro 2-node TP4,DP4 may be good, or increase TP 4-node DP32 is great Other：the stream FC has some problems

you can follow my fork, just edit v32_parser. And then you can use stream function call

Pin vLLM source to zyongye/vllm@bc34b25e (dsv4 branch) from vllm-project/vllm#40760 which adds [New Model] Support DeepseekV4. Changes: - Add docker/vllm/versions.env with custom VLLM_REPO/VLLM_REF - Update image configs to point to the custom commit - Add EXTRA_BUILD_ARGS forwarding in build_image.sh - Add SETUPTOOLS_SCM_PRETEND_VERSION build-arg in Dockerfile - Update workflows to source versions.env and include vllm-ref-short in tags

sfeng33 · 2026-04-24T23:21:39Z

+    tool_call_start_token: str = "<｜DSML｜tool_calls>"
+    tool_call_end_token: str = "</｜DSML｜tool_calls>"


These class attributes would not take effect, can do sth like this instead:

def __init__(self, tokenizer, tools=None): super().__init__(tokenizer, tools) self.tool_call_start_token = "<｜DSML｜tool_calls>" self.tool_call_end_token = "</｜DSML｜tool_calls>" self.tool_call_complete_regex = re.compile( r"<｜DSML｜tool_calls>(.*?)</｜DSML｜tool_calls>", re.DOTALL )

tlrmchlsmth · 2026-04-25T01:45:25Z

+    hash_indices_table: torch.Tensor | None = None,
+    routed_scaling_factor: float = 1.0,
+) -> tuple[torch.Tensor, ...]:
+    ops.topk_hash_softplus_sqrt(


When using DeepEP, this crashes with "expected scalar type Long but found Int"

The CUDA kernel in topk_softplus_sqrt_kernels.cu dispatches input_tokens and hash_indices_table data_ptr based on topk_indices.scalar_type(). DeepEP sets topk_indices_dtype to int64, but input_tokens and hash_indices_table are int32.

We can detect and handle this case:

Suggested change

ops.topk_hash_softplus_sqrt(

idx_dtype = topk_indices.dtype

if input_tokens is not None and input_tokens.dtype != idx_dtype:

input_tokens = input_tokens.to(idx_dtype)

if hash_indices_table is not None and hash_indices_table.dtype != idx_dtype:

hash_indices_table = hash_indices_table.to(idx_dtype)

ops.topk_hash_softplus_sqrt(

That's deepep specific constraint? I think all other a2a assume topk_ids to be int32. Can we change the payload on deepep side (btw v2 just come out idk if they have this capability)

tjtanaa · 2026-04-25T03:27:45Z

+import torch
+
+from vllm import _custom_ops as ops
+from vllm.model_executor.layers.deepseek_v4_attention import (


the path seems to have been changed to

from vllm.v1.attention.ops.deepseek_v4_ops import ( quantize_and_insert_k_cache, )

oh great catch!

tjtanaa · 2026-04-25T06:55:49Z

+    }
+    // Compute per-thread scale (using warp reduction when renormalizing).
+    if (renormalize) {
+      selected_sum = warpReduceSum(selected_sum);


cuda_compat.sh has a helper function VLLM_SHFL_XOR_SYNC_WIDTH which can be used to handle both CUDA and ROCm differences

how about defining it this way

#pragma unroll for (int mask = THREADS_PER_ROW / 2; mask > 0; mask /= 2) { selected_sum += VLLM_SHFL_XOR_SYNC_WIDTH(selected_sum, mask, THREADS_PER_ROW); }

Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>

zyongye · 2026-04-27T01:51:00Z

#40860

ivanium · 2026-04-27T20:55:41Z

While decode swa token usage is full, 'get_cpu_copy NotImplementedError' raised.

Scheduler hit an exception: Traceback (most recent call last):
   File "/workspace/sglang/python/sglang/srt/managers/scheduler.py", line 3041, in run_scheduler_process
     scheduler.event_loop_overlap_disagg_decode()
   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
     return func(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^
   File "/workspace/sglang/python/sglang/srt/disaggregation/decode.py", line 915, in event_loop_overlap_disagg_decode
     batch = self.get_next_disagg_decode_batch_to_run()
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/workspace/sglang/python/sglang/srt/disaggregation/decode.py", line 977, in get_next_disagg_decode_batch_to_run
     self.running_batch = self.update_running_batch(self.running_batch)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2236, in update_running_batch
     retracted_reqs, new_token_ratio, reqs_to_abort = batch.retract_decode(
                                                      ^^^^^^^^^^^^^^^^^^^^^
   File "/workspace/sglang/python/sglang/srt/managers/schedule_batch.py", line 1897, in retract_decode
     self.release_req(idx, len(sorted_indices), server_args)
   File "/workspace/sglang/python/sglang/srt/managers/schedule_batch.py", line 1930, in release_req
     req.offload_kv_cache(
   File "/workspace/sglang/python/sglang/srt/managers/schedule_batch.py", line 1117, in offload_kv_cache
     self.kv_cache_cpu = token_to_kv_pool_allocator.get_cpu_copy(token_indices)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/workspace/sglang/python/sglang/srt/mem_cache/swa_memory_pool.py", line 642, in get_cpu_copy
     return self._kvcache.get_cpu_copy(indices)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/workspace/sglang/python/sglang/srt/mem_cache/memory_pool.py", line 671, in get_cpu_copy
     raise NotImplementedError()
 NotImplementedError

Thanks for the report. From the stack trace, this appears to be coming from SGLang’s path rather than vLLM. We don’t expect this issue to occur with vLLM, so I’d recommend trying the same scenario with vLLM and letting us know if it reproduces there.
The support is included in our v0.20.0 release and recipes can be found in https://recipes.vllm.ai/deepseek-ai/DeepSeek-V4-Pro

nskpro-cmd · 2026-04-30T09:44:48Z

Any cookbook？And How can I run it with hopper?

Please see https://recipes.vllm.ai/deepseek-ai/DeepSeek-V4-Pro for recipes

I am trying to serve deepseek-ai/DeepSeek-V4-Pro using vLLM across 2 nodes, each equipped with 8x H100 80GB GPUs (16 GPUs in total). I am following the instructions from the vLLM recipes page (https://recipes.vllm.ai/deepseek-ai/DeepSeek-V4-Pro).
However, when initializing the vLLM engine, it consistently crashes with an Out of Memory (OOM) error. Even, set model-len 32k, set max-num-seqs, gpu-memory-utilization, etc., nothing worked.
Do you have any best practices for a 2-node H100 setup?
The 4-node works fine.
Just to add, using speculative_config causes issues as well.

@Yang1032 Hi bro, may I ask how did you manage to run v4-pro on 4 nodes? I still meet CUDA OOM on 4 nodes with the default DP num=cards num setup. Thank you very much!

@wxsms H100 80G V4-Pro 2-node TP4,DP4 may be good, or increase TP 4-node DP32 is great Other：the stream FC has some problems

Hi bro, i tried with TP-8 DP-2 PP-1 inttially and got same out of memmory errors. agian as you mentioned i shifted to TP4 DP4 PP1.
Still the same isssue. how could we run on 2 H100 nodes, without encountering these errors.

@wxsms , did you resolve the issue? if yes please share the config, Thankyou.

zyongye requested review from ApostaC, DarkLight1337, NickLucche, WoosukKwon, aarnphm, alexm-redhat, bbrowning, chaunceyjiang, heheda12345, mgoin, njhill, orozery, robertgshaw2-redhat, sfeng33, tjtanaa, tlrmchlsmth and ywang96 as code owners April 24, 2026 03:00

claude Bot reviewed Apr 24, 2026

View reviewed changes

zyongye requested review from LucasWilkinson, MatthewBonanni, ProExpertProg, benchislett, hmellor, houseroad, luccafong, pavanimajety, yewentao256, youkaichao and zhuohan123 as code owners April 24, 2026 03:00

LopezCastroRoberto mentioned this pull request Apr 24, 2026

[Perf][Kernel] BF16 input support for persistent topK - DeepSeekV4 #40811

Draft

temporary disable persistent topk for 1024

3602f14

Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>

AlpinDale mentioned this pull request Apr 24, 2026

feat: add support for DeepSeek-V4 aphrodite-engine/aphrodite-engine#1633

Closed

sfeng33 reviewed Apr 24, 2026

View reviewed changes

biswapanda mentioned this pull request Apr 25, 2026

feat: add experimental DeepSeek-V4-Flash + V4-Pro vLLM agg recipes ai-dynamo/dynamo#8668

Merged

tlrmchlsmth reviewed Apr 25, 2026

View reviewed changes

tjtanaa reviewed Apr 25, 2026

View reviewed changes

ivanium mentioned this pull request Apr 25, 2026

[Feat] DeepSeek V4 Rebased #40860

Merged

4 tasks

whx-sjtu mentioned this pull request Apr 25, 2026

[New Model][ROCm] Add AMD support for DeepSeek V4 #40864

Closed

4 tasks

ChuanLi1101 self-assigned this Apr 25, 2026

tjtanaa reviewed Apr 25, 2026

View reviewed changes

WoosukKwon added 2 commits April 25, 2026 07:34

Integrate MegaMoE

4258ac3

Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>

Add model change

06e4b4f

Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>

mergify Bot mentioned this pull request Apr 25, 2026

DeepSeek V4 + MegaMoE #40868

Closed

4 tasks

Support dummy loading

6d244bd

Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>

whx-sjtu mentioned this pull request Apr 25, 2026

[New Model][ROCm] Add AMD support for DeepSeek V4 #40871

Open

4 tasks

free up unused weights

e8e38e1

Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>

sunway513 mentioned this pull request Apr 25, 2026

[RFC] DeepSeek-V4 KV Cache Reform — closed for correctness implementation (v0.2.6) sunway513/ATOM#35

Open

33 tasks

jasl mentioned this pull request Apr 26, 2026

DeepSeek V4 support on SM12x with Triton sparse MLA fallback #40899

Closed

tonyliu312 mentioned this pull request Apr 26, 2026

[Kernel] Marlin MoE: include SM 12.x in default arch list #40923

Open

4 tasks

bbbearxyz mentioned this pull request Apr 26, 2026

[WIP]Support DeepSeek V4 flash on SM120 with Triton fallback #40929

Open

tjtanaavllm mentioned this pull request Apr 26, 2026

[ROCm] [DeepSeekV4] Preliminary Enablement of DeepSeekV4 on ROCm #40931

Closed

4 tasks

zyongye closed this Apr 27, 2026

github-project-automation Bot moved this to Done in NVIDIA Apr 27, 2026

ChuanLi1101 mentioned this pull request Apr 27, 2026

[ROCm][DSv4] AITER-accelerated MLA decode for DeepSeek V4 on MI355X (rebased on tj/dsv4prrebase) ROCm/vllm#901

Draft

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[New Model] Support DeepseekV4#40760

[New Model] Support DeepseekV4#40760
zyongye wants to merge 10 commits intovllm-project:mainfrom
zyongye:dsv4

zyongye commented Apr 24, 2026 •

edited by mgoin

Loading

Uh oh!

claude Bot left a comment

Uh oh!

Windswithyou commented Apr 24, 2026

Uh oh!

sfeng33 Apr 24, 2026

Uh oh!

tlrmchlsmth Apr 25, 2026

Uh oh!

zyongye Apr 25, 2026 •

edited

Loading

Uh oh!

tjtanaa Apr 25, 2026 •

edited

Loading

Uh oh!

zyongye Apr 25, 2026

Uh oh!

tjtanaa Apr 25, 2026 •

edited

Loading

Uh oh!

zyongye commented Apr 27, 2026

Uh oh!

ivanium commented Apr 27, 2026

Uh oh!

nskpro-cmd commented Apr 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

17 participants

		tool_call_start_token: str = "<｜DSML｜tool_calls>"
		tool_call_end_token: str = "</｜DSML｜tool_calls>"

-    ops.topk_hash_softplus_sqrt(
+    idx_dtype = topk_indices.dtype
+    if input_tokens is not None and input_tokens.dtype != idx_dtype:
+      input_tokens = input_tokens.to(idx_dtype)
+      if hash_indices_table is not None and hash_indices_table.dtype != idx_dtype:
+          hash_indices_table = hash_indices_table.to(idx_dtype)
+    ops.topk_hash_softplus_sqrt(

Uh oh!

Conversation

zyongye commented Apr 24, 2026 • edited by mgoin Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

Windswithyou commented Apr 24, 2026

Uh oh!

sfeng33 Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

tlrmchlsmth Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

zyongye Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tjtanaa Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zyongye Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

tjtanaa Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zyongye commented Apr 27, 2026

Uh oh!

ivanium commented Apr 27, 2026

Uh oh!

nskpro-cmd commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

17 participants

zyongye commented Apr 24, 2026 •

edited by mgoin

Loading

zyongye Apr 25, 2026 •

edited

Loading

tjtanaa Apr 25, 2026 •

edited

Loading

tjtanaa Apr 25, 2026 •

edited

Loading

nskpro-cmd commented Apr 30, 2026 •

edited

Loading