[New Model] Support DeepseekV4#40760
Conversation
Signed-off-by: youkaichao <youkaichao@gmail.com> Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu> Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> Signed-off-by: Nick Hill <nickhill123@gmail.com> Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: yasong.wang <yasong.wang@inferact.ai> Signed-off-by: Zhewen Li <zhewenli@inferact.ai> Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
you can follow my fork, just edit v32_parser. And then you can use stream function call |
Pin vLLM source to zyongye/vllm@bc34b25e (dsv4 branch) from vllm-project/vllm#40760 which adds [New Model] Support DeepseekV4. Changes: - Add docker/vllm/versions.env with custom VLLM_REPO/VLLM_REF - Update image configs to point to the custom commit - Add EXTRA_BUILD_ARGS forwarding in build_image.sh - Add SETUPTOOLS_SCM_PRETEND_VERSION build-arg in Dockerfile - Update workflows to source versions.env and include vllm-ref-short in tags
| tool_call_start_token: str = "<|DSML|tool_calls>" | ||
| tool_call_end_token: str = "</|DSML|tool_calls>" |
There was a problem hiding this comment.
These class attributes would not take effect, can do sth like this instead:
def __init__(self, tokenizer, tools=None):
super().__init__(tokenizer, tools)
self.tool_call_start_token = "<|DSML|tool_calls>"
self.tool_call_end_token = "</|DSML|tool_calls>"
self.tool_call_complete_regex = re.compile(
r"<|DSML|tool_calls>(.*?)</|DSML|tool_calls>", re.DOTALL
)
| hash_indices_table: torch.Tensor | None = None, | ||
| routed_scaling_factor: float = 1.0, | ||
| ) -> tuple[torch.Tensor, ...]: | ||
| ops.topk_hash_softplus_sqrt( |
There was a problem hiding this comment.
When using DeepEP, this crashes with "expected scalar type Long but found Int"
The CUDA kernel in topk_softplus_sqrt_kernels.cu dispatches input_tokens and hash_indices_table data_ptr based on topk_indices.scalar_type(). DeepEP sets topk_indices_dtype to int64, but input_tokens and hash_indices_table are int32.
We can detect and handle this case:
| ops.topk_hash_softplus_sqrt( | |
| idx_dtype = topk_indices.dtype | |
| if input_tokens is not None and input_tokens.dtype != idx_dtype: | |
| input_tokens = input_tokens.to(idx_dtype) | |
| if hash_indices_table is not None and hash_indices_table.dtype != idx_dtype: | |
| hash_indices_table = hash_indices_table.to(idx_dtype) | |
| ops.topk_hash_softplus_sqrt( |
There was a problem hiding this comment.
That's deepep specific constraint? I think all other a2a assume topk_ids to be int32. Can we change the payload on deepep side (btw v2 just come out idk if they have this capability)
| import torch | ||
|
|
||
| from vllm import _custom_ops as ops | ||
| from vllm.model_executor.layers.deepseek_v4_attention import ( |
There was a problem hiding this comment.
the path seems to have been changed to
from vllm.v1.attention.ops.deepseek_v4_ops import (
quantize_and_insert_k_cache,
)
| } | ||
| // Compute per-thread scale (using warp reduction when renormalizing). | ||
| if (renormalize) { | ||
| selected_sum = warpReduceSum(selected_sum); |
There was a problem hiding this comment.
cuda_compat.sh has a helper function VLLM_SHFL_XOR_SYNC_WIDTH which can be used to handle both CUDA and ROCm differences
how about defining it this way
#pragma unroll
for (int mask = THREADS_PER_ROW / 2; mask > 0; mask /= 2) {
selected_sum +=
VLLM_SHFL_XOR_SYNC_WIDTH(selected_sum, mask, THREADS_PER_ROW);
}
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
Thanks for the report. From the stack trace, this appears to be coming from SGLang’s path rather than vLLM. We don’t expect this issue to occur with vLLM, so I’d recommend trying the same scenario with vLLM and letting us know if it reproduces there. |
Hi bro, i tried with TP-8 DP-2 PP-1 inttially and got same out of memmory errors. agian as you mentioned i shifted to TP4 DP4 PP1. @wxsms , did you resolve the issue? if yes please share the config, Thankyou. |
Congratulations on Deepseek-ai to release the model. Thanks for all Inferact member's effort for support this.
Note: This model implementation is highly optimized. All the component is coupled. Lot of manually fused kernel. Please consult @WoosukKwon @zyongye @ivanium before making any changes.
Please see https://recipes.vllm.ai/deepseek-ai/DeepSeek-V4-Pro for recipes