feat: add experimental DeepSeek-V4-Flash + V4-Pro vLLM agg recipes#8668
Merged
biswapanda merged 16 commits intomainfrom Apr 25, 2026
Merged
feat: add experimental DeepSeek-V4-Flash + V4-Pro vLLM agg recipes#8668biswapanda merged 16 commits intomainfrom
biswapanda merged 16 commits intomainfrom
Conversation
e43bf1b to
b1abd5c
Compare
Adds a DeepSeek-V4-Flash serving recipe on top of the dsv4-preview vLLM image (vllm/vllm-openai:deepseekv4-cu130). Three pieces: - container/Dockerfile.dsv4: overlays dynamo artifacts (wheels, nats, etcd, NIXL, UCX, dynamo.vllm worker) onto the dsv4 vLLM image without touching the vLLM install. Parameterized via DYNAMO_SRC_IMAGE / DSV4_BASE_IMAGE. - recipes/deepseek-v4-flash/vllm/agg/deploy.yaml: plain vLLM (no dynamo CRD) aggregated Deployment + Service, DP=4 + EP on a single B200 node, reading weights from the shared-model-cache PVC. - recipes/deepseek-v4-flash/vllm/dsv4/vllm-dgd.yaml: DynamoGraphDeployment variant using the overlay image above (Frontend + decode), same DP=4 + EP shape, suitable for dynamo-operator-managed namespaces. Namespace-specific debug/PV/PVC manifests are intentionally omitted so the recipe is portable; consumers point at their own shared-model-cache PVC.
Signed-off-by: Dmitry Tokarev <dtokarev@nvidia.com>
155c017 to
e3f28d4
Compare
… not NGC The dsv4 overlay pulls Dynamo artifacts (wheels, nats, etcd, NIXL, UCX, dynamo.vllm worker) from a source image. The flow is: build the Dynamo vLLM runtime from this repo first using <repo_root>/container/README.md, then build the dsv4 overlay on top of it. Users should not need any non-public NGC access. - Dockerfile.dsv4: default DYNAMO_SRC_IMAGE is now `dynamo:latest-vllm-runtime` (the local tag produced by `container/render.py --framework vllm --target runtime`). Published `nvcr.io/nvidia/ai-dynamo/vllm-runtime:<tag>` is now an override option documented via --build-arg, not the default. - container/README.md: restructure into a two-step build flow. Step 1 links to the repo-root container/README.md with the exact render + docker build commands. Step 2 builds the dsv4 overlay. Drop the "access to the public Dynamo vLLM runtime image" / NGC-login prerequisite; replace the troubleshooting entry with the local-tag pull-access case. - Recipe README.md (Prerequisite 4): mirror the two-step structure so users see the build order up front.
…ache PVC Verified against https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash/tree/main: 46 safetensors shards, one ~1.06 GB shard + 45 at ~3.57-3.59 GB, totaling ~160 GB on disk in FP4+FP8 mixed form (NOT ~300 GB). - model-cache/model-cache.yaml: 700Gi -> 400Gi (~2.5x headroom over the actual weights; still room for HF cache metadata + one alternate revision without being wasteful). - README.md Quick Start comment: update size (~300GB -> ~160GB) and download estimate (1-2 hours -> 30-60 min, a more typical figure for an xet-accelerated HF pull of this size). - README.md Notes: add a "Model size" bullet that states the 46-shard layout and explains the PVC sizing rationale.
Signed-off-by: Dmitry Tokarev <dtokarev@nvidia.com>
Contributor
|
/ok to test ed842e1 |
Signed-off-by: Dmitry Tokarev <dtokarev@nvidia.com>
Adds the recipe directory referenced by recipes/README.md so its
documentation links resolve.
- recipes/deepseek-v4-pro/README.md -- model + deploy guide
- recipes/deepseek-v4-pro/vllm/agg/vllm-dgd.yaml -- TP=8 + EP, 8x B200
- recipes/deepseek-v4-pro/model-cache/{model-cache.yaml,model-download.yaml}
- recipes/deepseek-v4-pro/container/{Dockerfile.dsv4,README.md} --
shares the same dsv4 overlay image as the V4-Flash recipe.
Contributor
Author
|
/ok to test 98d8494 |
Signed-off-by: Dmitry Tokarev <dtokarev@nvidia.com>
dmitry-tokarev-nv
approved these changes
Apr 25, 2026
VincyZhang
pushed a commit
to VincyZhang/dynamo
that referenced
this pull request
Apr 27, 2026
…i-dynamo#8668) Signed-off-by: Dmitry Tokarev <dtokarev@nvidia.com> Co-authored-by: Dmitry Tokarev <dtokarev@nvidia.com> Signed-off-by: VincyZhang <wenxin.zhang@intel.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
Adds two experimental aggregated vLLM serving recipes for the DeepSeek-V4 series, sharing a single dsv4 reference container:
recipes/deepseek-v4-flash/— DeepSeek-V4-Flash (284B / 13B active), 4x B200, DP=4 + EP.recipes/deepseek-v4-pro/— DeepSeek-V4-Pro (1.6T / 49B active, 1M context), 8x B200, TP=8 + EP.Both follow the existing recipe layout (
recipes/qwen3-235b-a22b-fp8,recipes/llama-3-70b/vllm/agg) and are listed under Experimental Recipes inrecipes/README.md— functional, not yet performance-tuned, not benchmarked, and depend on a custom container build.Smoke-tested on a 4x B200 node (Flash, namespace
bis-dsv4) and an 8x B200 node (Pro). Reasoning (reasoning_content) and tool calling (tool_calls) verified end-to-end via/v1/chat/completionsagainst both.Recipe layout (per recipe)
Each recipe directory contains:
container/Dockerfile.dsv4— reference container build that overlays the Dynamo runtime (wheels, nats, etcd, NIXL, UCX,dynamo.vllmworker) on top ofvllm/vllm-openai:deepseekv4-cu130. The two Dockerfiles are duplicates so each recipe is self-contained; the produced image is identical and is reusable across both recipes (the model is selected at runtime via--model).container/README.md— build flow, build args (DYNAMO_SRC_IMAGE,DSV4_BASE_IMAGE), troubleshooting.model-cache/model-cache.yaml—ReadWriteManyPVC. 700Gi for Flash, 1500Gi for Pro.model-cache/model-download.yaml— HuggingFace download Job, readshf-token-secret.vllm/agg/vllm-dgd.yaml—DynamoGraphDeployment.README.md— prerequisites, Quick Start, recipe details with flag table, reasoning + tool calling verification curls, notes.DGD shapes
--tensor-parallel-size 1 --data-parallel-size 4 --enable-expert-parallel--tensor-parallel-size 8 --enable-expert-parallel--kv-cache-dtype fp8 --block-size 256--attention-config '{"use_fp4_indexer_cache":true}''{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}''{"mode":0,"cudagraph_mode":"FULL_DECODE_ONLY"}'(more conservative for the 1.6T model; matches the upstream V4-Pro example)--tokenizer-mode deepseek_v4,--dyn-reasoning-parser deepseek_v4,--dyn-tool-call-parser deepseek_v4failureThreshold: 360)failureThreshold: 540) — sized for the larger TP=8 weight loadVLLM_ENGINE_READY_TIMEOUT_SBoth worker pods run with
HF_HUB_OFFLINE=1(weights served from the PVC, never contacting the HF Hub), and the manifests ship with the placeholdernvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tagso users build the real image from their recipe'scontainer/Dockerfile.dsv4.Why TP=8 for Pro vs DP=4 for Flash?
Pro is ~5.5x larger than Flash on disk (~865 GB vs ~160 GB). With FP4+FP8 mixed weights, Pro does not fit in 4 ranks at typical batch shapes, so the upstream tested shape is TP=8 across all 8 GPUs of one node. Expert Parallel is still enabled on top of TP — TP shards the dense (attention/router/norm) weights, EP shards the experts.
Pro-specific reasoning modes
Per the model card, V4-Pro exposes three reasoning effort levels. The recipe surfaces them via
chat_template_kwargs:{}— fast intuitive responses{"thinking":true,"reasoning_effort":"high"}— explicit chain-of-thought{"thinking":true,"reasoning_effort":"max"}— maximum reasoning depth (model card recommends--max-model-len >= 393216)Index update
recipes/README.mdadds DeepSeek-V4-Flash and DeepSeek-V4-Pro rows under Experimental Recipes, both linking to theircontainer/Dockerfile.dsv4.Where should the reviewer start?
recipes/deepseek-v4-pro/README.mdandrecipes/deepseek-v4-flash/README.md— user-facing flow.recipes/deepseek-v4-pro/vllm/agg/vllm-dgd.yaml— the larger, more conservative DGD; flag rationale in the README.recipes/deepseek-v4-pro/container/Dockerfile.dsv4(identical to Flash's) — reference image build.Status / caveats
perf.yaml, no benchmark numbers.recipes/<recipe>/container/Dockerfile.dsv4.Testing
Deployed both recipes via
kubectl apply -f vllm/agg/vllm-dgd.yaml:kubectl get dynamographdeployment→state: successful,All resources are ready(both).GET /v1/modelsreturns the expected model id.POST /v1/chat/completionswith a reasoning prompt →message.reasoning_contentpopulated, cleanmessage.content.POST /v1/chat/completionswith tool definitions → structuredmessage.tool_calls,finish_reason: "tool_calls".chat_template_kwargs.Related
--dyn-reasoning-parser deepseek_v4and--dyn-tool-call-parser deepseek_v4flags require that change.