feat: add experimental DeepSeek-V4-Flash + V4-Pro vLLM agg recipes by biswapanda · Pull Request #8668 · ai-dynamo/dynamo

biswapanda · 2026-04-24T06:00:08Z

Overview

Adds two experimental aggregated vLLM serving recipes for the DeepSeek-V4 series, sharing a single dsv4 reference container:

recipes/deepseek-v4-flash/ — DeepSeek-V4-Flash (284B / 13B active), 4x B200, DP=4 + EP.
recipes/deepseek-v4-pro/ — DeepSeek-V4-Pro (1.6T / 49B active, 1M context), 8x B200, TP=8 + EP.

Both follow the existing recipe layout (recipes/qwen3-235b-a22b-fp8, recipes/llama-3-70b/vllm/agg) and are listed under Experimental Recipes in recipes/README.md — functional, not yet performance-tuned, not benchmarked, and depend on a custom container build.

Smoke-tested on a 4x B200 node (Flash, namespace bis-dsv4) and an 8x B200 node (Pro). Reasoning (reasoning_content) and tool calling (tool_calls) verified end-to-end via /v1/chat/completions against both.

Recipe layout (per recipe)

Each recipe directory contains:

container/Dockerfile.dsv4 — reference container build that overlays the Dynamo runtime (wheels, nats, etcd, NIXL, UCX, dynamo.vllm worker) on top of vllm/vllm-openai:deepseekv4-cu130. The two Dockerfiles are duplicates so each recipe is self-contained; the produced image is identical and is reusable across both recipes (the model is selected at runtime via --model).
```
docker build -f recipes/deepseek-v4-pro/container/Dockerfile.dsv4 \
  -t <your-registry>/vllm-dsv4:<tag> .
```
container/README.md — build flow, build args (DYNAMO_SRC_IMAGE, DSV4_BASE_IMAGE), troubleshooting.
model-cache/model-cache.yaml — ReadWriteMany PVC. 700Gi for Flash, 1500Gi for Pro.
model-cache/model-download.yaml — HuggingFace download Job, reads hf-token-secret.
vllm/agg/vllm-dgd.yaml — DynamoGraphDeployment.
README.md — prerequisites, Quick Start, recipe details with flag table, reasoning + tool calling verification curls, notes.

DGD shapes

	Flash	Pro
Parallelism	`--tensor-parallel-size 1 --data-parallel-size 4 --enable-expert-parallel`	`--tensor-parallel-size 8 --enable-expert-parallel`
GPUs	4x B200 (single node, 4 of 8)	8x B200 (fills one node)
KV cache	`--kv-cache-dtype fp8 --block-size 256`	same
Indexer	`--attention-config '{"use_fp4_indexer_cache":true}'`	same
Cudagraphs	`'{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}'`	`'{"mode":0,"cudagraph_mode":"FULL_DECODE_ONLY"}'` (more conservative for the 1.6T model; matches the upstream V4-Pro example)
Tokenizer / parsers	`--tokenizer-mode deepseek_v4`, `--dyn-reasoning-parser deepseek_v4`, `--dyn-tool-call-parser deepseek_v4`	same
Startup probe	60 min (`failureThreshold: 360`)	90 min (`failureThreshold: 540`) — sized for the larger TP=8 weight load
`VLLM_ENGINE_READY_TIMEOUT_S`	3600	5400

Both worker pods run with HF_HUB_OFFLINE=1 (weights served from the PVC, never contacting the HF Hub), and the manifests ship with the placeholder nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag so users build the real image from their recipe's container/Dockerfile.dsv4.

Why TP=8 for Pro vs DP=4 for Flash?

Pro is ~5.5x larger than Flash on disk (~865 GB vs ~160 GB). With FP4+FP8 mixed weights, Pro does not fit in 4 ranks at typical batch shapes, so the upstream tested shape is TP=8 across all 8 GPUs of one node. Expert Parallel is still enabled on top of TP — TP shards the dense (attention/router/norm) weights, EP shards the experts.

Pro-specific reasoning modes

Per the model card, V4-Pro exposes three reasoning effort levels. The recipe surfaces them via chat_template_kwargs:

Non-think — {} — fast intuitive responses
Think High — {"thinking":true,"reasoning_effort":"high"} — explicit chain-of-thought
Think Max — {"thinking":true,"reasoning_effort":"max"} — maximum reasoning depth (model card recommends --max-model-len >= 393216)

Index update

recipes/README.md adds DeepSeek-V4-Flash and DeepSeek-V4-Pro rows under Experimental Recipes, both linking to their container/Dockerfile.dsv4.

Where should the reviewer start?

recipes/deepseek-v4-pro/README.md and recipes/deepseek-v4-flash/README.md — user-facing flow.
recipes/deepseek-v4-pro/vllm/agg/vllm-dgd.yaml — the larger, more conservative DGD; flag rationale in the README.
recipes/deepseek-v4-pro/container/Dockerfile.dsv4 (identical to Flash's) — reference image build.

Status / caveats

Experimental. Not performance-tuned, no perf.yaml, no benchmark numbers.
Requires custom container build. DeepSeek-V4 is not in a stock vLLM release yet; users build from recipes/<recipe>/container/Dockerfile.dsv4.
Slow first launch. Up to ~60 min (Flash) / ~90 min (Pro) for weight load + FlashInfer autotune + cudagraph warmup; startup probes are sized accordingly.
Scope. Aggregated single-node only. No disagg/P-D variant yet. No multi-node Pro variant.

Testing

Deployed both recipes via kubectl apply -f vllm/agg/vllm-dgd.yaml:

kubectl get dynamographdeployment → state: successful, All resources are ready (both).
GET /v1/models returns the expected model id.
POST /v1/chat/completions with a reasoning prompt → message.reasoning_content populated, clean message.content.
POST /v1/chat/completions with tool definitions → structured message.tool_calls, finish_reason: "tool_calls".
For Pro: verified all three reasoning effort modes via chat_template_kwargs.

Adds a DeepSeek-V4-Flash serving recipe on top of the dsv4-preview vLLM image (vllm/vllm-openai:deepseekv4-cu130). Three pieces: - container/Dockerfile.dsv4: overlays dynamo artifacts (wheels, nats, etcd, NIXL, UCX, dynamo.vllm worker) onto the dsv4 vLLM image without touching the vLLM install. Parameterized via DYNAMO_SRC_IMAGE / DSV4_BASE_IMAGE. - recipes/deepseek-v4-flash/vllm/agg/deploy.yaml: plain vLLM (no dynamo CRD) aggregated Deployment + Service, DP=4 + EP on a single B200 node, reading weights from the shared-model-cache PVC. - recipes/deepseek-v4-flash/vllm/dsv4/vllm-dgd.yaml: DynamoGraphDeployment variant using the overlay image above (Frontend + decode), same DP=4 + EP shape, suitable for dynamo-operator-managed namespaces. Namespace-specific debug/PV/PVC manifests are intentionally omitted so the recipe is portable; consumers point at their own shared-model-cache PVC.

Signed-off-by: Dmitry Tokarev <dtokarev@nvidia.com>

… not NGC The dsv4 overlay pulls Dynamo artifacts (wheels, nats, etcd, NIXL, UCX, dynamo.vllm worker) from a source image. The flow is: build the Dynamo vLLM runtime from this repo first using <repo_root>/container/README.md, then build the dsv4 overlay on top of it. Users should not need any non-public NGC access. - Dockerfile.dsv4: default DYNAMO_SRC_IMAGE is now `dynamo:latest-vllm-runtime` (the local tag produced by `container/render.py --framework vllm --target runtime`). Published `nvcr.io/nvidia/ai-dynamo/vllm-runtime:<tag>` is now an override option documented via --build-arg, not the default. - container/README.md: restructure into a two-step build flow. Step 1 links to the repo-root container/README.md with the exact render + docker build commands. Step 2 builds the dsv4 overlay. Drop the "access to the public Dynamo vLLM runtime image" / NGC-login prerequisite; replace the troubleshooting entry with the local-tag pull-access case. - Recipe README.md (Prerequisite 4): mirror the two-step structure so users see the build order up front.

…ache PVC Verified against https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash/tree/main: 46 safetensors shards, one ~1.06 GB shard + 45 at ~3.57-3.59 GB, totaling ~160 GB on disk in FP4+FP8 mixed form (NOT ~300 GB). - model-cache/model-cache.yaml: 700Gi -> 400Gi (~2.5x headroom over the actual weights; still room for HF cache metadata + one alternate revision without being wasteful). - README.md Quick Start comment: update size (~300GB -> ~160GB) and download estimate (1-2 hours -> 30-60 min, a more typical figure for an xet-accelerated HF pull of this size). - README.md Notes: add a "Model size" bullet that states the 46-shard layout and explains the PVC sizing rationale.

Signed-off-by: Dmitry Tokarev <dtokarev@nvidia.com>

dmitry-tokarev-nv · 2026-04-24T20:52:57Z

/ok to test ed842e1

Signed-off-by: Dmitry Tokarev <dtokarev@nvidia.com>

Adds the recipe directory referenced by recipes/README.md so its documentation links resolve. - recipes/deepseek-v4-pro/README.md -- model + deploy guide - recipes/deepseek-v4-pro/vllm/agg/vllm-dgd.yaml -- TP=8 + EP, 8x B200 - recipes/deepseek-v4-pro/model-cache/{model-cache.yaml,model-download.yaml} - recipes/deepseek-v4-pro/container/{Dockerfile.dsv4,README.md} -- shares the same dsv4 overlay image as the V4-Flash recipe.

biswapanda · 2026-04-25T01:38:18Z

/ok to test 98d8494

Signed-off-by: Dmitry Tokarev <dtokarev@nvidia.com>

…i-dynamo#8668) Signed-off-by: Dmitry Tokarev <dtokarev@nvidia.com> Co-authored-by: Dmitry Tokarev <dtokarev@nvidia.com> Signed-off-by: VincyZhang <wenxin.zhang@intel.com>

biswapanda requested review from a team as code owners April 24, 2026 06:00

pull-request-size Bot added the size/XL label Apr 24, 2026

github-actions Bot added feat container labels Apr 24, 2026

biswapanda force-pushed the bis/vllm-test-1 branch from e43bf1b to b1abd5c Compare April 24, 2026 06:32

pull-request-size Bot added size/L and removed size/XL labels Apr 24, 2026

copy-pr-bot Bot temporarily deployed to GITLAB April 24, 2026 06:32 Inactive

biswapanda marked this pull request as draft April 24, 2026 15:21

biswapanda marked this pull request as ready for review April 24, 2026 18:05

dmitry-tokarev-nv reviewed Apr 24, 2026

View reviewed changes

Comment thread recipes/deepseek-v4-flash/container/Dockerfile.dsv4

copy-pr-bot Bot temporarily deployed to GITLAB April 24, 2026 18:51 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB April 24, 2026 19:11 Inactive

biswapanda and others added 3 commits April 24, 2026 12:59

wip

5652531

Apply suggestion from @dmitry-tokarev-nv

26f63e1

Signed-off-by: Dmitry Tokarev <dtokarev@nvidia.com>

biswapanda changed the title ~~feat: add vllm recipe for DSV4 recipe test~~ feat: add vllm recipe for DSV4 recipe Apr 24, 2026

recipes: DeepSeek-V4-Flash vLLM agg recipe

e3f28d4

biswapanda force-pushed the bis/vllm-test-1 branch from 155c017 to e3f28d4 Compare April 24, 2026 20:11

biswapanda changed the title ~~feat: add vllm recipe for DSV4 recipe~~ recipes: production-quality DeepSeek-V4-Flash vLLM agg recipe Apr 24, 2026

copy-pr-bot Bot temporarily deployed to GITLAB April 24, 2026 20:11 Inactive

github-actions Bot added documentation Improvements or additions to documentation backend::vllm Relates to the vllm backend deployment::k8s Relates to dynamo deployment in kubernetes backend::trtllm Relates to the trtllm backend planner labels Apr 24, 2026

biswapanda added 2 commits April 24, 2026 13:36

pull-request-size Bot added size/XL and removed size/L labels Apr 24, 2026

copy-pr-bot Bot had a problem deploying to GITLAB April 24, 2026 20:40 Error

dmitry-tokarev-nv changed the title ~~recipes: add experimental DeepSeek-V4-Flash vLLM agg recipe~~ feat: add experimental DeepSeek-V4-Flash vLLM agg recipe Apr 24, 2026

formatting

ed842e1

Signed-off-by: Dmitry Tokarev <dtokarev@nvidia.com>

copy-pr-bot Bot had a problem deploying to GITLAB April 24, 2026 20:43 Error

Merge branch 'main' into bis/vllm-test-1

db86e01

copy-pr-bot Bot had a problem deploying to GITLAB April 24, 2026 20:57 Error

Debian packages upgrades

e7f294c

Signed-off-by: Dmitry Tokarev <dtokarev@nvidia.com>

copy-pr-bot Bot had a problem deploying to GITLAB April 24, 2026 22:24 Error

biswapanda added 2 commits April 24, 2026 16:26

recipes: add experimental DeepSeek-V4-Pro vLLM agg recipe

fbb9e96

pull-request-size Bot added size/XXL and removed size/XL labels Apr 24, 2026

copy-pr-bot Bot had a problem deploying to GITLAB April 24, 2026 23:42 Error

address coderabbit comment

98d8494

biswapanda changed the title ~~feat: add experimental DeepSeek-V4-Flash vLLM agg recipe~~ feat: add experimental DeepSeek-V4-Flash + V4-Pro vLLM agg recipes Apr 25, 2026

copy-pr-bot Bot had a problem deploying to GITLAB April 25, 2026 00:51 Error

biswapanda enabled auto-merge (squash) April 25, 2026 01:37

fixed etcd path

b7873df

Signed-off-by: Dmitry Tokarev <dtokarev@nvidia.com>

copy-pr-bot Bot had a problem deploying to GITLAB April 25, 2026 02:41 Error

dmitry-tokarev-nv approved these changes Apr 25, 2026

View reviewed changes

biswapanda merged commit 9c07899 into main Apr 25, 2026
58 of 59 checks passed

biswapanda deleted the bis/vllm-test-1 branch April 25, 2026 04:22

dmitry-tokarev-nv mentioned this pull request Apr 25, 2026

feat: add experimental DeepSeek-V4-Flash + V4-Pro vLLM agg recipes (#… #8719

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add experimental DeepSeek-V4-Flash + V4-Pro vLLM agg recipes#8668

feat: add experimental DeepSeek-V4-Flash + V4-Pro vLLM agg recipes#8668
biswapanda merged 16 commits intomainfrom
bis/vllm-test-1

biswapanda commented Apr 24, 2026 •

edited

Loading

Uh oh!

Uh oh!

dmitry-tokarev-nv commented Apr 24, 2026

Uh oh!

biswapanda commented Apr 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

biswapanda commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Recipe layout (per recipe)

DGD shapes

Why TP=8 for Pro vs DP=4 for Flash?

Pro-specific reasoning modes

Index update

Where should the reviewer start?

Status / caveats

Testing

Related

Uh oh!

Uh oh!

dmitry-tokarev-nv commented Apr 24, 2026

Uh oh!

biswapanda commented Apr 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

biswapanda commented Apr 24, 2026 •

edited

Loading