Skip to content

feat: add experimental DeepSeek-V4-Flash + V4-Pro vLLM agg recipes#8668

Merged
biswapanda merged 16 commits intomainfrom
bis/vllm-test-1
Apr 25, 2026
Merged

feat: add experimental DeepSeek-V4-Flash + V4-Pro vLLM agg recipes#8668
biswapanda merged 16 commits intomainfrom
bis/vllm-test-1

Conversation

@biswapanda
Copy link
Copy Markdown
Contributor

@biswapanda biswapanda commented Apr 24, 2026

Overview

Adds two experimental aggregated vLLM serving recipes for the DeepSeek-V4 series, sharing a single dsv4 reference container:

  • recipes/deepseek-v4-flash/DeepSeek-V4-Flash (284B / 13B active), 4x B200, DP=4 + EP.
  • recipes/deepseek-v4-pro/DeepSeek-V4-Pro (1.6T / 49B active, 1M context), 8x B200, TP=8 + EP.

Both follow the existing recipe layout (recipes/qwen3-235b-a22b-fp8, recipes/llama-3-70b/vllm/agg) and are listed under Experimental Recipes in recipes/README.md — functional, not yet performance-tuned, not benchmarked, and depend on a custom container build.

Smoke-tested on a 4x B200 node (Flash, namespace bis-dsv4) and an 8x B200 node (Pro). Reasoning (reasoning_content) and tool calling (tool_calls) verified end-to-end via /v1/chat/completions against both.

Recipe layout (per recipe)

Each recipe directory contains:

  • container/Dockerfile.dsv4 — reference container build that overlays the Dynamo runtime (wheels, nats, etcd, NIXL, UCX, dynamo.vllm worker) on top of vllm/vllm-openai:deepseekv4-cu130. The two Dockerfiles are duplicates so each recipe is self-contained; the produced image is identical and is reusable across both recipes (the model is selected at runtime via --model).
    docker build -f recipes/deepseek-v4-pro/container/Dockerfile.dsv4 \
      -t <your-registry>/vllm-dsv4:<tag> .
    
  • container/README.md — build flow, build args (DYNAMO_SRC_IMAGE, DSV4_BASE_IMAGE), troubleshooting.
  • model-cache/model-cache.yamlReadWriteMany PVC. 700Gi for Flash, 1500Gi for Pro.
  • model-cache/model-download.yaml — HuggingFace download Job, reads hf-token-secret.
  • vllm/agg/vllm-dgd.yamlDynamoGraphDeployment.
  • README.md — prerequisites, Quick Start, recipe details with flag table, reasoning + tool calling verification curls, notes.

DGD shapes

Flash Pro
Parallelism --tensor-parallel-size 1 --data-parallel-size 4 --enable-expert-parallel --tensor-parallel-size 8 --enable-expert-parallel
GPUs 4x B200 (single node, 4 of 8) 8x B200 (fills one node)
KV cache --kv-cache-dtype fp8 --block-size 256 same
Indexer --attention-config '{"use_fp4_indexer_cache":true}' same
Cudagraphs '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}' '{"mode":0,"cudagraph_mode":"FULL_DECODE_ONLY"}' (more conservative for the 1.6T model; matches the upstream V4-Pro example)
Tokenizer / parsers --tokenizer-mode deepseek_v4, --dyn-reasoning-parser deepseek_v4, --dyn-tool-call-parser deepseek_v4 same
Startup probe 60 min (failureThreshold: 360) 90 min (failureThreshold: 540) — sized for the larger TP=8 weight load
VLLM_ENGINE_READY_TIMEOUT_S 3600 5400

Both worker pods run with HF_HUB_OFFLINE=1 (weights served from the PVC, never contacting the HF Hub), and the manifests ship with the placeholder nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag so users build the real image from their recipe's container/Dockerfile.dsv4.

Why TP=8 for Pro vs DP=4 for Flash?

Pro is ~5.5x larger than Flash on disk (~865 GB vs ~160 GB). With FP4+FP8 mixed weights, Pro does not fit in 4 ranks at typical batch shapes, so the upstream tested shape is TP=8 across all 8 GPUs of one node. Expert Parallel is still enabled on top of TP — TP shards the dense (attention/router/norm) weights, EP shards the experts.

Pro-specific reasoning modes

Per the model card, V4-Pro exposes three reasoning effort levels. The recipe surfaces them via chat_template_kwargs:

  • Non-think{} — fast intuitive responses
  • Think High{"thinking":true,"reasoning_effort":"high"} — explicit chain-of-thought
  • Think Max{"thinking":true,"reasoning_effort":"max"} — maximum reasoning depth (model card recommends --max-model-len >= 393216)

Index update

recipes/README.md adds DeepSeek-V4-Flash and DeepSeek-V4-Pro rows under Experimental Recipes, both linking to their container/Dockerfile.dsv4.

Where should the reviewer start?

  • recipes/deepseek-v4-pro/README.md and recipes/deepseek-v4-flash/README.md — user-facing flow.
  • recipes/deepseek-v4-pro/vllm/agg/vllm-dgd.yaml — the larger, more conservative DGD; flag rationale in the README.
  • recipes/deepseek-v4-pro/container/Dockerfile.dsv4 (identical to Flash's) — reference image build.

Status / caveats

  • Experimental. Not performance-tuned, no perf.yaml, no benchmark numbers.
  • Requires custom container build. DeepSeek-V4 is not in a stock vLLM release yet; users build from recipes/<recipe>/container/Dockerfile.dsv4.
  • Slow first launch. Up to ~60 min (Flash) / ~90 min (Pro) for weight load + FlashInfer autotune + cudagraph warmup; startup probes are sized accordingly.
  • Scope. Aggregated single-node only. No disagg/P-D variant yet. No multi-node Pro variant.

Testing

Deployed both recipes via kubectl apply -f vllm/agg/vllm-dgd.yaml:

  • kubectl get dynamographdeploymentstate: successful, All resources are ready (both).
  • GET /v1/models returns the expected model id.
  • POST /v1/chat/completions with a reasoning prompt → message.reasoning_content populated, clean message.content.
  • POST /v1/chat/completions with tool definitions → structured message.tool_calls, finish_reason: "tool_calls".
  • For Pro: verified all three reasoning effort modes via chat_template_kwargs.

Related

  • Depends on the DeepSeek-V4 parser PR (reasoning + tool-call support) — the --dyn-reasoning-parser deepseek_v4 and --dyn-tool-call-parser deepseek_v4 flags require that change.
  • Upstream vLLM dsv4 image: vllm-project/vllm#40760.

@biswapanda biswapanda requested review from a team as code owners April 24, 2026 06:00
@biswapanda biswapanda marked this pull request as draft April 24, 2026 15:21
@biswapanda biswapanda marked this pull request as ready for review April 24, 2026 18:05
Comment thread recipes/deepseek-v4-flash/container/Dockerfile.dsv4
biswapanda and others added 3 commits April 24, 2026 12:59
Adds a DeepSeek-V4-Flash serving recipe on top of the dsv4-preview vLLM
image (vllm/vllm-openai:deepseekv4-cu130). Three pieces:

- container/Dockerfile.dsv4: overlays dynamo artifacts (wheels, nats, etcd,
  NIXL, UCX, dynamo.vllm worker) onto the dsv4 vLLM image without touching
  the vLLM install. Parameterized via DYNAMO_SRC_IMAGE / DSV4_BASE_IMAGE.

- recipes/deepseek-v4-flash/vllm/agg/deploy.yaml: plain vLLM (no dynamo CRD)
  aggregated Deployment + Service, DP=4 + EP on a single B200 node, reading
  weights from the shared-model-cache PVC.

- recipes/deepseek-v4-flash/vllm/dsv4/vllm-dgd.yaml: DynamoGraphDeployment
  variant using the overlay image above (Frontend + decode), same DP=4 + EP
  shape, suitable for dynamo-operator-managed namespaces.

Namespace-specific debug/PV/PVC manifests are intentionally omitted so the
recipe is portable; consumers point at their own shared-model-cache PVC.
Signed-off-by: Dmitry Tokarev <dtokarev@nvidia.com>
@biswapanda biswapanda changed the title feat: add vllm recipe for DSV4 recipe test feat: add vllm recipe for DSV4 recipe Apr 24, 2026
@biswapanda biswapanda changed the title feat: add vllm recipe for DSV4 recipe recipes: production-quality DeepSeek-V4-Flash vLLM agg recipe Apr 24, 2026
@github-actions github-actions Bot added documentation Improvements or additions to documentation backend::vllm Relates to the vllm backend deployment::k8s Relates to dynamo deployment in kubernetes backend::trtllm Relates to the trtllm backend planner labels Apr 24, 2026
… not NGC

The dsv4 overlay pulls Dynamo artifacts (wheels, nats, etcd, NIXL, UCX,
dynamo.vllm worker) from a source image. The flow is: build the Dynamo
vLLM runtime from this repo first using <repo_root>/container/README.md,
then build the dsv4 overlay on top of it. Users should not need any
non-public NGC access.

- Dockerfile.dsv4: default DYNAMO_SRC_IMAGE is now
  `dynamo:latest-vllm-runtime` (the local tag produced by
  `container/render.py --framework vllm --target runtime`). Published
  `nvcr.io/nvidia/ai-dynamo/vllm-runtime:<tag>` is now an override
  option documented via --build-arg, not the default.
- container/README.md: restructure into a two-step build flow. Step 1
  links to the repo-root container/README.md with the exact render +
  docker build commands. Step 2 builds the dsv4 overlay. Drop the
  "access to the public Dynamo vLLM runtime image" / NGC-login
  prerequisite; replace the troubleshooting entry with the local-tag
  pull-access case.
- Recipe README.md (Prerequisite 4): mirror the two-step structure so
  users see the build order up front.
…ache PVC

Verified against https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash/tree/main:
46 safetensors shards, one ~1.06 GB shard + 45 at ~3.57-3.59 GB, totaling
~160 GB on disk in FP4+FP8 mixed form (NOT ~300 GB).

- model-cache/model-cache.yaml: 700Gi -> 400Gi (~2.5x headroom over the
  actual weights; still room for HF cache metadata + one alternate
  revision without being wasteful).
- README.md Quick Start comment: update size (~300GB -> ~160GB) and
  download estimate (1-2 hours -> 30-60 min, a more typical figure for
  an xet-accelerated HF pull of this size).
- README.md Notes: add a "Model size" bullet that states the 46-shard
  layout and explains the PVC sizing rationale.
@dmitry-tokarev-nv dmitry-tokarev-nv changed the title recipes: add experimental DeepSeek-V4-Flash vLLM agg recipe feat: add experimental DeepSeek-V4-Flash vLLM agg recipe Apr 24, 2026
Signed-off-by: Dmitry Tokarev <dtokarev@nvidia.com>
@dmitry-tokarev-nv
Copy link
Copy Markdown
Contributor

/ok to test ed842e1

Signed-off-by: Dmitry Tokarev <dtokarev@nvidia.com>
Adds the recipe directory referenced by recipes/README.md so its
documentation links resolve.

- recipes/deepseek-v4-pro/README.md            -- model + deploy guide
- recipes/deepseek-v4-pro/vllm/agg/vllm-dgd.yaml -- TP=8 + EP, 8x B200
- recipes/deepseek-v4-pro/model-cache/{model-cache.yaml,model-download.yaml}
- recipes/deepseek-v4-pro/container/{Dockerfile.dsv4,README.md} --
  shares the same dsv4 overlay image as the V4-Flash recipe.
@biswapanda biswapanda changed the title feat: add experimental DeepSeek-V4-Flash vLLM agg recipe feat: add experimental DeepSeek-V4-Flash + V4-Pro vLLM agg recipes Apr 25, 2026
@biswapanda biswapanda enabled auto-merge (squash) April 25, 2026 01:37
@biswapanda
Copy link
Copy Markdown
Contributor Author

/ok to test 98d8494

Signed-off-by: Dmitry Tokarev <dtokarev@nvidia.com>
@biswapanda biswapanda merged commit 9c07899 into main Apr 25, 2026
58 of 59 checks passed
@biswapanda biswapanda deleted the bis/vllm-test-1 branch April 25, 2026 04:22
VincyZhang pushed a commit to VincyZhang/dynamo that referenced this pull request Apr 27, 2026
…i-dynamo#8668)

Signed-off-by: Dmitry Tokarev <dtokarev@nvidia.com>
Co-authored-by: Dmitry Tokarev <dtokarev@nvidia.com>
Signed-off-by: VincyZhang <wenxin.zhang@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

actions backend::trtllm Relates to the trtllm backend backend::vllm Relates to the vllm backend container deployment::k8s Relates to dynamo deployment in kubernetes documentation Improvements or additions to documentation feat frontend `python -m dynamo.frontend` and `dynamo-run in=http|text|grpc` multimodal planner router Relates to routing, KV-aware routing, etc. size/XXL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants