ai-dynamo · biswapanda · Apr 25, 2026 · Apr 24, 2026 · Apr 24, 2026 · Apr 24, 2026
@@ -69,6 +69,8 @@ These recipes are under active development and may require additional setup step
 |-------|-----------|------|------|------------|-------|
 | **[GLM-5-NVFP4](glm-5-nvfp4/sglang/disagg/)** | SGLang | Disagg Prefill/Decode | 20x GB200 | ✅ | NVFP4, EAGLE speculative decoding, TP16 decode + TP4 prefill. Requires [custom container build](glm-5-nvfp4/). |
 | **[nvidia/Kimi-K2.5-NVFP4](kimi-k2.5/trtllm/agg/nvidia/)** | TensorRT-LLM | Aggregated | 8x B200 | ✅ | Text only — MoE model, TP8×EP8, reasoning + tool calling. Vision input not yet functional. |
+| **[DeepSeek-V4-Flash](deepseek-v4-flash/vllm/agg/)** | vLLM | Aggregated | 4x B200 | ✅ | Text only — MoE model (284B / 13B active), DP=4 + EP, FP8 KV cache, reasoning + tool calling. Requires [custom container build](deepseek-v4-flash/container/). |
+| **[DeepSeek-V4-Pro](deepseek-v4-pro/vllm/agg/)** | vLLM | Aggregated | 8x B200 | ✅ | Text only — MoE model (1.6T / 49B active, 1M context), TP=8 + EP, FP4+FP8 mixed checkpoint, FP8 KV cache, CSA+HCA attention, three reasoning effort modes, tool calling. Requires [custom container build](deepseek-v4-pro/container/). |
 
 ## Recipe Structure
 

@@ -0,0 +1,161 @@
+# DeepSeek-V4-Flash Recipe
+
+Aggregated-serving recipe for **DeepSeek-V4-Flash** on vLLM with Dynamo.
+
+| Variant | Model | Status | Modality | Manifest | GPUs |
+|---------|-------|--------|----------|----------|------|
+| **vllm-agg** | `deepseek-ai/DeepSeek-V4-Flash` | Experimental | Text only | [`vllm/agg/vllm-dgd.yaml`](vllm/agg/vllm-dgd.yaml) | 4x B200 |
+
+Aggregated, single-replica: 1 decode pod running DP=4 + Expert Parallel on 4 B200 GPUs (TP=1). Tested on 4 of 8 GPUs per B200 node.
+
+## Prerequisites
+
+1. **Dynamo Platform installed** — see the [Kubernetes Deployment Guide](../../docs/kubernetes/README.md).
+2. **GPU cluster** with at least 4 B200 GPUs available on one node.
+3. **HuggingFace token** with access to `deepseek-ai/DeepSeek-V4-Flash`.
+4. **Dynamo + vLLM image with the DeepSeek-V4 stack.** DeepSeek-V4-Flash is not in a stock vLLM release yet. It is built in two steps:
+
+   1. Build the Dynamo vLLM runtime image locally per [`<repo_root>/container/README.md`](../../container/README.md) (this produces the local tag `dynamo:latest-vllm-runtime`).
+   2. Build the DeepSeek-V4-Flash overlay on top of it using [`container/Dockerfile.dsv4`](container/Dockerfile.dsv4). See [`container/README.md`](container/README.md) for build args and troubleshooting. From the repo root:
+
+      ```bash
+      docker build -f recipes/deepseek-v4-flash/container/Dockerfile.dsv4 \
+        -t <your-registry>/vllm-dsv4:<tag> .
+      ```
+
+   Then set the `image:` fields in `vllm/agg/vllm-dgd.yaml` (both the frontend and decode workers) to `<your-registry>/vllm-dsv4:<tag>`.
+
+## Quick Start
+
+```bash
+export NAMESPACE=dynamo-demo
+kubectl create namespace ${NAMESPACE}
+
+# HuggingFace token secret (consumed by the download Job and, as a convenience, by the worker)
+kubectl create secret generic hf-token-secret \
+  --from-literal=HF_TOKEN="your-token-here" \
+  -n ${NAMESPACE}
+
+# Download model into the model-cache PVC.
+# Edit model-cache/model-cache.yaml and set storageClassName to a RWX class in your cluster.
+# The PVC requests 400Gi; DeepSeek-V4-Flash is ~160GB on disk (46 safetensors shards,
+# FP4+FP8 mixed) and typically takes 30-60 min to download on first apply.
+kubectl apply -f model-cache/model-cache.yaml -n ${NAMESPACE}
+kubectl apply -f model-cache/model-download.yaml -n ${NAMESPACE}
+kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=7200s
+
+# Update the `image:` fields in vllm/agg/vllm-dgd.yaml to your Dynamo + vLLM build.
+
+# Deploy
+kubectl apply -f vllm/agg/vllm-dgd.yaml -n ${NAMESPACE}
+
+# First launch of the decode worker takes up to ~60 minutes (weight load +
+# FlashInfer autotune + cudagraph warmup). The startup probe is sized for this.
+kubectl wait --for=condition=Ready pod \
+  -l nvidia.com/dynamo-graph-deployment-name=dsv4-flash-agg \
+  -n ${NAMESPACE} --timeout=3600s
+```
+
+## Test the Deployment
+
+```bash
+kubectl port-forward svc/dsv4-flash-agg-frontend 8000:8000 -n ${NAMESPACE}
+
+curl http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "deepseek-ai/DeepSeek-V4-Flash",
+    "messages": [{"role": "user", "content": "Hello!"}],
+    "max_tokens": 100
+  }'
+```
+
+## Recipe Details
+
+The worker command lives in `vllm/agg/vllm-dgd.yaml`. Key flags and why they're there:
+
+| Flag | Purpose |
+|------|---------|
+| `--tokenizer-mode deepseek_v4` | Selects the DeepSeek-V4 tokenizer |
+| `--dyn-reasoning-parser deepseek_v4` | Extracts chain-of-thought into `message.reasoning_content` |
+| `--dyn-tool-call-parser deepseek_v4` | Emits OpenAI-compatible structured `tool_calls` |
+| `--attention-config '{"use_fp4_indexer_cache":true}'` | Blackwell FP4 indexer cache for CSA+HCA attention |
+| `--kv-cache-dtype fp8` + `--block-size 256` | FP8 KV cache; block size matches the upstream recipe |
+| `--tensor-parallel-size 1 --data-parallel-size 4 --enable-expert-parallel` | DP=4 + EP across the 4 GPUs (TP=1) |
+| `--compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}'` | Single-node DEP compilation config from the upstream recipe |
+| `--max-num-seqs 256` | Concurrency cap |
+
+## Model Details
+
+| | |
+|---|---|
+| **Model** | `deepseek-ai/DeepSeek-V4-Flash` (MoE, 284B total / 13B active) |
+| **Checkpoint** | Mixed FP4 (expert weights) + FP8 (attention, norm, router) |
+| **Backend** | vLLM with the DeepSeek-V4 stack (`vllm/vllm-openai:deepseekv4-cu130`) |
+| **Parallelism** | TP=1, DP=4, Expert Parallel enabled |
+| **KV cache** | FP8, block size 256 |
+| **Attention** | Hybrid CSA + HCA with Blackwell FP4 indexer cache |
+
+## Verifying Reasoning
+
+```bash
+curl -s http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "deepseek-ai/DeepSeek-V4-Flash",
+    "messages": [{"role": "user", "content": "What is 2+2? Answer briefly."}],
+    "max_tokens": 200
+  }' | python3 -m json.tool
+```
+
+Expected:
+
+- `choices[0].message.reasoning_content` contains the model's chain-of-thought.
+- `choices[0].message.content` contains only the final answer.
+- No raw `</think>` tags in either field.
+
+If `reasoning_content` is `null` and `</think>` appears in `content`, the reasoning parser isn't wired up — confirm `--dyn-reasoning-parser deepseek_v4` is on the worker command.
+
+## Verifying Tool Calling
+
+```bash
+curl -s http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "deepseek-ai/DeepSeek-V4-Flash",
+    "messages": [{"role": "user", "content": "What is the weather in San Francisco?"}],
+    "tools": [{
+      "type": "function",
+      "function": {
+        "name": "get_weather",
+        "description": "Get the current weather for a location",
+        "parameters": {
+          "type": "object",
+          "properties": {
+            "location": {"type": "string", "description": "City name"}
+          },
+          "required": ["location"]
+        }
+      }
+    }],
+    "max_tokens": 300
+  }' | python3 -m json.tool
+```
+
+Expected:
+
+- `choices[0].message.tool_calls` is a structured array with `function.name`, `function.arguments`, and `id`.
+- `choices[0].finish_reason` is `"tool_calls"`.
+- `choices[0].message.reasoning_content` may contain the model's reasoning about tool selection.
+
+If `tool_calls` is missing and raw tool-call markers appear in `content`, confirm `--dyn-tool-call-parser deepseek_v4` is on the worker command.
+
+## Notes
+
+- **Storage class.** Update `storageClassName` in `model-cache/model-cache.yaml` to a RWX class that can serve the PVC to frontend and worker pods.
+- **Model size.** `deepseek-ai/DeepSeek-V4-Flash` is ~160 GB on disk (46 safetensors shards in FP4+FP8 mixed form). The 400Gi PVC leaves headroom for HF cache metadata and one alternate revision.
+- **Image tag.** The manifest ships with `nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag`. Replace with your built Dynamo + vLLM (DeepSeek-V4) image — see Prerequisite 4.
+- **First launch is slow.** The decode worker loads weights and warms CUDA graphs; the startup probe allows up to ~60 min (`failureThreshold: 360` at `periodSeconds: 10`) and `VLLM_ENGINE_READY_TIMEOUT_S=3600` is set to match.
+- **Parser flags.** Use the Dynamo variants on the worker (`--dyn-reasoning-parser`, `--dyn-tool-call-parser`). vLLM's native `--reasoning-parser` / `--tool-call-parser` are engine-side and do not feed the Dynamo OpenAI renderer.
+- **DP stability.** `VLLM_RANDOMIZE_DP_DUMMY_INPUTS=1` and `VLLM_SKIP_P2P_CHECK=1` mirror the DeepSeek-R1 vLLM recipe and stabilize DP dummy inputs on Blackwell.
+- **Offline model cache.** The worker runs with `HF_HUB_OFFLINE=1` so vLLM reads the cached weights from the PVC and never contacts the HF Hub at startup. The HF token secret is mounted defensively; it isn't required at runtime once the download Job has completed.
@@ -0,0 +1,90 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Dynamo vLLM runtime overlaid on the official DeepSeek-V4 vLLM image.
+#
+# Base: vllm/vllm-openai:deepseekv4-cu130 — ships vLLM from PR #40760
+# (zyongye/vllm:dsv4) with the DeepSeek-V4 kernels, tokenizer_mode, tool+reasoning
+# parsers, hybrid CSA+HCA attention, MTP speculative decoding, and FP4 indexer.
+#
+# We take pre-built dynamo artifacts (wheels, nats, etcd, NIXL, UCX, dynamo.vllm
+# python worker) from a locally-built Dynamo vLLM runtime image (produced via
+# <repo_root>/container/README.md) and layer them on top of the dsv4 vLLM image
+# without touching the vLLM install.
+#
+# Build (run from the repo root):
+#   docker build -f recipes/deepseek-v4-flash/container/Dockerfile.dsv4 \
+#     -t <your-registry>/vllm-dsv4:<tag> .
+#
+# See recipes/deepseek-v4-flash/container/README.md for build args and
+# troubleshooting.
+#
+# Both base images must be Python 3.12 (verified).
+
+# Default to the local tag produced by `container/render.py --framework vllm
+# --target runtime` + `docker build -t dynamo:latest-vllm-runtime ...`. Override
+# with --build-arg DYNAMO_SRC_IMAGE=... to use a published release tag instead.
+ARG DYNAMO_SRC_IMAGE=dynamo:latest-vllm-runtime
+ARG DSV4_BASE_IMAGE=vllm/vllm-openai:deepseekv4-cu130
+
+FROM ${DYNAMO_SRC_IMAGE} AS dynamo_src
+
+FROM ${DSV4_BASE_IMAGE}
+
+ENV DEBIAN_FRONTEND=noninteractive
+
+# Runtime deps dynamo needs that aren't in the vLLM image (etcd/nats are static
+# binaries we COPY; libibverbs/rdma-core are needed for NIXL's UCX transport).
+RUN apt-get update && apt-get install -y --no-install-recommends \
+        libibverbs1 rdma-core ibverbs-utils libibumad3 \
+        libnuma1 librdmacm1 ibverbs-providers \
+        ca-certificates jq curl \
+    && apt list --upgradable 2>/dev/null | tail -n +2 | grep 'jammy-' | awk -F/ '{print $1}' | xargs -r apt-get install -y --only-upgrade \
+    && rm -rf /var/lib/apt/lists/*
+
+# --- patch vLLM: drop unsupported topk=1024 from sparse attn indexer ---
+# from https://github.com/vllm-project/vllm/pull/40760/changes/3602f14f0e146b234be911d916e381b4e6a4dc0c
+# TODO: remove once https://github.com/vllm-project/vllm/pull/40760 lands in the base image.
+RUN sed -i 's/(512, 1024, 2048)/(512, 2048)/' \
+      /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/sparse_attn_indexer.py
+
+# --- static binaries ---
+COPY --from=dynamo_src /usr/bin/nats-server /usr/bin/nats-server
+COPY --from=dynamo_src /usr/local/bin/etcd /usr/local/bin/etcd
+ENV PATH=/usr/local/bin/etcd:${PATH}
+
+# --- UCX ---
+COPY --from=dynamo_src /usr/local/ucx /usr/local/ucx
+ENV PATH=/usr/local/ucx/bin:${PATH}
+
+# --- NIXL (C++ libs for KV transfer) ---
+COPY --from=dynamo_src /opt/nvidia/nvda_nixl /opt/nvidia/nvda_nixl
+ENV NIXL_PREFIX=/opt/nvidia/nvda_nixl \
+    NIXL_LIB_DIR=/opt/nvidia/nvda_nixl/lib64 \
+    NIXL_PLUGIN_DIR=/opt/nvidia/nvda_nixl/lib64/plugins
+ENV LD_LIBRARY_PATH=${NIXL_LIB_DIR}:${NIXL_PLUGIN_DIR}:/usr/local/ucx/lib:/usr/local/ucx/lib/ucx:${LD_LIBRARY_PATH}
+
+# --- install dynamo python wheels into the dsv4 image's system python ---
+# The dsv4 image uses system python3.12 with pip at /usr/local/lib/python3.12/dist-packages.
+# ai_dynamo_runtime is abi3 (cp310+), compatible with cp312.
+COPY --from=dynamo_src /opt/dynamo/wheelhouse /opt/dynamo/wheelhouse
+RUN pip install --no-cache-dir \
+        /opt/dynamo/wheelhouse/ai_dynamo_runtime*.whl \
+        /opt/dynamo/wheelhouse/ai_dynamo*any.whl \
+        /opt/dynamo/wheelhouse/nixl/nixl*.whl
+
+# --- dynamo python source (dynamo.vllm worker + common + mocker) ---
+# Bring the worker entrypoint tree so `python -m dynamo.vllm` resolves.
+COPY --from=dynamo_src /workspace/components/src/dynamo /workspace/components/src/dynamo
+ENV PYTHONPATH=/workspace/components/src:${PYTHONPATH:-}
+
+WORKDIR /workspace
+
+# --- dynamo runtime env tweaks ---
+# Keep vLLM's flashinfer sampler (enabled by default in 0.20+ but explicit here).
+ENV VLLM_USE_FLASHINFER_SAMPLER=1
+
+# Default to bash so the Dynamo CRD operator can exec `python3 -m dynamo.vllm`
+# via the manifest command/args rather than the vLLM api_server entrypoint.
+ENTRYPOINT []
+CMD ["bash"]
@@ -0,0 +1,88 @@
+# DeepSeek-V4-Flash Reference Container
+
+DeepSeek-V4-Flash is not in a stock vLLM release yet, so the recipe ships with its own reference Dockerfile that overlays the Dynamo runtime on top of the upstream dsv4 vLLM image.
+
+- **Base:** [`vllm/vllm-openai:deepseekv4-cu130`](https://hub.docker.com/r/vllm/vllm-openai/tags) — vLLM from PR [#40760](https://github.com/vllm-project/vllm/pull/40760) (`zyongye/vllm:dsv4`) with the DeepSeek-V4 kernels, `tokenizer_mode`, tool + reasoning parsers, hybrid CSA + HCA attention, MTP speculative decoding, and the FP4 indexer.
+- **Overlay:** pre-built Dynamo artifacts (wheels, static `nats`/`etcd` binaries, NIXL, UCX, the `dynamo.vllm` Python worker) copied from a locally-built Dynamo vLLM runtime image.
+
+Both layers use Python 3.12; no vLLM reinstall is performed.
+
+## Build flow
+
+Two Docker images are involved:
+
+1. **Dynamo vLLM runtime** — built from this repo using the instructions in [`<repo_root>/container/README.md`](../../../container/README.md). This image contains the Dynamo Rust runtime, wheels, and the `dynamo.vllm` worker.
+2. **DeepSeek-V4-Flash overlay** — built here, using the image from step 1 as the source stage (`DYNAMO_SRC_IMAGE`) and the upstream dsv4 vLLM image as the final base (`DSV4_BASE_IMAGE`).
+
+## Step 1 — Build the Dynamo vLLM runtime
+
+From the **repo root**, render and build the runtime image per [`container/README.md`](../../../container/README.md):
+
+```bash
+# From <repo_root>
+container/render.py --framework vllm --target runtime --output-short-filename
+docker build -t dynamo:latest-vllm-runtime -f container/rendered.Dockerfile .
+```
+
+This produces the local tag `dynamo:latest-vllm-runtime`, which is what Step 2 expects by default.
+
+## Step 2 — Build the DeepSeek-V4-Flash overlay
+
+Still from the **repo root**:
+
+```bash
+docker build \
+  -f recipes/deepseek-v4-flash/container/Dockerfile.dsv4 \
+  -t <your-registry>/vllm-dsv4:<tag> \
+  .
+```
+
+The Dockerfile takes no files from the build context (everything comes from `FROM` / `COPY --from=`), so any context directory works — using the repo root keeps the `-f` path straightforward.
+
+### Build args
+
+Both can be overridden with `--build-arg`:
+
+| Arg | Default | Purpose |
+|-----|---------|---------|
+| `DYNAMO_SRC_IMAGE` | `dynamo:latest-vllm-runtime` | Source image for the Dynamo overlay. The default matches the tag produced by Step 1. Override with a pinned released tag (e.g. `nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.2`) for reproducible builds without rebuilding locally. |
+| `DSV4_BASE_IMAGE` | `vllm/vllm-openai:deepseekv4-cu130` | The dsv4 vLLM base. The `cu129` tag is also available for CUDA 12.9 hosts. |
+
+Example — pin the overlay source to a released Dynamo tag on a CUDA 12.9 host:
+
+```bash
+docker build \
+  -f recipes/deepseek-v4-flash/container/Dockerfile.dsv4 \
+  --build-arg DYNAMO_SRC_IMAGE=nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.2-cuda13 \
+  --build-arg DSV4_BASE_IMAGE=vllm/vllm-openai:deepseekv4-cu129 \
+  -t <your-registry>/vllm-dsv4:<tag> \
+  .
+```
+
+## Push
+
+```bash
+docker push <your-registry>/vllm-dsv4:<tag>
+```
+
+## Wire into the recipe
+
+Once the image is pushed, update the `image:` fields in
+[`../vllm/agg/vllm-dgd.yaml`](../vllm/agg/vllm-dgd.yaml) (both the Frontend and the `VllmDecodeWorker`) to point at `<your-registry>/vllm-dsv4:<tag>`, then follow the recipe's [Quick Start](../README.md#quick-start) to deploy.
+
+## What the Dockerfile does
+
+1. Installs the RDMA / UCX runtime deps on top of the dsv4 vLLM image (`libibverbs1`, `rdma-core`, `ibverbs-utils`, `libibumad3`, `libnuma1`, `librdmacm1`, `ibverbs-providers`, plus `ca-certificates`, `jq`, `curl`).
+2. Applies a small upstream vLLM patch to the sparse attention indexer (drops the unsupported `topk=1024`). Remove once [vLLM PR #40760](https://github.com/vllm-project/vllm/pull/40760) lands in the base image.
+3. Copies the static `nats-server` and `etcd` binaries from the Dynamo runtime image.
+4. Copies UCX into `/usr/local/ucx` and NIXL into `/opt/nvidia/nvda_nixl`, with `LD_LIBRARY_PATH` set so NIXL's plugins resolve at runtime.
+5. Installs the Dynamo Python wheels (`ai_dynamo_runtime`, `ai_dynamo`, NIXL Python bindings) into the dsv4 image's system Python 3.12.
+6. Copies the `dynamo` Python package tree into `/workspace/components/src/dynamo` and puts it on `PYTHONPATH` so `python3 -m dynamo.vllm` resolves.
+7. Keeps vLLM's FlashInfer sampler enabled (`VLLM_USE_FLASHINFER_SAMPLER=1`) and clears `ENTRYPOINT` so the Dynamo CRD operator's `command` / `args` take effect.
+
+## Troubleshooting
+
+- **`pull access denied for dynamo:latest-vllm-runtime`** — Step 1 has not been run (or produced a different tag). Build the Dynamo vLLM runtime image locally per [`<repo_root>/container/README.md`](../../../container/README.md), or override `--build-arg DYNAMO_SRC_IMAGE=<your-image>`.
+- **`no matching manifest for linux/amd64`** — the dsv4 base is amd64-only today; build on an x86_64 host.
+- **CUDA version mismatch on the host** — use `DSV4_BASE_IMAGE=vllm/vllm-openai:deepseekv4-cu129` if your node is still on CUDA 12.9.
+- **NIXL plugins not found at runtime** — confirm `LD_LIBRARY_PATH` includes `/opt/nvidia/nvda_nixl/lib64/plugins` (set in the Dockerfile; don't unset it in the pod spec).
@@ -0,0 +1,13 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+apiVersion: v1
+kind: PersistentVolumeClaim
+metadata:
+  name: model-cache
+spec:
+  accessModes:
+    - ReadWriteMany
+  resources:
+    requests:
+      storage: 400Gi
+  storageClassName: "your-storage-class-name"