Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions recipes/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,8 @@ These recipes are under active development and may require additional setup step
|-------|-----------|------|------|------------|-------|
| **[GLM-5-NVFP4](glm-5-nvfp4/sglang/disagg/)** | SGLang | Disagg Prefill/Decode | 20x GB200 | ✅ | NVFP4, EAGLE speculative decoding, TP16 decode + TP4 prefill. Requires [custom container build](glm-5-nvfp4/). |
| **[nvidia/Kimi-K2.5-NVFP4](kimi-k2.5/trtllm/agg/nvidia/)** | TensorRT-LLM | Aggregated | 8x B200 | ✅ | Text only — MoE model, TP8×EP8, reasoning + tool calling. Vision input not yet functional. |
| **[DeepSeek-V4-Flash](deepseek-v4-flash/vllm/agg/)** | vLLM | Aggregated | 4x B200 | ✅ | Text only — MoE model (284B / 13B active), DP=4 + EP, FP8 KV cache, reasoning + tool calling. Requires [custom container build](deepseek-v4-flash/container/). |
| **[DeepSeek-V4-Pro](deepseek-v4-pro/vllm/agg/)** | vLLM | Aggregated | 8x B200 | ✅ | Text only — MoE model (1.6T / 49B active, 1M context), TP=8 + EP, FP4+FP8 mixed checkpoint, FP8 KV cache, CSA+HCA attention, three reasoning effort modes, tool calling. Requires [custom container build](deepseek-v4-pro/container/). |

## Recipe Structure

Expand Down
161 changes: 161 additions & 0 deletions recipes/deepseek-v4-flash/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,161 @@
# DeepSeek-V4-Flash Recipe

Aggregated-serving recipe for **DeepSeek-V4-Flash** on vLLM with Dynamo.

| Variant | Model | Status | Modality | Manifest | GPUs |
|---------|-------|--------|----------|----------|------|
| **vllm-agg** | `deepseek-ai/DeepSeek-V4-Flash` | Experimental | Text only | [`vllm/agg/vllm-dgd.yaml`](vllm/agg/vllm-dgd.yaml) | 4x B200 |

Aggregated, single-replica: 1 decode pod running DP=4 + Expert Parallel on 4 B200 GPUs (TP=1). Tested on 4 of 8 GPUs per B200 node.

## Prerequisites

1. **Dynamo Platform installed** — see the [Kubernetes Deployment Guide](../../docs/kubernetes/README.md).
2. **GPU cluster** with at least 4 B200 GPUs available on one node.
3. **HuggingFace token** with access to `deepseek-ai/DeepSeek-V4-Flash`.
4. **Dynamo + vLLM image with the DeepSeek-V4 stack.** DeepSeek-V4-Flash is not in a stock vLLM release yet. It is built in two steps:

1. Build the Dynamo vLLM runtime image locally per [`<repo_root>/container/README.md`](../../container/README.md) (this produces the local tag `dynamo:latest-vllm-runtime`).
2. Build the DeepSeek-V4-Flash overlay on top of it using [`container/Dockerfile.dsv4`](container/Dockerfile.dsv4). See [`container/README.md`](container/README.md) for build args and troubleshooting. From the repo root:

```bash
docker build -f recipes/deepseek-v4-flash/container/Dockerfile.dsv4 \
-t <your-registry>/vllm-dsv4:<tag> .
```

Then set the `image:` fields in `vllm/agg/vllm-dgd.yaml` (both the frontend and decode workers) to `<your-registry>/vllm-dsv4:<tag>`.

## Quick Start

```bash
export NAMESPACE=dynamo-demo
kubectl create namespace ${NAMESPACE}

# HuggingFace token secret (consumed by the download Job and, as a convenience, by the worker)
kubectl create secret generic hf-token-secret \
--from-literal=HF_TOKEN="your-token-here" \
-n ${NAMESPACE}

# Download model into the model-cache PVC.
# Edit model-cache/model-cache.yaml and set storageClassName to a RWX class in your cluster.
# The PVC requests 400Gi; DeepSeek-V4-Flash is ~160GB on disk (46 safetensors shards,
# FP4+FP8 mixed) and typically takes 30-60 min to download on first apply.
kubectl apply -f model-cache/model-cache.yaml -n ${NAMESPACE}
kubectl apply -f model-cache/model-download.yaml -n ${NAMESPACE}
kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=7200s

# Update the `image:` fields in vllm/agg/vllm-dgd.yaml to your Dynamo + vLLM build.

# Deploy
kubectl apply -f vllm/agg/vllm-dgd.yaml -n ${NAMESPACE}

# First launch of the decode worker takes up to ~60 minutes (weight load +
# FlashInfer autotune + cudagraph warmup). The startup probe is sized for this.
kubectl wait --for=condition=Ready pod \
-l nvidia.com/dynamo-graph-deployment-name=dsv4-flash-agg \
-n ${NAMESPACE} --timeout=3600s
```

## Test the Deployment

```bash
kubectl port-forward svc/dsv4-flash-agg-frontend 8000:8000 -n ${NAMESPACE}

curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/DeepSeek-V4-Flash",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 100
}'
```

## Recipe Details

The worker command lives in `vllm/agg/vllm-dgd.yaml`. Key flags and why they're there:

| Flag | Purpose |
|------|---------|
| `--tokenizer-mode deepseek_v4` | Selects the DeepSeek-V4 tokenizer |
| `--dyn-reasoning-parser deepseek_v4` | Extracts chain-of-thought into `message.reasoning_content` |
| `--dyn-tool-call-parser deepseek_v4` | Emits OpenAI-compatible structured `tool_calls` |
| `--attention-config '{"use_fp4_indexer_cache":true}'` | Blackwell FP4 indexer cache for CSA+HCA attention |
| `--kv-cache-dtype fp8` + `--block-size 256` | FP8 KV cache; block size matches the upstream recipe |
| `--tensor-parallel-size 1 --data-parallel-size 4 --enable-expert-parallel` | DP=4 + EP across the 4 GPUs (TP=1) |
| `--compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}'` | Single-node DEP compilation config from the upstream recipe |
| `--max-num-seqs 256` | Concurrency cap |

## Model Details

| | |
|---|---|
| **Model** | `deepseek-ai/DeepSeek-V4-Flash` (MoE, 284B total / 13B active) |
| **Checkpoint** | Mixed FP4 (expert weights) + FP8 (attention, norm, router) |
| **Backend** | vLLM with the DeepSeek-V4 stack (`vllm/vllm-openai:deepseekv4-cu130`) |
| **Parallelism** | TP=1, DP=4, Expert Parallel enabled |
| **KV cache** | FP8, block size 256 |
| **Attention** | Hybrid CSA + HCA with Blackwell FP4 indexer cache |

## Verifying Reasoning

```bash
curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/DeepSeek-V4-Flash",
"messages": [{"role": "user", "content": "What is 2+2? Answer briefly."}],
"max_tokens": 200
}' | python3 -m json.tool
```

Expected:

- `choices[0].message.reasoning_content` contains the model's chain-of-thought.
- `choices[0].message.content` contains only the final answer.
- No raw `</think>` tags in either field.

If `reasoning_content` is `null` and `</think>` appears in `content`, the reasoning parser isn't wired up — confirm `--dyn-reasoning-parser deepseek_v4` is on the worker command.

## Verifying Tool Calling

```bash
curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/DeepSeek-V4-Flash",
"messages": [{"role": "user", "content": "What is the weather in San Francisco?"}],
"tools": [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "City name"}
},
"required": ["location"]
}
}
}],
"max_tokens": 300
}' | python3 -m json.tool
```

Expected:

- `choices[0].message.tool_calls` is a structured array with `function.name`, `function.arguments`, and `id`.
- `choices[0].finish_reason` is `"tool_calls"`.
- `choices[0].message.reasoning_content` may contain the model's reasoning about tool selection.

If `tool_calls` is missing and raw tool-call markers appear in `content`, confirm `--dyn-tool-call-parser deepseek_v4` is on the worker command.

## Notes

- **Storage class.** Update `storageClassName` in `model-cache/model-cache.yaml` to a RWX class that can serve the PVC to frontend and worker pods.
- **Model size.** `deepseek-ai/DeepSeek-V4-Flash` is ~160 GB on disk (46 safetensors shards in FP4+FP8 mixed form). The 400Gi PVC leaves headroom for HF cache metadata and one alternate revision.
- **Image tag.** The manifest ships with `nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag`. Replace with your built Dynamo + vLLM (DeepSeek-V4) image — see Prerequisite 4.
- **First launch is slow.** The decode worker loads weights and warms CUDA graphs; the startup probe allows up to ~60 min (`failureThreshold: 360` at `periodSeconds: 10`) and `VLLM_ENGINE_READY_TIMEOUT_S=3600` is set to match.
- **Parser flags.** Use the Dynamo variants on the worker (`--dyn-reasoning-parser`, `--dyn-tool-call-parser`). vLLM's native `--reasoning-parser` / `--tool-call-parser` are engine-side and do not feed the Dynamo OpenAI renderer.
- **DP stability.** `VLLM_RANDOMIZE_DP_DUMMY_INPUTS=1` and `VLLM_SKIP_P2P_CHECK=1` mirror the DeepSeek-R1 vLLM recipe and stabilize DP dummy inputs on Blackwell.
- **Offline model cache.** The worker runs with `HF_HUB_OFFLINE=1` so vLLM reads the cached weights from the PVC and never contacts the HF Hub at startup. The HF token secret is mounted defensively; it isn't required at runtime once the download Job has completed.
90 changes: 90 additions & 0 deletions recipes/deepseek-v4-flash/container/Dockerfile.dsv4
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Dynamo vLLM runtime overlaid on the official DeepSeek-V4 vLLM image.
#
# Base: vllm/vllm-openai:deepseekv4-cu130 — ships vLLM from PR #40760
# (zyongye/vllm:dsv4) with the DeepSeek-V4 kernels, tokenizer_mode, tool+reasoning
# parsers, hybrid CSA+HCA attention, MTP speculative decoding, and FP4 indexer.
#
# We take pre-built dynamo artifacts (wheels, nats, etcd, NIXL, UCX, dynamo.vllm
# python worker) from a locally-built Dynamo vLLM runtime image (produced via
# <repo_root>/container/README.md) and layer them on top of the dsv4 vLLM image
# without touching the vLLM install.
#
# Build (run from the repo root):
# docker build -f recipes/deepseek-v4-flash/container/Dockerfile.dsv4 \
# -t <your-registry>/vllm-dsv4:<tag> .
#
# See recipes/deepseek-v4-flash/container/README.md for build args and
# troubleshooting.
#
# Both base images must be Python 3.12 (verified).

# Default to the local tag produced by `container/render.py --framework vllm
# --target runtime` + `docker build -t dynamo:latest-vllm-runtime ...`. Override
# with --build-arg DYNAMO_SRC_IMAGE=... to use a published release tag instead.
ARG DYNAMO_SRC_IMAGE=dynamo:latest-vllm-runtime
ARG DSV4_BASE_IMAGE=vllm/vllm-openai:deepseekv4-cu130

FROM ${DYNAMO_SRC_IMAGE} AS dynamo_src

FROM ${DSV4_BASE_IMAGE}

ENV DEBIAN_FRONTEND=noninteractive

# Runtime deps dynamo needs that aren't in the vLLM image (etcd/nats are static
# binaries we COPY; libibverbs/rdma-core are needed for NIXL's UCX transport).
RUN apt-get update && apt-get install -y --no-install-recommends \
libibverbs1 rdma-core ibverbs-utils libibumad3 \
libnuma1 librdmacm1 ibverbs-providers \
ca-certificates jq curl \
&& apt list --upgradable 2>/dev/null | tail -n +2 | grep 'jammy-' | awk -F/ '{print $1}' | xargs -r apt-get install -y --only-upgrade \
&& rm -rf /var/lib/apt/lists/*

Comment thread
dmitry-tokarev-nv marked this conversation as resolved.
# --- patch vLLM: drop unsupported topk=1024 from sparse attn indexer ---
# from https://github.com/vllm-project/vllm/pull/40760/changes/3602f14f0e146b234be911d916e381b4e6a4dc0c
# TODO: remove once https://github.com/vllm-project/vllm/pull/40760 lands in the base image.
RUN sed -i 's/(512, 1024, 2048)/(512, 2048)/' \
/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/sparse_attn_indexer.py

# --- static binaries ---
COPY --from=dynamo_src /usr/bin/nats-server /usr/bin/nats-server
COPY --from=dynamo_src /usr/local/bin/etcd /usr/local/bin/etcd
ENV PATH=/usr/local/bin/etcd:${PATH}
Comment thread
coderabbitai[bot] marked this conversation as resolved.

# --- UCX ---
COPY --from=dynamo_src /usr/local/ucx /usr/local/ucx
ENV PATH=/usr/local/ucx/bin:${PATH}

# --- NIXL (C++ libs for KV transfer) ---
COPY --from=dynamo_src /opt/nvidia/nvda_nixl /opt/nvidia/nvda_nixl
ENV NIXL_PREFIX=/opt/nvidia/nvda_nixl \
NIXL_LIB_DIR=/opt/nvidia/nvda_nixl/lib64 \
NIXL_PLUGIN_DIR=/opt/nvidia/nvda_nixl/lib64/plugins
ENV LD_LIBRARY_PATH=${NIXL_LIB_DIR}:${NIXL_PLUGIN_DIR}:/usr/local/ucx/lib:/usr/local/ucx/lib/ucx:${LD_LIBRARY_PATH}

# --- install dynamo python wheels into the dsv4 image's system python ---
# The dsv4 image uses system python3.12 with pip at /usr/local/lib/python3.12/dist-packages.
# ai_dynamo_runtime is abi3 (cp310+), compatible with cp312.
COPY --from=dynamo_src /opt/dynamo/wheelhouse /opt/dynamo/wheelhouse
RUN pip install --no-cache-dir \
/opt/dynamo/wheelhouse/ai_dynamo_runtime*.whl \
/opt/dynamo/wheelhouse/ai_dynamo*any.whl \
/opt/dynamo/wheelhouse/nixl/nixl*.whl

# --- dynamo python source (dynamo.vllm worker + common + mocker) ---
# Bring the worker entrypoint tree so `python -m dynamo.vllm` resolves.
COPY --from=dynamo_src /workspace/components/src/dynamo /workspace/components/src/dynamo
ENV PYTHONPATH=/workspace/components/src:${PYTHONPATH:-}

WORKDIR /workspace

# --- dynamo runtime env tweaks ---
# Keep vLLM's flashinfer sampler (enabled by default in 0.20+ but explicit here).
ENV VLLM_USE_FLASHINFER_SAMPLER=1

# Default to bash so the Dynamo CRD operator can exec `python3 -m dynamo.vllm`
# via the manifest command/args rather than the vLLM api_server entrypoint.
ENTRYPOINT []
CMD ["bash"]
88 changes: 88 additions & 0 deletions recipes/deepseek-v4-flash/container/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# DeepSeek-V4-Flash Reference Container

DeepSeek-V4-Flash is not in a stock vLLM release yet, so the recipe ships with its own reference Dockerfile that overlays the Dynamo runtime on top of the upstream dsv4 vLLM image.

- **Base:** [`vllm/vllm-openai:deepseekv4-cu130`](https://hub.docker.com/r/vllm/vllm-openai/tags) — vLLM from PR [#40760](https://github.com/vllm-project/vllm/pull/40760) (`zyongye/vllm:dsv4`) with the DeepSeek-V4 kernels, `tokenizer_mode`, tool + reasoning parsers, hybrid CSA + HCA attention, MTP speculative decoding, and the FP4 indexer.
- **Overlay:** pre-built Dynamo artifacts (wheels, static `nats`/`etcd` binaries, NIXL, UCX, the `dynamo.vllm` Python worker) copied from a locally-built Dynamo vLLM runtime image.

Both layers use Python 3.12; no vLLM reinstall is performed.

## Build flow

Two Docker images are involved:

1. **Dynamo vLLM runtime** — built from this repo using the instructions in [`<repo_root>/container/README.md`](../../../container/README.md). This image contains the Dynamo Rust runtime, wheels, and the `dynamo.vllm` worker.
2. **DeepSeek-V4-Flash overlay** — built here, using the image from step 1 as the source stage (`DYNAMO_SRC_IMAGE`) and the upstream dsv4 vLLM image as the final base (`DSV4_BASE_IMAGE`).

## Step 1 — Build the Dynamo vLLM runtime

From the **repo root**, render and build the runtime image per [`container/README.md`](../../../container/README.md):

```bash
# From <repo_root>
container/render.py --framework vllm --target runtime --output-short-filename
docker build -t dynamo:latest-vllm-runtime -f container/rendered.Dockerfile .
```

This produces the local tag `dynamo:latest-vllm-runtime`, which is what Step 2 expects by default.

## Step 2 — Build the DeepSeek-V4-Flash overlay

Still from the **repo root**:

```bash
docker build \
-f recipes/deepseek-v4-flash/container/Dockerfile.dsv4 \
-t <your-registry>/vllm-dsv4:<tag> \
.
```

The Dockerfile takes no files from the build context (everything comes from `FROM` / `COPY --from=`), so any context directory works — using the repo root keeps the `-f` path straightforward.

### Build args

Both can be overridden with `--build-arg`:

| Arg | Default | Purpose |
|-----|---------|---------|
| `DYNAMO_SRC_IMAGE` | `dynamo:latest-vllm-runtime` | Source image for the Dynamo overlay. The default matches the tag produced by Step 1. Override with a pinned released tag (e.g. `nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.2`) for reproducible builds without rebuilding locally. |
| `DSV4_BASE_IMAGE` | `vllm/vllm-openai:deepseekv4-cu130` | The dsv4 vLLM base. The `cu129` tag is also available for CUDA 12.9 hosts. |

Example — pin the overlay source to a released Dynamo tag on a CUDA 12.9 host:

```bash
docker build \
-f recipes/deepseek-v4-flash/container/Dockerfile.dsv4 \
--build-arg DYNAMO_SRC_IMAGE=nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.2-cuda13 \
--build-arg DSV4_BASE_IMAGE=vllm/vllm-openai:deepseekv4-cu129 \
-t <your-registry>/vllm-dsv4:<tag> \
.
```

## Push

```bash
docker push <your-registry>/vllm-dsv4:<tag>
```

## Wire into the recipe

Once the image is pushed, update the `image:` fields in
[`../vllm/agg/vllm-dgd.yaml`](../vllm/agg/vllm-dgd.yaml) (both the Frontend and the `VllmDecodeWorker`) to point at `<your-registry>/vllm-dsv4:<tag>`, then follow the recipe's [Quick Start](../README.md#quick-start) to deploy.

## What the Dockerfile does

1. Installs the RDMA / UCX runtime deps on top of the dsv4 vLLM image (`libibverbs1`, `rdma-core`, `ibverbs-utils`, `libibumad3`, `libnuma1`, `librdmacm1`, `ibverbs-providers`, plus `ca-certificates`, `jq`, `curl`).
2. Applies a small upstream vLLM patch to the sparse attention indexer (drops the unsupported `topk=1024`). Remove once [vLLM PR #40760](https://github.com/vllm-project/vllm/pull/40760) lands in the base image.
3. Copies the static `nats-server` and `etcd` binaries from the Dynamo runtime image.
4. Copies UCX into `/usr/local/ucx` and NIXL into `/opt/nvidia/nvda_nixl`, with `LD_LIBRARY_PATH` set so NIXL's plugins resolve at runtime.
5. Installs the Dynamo Python wheels (`ai_dynamo_runtime`, `ai_dynamo`, NIXL Python bindings) into the dsv4 image's system Python 3.12.
6. Copies the `dynamo` Python package tree into `/workspace/components/src/dynamo` and puts it on `PYTHONPATH` so `python3 -m dynamo.vllm` resolves.
7. Keeps vLLM's FlashInfer sampler enabled (`VLLM_USE_FLASHINFER_SAMPLER=1`) and clears `ENTRYPOINT` so the Dynamo CRD operator's `command` / `args` take effect.

## Troubleshooting

- **`pull access denied for dynamo:latest-vllm-runtime`** — Step 1 has not been run (or produced a different tag). Build the Dynamo vLLM runtime image locally per [`<repo_root>/container/README.md`](../../../container/README.md), or override `--build-arg DYNAMO_SRC_IMAGE=<your-image>`.
- **`no matching manifest for linux/amd64`** — the dsv4 base is amd64-only today; build on an x86_64 host.
- **CUDA version mismatch on the host** — use `DSV4_BASE_IMAGE=vllm/vllm-openai:deepseekv4-cu129` if your node is still on CUDA 12.9.
- **NIXL plugins not found at runtime** — confirm `LD_LIBRARY_PATH` includes `/opt/nvidia/nvda_nixl/lib64/plugins` (set in the Dockerfile; don't unset it in the pod spec).
13 changes: 13 additions & 0 deletions recipes/deepseek-v4-flash/model-cache/model-cache.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-cache
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 400Gi
storageClassName: "your-storage-class-name"
Loading
Loading