diff --git a/recipes/README.md b/recipes/README.md index 2a5cebff55ab..b4bfa470dc25 100644 --- a/recipes/README.md +++ b/recipes/README.md @@ -69,6 +69,8 @@ These recipes are under active development and may require additional setup step |-------|-----------|------|------|------------|-------| | **[GLM-5-NVFP4](glm-5-nvfp4/sglang/disagg/)** | SGLang | Disagg Prefill/Decode | 20x GB200 | ✅ | NVFP4, EAGLE speculative decoding, TP16 decode + TP4 prefill. Requires [custom container build](glm-5-nvfp4/). | | **[nvidia/Kimi-K2.5-NVFP4](kimi-k2.5/trtllm/agg/nvidia/)** | TensorRT-LLM | Aggregated | 8x B200 | ✅ | Text only — MoE model, TP8×EP8, reasoning + tool calling. Vision input not yet functional. | +| **[DeepSeek-V4-Flash](deepseek-v4-flash/vllm/agg/)** | vLLM | Aggregated | 4x B200 | ✅ | Text only — MoE model (284B / 13B active), DP=4 + EP, FP8 KV cache, reasoning + tool calling. Requires [custom container build](deepseek-v4-flash/container/). | +| **[DeepSeek-V4-Pro](deepseek-v4-pro/vllm/agg/)** | vLLM | Aggregated | 8x B200 | ✅ | Text only — MoE model (1.6T / 49B active, 1M context), TP=8 + EP, FP4+FP8 mixed checkpoint, FP8 KV cache, CSA+HCA attention, three reasoning effort modes, tool calling. Requires [custom container build](deepseek-v4-pro/container/). | ## Recipe Structure diff --git a/recipes/deepseek-v4-flash/README.md b/recipes/deepseek-v4-flash/README.md new file mode 100644 index 000000000000..6e2f0070252d --- /dev/null +++ b/recipes/deepseek-v4-flash/README.md @@ -0,0 +1,161 @@ +# DeepSeek-V4-Flash Recipe + +Aggregated-serving recipe for **DeepSeek-V4-Flash** on vLLM with Dynamo. + +| Variant | Model | Status | Modality | Manifest | GPUs | +|---------|-------|--------|----------|----------|------| +| **vllm-agg** | `deepseek-ai/DeepSeek-V4-Flash` | Experimental | Text only | [`vllm/agg/vllm-dgd.yaml`](vllm/agg/vllm-dgd.yaml) | 4x B200 | + +Aggregated, single-replica: 1 decode pod running DP=4 + Expert Parallel on 4 B200 GPUs (TP=1). Tested on 4 of 8 GPUs per B200 node. + +## Prerequisites + +1. **Dynamo Platform installed** — see the [Kubernetes Deployment Guide](../../docs/kubernetes/README.md). +2. **GPU cluster** with at least 4 B200 GPUs available on one node. +3. **HuggingFace token** with access to `deepseek-ai/DeepSeek-V4-Flash`. +4. **Dynamo + vLLM image with the DeepSeek-V4 stack.** DeepSeek-V4-Flash is not in a stock vLLM release yet. It is built in two steps: + + 1. Build the Dynamo vLLM runtime image locally per [`/container/README.md`](../../container/README.md) (this produces the local tag `dynamo:latest-vllm-runtime`). + 2. Build the DeepSeek-V4-Flash overlay on top of it using [`container/Dockerfile.dsv4`](container/Dockerfile.dsv4). See [`container/README.md`](container/README.md) for build args and troubleshooting. From the repo root: + + ```bash + docker build -f recipes/deepseek-v4-flash/container/Dockerfile.dsv4 \ + -t /vllm-dsv4: . + ``` + + Then set the `image:` fields in `vllm/agg/vllm-dgd.yaml` (both the frontend and decode workers) to `/vllm-dsv4:`. + +## Quick Start + +```bash +export NAMESPACE=dynamo-demo +kubectl create namespace ${NAMESPACE} + +# HuggingFace token secret (consumed by the download Job and, as a convenience, by the worker) +kubectl create secret generic hf-token-secret \ + --from-literal=HF_TOKEN="your-token-here" \ + -n ${NAMESPACE} + +# Download model into the model-cache PVC. +# Edit model-cache/model-cache.yaml and set storageClassName to a RWX class in your cluster. +# The PVC requests 400Gi; DeepSeek-V4-Flash is ~160GB on disk (46 safetensors shards, +# FP4+FP8 mixed) and typically takes 30-60 min to download on first apply. +kubectl apply -f model-cache/model-cache.yaml -n ${NAMESPACE} +kubectl apply -f model-cache/model-download.yaml -n ${NAMESPACE} +kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=7200s + +# Update the `image:` fields in vllm/agg/vllm-dgd.yaml to your Dynamo + vLLM build. + +# Deploy +kubectl apply -f vllm/agg/vllm-dgd.yaml -n ${NAMESPACE} + +# First launch of the decode worker takes up to ~60 minutes (weight load + +# FlashInfer autotune + cudagraph warmup). The startup probe is sized for this. +kubectl wait --for=condition=Ready pod \ + -l nvidia.com/dynamo-graph-deployment-name=dsv4-flash-agg \ + -n ${NAMESPACE} --timeout=3600s +``` + +## Test the Deployment + +```bash +kubectl port-forward svc/dsv4-flash-agg-frontend 8000:8000 -n ${NAMESPACE} + +curl http://localhost:8000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "deepseek-ai/DeepSeek-V4-Flash", + "messages": [{"role": "user", "content": "Hello!"}], + "max_tokens": 100 + }' +``` + +## Recipe Details + +The worker command lives in `vllm/agg/vllm-dgd.yaml`. Key flags and why they're there: + +| Flag | Purpose | +|------|---------| +| `--tokenizer-mode deepseek_v4` | Selects the DeepSeek-V4 tokenizer | +| `--dyn-reasoning-parser deepseek_v4` | Extracts chain-of-thought into `message.reasoning_content` | +| `--dyn-tool-call-parser deepseek_v4` | Emits OpenAI-compatible structured `tool_calls` | +| `--attention-config '{"use_fp4_indexer_cache":true}'` | Blackwell FP4 indexer cache for CSA+HCA attention | +| `--kv-cache-dtype fp8` + `--block-size 256` | FP8 KV cache; block size matches the upstream recipe | +| `--tensor-parallel-size 1 --data-parallel-size 4 --enable-expert-parallel` | DP=4 + EP across the 4 GPUs (TP=1) | +| `--compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}'` | Single-node DEP compilation config from the upstream recipe | +| `--max-num-seqs 256` | Concurrency cap | + +## Model Details + +| | | +|---|---| +| **Model** | `deepseek-ai/DeepSeek-V4-Flash` (MoE, 284B total / 13B active) | +| **Checkpoint** | Mixed FP4 (expert weights) + FP8 (attention, norm, router) | +| **Backend** | vLLM with the DeepSeek-V4 stack (`vllm/vllm-openai:deepseekv4-cu130`) | +| **Parallelism** | TP=1, DP=4, Expert Parallel enabled | +| **KV cache** | FP8, block size 256 | +| **Attention** | Hybrid CSA + HCA with Blackwell FP4 indexer cache | + +## Verifying Reasoning + +```bash +curl -s http://localhost:8000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "deepseek-ai/DeepSeek-V4-Flash", + "messages": [{"role": "user", "content": "What is 2+2? Answer briefly."}], + "max_tokens": 200 + }' | python3 -m json.tool +``` + +Expected: + +- `choices[0].message.reasoning_content` contains the model's chain-of-thought. +- `choices[0].message.content` contains only the final answer. +- No raw `` tags in either field. + +If `reasoning_content` is `null` and `` appears in `content`, the reasoning parser isn't wired up — confirm `--dyn-reasoning-parser deepseek_v4` is on the worker command. + +## Verifying Tool Calling + +```bash +curl -s http://localhost:8000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "deepseek-ai/DeepSeek-V4-Flash", + "messages": [{"role": "user", "content": "What is the weather in San Francisco?"}], + "tools": [{ + "type": "function", + "function": { + "name": "get_weather", + "description": "Get the current weather for a location", + "parameters": { + "type": "object", + "properties": { + "location": {"type": "string", "description": "City name"} + }, + "required": ["location"] + } + } + }], + "max_tokens": 300 + }' | python3 -m json.tool +``` + +Expected: + +- `choices[0].message.tool_calls` is a structured array with `function.name`, `function.arguments`, and `id`. +- `choices[0].finish_reason` is `"tool_calls"`. +- `choices[0].message.reasoning_content` may contain the model's reasoning about tool selection. + +If `tool_calls` is missing and raw tool-call markers appear in `content`, confirm `--dyn-tool-call-parser deepseek_v4` is on the worker command. + +## Notes + +- **Storage class.** Update `storageClassName` in `model-cache/model-cache.yaml` to a RWX class that can serve the PVC to frontend and worker pods. +- **Model size.** `deepseek-ai/DeepSeek-V4-Flash` is ~160 GB on disk (46 safetensors shards in FP4+FP8 mixed form). The 400Gi PVC leaves headroom for HF cache metadata and one alternate revision. +- **Image tag.** The manifest ships with `nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag`. Replace with your built Dynamo + vLLM (DeepSeek-V4) image — see Prerequisite 4. +- **First launch is slow.** The decode worker loads weights and warms CUDA graphs; the startup probe allows up to ~60 min (`failureThreshold: 360` at `periodSeconds: 10`) and `VLLM_ENGINE_READY_TIMEOUT_S=3600` is set to match. +- **Parser flags.** Use the Dynamo variants on the worker (`--dyn-reasoning-parser`, `--dyn-tool-call-parser`). vLLM's native `--reasoning-parser` / `--tool-call-parser` are engine-side and do not feed the Dynamo OpenAI renderer. +- **DP stability.** `VLLM_RANDOMIZE_DP_DUMMY_INPUTS=1` and `VLLM_SKIP_P2P_CHECK=1` mirror the DeepSeek-R1 vLLM recipe and stabilize DP dummy inputs on Blackwell. +- **Offline model cache.** The worker runs with `HF_HUB_OFFLINE=1` so vLLM reads the cached weights from the PVC and never contacts the HF Hub at startup. The HF token secret is mounted defensively; it isn't required at runtime once the download Job has completed. diff --git a/recipes/deepseek-v4-flash/container/Dockerfile.dsv4 b/recipes/deepseek-v4-flash/container/Dockerfile.dsv4 new file mode 100644 index 000000000000..e3a09ce479d3 --- /dev/null +++ b/recipes/deepseek-v4-flash/container/Dockerfile.dsv4 @@ -0,0 +1,90 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Dynamo vLLM runtime overlaid on the official DeepSeek-V4 vLLM image. +# +# Base: vllm/vllm-openai:deepseekv4-cu130 — ships vLLM from PR #40760 +# (zyongye/vllm:dsv4) with the DeepSeek-V4 kernels, tokenizer_mode, tool+reasoning +# parsers, hybrid CSA+HCA attention, MTP speculative decoding, and FP4 indexer. +# +# We take pre-built dynamo artifacts (wheels, nats, etcd, NIXL, UCX, dynamo.vllm +# python worker) from a locally-built Dynamo vLLM runtime image (produced via +# /container/README.md) and layer them on top of the dsv4 vLLM image +# without touching the vLLM install. +# +# Build (run from the repo root): +# docker build -f recipes/deepseek-v4-flash/container/Dockerfile.dsv4 \ +# -t /vllm-dsv4: . +# +# See recipes/deepseek-v4-flash/container/README.md for build args and +# troubleshooting. +# +# Both base images must be Python 3.12 (verified). + +# Default to the local tag produced by `container/render.py --framework vllm +# --target runtime` + `docker build -t dynamo:latest-vllm-runtime ...`. Override +# with --build-arg DYNAMO_SRC_IMAGE=... to use a published release tag instead. +ARG DYNAMO_SRC_IMAGE=dynamo:latest-vllm-runtime +ARG DSV4_BASE_IMAGE=vllm/vllm-openai:deepseekv4-cu130 + +FROM ${DYNAMO_SRC_IMAGE} AS dynamo_src + +FROM ${DSV4_BASE_IMAGE} + +ENV DEBIAN_FRONTEND=noninteractive + +# Runtime deps dynamo needs that aren't in the vLLM image (etcd/nats are static +# binaries we COPY; libibverbs/rdma-core are needed for NIXL's UCX transport). +RUN apt-get update && apt-get install -y --no-install-recommends \ + libibverbs1 rdma-core ibverbs-utils libibumad3 \ + libnuma1 librdmacm1 ibverbs-providers \ + ca-certificates jq curl \ + && apt list --upgradable 2>/dev/null | tail -n +2 | grep 'jammy-' | awk -F/ '{print $1}' | xargs -r apt-get install -y --only-upgrade \ + && rm -rf /var/lib/apt/lists/* + +# --- patch vLLM: drop unsupported topk=1024 from sparse attn indexer --- +# from https://github.com/vllm-project/vllm/pull/40760/changes/3602f14f0e146b234be911d916e381b4e6a4dc0c +# TODO: remove once https://github.com/vllm-project/vllm/pull/40760 lands in the base image. +RUN sed -i 's/(512, 1024, 2048)/(512, 2048)/' \ + /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/sparse_attn_indexer.py + +# --- static binaries --- +COPY --from=dynamo_src /usr/bin/nats-server /usr/bin/nats-server +COPY --from=dynamo_src /usr/local/bin/etcd /usr/local/bin/etcd +ENV PATH=/usr/local/bin/etcd:${PATH} + +# --- UCX --- +COPY --from=dynamo_src /usr/local/ucx /usr/local/ucx +ENV PATH=/usr/local/ucx/bin:${PATH} + +# --- NIXL (C++ libs for KV transfer) --- +COPY --from=dynamo_src /opt/nvidia/nvda_nixl /opt/nvidia/nvda_nixl +ENV NIXL_PREFIX=/opt/nvidia/nvda_nixl \ + NIXL_LIB_DIR=/opt/nvidia/nvda_nixl/lib64 \ + NIXL_PLUGIN_DIR=/opt/nvidia/nvda_nixl/lib64/plugins +ENV LD_LIBRARY_PATH=${NIXL_LIB_DIR}:${NIXL_PLUGIN_DIR}:/usr/local/ucx/lib:/usr/local/ucx/lib/ucx:${LD_LIBRARY_PATH} + +# --- install dynamo python wheels into the dsv4 image's system python --- +# The dsv4 image uses system python3.12 with pip at /usr/local/lib/python3.12/dist-packages. +# ai_dynamo_runtime is abi3 (cp310+), compatible with cp312. +COPY --from=dynamo_src /opt/dynamo/wheelhouse /opt/dynamo/wheelhouse +RUN pip install --no-cache-dir \ + /opt/dynamo/wheelhouse/ai_dynamo_runtime*.whl \ + /opt/dynamo/wheelhouse/ai_dynamo*any.whl \ + /opt/dynamo/wheelhouse/nixl/nixl*.whl + +# --- dynamo python source (dynamo.vllm worker + common + mocker) --- +# Bring the worker entrypoint tree so `python -m dynamo.vllm` resolves. +COPY --from=dynamo_src /workspace/components/src/dynamo /workspace/components/src/dynamo +ENV PYTHONPATH=/workspace/components/src:${PYTHONPATH:-} + +WORKDIR /workspace + +# --- dynamo runtime env tweaks --- +# Keep vLLM's flashinfer sampler (enabled by default in 0.20+ but explicit here). +ENV VLLM_USE_FLASHINFER_SAMPLER=1 + +# Default to bash so the Dynamo CRD operator can exec `python3 -m dynamo.vllm` +# via the manifest command/args rather than the vLLM api_server entrypoint. +ENTRYPOINT [] +CMD ["bash"] diff --git a/recipes/deepseek-v4-flash/container/README.md b/recipes/deepseek-v4-flash/container/README.md new file mode 100644 index 000000000000..9f5b4be83ce4 --- /dev/null +++ b/recipes/deepseek-v4-flash/container/README.md @@ -0,0 +1,88 @@ +# DeepSeek-V4-Flash Reference Container + +DeepSeek-V4-Flash is not in a stock vLLM release yet, so the recipe ships with its own reference Dockerfile that overlays the Dynamo runtime on top of the upstream dsv4 vLLM image. + +- **Base:** [`vllm/vllm-openai:deepseekv4-cu130`](https://hub.docker.com/r/vllm/vllm-openai/tags) — vLLM from PR [#40760](https://github.com/vllm-project/vllm/pull/40760) (`zyongye/vllm:dsv4`) with the DeepSeek-V4 kernels, `tokenizer_mode`, tool + reasoning parsers, hybrid CSA + HCA attention, MTP speculative decoding, and the FP4 indexer. +- **Overlay:** pre-built Dynamo artifacts (wheels, static `nats`/`etcd` binaries, NIXL, UCX, the `dynamo.vllm` Python worker) copied from a locally-built Dynamo vLLM runtime image. + +Both layers use Python 3.12; no vLLM reinstall is performed. + +## Build flow + +Two Docker images are involved: + +1. **Dynamo vLLM runtime** — built from this repo using the instructions in [`/container/README.md`](../../../container/README.md). This image contains the Dynamo Rust runtime, wheels, and the `dynamo.vllm` worker. +2. **DeepSeek-V4-Flash overlay** — built here, using the image from step 1 as the source stage (`DYNAMO_SRC_IMAGE`) and the upstream dsv4 vLLM image as the final base (`DSV4_BASE_IMAGE`). + +## Step 1 — Build the Dynamo vLLM runtime + +From the **repo root**, render and build the runtime image per [`container/README.md`](../../../container/README.md): + +```bash +# From +container/render.py --framework vllm --target runtime --output-short-filename +docker build -t dynamo:latest-vllm-runtime -f container/rendered.Dockerfile . +``` + +This produces the local tag `dynamo:latest-vllm-runtime`, which is what Step 2 expects by default. + +## Step 2 — Build the DeepSeek-V4-Flash overlay + +Still from the **repo root**: + +```bash +docker build \ + -f recipes/deepseek-v4-flash/container/Dockerfile.dsv4 \ + -t /vllm-dsv4: \ + . +``` + +The Dockerfile takes no files from the build context (everything comes from `FROM` / `COPY --from=`), so any context directory works — using the repo root keeps the `-f` path straightforward. + +### Build args + +Both can be overridden with `--build-arg`: + +| Arg | Default | Purpose | +|-----|---------|---------| +| `DYNAMO_SRC_IMAGE` | `dynamo:latest-vllm-runtime` | Source image for the Dynamo overlay. The default matches the tag produced by Step 1. Override with a pinned released tag (e.g. `nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.2`) for reproducible builds without rebuilding locally. | +| `DSV4_BASE_IMAGE` | `vllm/vllm-openai:deepseekv4-cu130` | The dsv4 vLLM base. The `cu129` tag is also available for CUDA 12.9 hosts. | + +Example — pin the overlay source to a released Dynamo tag on a CUDA 12.9 host: + +```bash +docker build \ + -f recipes/deepseek-v4-flash/container/Dockerfile.dsv4 \ + --build-arg DYNAMO_SRC_IMAGE=nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.2-cuda13 \ + --build-arg DSV4_BASE_IMAGE=vllm/vllm-openai:deepseekv4-cu129 \ + -t /vllm-dsv4: \ + . +``` + +## Push + +```bash +docker push /vllm-dsv4: +``` + +## Wire into the recipe + +Once the image is pushed, update the `image:` fields in +[`../vllm/agg/vllm-dgd.yaml`](../vllm/agg/vllm-dgd.yaml) (both the Frontend and the `VllmDecodeWorker`) to point at `/vllm-dsv4:`, then follow the recipe's [Quick Start](../README.md#quick-start) to deploy. + +## What the Dockerfile does + +1. Installs the RDMA / UCX runtime deps on top of the dsv4 vLLM image (`libibverbs1`, `rdma-core`, `ibverbs-utils`, `libibumad3`, `libnuma1`, `librdmacm1`, `ibverbs-providers`, plus `ca-certificates`, `jq`, `curl`). +2. Applies a small upstream vLLM patch to the sparse attention indexer (drops the unsupported `topk=1024`). Remove once [vLLM PR #40760](https://github.com/vllm-project/vllm/pull/40760) lands in the base image. +3. Copies the static `nats-server` and `etcd` binaries from the Dynamo runtime image. +4. Copies UCX into `/usr/local/ucx` and NIXL into `/opt/nvidia/nvda_nixl`, with `LD_LIBRARY_PATH` set so NIXL's plugins resolve at runtime. +5. Installs the Dynamo Python wheels (`ai_dynamo_runtime`, `ai_dynamo`, NIXL Python bindings) into the dsv4 image's system Python 3.12. +6. Copies the `dynamo` Python package tree into `/workspace/components/src/dynamo` and puts it on `PYTHONPATH` so `python3 -m dynamo.vllm` resolves. +7. Keeps vLLM's FlashInfer sampler enabled (`VLLM_USE_FLASHINFER_SAMPLER=1`) and clears `ENTRYPOINT` so the Dynamo CRD operator's `command` / `args` take effect. + +## Troubleshooting + +- **`pull access denied for dynamo:latest-vllm-runtime`** — Step 1 has not been run (or produced a different tag). Build the Dynamo vLLM runtime image locally per [`/container/README.md`](../../../container/README.md), or override `--build-arg DYNAMO_SRC_IMAGE=`. +- **`no matching manifest for linux/amd64`** — the dsv4 base is amd64-only today; build on an x86_64 host. +- **CUDA version mismatch on the host** — use `DSV4_BASE_IMAGE=vllm/vllm-openai:deepseekv4-cu129` if your node is still on CUDA 12.9. +- **NIXL plugins not found at runtime** — confirm `LD_LIBRARY_PATH` includes `/opt/nvidia/nvda_nixl/lib64/plugins` (set in the Dockerfile; don't unset it in the pod spec). diff --git a/recipes/deepseek-v4-flash/model-cache/model-cache.yaml b/recipes/deepseek-v4-flash/model-cache/model-cache.yaml new file mode 100644 index 000000000000..244c1f5eda51 --- /dev/null +++ b/recipes/deepseek-v4-flash/model-cache/model-cache.yaml @@ -0,0 +1,13 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +apiVersion: v1 +kind: PersistentVolumeClaim +metadata: + name: model-cache +spec: + accessModes: + - ReadWriteMany + resources: + requests: + storage: 400Gi + storageClassName: "your-storage-class-name" diff --git a/recipes/deepseek-v4-flash/model-cache/model-download.yaml b/recipes/deepseek-v4-flash/model-cache/model-download.yaml new file mode 100644 index 000000000000..a612c799d597 --- /dev/null +++ b/recipes/deepseek-v4-flash/model-cache/model-download.yaml @@ -0,0 +1,42 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +apiVersion: batch/v1 +kind: Job +metadata: + name: model-download +spec: + backoffLimit: 3 + completions: 1 + parallelism: 1 + template: + metadata: + labels: + app: model-download + spec: + restartPolicy: Never + containers: + - name: model-download + image: python:3.10-slim + command: ["sh", "-c"] + envFrom: + - secretRef: + name: hf-token-secret + env: + - name: MODEL_NAME + value: deepseek-ai/DeepSeek-V4-Flash + - name: HF_HOME + value: /model-store + - name: HF_XET_HIGH_PERFORMANCE + value: "1" + args: + - | + set -eux + pip install --no-cache-dir huggingface_hub==1.11.0 + hf download $MODEL_NAME + volumeMounts: + - name: model-cache + mountPath: /model-store + volumes: + - name: model-cache + persistentVolumeClaim: + claimName: model-cache diff --git a/recipes/deepseek-v4-flash/vllm/agg/vllm-dgd.yaml b/recipes/deepseek-v4-flash/vllm/agg/vllm-dgd.yaml new file mode 100644 index 000000000000..2ccc0bb867fc --- /dev/null +++ b/recipes/deepseek-v4-flash/vllm/agg/vllm-dgd.yaml @@ -0,0 +1,112 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# DynamoGraphDeployment for deepseek-ai/DeepSeek-V4-Flash on vLLM, +# aggregated serving (no prefill/decode disaggregation). +# +# Upstream vLLM recipe: +# https://github.com/vllm-project/recipes/blob/main/models/deepseek-ai/DeepSeek-V4-Flash.yaml +# +# Shape: 1 replica x 4 B200 GPUs, DP=4 + Expert Parallel, TP=1. +# Tested on 4 of 8 GPUs per B200 node. +# +# Image: replace the `:my-tag` placeholder with a Dynamo + vLLM image that +# includes the DeepSeek-V4 stack. See `../../container/README.md` +# for the reference build -- it overlays dynamo on +# vllm/vllm-openai:deepseekv4-cu130. +# +# Weights: served from the `model-cache` PVC populated by +# `../../model-cache/model-download.yaml`. +apiVersion: nvidia.com/v1alpha1 +kind: DynamoGraphDeployment +metadata: + name: dsv4-flash-agg +spec: + backendFramework: vllm + pvcs: + - name: model-cache + create: false + services: + Frontend: + componentType: frontend + replicas: 1 + volumeMounts: + - name: model-cache + mountPoint: /opt/models + extraPodSpec: + mainContainer: + image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag + workingDir: /workspace/examples/backends/vllm + env: + - name: HF_HOME + value: /opt/models + - name: HF_HUB_OFFLINE + value: "1" + VllmDecodeWorker: + componentType: worker + subComponentType: decode + envFromSecret: hf-token-secret + volumeMounts: + - name: model-cache + mountPoint: /opt/models + sharedMemory: + size: 200Gi + extraPodSpec: + mainContainer: + image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag + workingDir: /workspace/examples/backends/vllm + # Up to ~60 min for first launch: weight load + FlashInfer autotune + + # cudagraph warmup. periodSeconds * failureThreshold = 10 * 360 = 3600s. + startupProbe: + httpGet: + path: /health + port: 9090 + periodSeconds: 10 + timeoutSeconds: 10 + failureThreshold: 360 + env: + - name: SERVED_MODEL_NAME + value: deepseek-ai/DeepSeek-V4-Flash + - name: MODEL_PATH + value: deepseek-ai/DeepSeek-V4-Flash + - name: HF_HOME + value: /opt/models + # Read weights from the PVC only; do not hit the HF Hub at startup. + - name: HF_HUB_OFFLINE + value: "1" + # Give the engine room to finish first-launch init. + - name: VLLM_ENGINE_READY_TIMEOUT_S + value: "3600" + # Stabilize DP dummy inputs (matches the DeepSeek-R1 vLLM recipe). + - name: VLLM_RANDOMIZE_DP_DUMMY_INPUTS + value: "1" + - name: VLLM_SKIP_P2P_CHECK + value: "1" + - name: NCCL_CUMEM_ENABLE + value: "1" + command: + - /bin/sh + - -c + args: + - | + python3 -m dynamo.vllm \ + --model "${MODEL_PATH}" \ + --served-model-name "${SERVED_MODEL_NAME}" \ + --trust-remote-code \ + --kv-cache-dtype fp8 \ + --block-size 256 \ + --tensor-parallel-size 1 \ + --data-parallel-size 4 \ + --enable-expert-parallel \ + --tokenizer-mode deepseek_v4 \ + --dyn-reasoning-parser deepseek_v4 \ + --dyn-tool-call-parser deepseek_v4 \ + --attention-config '{"use_fp4_indexer_cache":true}' \ + --compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}' \ + --max-num-seqs 256 + replicas: 1 + resources: + limits: + gpu: "4" + requests: + gpu: "4" diff --git a/recipes/deepseek-v4-pro/README.md b/recipes/deepseek-v4-pro/README.md new file mode 100644 index 000000000000..4fe2fd3cd8ee --- /dev/null +++ b/recipes/deepseek-v4-pro/README.md @@ -0,0 +1,179 @@ +# DeepSeek-V4-Pro Recipe + +Aggregated-serving recipe for **DeepSeek-V4-Pro** on vLLM with Dynamo. + +| Variant | Model | Status | Modality | Manifest | GPUs | +|---------|-------|--------|----------|----------|------| +| **vllm-agg** | `deepseek-ai/DeepSeek-V4-Pro` | Experimental | Text only | [`vllm/agg/vllm-dgd.yaml`](vllm/agg/vllm-dgd.yaml) | 8x B200 | + +Aggregated, single-replica: 1 decode pod running TP=8 + Expert Parallel on all 8 GPUs of one node. + +## Prerequisites + +1. **Dynamo Platform installed** — see the [Kubernetes Deployment Guide](../../docs/kubernetes/README.md). +2. **GPU cluster** with at least 8 B200 GPUs available on one node (TP=8 fills an 8-GPU box). +3. **HuggingFace token** with access to `deepseek-ai/DeepSeek-V4-Pro`. +4. **Dynamo + vLLM image with the DeepSeek-V4 stack.** DeepSeek-V4-Pro is not in a stock vLLM release yet. It is built in two steps: + + 1. Build the Dynamo vLLM runtime image locally per [`/container/README.md`](../../container/README.md) (this produces the local tag `dynamo:latest-vllm-runtime`). + 2. Build the DeepSeek-V4-Pro overlay on top of it using [`container/Dockerfile.dsv4`](container/Dockerfile.dsv4). See [`container/README.md`](container/README.md) for build args and troubleshooting. From the repo root: + + ```bash + docker build -f recipes/deepseek-v4-pro/container/Dockerfile.dsv4 \ + -t /vllm-dsv4: . + ``` + + Then set the `image:` fields in `vllm/agg/vllm-dgd.yaml` (both the frontend and decode workers) to `/vllm-dsv4:`. + + > The Pro and Flash recipes share the same dsv4 image. If you've already built it for [deepseek-v4-flash](../deepseek-v4-flash/), reuse the tag here — model selection happens at runtime via `--model`. + +## Quick Start + +```bash +export NAMESPACE=dynamo-demo +kubectl create namespace ${NAMESPACE} + +# HuggingFace token secret (consumed by the download Job and, as a convenience, by the worker) +kubectl create secret generic hf-token-secret \ + --from-literal=HF_TOKEN="your-token-here" \ + -n ${NAMESPACE} + +# Download model into the model-cache PVC. +# Edit model-cache/model-cache.yaml and set storageClassName to a RWX class in your cluster. +# The PVC requests 1500Gi; DeepSeek-V4-Pro is ~865 GB on disk (64 safetensors shards, +# FP4+FP8 mixed) and typically takes 1.5-3 hours to download on first apply. +kubectl apply -f model-cache/model-cache.yaml -n ${NAMESPACE} +kubectl apply -f model-cache/model-download.yaml -n ${NAMESPACE} +kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=14400s + +# Update the `image:` fields in vllm/agg/vllm-dgd.yaml to your Dynamo + vLLM build. + +# Deploy +kubectl apply -f vllm/agg/vllm-dgd.yaml -n ${NAMESPACE} + +# First launch of the decode worker takes up to ~90 minutes (TP=8 weight load + +# FlashInfer autotune + cudagraph warmup). The startup probe is sized for this. +kubectl wait --for=condition=Ready pod \ + -l nvidia.com/dynamo-graph-deployment-name=dsv4-pro-agg \ + -n ${NAMESPACE} --timeout=5400s +``` + +## Test the Deployment + +```bash +kubectl port-forward svc/dsv4-pro-agg-frontend 8000:8000 -n ${NAMESPACE} + +curl http://localhost:8000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "deepseek-ai/DeepSeek-V4-Pro", + "messages": [{"role": "user", "content": "Hello!"}], + "max_tokens": 100 + }' +``` + +## Recipe Details + +The worker command lives in `vllm/agg/vllm-dgd.yaml`. Key flags and why they're there: + +| Flag | Purpose | +|------|---------| +| `--tokenizer-mode deepseek_v4` | Selects the DeepSeek-V4 tokenizer | +| `--dyn-reasoning-parser deepseek_v4` | Extracts chain-of-thought into `message.reasoning_content` | +| `--dyn-tool-call-parser deepseek_v4` | Emits OpenAI-compatible structured `tool_calls` | +| `--attention-config '{"use_fp4_indexer_cache":true}'` | Blackwell FP4 indexer cache for CSA+HCA attention | +| `--kv-cache-dtype fp8` + `--block-size 256` | FP8 KV cache; block size matches the upstream recipe | +| `--tensor-parallel-size 8 --enable-expert-parallel` | TP=8 across 8 GPUs of one node, with EP enabled for the MoE experts | +| `--compilation-config '{"mode":0,"cudagraph_mode":"FULL_DECODE_ONLY"}'` | Conservative cudagraph mode appropriate for the larger Pro model (matches upstream V4-Pro example) | +| `--max-num-seqs 256` | Concurrency cap | + +### Why TP=8 (not DP=4 like Flash)? + +DeepSeek-V4-Pro is ~5.5x larger than Flash on disk (~865 GB vs. ~160 GB). With FP4+FP8 mixed weights it does not fit in 4 ranks at typical batch shapes, so the upstream tested shape for Pro is **TP=8 across all 8 GPUs of one node**. Expert Parallel is still enabled on top of TP — TP shards the dense (attention/router/norm) weights, EP shards the experts. + +## Model Details + +Sourced from the [`deepseek-ai/DeepSeek-V4-Pro` model card](https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro) (preview release): + +| | | +|---|---| +| **Model** | `deepseek-ai/DeepSeek-V4-Pro` (MoE, 1.6T total / 49B active per token) | +| **Context length** | 1M tokens | +| **Checkpoint** | Mixed precision — MoE expert weights in FP4; most other parameters in FP8 | +| **Attention** | Hybrid Compressed Sparse Attention (CSA) + Heavily Compressed Attention (HCA). Recipe enables the Blackwell FP4 indexer cache via `--attention-config '{"use_fp4_indexer_cache":true}'` | +| **Residual path** | Manifold-Constrained Hyper-Connections (mHC) | +| **Reasoning modes** | Three effort levels exposed via `chat_template_kwargs`: `{}` (Non-think), `{"thinking":true,"reasoning_effort":"high"}` (Think High), `{"thinking":true,"reasoning_effort":"max"}` (Think Max — needs `--max-model-len >= 393216`) | +| **Long-context efficiency** | Per the model card, ~27% of the per-token inference FLOPs and ~10% of the KV cache vs. DeepSeek-V3.2 at 1M context | +| **License** | MIT | + +Recipe-level (not model-card) settings in this deployment: + +| | | +|---|---| +| **Backend** | vLLM with the DeepSeek-V4 stack (`vllm/vllm-openai:deepseekv4-cu130`) | +| **Parallelism** | TP=8, Expert Parallel enabled | +| **KV cache** | FP8, block size 256 | + +## Verifying Reasoning + +```bash +curl -s http://localhost:8000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "deepseek-ai/DeepSeek-V4-Pro", + "messages": [{"role": "user", "content": "What is 2+2? Answer briefly."}], + "max_tokens": 200 + }' | python3 -m json.tool +``` + +Expected: + +- `choices[0].message.reasoning_content` contains the model's chain-of-thought. +- `choices[0].message.content` contains only the final answer. +- No raw `` tags in either field. + +If `reasoning_content` is `null` and `` appears in `content`, the reasoning parser isn't wired up — confirm `--dyn-reasoning-parser deepseek_v4` is on the worker command. + +## Verifying Tool Calling + +```bash +curl -s http://localhost:8000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "deepseek-ai/DeepSeek-V4-Pro", + "messages": [{"role": "user", "content": "What is the weather in San Francisco?"}], + "tools": [{ + "type": "function", + "function": { + "name": "get_weather", + "description": "Get the current weather for a location", + "parameters": { + "type": "object", + "properties": { + "location": {"type": "string", "description": "City name"} + }, + "required": ["location"] + } + } + }], + "max_tokens": 300 + }' | python3 -m json.tool +``` + +Expected: + +- `choices[0].message.tool_calls` is a structured array with `function.name`, `function.arguments`, and `id`. +- `choices[0].finish_reason` is `"tool_calls"`. +- `choices[0].message.reasoning_content` may contain the model's reasoning about tool selection. + +If `tool_calls` is missing and raw tool-call markers appear in `content`, confirm `--dyn-tool-call-parser deepseek_v4` is on the worker command. + +## Notes + +- **Storage class.** Update `storageClassName` in `model-cache/model-cache.yaml` to a RWX class that can serve the PVC to frontend and worker pods. +- **Model size.** `deepseek-ai/DeepSeek-V4-Pro` is ~865 GB on disk (64 safetensors shards in FP4+FP8 mixed form). The 1500Gi PVC leaves ~1.7x headroom for HF cache metadata and one alternate revision. +- **Image tag.** The manifest ships with `nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag`. Replace with your built Dynamo + vLLM (DeepSeek-V4) image — see Prerequisite 4. +- **First launch is slow.** The decode worker loads weights across 8 TP ranks and warms CUDA graphs; the startup probe allows up to ~90 min (`failureThreshold: 540` at `periodSeconds: 10`) and `VLLM_ENGINE_READY_TIMEOUT_S=5400` is set to match. +- **Parser flags.** Use the Dynamo variants on the worker (`--dyn-reasoning-parser`, `--dyn-tool-call-parser`). vLLM's native `--reasoning-parser` / `--tool-call-parser` are engine-side and do not feed the Dynamo OpenAI renderer. +- **Offline model cache.** The worker runs with `HF_HUB_OFFLINE=1` so vLLM reads the cached weights from the PVC and never contacts the HF Hub at startup. The HF token secret is mounted defensively; it isn't required at runtime once the download Job has completed. +- **Sibling recipe.** [DeepSeek-V4-Flash](../deepseek-v4-flash/) is the smaller sibling (284B / 13B active, DP=4 + EP on 4 B200 GPUs) and uses the same dsv4 container image. diff --git a/recipes/deepseek-v4-pro/container/Dockerfile.dsv4 b/recipes/deepseek-v4-pro/container/Dockerfile.dsv4 new file mode 100644 index 000000000000..81a5f267f059 --- /dev/null +++ b/recipes/deepseek-v4-pro/container/Dockerfile.dsv4 @@ -0,0 +1,91 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Dynamo vLLM runtime overlaid on the official DeepSeek-V4 vLLM image. +# Shared image for all DeepSeek-V4 recipes (Flash, Pro, ...). +# +# Base: vllm/vllm-openai:deepseekv4-cu130 — ships vLLM from PR #40760 +# (zyongye/vllm:dsv4) with the DeepSeek-V4 kernels, tokenizer_mode, tool+reasoning +# parsers, hybrid CSA+HCA attention, MTP speculative decoding, and FP4 indexer. +# +# We take pre-built dynamo artifacts (wheels, nats, etcd, NIXL, UCX, dynamo.vllm +# python worker) from a locally-built Dynamo vLLM runtime image (produced via +# /container/README.md) and layer them on top of the dsv4 vLLM image +# without touching the vLLM install. +# +# Build (run from the repo root): +# docker build -f recipes/deepseek-v4-pro/container/Dockerfile.dsv4 \ +# -t /vllm-dsv4: . +# +# See recipes/deepseek-v4-pro/container/README.md for build args and +# troubleshooting. +# +# Both base images must be Python 3.12 (verified). + +# Default to the local tag produced by `container/render.py --framework vllm +# --target runtime` + `docker build -t dynamo:latest-vllm-runtime ...`. Override +# with --build-arg DYNAMO_SRC_IMAGE=... to use a published release tag instead. +ARG DYNAMO_SRC_IMAGE=dynamo:latest-vllm-runtime +ARG DSV4_BASE_IMAGE=vllm/vllm-openai:deepseekv4-cu130 + +FROM ${DYNAMO_SRC_IMAGE} AS dynamo_src + +FROM ${DSV4_BASE_IMAGE} + +ENV DEBIAN_FRONTEND=noninteractive + +# Runtime deps dynamo needs that aren't in the vLLM image (etcd/nats are static +# binaries we COPY; libibverbs/rdma-core are needed for NIXL's UCX transport). +RUN apt-get update && apt-get install -y --no-install-recommends \ + libibverbs1 rdma-core ibverbs-utils libibumad3 \ + libnuma1 librdmacm1 ibverbs-providers \ + ca-certificates jq curl \ + && apt list --upgradable 2>/dev/null | tail -n +2 | grep 'jammy-' | awk -F/ '{print $1}' | xargs -r apt-get install -y --only-upgrade \ + && rm -rf /var/lib/apt/lists/* + +# --- patch vLLM: drop unsupported topk=1024 from sparse attn indexer --- +# from https://github.com/vllm-project/vllm/pull/40760/changes/3602f14f0e146b234be911d916e381b4e6a4dc0c +# TODO: remove once https://github.com/vllm-project/vllm/pull/40760 lands in the base image. +RUN sed -i 's/(512, 1024, 2048)/(512, 2048)/' \ + /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/sparse_attn_indexer.py + +# --- static binaries --- +COPY --from=dynamo_src /usr/bin/nats-server /usr/bin/nats-server +COPY --from=dynamo_src /usr/local/bin/etcd /usr/local/bin/etcd +ENV PATH=/usr/local/bin/etcd:${PATH} + +# --- UCX --- +COPY --from=dynamo_src /usr/local/ucx /usr/local/ucx +ENV PATH=/usr/local/ucx/bin:${PATH} + +# --- NIXL (C++ libs for KV transfer) --- +COPY --from=dynamo_src /opt/nvidia/nvda_nixl /opt/nvidia/nvda_nixl +ENV NIXL_PREFIX=/opt/nvidia/nvda_nixl \ + NIXL_LIB_DIR=/opt/nvidia/nvda_nixl/lib64 \ + NIXL_PLUGIN_DIR=/opt/nvidia/nvda_nixl/lib64/plugins +ENV LD_LIBRARY_PATH=${NIXL_LIB_DIR}:${NIXL_PLUGIN_DIR}:/usr/local/ucx/lib:/usr/local/ucx/lib/ucx:${LD_LIBRARY_PATH} + +# --- install dynamo python wheels into the dsv4 image's system python --- +# The dsv4 image uses system python3.12 with pip at /usr/local/lib/python3.12/dist-packages. +# ai_dynamo_runtime is abi3 (cp310+), compatible with cp312. +COPY --from=dynamo_src /opt/dynamo/wheelhouse /opt/dynamo/wheelhouse +RUN pip install --no-cache-dir \ + /opt/dynamo/wheelhouse/ai_dynamo_runtime*.whl \ + /opt/dynamo/wheelhouse/ai_dynamo*any.whl \ + /opt/dynamo/wheelhouse/nixl/nixl*.whl + +# --- dynamo python source (dynamo.vllm worker + common + mocker) --- +# Bring the worker entrypoint tree so `python -m dynamo.vllm` resolves. +COPY --from=dynamo_src /workspace/components/src/dynamo /workspace/components/src/dynamo +ENV PYTHONPATH=/workspace/components/src:${PYTHONPATH:-} + +WORKDIR /workspace + +# --- dynamo runtime env tweaks --- +# Keep vLLM's flashinfer sampler (enabled by default in 0.20+ but explicit here). +ENV VLLM_USE_FLASHINFER_SAMPLER=1 + +# Default to bash so the Dynamo CRD operator can exec `python3 -m dynamo.vllm` +# via the manifest command/args rather than the vLLM api_server entrypoint. +ENTRYPOINT [] +CMD ["bash"] diff --git a/recipes/deepseek-v4-pro/container/README.md b/recipes/deepseek-v4-pro/container/README.md new file mode 100644 index 000000000000..2391f4ae3dab --- /dev/null +++ b/recipes/deepseek-v4-pro/container/README.md @@ -0,0 +1,90 @@ +# DeepSeek-V4-Pro Reference Container + +DeepSeek-V4-Pro is not in a stock vLLM release yet, so the recipe ships with its own reference Dockerfile that overlays the Dynamo runtime on top of the upstream dsv4 vLLM image. The image is the same one the V4-Flash recipe uses — DeepSeek-V4-Flash and DeepSeek-V4-Pro share the same vLLM dsv4 stack — but is duplicated here so each recipe is self-contained. + +- **Base:** [`vllm/vllm-openai:deepseekv4-cu130`](https://hub.docker.com/r/vllm/vllm-openai/tags) — vLLM from PR [#40760](https://github.com/vllm-project/vllm/pull/40760) (`zyongye/vllm:dsv4`) with the DeepSeek-V4 kernels, `tokenizer_mode`, tool + reasoning parsers, hybrid CSA + HCA attention, MTP speculative decoding, and the FP4 indexer. +- **Overlay:** pre-built Dynamo artifacts (wheels, static `nats`/`etcd` binaries, NIXL, UCX, the `dynamo.vllm` Python worker) copied from a locally-built Dynamo vLLM runtime image. + +Both layers use Python 3.12; no vLLM reinstall is performed. + +## Build flow + +Two Docker images are involved: + +1. **Dynamo vLLM runtime** — built from this repo using the instructions in [`/container/README.md`](../../../container/README.md). This image contains the Dynamo Rust runtime, wheels, and the `dynamo.vllm` worker. +2. **DeepSeek-V4-Pro overlay** — built here, using the image from step 1 as the source stage (`DYNAMO_SRC_IMAGE`) and the upstream dsv4 vLLM image as the final base (`DSV4_BASE_IMAGE`). + +## Step 1 — Build the Dynamo vLLM runtime + +From the **repo root**, render and build the runtime image per [`container/README.md`](../../../container/README.md): + +```bash +# From +container/render.py --framework vllm --target runtime --output-short-filename +docker build -t dynamo:latest-vllm-runtime -f container/rendered.Dockerfile . +``` + +This produces the local tag `dynamo:latest-vllm-runtime`, which is what Step 2 expects by default. + +## Step 2 — Build the DeepSeek-V4-Pro overlay + +Still from the **repo root**: + +```bash +docker build \ + -f recipes/deepseek-v4-pro/container/Dockerfile.dsv4 \ + -t /vllm-dsv4: \ + . +``` + +The Dockerfile takes no files from the build context (everything comes from `FROM` / `COPY --from=`), so any context directory works — using the repo root keeps the `-f` path straightforward. + +> If you have already built the dsv4 overlay for the V4-Flash recipe, you can reuse the same image tag here — there is nothing model-specific in the container. The model is selected at runtime via `--model`. + +### Build args + +Both can be overridden with `--build-arg`: + +| Arg | Default | Purpose | +|-----|---------|---------| +| `DYNAMO_SRC_IMAGE` | `dynamo:latest-vllm-runtime` | Source image for the Dynamo overlay. The default matches the tag produced by Step 1. Override with a pinned released tag (e.g. `nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.2`) for reproducible builds without rebuilding locally. | +| `DSV4_BASE_IMAGE` | `vllm/vllm-openai:deepseekv4-cu130` | The dsv4 vLLM base. The `cu129` tag is also available for CUDA 12.9 hosts. | + +Example — pin the overlay source to a released Dynamo tag on a CUDA 12.9 host: + +```bash +docker build \ + -f recipes/deepseek-v4-pro/container/Dockerfile.dsv4 \ + --build-arg DYNAMO_SRC_IMAGE=nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.2-cuda13 \ + --build-arg DSV4_BASE_IMAGE=vllm/vllm-openai:deepseekv4-cu129 \ + -t /vllm-dsv4: \ + . +``` + +## Push + +```bash +docker push /vllm-dsv4: +``` + +## Wire into the recipe + +Once the image is pushed, update the `image:` fields in +[`../vllm/agg/vllm-dgd.yaml`](../vllm/agg/vllm-dgd.yaml) (both the Frontend and the `VllmDecodeWorker`) to point at `/vllm-dsv4:`, then follow the recipe's [Quick Start](../README.md#quick-start) to deploy. + +## What the Dockerfile does + +1. Installs the RDMA / UCX runtime deps on top of the dsv4 vLLM image (`libibverbs1`, `rdma-core`, `ibverbs-utils`, `libibumad3`, `libnuma1`, `librdmacm1`, `ibverbs-providers`, plus `ca-certificates`, `jq`, `curl`). +2. Applies a small upstream vLLM patch to the sparse attention indexer (drops the unsupported `topk=1024`). Remove once [vLLM PR #40760](https://github.com/vllm-project/vllm/pull/40760) lands in the base image. +3. Copies the static `nats-server` and `etcd` binaries from the Dynamo runtime image. +4. Copies UCX into `/usr/local/ucx` and NIXL into `/opt/nvidia/nvda_nixl`, with `LD_LIBRARY_PATH` set so NIXL's plugins resolve at runtime. +5. Installs the Dynamo Python wheels (`ai_dynamo_runtime`, `ai_dynamo`, NIXL Python bindings) into the dsv4 image's system Python 3.12. +6. Copies the `dynamo` Python package tree into `/workspace/components/src/dynamo` and puts it on `PYTHONPATH` so `python3 -m dynamo.vllm` resolves. +7. Keeps vLLM's FlashInfer sampler enabled (`VLLM_USE_FLASHINFER_SAMPLER=1`) and clears `ENTRYPOINT` so the Dynamo CRD operator's `command` / `args` take effect. + +## Troubleshooting + +- **`pull access denied for dynamo:latest-vllm-runtime`** — Step 1 has not been run (or produced a different tag). Build the Dynamo vLLM runtime image locally per [`/container/README.md`](../../../container/README.md), or override `--build-arg DYNAMO_SRC_IMAGE=`. +- **`no matching manifest for linux/amd64`** — the dsv4 base is amd64-only today; build on an x86_64 host. +- **CUDA version mismatch on the host** — use `DSV4_BASE_IMAGE=vllm/vllm-openai:deepseekv4-cu129` if your node is still on CUDA 12.9. +- **NIXL plugins not found at runtime** — confirm `LD_LIBRARY_PATH` includes `/opt/nvidia/nvda_nixl/lib64/plugins` (set in the Dockerfile; don't unset it in the pod spec). diff --git a/recipes/deepseek-v4-pro/model-cache/model-cache.yaml b/recipes/deepseek-v4-pro/model-cache/model-cache.yaml new file mode 100644 index 000000000000..53618b0bfbe9 --- /dev/null +++ b/recipes/deepseek-v4-pro/model-cache/model-cache.yaml @@ -0,0 +1,13 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +apiVersion: v1 +kind: PersistentVolumeClaim +metadata: + name: model-cache +spec: + accessModes: + - ReadWriteMany + resources: + requests: + storage: 1500Gi + storageClassName: "your-storage-class-name" diff --git a/recipes/deepseek-v4-pro/model-cache/model-download.yaml b/recipes/deepseek-v4-pro/model-cache/model-download.yaml new file mode 100644 index 000000000000..f1c4a1726e03 --- /dev/null +++ b/recipes/deepseek-v4-pro/model-cache/model-download.yaml @@ -0,0 +1,42 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +apiVersion: batch/v1 +kind: Job +metadata: + name: model-download +spec: + backoffLimit: 3 + completions: 1 + parallelism: 1 + template: + metadata: + labels: + app: model-download + spec: + restartPolicy: Never + containers: + - name: model-download + image: python:3.10-slim + command: ["sh", "-c"] + envFrom: + - secretRef: + name: hf-token-secret + env: + - name: MODEL_NAME + value: deepseek-ai/DeepSeek-V4-Pro + - name: HF_HOME + value: /model-store + - name: HF_XET_HIGH_PERFORMANCE + value: "1" + args: + - | + set -eux + pip install --no-cache-dir huggingface_hub==1.11.0 + hf download $MODEL_NAME + volumeMounts: + - name: model-cache + mountPath: /model-store + volumes: + - name: model-cache + persistentVolumeClaim: + claimName: model-cache diff --git a/recipes/deepseek-v4-pro/vllm/agg/vllm-dgd.yaml b/recipes/deepseek-v4-pro/vllm/agg/vllm-dgd.yaml new file mode 100644 index 000000000000..dd88474d5eb9 --- /dev/null +++ b/recipes/deepseek-v4-pro/vllm/agg/vllm-dgd.yaml @@ -0,0 +1,117 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# DynamoGraphDeployment for deepseek-ai/DeepSeek-V4-Pro on vLLM, +# aggregated serving (no prefill/decode disaggregation). +# +# Upstream reference command: +# docker run --gpus all ... vllm/vllm-openai:deepseekv4-cu130 \ +# deepseek-ai/DeepSeek-V4-Pro \ +# --trust-remote-code --kv-cache-dtype fp8 --block-size 256 \ +# --enable-expert-parallel --tensor-parallel-size 8 \ +# --compilation-config '{"mode": 0, "cudagraph_mode": "FULL_DECODE_ONLY"}' \ +# --attention_config.use_fp4_indexer_cache=True \ +# --tokenizer-mode deepseek_v4 \ +# --tool-call-parser deepseek_v4 --enable-auto-tool-choice \ +# --reasoning-parser deepseek_v4 +# +# Shape: 1 replica x 8 GPUs, TP=8 + Expert Parallel. Fills a single 8-GPU node. +# +# Image: replace the `:my-tag` placeholder with a Dynamo + vLLM image that +# includes the DeepSeek-V4 stack. See `../../container/README.md` +# for the reference build -- it overlays dynamo on +# vllm/vllm-openai:deepseekv4-cu130. +# +# Weights: served from the `model-cache` PVC populated by +# `../../model-cache/model-download.yaml`. +apiVersion: nvidia.com/v1alpha1 +kind: DynamoGraphDeployment +metadata: + name: dsv4-pro-agg +spec: + backendFramework: vllm + pvcs: + - name: model-cache + create: false + services: + Frontend: + componentType: frontend + replicas: 1 + volumeMounts: + - name: model-cache + mountPoint: /opt/models + extraPodSpec: + mainContainer: + image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag + workingDir: /workspace/examples/backends/vllm + env: + - name: HF_HOME + value: /opt/models + - name: HF_HUB_OFFLINE + value: "1" + VllmDecodeWorker: + componentType: worker + subComponentType: decode + envFromSecret: hf-token-secret + volumeMounts: + - name: model-cache + mountPoint: /opt/models + sharedMemory: + size: 200Gi + extraPodSpec: + mainContainer: + image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag + workingDir: /workspace/examples/backends/vllm + # DeepSeek-V4-Pro (1.6T params) is large; first launch loads weights + # over TP=8 ranks plus FlashInfer autotune + cudagraph warmup. Allow + # ~90 min: periodSeconds * failureThreshold = 10 * 540 = 5400s. + startupProbe: + httpGet: + path: /health + port: 9090 + periodSeconds: 10 + timeoutSeconds: 10 + failureThreshold: 540 + env: + - name: SERVED_MODEL_NAME + value: deepseek-ai/DeepSeek-V4-Pro + - name: MODEL_PATH + value: deepseek-ai/DeepSeek-V4-Pro + - name: HF_HOME + value: /opt/models + # Read weights from the PVC only; do not hit the HF Hub at startup. + - name: HF_HUB_OFFLINE + value: "1" + # Give the engine room to finish first-launch init. + - name: VLLM_ENGINE_READY_TIMEOUT_S + value: "5400" + # Stabilize TP/EP all-reduces and skip the IPC P2P probe. + - name: VLLM_SKIP_P2P_CHECK + value: "1" + - name: NCCL_CUMEM_ENABLE + value: "1" + command: + - /bin/sh + - -c + args: + - | + python3 -m dynamo.vllm \ + --model "${MODEL_PATH}" \ + --served-model-name "${SERVED_MODEL_NAME}" \ + --trust-remote-code \ + --kv-cache-dtype fp8 \ + --block-size 256 \ + --tensor-parallel-size 8 \ + --enable-expert-parallel \ + --tokenizer-mode deepseek_v4 \ + --dyn-reasoning-parser deepseek_v4 \ + --dyn-tool-call-parser deepseek_v4 \ + --attention-config '{"use_fp4_indexer_cache":true}' \ + --compilation-config '{"mode":0,"cudagraph_mode":"FULL_DECODE_ONLY"}' \ + --max-num-seqs 256 + replicas: 1 + resources: + limits: + gpu: "8" + requests: + gpu: "8"