-
Notifications
You must be signed in to change notification settings - Fork 1.1k
feat: add experimental DeepSeek-V4-Flash + V4-Pro vLLM agg recipes #8668
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
16 commits
Select commit
Hold shift + click to select a range
bca0274
feat: add vllm recipe for DeepSeek-V4-Flash
biswapanda 5652531
wip
biswapanda 26f63e1
Apply suggestion from @dmitry-tokarev-nv
dmitry-tokarev-nv e3f28d4
recipes: DeepSeek-V4-Flash vLLM agg recipe
biswapanda b1b92d9
recipes(deepseek-v4-flash): move Dockerfile.dsv4 into recipe dir
biswapanda 63ccf99
recipes(deepseek-v4-flash): add container/ subdir with Dockerfile + b…
biswapanda c4def86
recipes(deepseek-v4-flash): use public vllm-runtime image as Dockerfi…
biswapanda 3761a07
recipes(deepseek-v4-flash): build overlay from locally-built runtime,…
biswapanda 4b0205e
recipes(deepseek-v4-flash): correct model size and right-size model-c…
biswapanda ed842e1
formatting
dmitry-tokarev-nv db86e01
Merge branch 'main' into bis/vllm-test-1
dmitry-tokarev-nv e7f294c
Debian packages upgrades
dmitry-tokarev-nv fbb9e96
recipes: add experimental DeepSeek-V4-Pro vLLM agg recipe
biswapanda 5e72ad6
recipes: add experimental DeepSeek-V4-Pro vLLM agg recipe
biswapanda 98d8494
address coderabbit comment
biswapanda b7873df
fixed etcd path
dmitry-tokarev-nv File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,161 @@ | ||
| # DeepSeek-V4-Flash Recipe | ||
|
|
||
| Aggregated-serving recipe for **DeepSeek-V4-Flash** on vLLM with Dynamo. | ||
|
|
||
| | Variant | Model | Status | Modality | Manifest | GPUs | | ||
| |---------|-------|--------|----------|----------|------| | ||
| | **vllm-agg** | `deepseek-ai/DeepSeek-V4-Flash` | Experimental | Text only | [`vllm/agg/vllm-dgd.yaml`](vllm/agg/vllm-dgd.yaml) | 4x B200 | | ||
|
|
||
| Aggregated, single-replica: 1 decode pod running DP=4 + Expert Parallel on 4 B200 GPUs (TP=1). Tested on 4 of 8 GPUs per B200 node. | ||
|
|
||
| ## Prerequisites | ||
|
|
||
| 1. **Dynamo Platform installed** — see the [Kubernetes Deployment Guide](../../docs/kubernetes/README.md). | ||
| 2. **GPU cluster** with at least 4 B200 GPUs available on one node. | ||
| 3. **HuggingFace token** with access to `deepseek-ai/DeepSeek-V4-Flash`. | ||
| 4. **Dynamo + vLLM image with the DeepSeek-V4 stack.** DeepSeek-V4-Flash is not in a stock vLLM release yet. It is built in two steps: | ||
|
|
||
| 1. Build the Dynamo vLLM runtime image locally per [`<repo_root>/container/README.md`](../../container/README.md) (this produces the local tag `dynamo:latest-vllm-runtime`). | ||
| 2. Build the DeepSeek-V4-Flash overlay on top of it using [`container/Dockerfile.dsv4`](container/Dockerfile.dsv4). See [`container/README.md`](container/README.md) for build args and troubleshooting. From the repo root: | ||
|
|
||
| ```bash | ||
| docker build -f recipes/deepseek-v4-flash/container/Dockerfile.dsv4 \ | ||
| -t <your-registry>/vllm-dsv4:<tag> . | ||
| ``` | ||
|
|
||
| Then set the `image:` fields in `vllm/agg/vllm-dgd.yaml` (both the frontend and decode workers) to `<your-registry>/vllm-dsv4:<tag>`. | ||
|
|
||
| ## Quick Start | ||
|
|
||
| ```bash | ||
| export NAMESPACE=dynamo-demo | ||
| kubectl create namespace ${NAMESPACE} | ||
|
|
||
| # HuggingFace token secret (consumed by the download Job and, as a convenience, by the worker) | ||
| kubectl create secret generic hf-token-secret \ | ||
| --from-literal=HF_TOKEN="your-token-here" \ | ||
| -n ${NAMESPACE} | ||
|
|
||
| # Download model into the model-cache PVC. | ||
| # Edit model-cache/model-cache.yaml and set storageClassName to a RWX class in your cluster. | ||
| # The PVC requests 400Gi; DeepSeek-V4-Flash is ~160GB on disk (46 safetensors shards, | ||
| # FP4+FP8 mixed) and typically takes 30-60 min to download on first apply. | ||
| kubectl apply -f model-cache/model-cache.yaml -n ${NAMESPACE} | ||
| kubectl apply -f model-cache/model-download.yaml -n ${NAMESPACE} | ||
| kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=7200s | ||
|
|
||
| # Update the `image:` fields in vllm/agg/vllm-dgd.yaml to your Dynamo + vLLM build. | ||
|
|
||
| # Deploy | ||
| kubectl apply -f vllm/agg/vllm-dgd.yaml -n ${NAMESPACE} | ||
|
|
||
| # First launch of the decode worker takes up to ~60 minutes (weight load + | ||
| # FlashInfer autotune + cudagraph warmup). The startup probe is sized for this. | ||
| kubectl wait --for=condition=Ready pod \ | ||
| -l nvidia.com/dynamo-graph-deployment-name=dsv4-flash-agg \ | ||
| -n ${NAMESPACE} --timeout=3600s | ||
| ``` | ||
|
|
||
| ## Test the Deployment | ||
|
|
||
| ```bash | ||
| kubectl port-forward svc/dsv4-flash-agg-frontend 8000:8000 -n ${NAMESPACE} | ||
|
|
||
| curl http://localhost:8000/v1/chat/completions \ | ||
| -H "Content-Type: application/json" \ | ||
| -d '{ | ||
| "model": "deepseek-ai/DeepSeek-V4-Flash", | ||
| "messages": [{"role": "user", "content": "Hello!"}], | ||
| "max_tokens": 100 | ||
| }' | ||
| ``` | ||
|
|
||
| ## Recipe Details | ||
|
|
||
| The worker command lives in `vllm/agg/vllm-dgd.yaml`. Key flags and why they're there: | ||
|
|
||
| | Flag | Purpose | | ||
| |------|---------| | ||
| | `--tokenizer-mode deepseek_v4` | Selects the DeepSeek-V4 tokenizer | | ||
| | `--dyn-reasoning-parser deepseek_v4` | Extracts chain-of-thought into `message.reasoning_content` | | ||
| | `--dyn-tool-call-parser deepseek_v4` | Emits OpenAI-compatible structured `tool_calls` | | ||
| | `--attention-config '{"use_fp4_indexer_cache":true}'` | Blackwell FP4 indexer cache for CSA+HCA attention | | ||
| | `--kv-cache-dtype fp8` + `--block-size 256` | FP8 KV cache; block size matches the upstream recipe | | ||
| | `--tensor-parallel-size 1 --data-parallel-size 4 --enable-expert-parallel` | DP=4 + EP across the 4 GPUs (TP=1) | | ||
| | `--compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}'` | Single-node DEP compilation config from the upstream recipe | | ||
| | `--max-num-seqs 256` | Concurrency cap | | ||
|
|
||
| ## Model Details | ||
|
|
||
| | | | | ||
| |---|---| | ||
| | **Model** | `deepseek-ai/DeepSeek-V4-Flash` (MoE, 284B total / 13B active) | | ||
| | **Checkpoint** | Mixed FP4 (expert weights) + FP8 (attention, norm, router) | | ||
| | **Backend** | vLLM with the DeepSeek-V4 stack (`vllm/vllm-openai:deepseekv4-cu130`) | | ||
| | **Parallelism** | TP=1, DP=4, Expert Parallel enabled | | ||
| | **KV cache** | FP8, block size 256 | | ||
| | **Attention** | Hybrid CSA + HCA with Blackwell FP4 indexer cache | | ||
|
|
||
| ## Verifying Reasoning | ||
|
|
||
| ```bash | ||
| curl -s http://localhost:8000/v1/chat/completions \ | ||
| -H "Content-Type: application/json" \ | ||
| -d '{ | ||
| "model": "deepseek-ai/DeepSeek-V4-Flash", | ||
| "messages": [{"role": "user", "content": "What is 2+2? Answer briefly."}], | ||
| "max_tokens": 200 | ||
| }' | python3 -m json.tool | ||
| ``` | ||
|
|
||
| Expected: | ||
|
|
||
| - `choices[0].message.reasoning_content` contains the model's chain-of-thought. | ||
| - `choices[0].message.content` contains only the final answer. | ||
| - No raw `</think>` tags in either field. | ||
|
|
||
| If `reasoning_content` is `null` and `</think>` appears in `content`, the reasoning parser isn't wired up — confirm `--dyn-reasoning-parser deepseek_v4` is on the worker command. | ||
|
|
||
| ## Verifying Tool Calling | ||
|
|
||
| ```bash | ||
| curl -s http://localhost:8000/v1/chat/completions \ | ||
| -H "Content-Type: application/json" \ | ||
| -d '{ | ||
| "model": "deepseek-ai/DeepSeek-V4-Flash", | ||
| "messages": [{"role": "user", "content": "What is the weather in San Francisco?"}], | ||
| "tools": [{ | ||
| "type": "function", | ||
| "function": { | ||
| "name": "get_weather", | ||
| "description": "Get the current weather for a location", | ||
| "parameters": { | ||
| "type": "object", | ||
| "properties": { | ||
| "location": {"type": "string", "description": "City name"} | ||
| }, | ||
| "required": ["location"] | ||
| } | ||
| } | ||
| }], | ||
| "max_tokens": 300 | ||
| }' | python3 -m json.tool | ||
| ``` | ||
|
|
||
| Expected: | ||
|
|
||
| - `choices[0].message.tool_calls` is a structured array with `function.name`, `function.arguments`, and `id`. | ||
| - `choices[0].finish_reason` is `"tool_calls"`. | ||
| - `choices[0].message.reasoning_content` may contain the model's reasoning about tool selection. | ||
|
|
||
| If `tool_calls` is missing and raw tool-call markers appear in `content`, confirm `--dyn-tool-call-parser deepseek_v4` is on the worker command. | ||
|
|
||
| ## Notes | ||
|
|
||
| - **Storage class.** Update `storageClassName` in `model-cache/model-cache.yaml` to a RWX class that can serve the PVC to frontend and worker pods. | ||
| - **Model size.** `deepseek-ai/DeepSeek-V4-Flash` is ~160 GB on disk (46 safetensors shards in FP4+FP8 mixed form). The 400Gi PVC leaves headroom for HF cache metadata and one alternate revision. | ||
| - **Image tag.** The manifest ships with `nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag`. Replace with your built Dynamo + vLLM (DeepSeek-V4) image — see Prerequisite 4. | ||
| - **First launch is slow.** The decode worker loads weights and warms CUDA graphs; the startup probe allows up to ~60 min (`failureThreshold: 360` at `periodSeconds: 10`) and `VLLM_ENGINE_READY_TIMEOUT_S=3600` is set to match. | ||
| - **Parser flags.** Use the Dynamo variants on the worker (`--dyn-reasoning-parser`, `--dyn-tool-call-parser`). vLLM's native `--reasoning-parser` / `--tool-call-parser` are engine-side and do not feed the Dynamo OpenAI renderer. | ||
| - **DP stability.** `VLLM_RANDOMIZE_DP_DUMMY_INPUTS=1` and `VLLM_SKIP_P2P_CHECK=1` mirror the DeepSeek-R1 vLLM recipe and stabilize DP dummy inputs on Blackwell. | ||
| - **Offline model cache.** The worker runs with `HF_HUB_OFFLINE=1` so vLLM reads the cached weights from the PVC and never contacts the HF Hub at startup. The HF token secret is mounted defensively; it isn't required at runtime once the download Job has completed. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,90 @@ | ||
| # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
| # | ||
| # Dynamo vLLM runtime overlaid on the official DeepSeek-V4 vLLM image. | ||
| # | ||
| # Base: vllm/vllm-openai:deepseekv4-cu130 — ships vLLM from PR #40760 | ||
| # (zyongye/vllm:dsv4) with the DeepSeek-V4 kernels, tokenizer_mode, tool+reasoning | ||
| # parsers, hybrid CSA+HCA attention, MTP speculative decoding, and FP4 indexer. | ||
| # | ||
| # We take pre-built dynamo artifacts (wheels, nats, etcd, NIXL, UCX, dynamo.vllm | ||
| # python worker) from a locally-built Dynamo vLLM runtime image (produced via | ||
| # <repo_root>/container/README.md) and layer them on top of the dsv4 vLLM image | ||
| # without touching the vLLM install. | ||
| # | ||
| # Build (run from the repo root): | ||
| # docker build -f recipes/deepseek-v4-flash/container/Dockerfile.dsv4 \ | ||
| # -t <your-registry>/vllm-dsv4:<tag> . | ||
| # | ||
| # See recipes/deepseek-v4-flash/container/README.md for build args and | ||
| # troubleshooting. | ||
| # | ||
| # Both base images must be Python 3.12 (verified). | ||
|
|
||
| # Default to the local tag produced by `container/render.py --framework vllm | ||
| # --target runtime` + `docker build -t dynamo:latest-vllm-runtime ...`. Override | ||
| # with --build-arg DYNAMO_SRC_IMAGE=... to use a published release tag instead. | ||
| ARG DYNAMO_SRC_IMAGE=dynamo:latest-vllm-runtime | ||
| ARG DSV4_BASE_IMAGE=vllm/vllm-openai:deepseekv4-cu130 | ||
|
|
||
| FROM ${DYNAMO_SRC_IMAGE} AS dynamo_src | ||
|
|
||
| FROM ${DSV4_BASE_IMAGE} | ||
|
|
||
| ENV DEBIAN_FRONTEND=noninteractive | ||
|
|
||
| # Runtime deps dynamo needs that aren't in the vLLM image (etcd/nats are static | ||
| # binaries we COPY; libibverbs/rdma-core are needed for NIXL's UCX transport). | ||
| RUN apt-get update && apt-get install -y --no-install-recommends \ | ||
| libibverbs1 rdma-core ibverbs-utils libibumad3 \ | ||
| libnuma1 librdmacm1 ibverbs-providers \ | ||
| ca-certificates jq curl \ | ||
| && apt list --upgradable 2>/dev/null | tail -n +2 | grep 'jammy-' | awk -F/ '{print $1}' | xargs -r apt-get install -y --only-upgrade \ | ||
| && rm -rf /var/lib/apt/lists/* | ||
|
|
||
| # --- patch vLLM: drop unsupported topk=1024 from sparse attn indexer --- | ||
| # from https://github.com/vllm-project/vllm/pull/40760/changes/3602f14f0e146b234be911d916e381b4e6a4dc0c | ||
| # TODO: remove once https://github.com/vllm-project/vllm/pull/40760 lands in the base image. | ||
| RUN sed -i 's/(512, 1024, 2048)/(512, 2048)/' \ | ||
| /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/sparse_attn_indexer.py | ||
|
|
||
| # --- static binaries --- | ||
| COPY --from=dynamo_src /usr/bin/nats-server /usr/bin/nats-server | ||
| COPY --from=dynamo_src /usr/local/bin/etcd /usr/local/bin/etcd | ||
| ENV PATH=/usr/local/bin/etcd:${PATH} | ||
|
coderabbitai[bot] marked this conversation as resolved.
|
||
|
|
||
| # --- UCX --- | ||
| COPY --from=dynamo_src /usr/local/ucx /usr/local/ucx | ||
| ENV PATH=/usr/local/ucx/bin:${PATH} | ||
|
|
||
| # --- NIXL (C++ libs for KV transfer) --- | ||
| COPY --from=dynamo_src /opt/nvidia/nvda_nixl /opt/nvidia/nvda_nixl | ||
| ENV NIXL_PREFIX=/opt/nvidia/nvda_nixl \ | ||
| NIXL_LIB_DIR=/opt/nvidia/nvda_nixl/lib64 \ | ||
| NIXL_PLUGIN_DIR=/opt/nvidia/nvda_nixl/lib64/plugins | ||
| ENV LD_LIBRARY_PATH=${NIXL_LIB_DIR}:${NIXL_PLUGIN_DIR}:/usr/local/ucx/lib:/usr/local/ucx/lib/ucx:${LD_LIBRARY_PATH} | ||
|
|
||
| # --- install dynamo python wheels into the dsv4 image's system python --- | ||
| # The dsv4 image uses system python3.12 with pip at /usr/local/lib/python3.12/dist-packages. | ||
| # ai_dynamo_runtime is abi3 (cp310+), compatible with cp312. | ||
| COPY --from=dynamo_src /opt/dynamo/wheelhouse /opt/dynamo/wheelhouse | ||
| RUN pip install --no-cache-dir \ | ||
| /opt/dynamo/wheelhouse/ai_dynamo_runtime*.whl \ | ||
| /opt/dynamo/wheelhouse/ai_dynamo*any.whl \ | ||
| /opt/dynamo/wheelhouse/nixl/nixl*.whl | ||
|
|
||
| # --- dynamo python source (dynamo.vllm worker + common + mocker) --- | ||
| # Bring the worker entrypoint tree so `python -m dynamo.vllm` resolves. | ||
| COPY --from=dynamo_src /workspace/components/src/dynamo /workspace/components/src/dynamo | ||
| ENV PYTHONPATH=/workspace/components/src:${PYTHONPATH:-} | ||
|
|
||
| WORKDIR /workspace | ||
|
|
||
| # --- dynamo runtime env tweaks --- | ||
| # Keep vLLM's flashinfer sampler (enabled by default in 0.20+ but explicit here). | ||
| ENV VLLM_USE_FLASHINFER_SAMPLER=1 | ||
|
|
||
| # Default to bash so the Dynamo CRD operator can exec `python3 -m dynamo.vllm` | ||
| # via the manifest command/args rather than the vLLM api_server entrypoint. | ||
| ENTRYPOINT [] | ||
| CMD ["bash"] | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,88 @@ | ||
| # DeepSeek-V4-Flash Reference Container | ||
|
|
||
| DeepSeek-V4-Flash is not in a stock vLLM release yet, so the recipe ships with its own reference Dockerfile that overlays the Dynamo runtime on top of the upstream dsv4 vLLM image. | ||
|
|
||
| - **Base:** [`vllm/vllm-openai:deepseekv4-cu130`](https://hub.docker.com/r/vllm/vllm-openai/tags) — vLLM from PR [#40760](https://github.com/vllm-project/vllm/pull/40760) (`zyongye/vllm:dsv4`) with the DeepSeek-V4 kernels, `tokenizer_mode`, tool + reasoning parsers, hybrid CSA + HCA attention, MTP speculative decoding, and the FP4 indexer. | ||
| - **Overlay:** pre-built Dynamo artifacts (wheels, static `nats`/`etcd` binaries, NIXL, UCX, the `dynamo.vllm` Python worker) copied from a locally-built Dynamo vLLM runtime image. | ||
|
|
||
| Both layers use Python 3.12; no vLLM reinstall is performed. | ||
|
|
||
| ## Build flow | ||
|
|
||
| Two Docker images are involved: | ||
|
|
||
| 1. **Dynamo vLLM runtime** — built from this repo using the instructions in [`<repo_root>/container/README.md`](../../../container/README.md). This image contains the Dynamo Rust runtime, wheels, and the `dynamo.vllm` worker. | ||
| 2. **DeepSeek-V4-Flash overlay** — built here, using the image from step 1 as the source stage (`DYNAMO_SRC_IMAGE`) and the upstream dsv4 vLLM image as the final base (`DSV4_BASE_IMAGE`). | ||
|
|
||
| ## Step 1 — Build the Dynamo vLLM runtime | ||
|
|
||
| From the **repo root**, render and build the runtime image per [`container/README.md`](../../../container/README.md): | ||
|
|
||
| ```bash | ||
| # From <repo_root> | ||
| container/render.py --framework vllm --target runtime --output-short-filename | ||
| docker build -t dynamo:latest-vllm-runtime -f container/rendered.Dockerfile . | ||
| ``` | ||
|
|
||
| This produces the local tag `dynamo:latest-vllm-runtime`, which is what Step 2 expects by default. | ||
|
|
||
| ## Step 2 — Build the DeepSeek-V4-Flash overlay | ||
|
|
||
| Still from the **repo root**: | ||
|
|
||
| ```bash | ||
| docker build \ | ||
| -f recipes/deepseek-v4-flash/container/Dockerfile.dsv4 \ | ||
| -t <your-registry>/vllm-dsv4:<tag> \ | ||
| . | ||
| ``` | ||
|
|
||
| The Dockerfile takes no files from the build context (everything comes from `FROM` / `COPY --from=`), so any context directory works — using the repo root keeps the `-f` path straightforward. | ||
|
|
||
| ### Build args | ||
|
|
||
| Both can be overridden with `--build-arg`: | ||
|
|
||
| | Arg | Default | Purpose | | ||
| |-----|---------|---------| | ||
| | `DYNAMO_SRC_IMAGE` | `dynamo:latest-vllm-runtime` | Source image for the Dynamo overlay. The default matches the tag produced by Step 1. Override with a pinned released tag (e.g. `nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.2`) for reproducible builds without rebuilding locally. | | ||
| | `DSV4_BASE_IMAGE` | `vllm/vllm-openai:deepseekv4-cu130` | The dsv4 vLLM base. The `cu129` tag is also available for CUDA 12.9 hosts. | | ||
|
|
||
| Example — pin the overlay source to a released Dynamo tag on a CUDA 12.9 host: | ||
|
|
||
| ```bash | ||
| docker build \ | ||
| -f recipes/deepseek-v4-flash/container/Dockerfile.dsv4 \ | ||
| --build-arg DYNAMO_SRC_IMAGE=nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.2-cuda13 \ | ||
| --build-arg DSV4_BASE_IMAGE=vllm/vllm-openai:deepseekv4-cu129 \ | ||
| -t <your-registry>/vllm-dsv4:<tag> \ | ||
| . | ||
| ``` | ||
|
|
||
| ## Push | ||
|
|
||
| ```bash | ||
| docker push <your-registry>/vllm-dsv4:<tag> | ||
| ``` | ||
|
|
||
| ## Wire into the recipe | ||
|
|
||
| Once the image is pushed, update the `image:` fields in | ||
| [`../vllm/agg/vllm-dgd.yaml`](../vllm/agg/vllm-dgd.yaml) (both the Frontend and the `VllmDecodeWorker`) to point at `<your-registry>/vllm-dsv4:<tag>`, then follow the recipe's [Quick Start](../README.md#quick-start) to deploy. | ||
|
|
||
| ## What the Dockerfile does | ||
|
|
||
| 1. Installs the RDMA / UCX runtime deps on top of the dsv4 vLLM image (`libibverbs1`, `rdma-core`, `ibverbs-utils`, `libibumad3`, `libnuma1`, `librdmacm1`, `ibverbs-providers`, plus `ca-certificates`, `jq`, `curl`). | ||
| 2. Applies a small upstream vLLM patch to the sparse attention indexer (drops the unsupported `topk=1024`). Remove once [vLLM PR #40760](https://github.com/vllm-project/vllm/pull/40760) lands in the base image. | ||
| 3. Copies the static `nats-server` and `etcd` binaries from the Dynamo runtime image. | ||
| 4. Copies UCX into `/usr/local/ucx` and NIXL into `/opt/nvidia/nvda_nixl`, with `LD_LIBRARY_PATH` set so NIXL's plugins resolve at runtime. | ||
| 5. Installs the Dynamo Python wheels (`ai_dynamo_runtime`, `ai_dynamo`, NIXL Python bindings) into the dsv4 image's system Python 3.12. | ||
| 6. Copies the `dynamo` Python package tree into `/workspace/components/src/dynamo` and puts it on `PYTHONPATH` so `python3 -m dynamo.vllm` resolves. | ||
| 7. Keeps vLLM's FlashInfer sampler enabled (`VLLM_USE_FLASHINFER_SAMPLER=1`) and clears `ENTRYPOINT` so the Dynamo CRD operator's `command` / `args` take effect. | ||
|
|
||
| ## Troubleshooting | ||
|
|
||
| - **`pull access denied for dynamo:latest-vllm-runtime`** — Step 1 has not been run (or produced a different tag). Build the Dynamo vLLM runtime image locally per [`<repo_root>/container/README.md`](../../../container/README.md), or override `--build-arg DYNAMO_SRC_IMAGE=<your-image>`. | ||
| - **`no matching manifest for linux/amd64`** — the dsv4 base is amd64-only today; build on an x86_64 host. | ||
| - **CUDA version mismatch on the host** — use `DSV4_BASE_IMAGE=vllm/vllm-openai:deepseekv4-cu129` if your node is still on CUDA 12.9. | ||
| - **NIXL plugins not found at runtime** — confirm `LD_LIBRARY_PATH` includes `/opt/nvidia/nvda_nixl/lib64/plugins` (set in the Dockerfile; don't unset it in the pod spec). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,13 @@ | ||
| # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
| apiVersion: v1 | ||
| kind: PersistentVolumeClaim | ||
| metadata: | ||
| name: model-cache | ||
| spec: | ||
| accessModes: | ||
| - ReadWriteMany | ||
| resources: | ||
| requests: | ||
| storage: 400Gi | ||
| storageClassName: "your-storage-class-name" |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.