vllm-project · loveysuby · Feb 10, 2026 · Apr 6, 2026 · Apr 6, 2026 · Apr 6, 2026
@@ -94,6 +94,10 @@ Note: Pre-built wheels are currently available for vLLM-Omni 0.11.0rc1, 0.12.0rc
 
 ### Build your own docker image
 
+=== "NVIDIA CUDA"
+
+    --8<-- "docs/getting_started/installation/gpu/cuda.inc.md:build-docker"
+
 === "AMD ROCm"
 
     --8<-- "docs/getting_started/installation/gpu/rocm.inc.md:build-docker"

@@ -110,3 +110,53 @@ docker run --runtime nvidia --gpus 2 \
     The CUDA image does not define a default entrypoint, so include `vllm serve ... --omni` after the image name.
 
 # --8<-- [end:pre-built-images]
+
+# --8<-- [start:build-docker]
+
+#### Build docker image
+
+```bash
+DOCKER_BUILDKIT=1 docker build -f docker/Dockerfile.cuda -t vllm-omni-cuda .
+```
+
+If you want to specify the base vLLM version:
+
+```bash
+DOCKER_BUILDKIT=1 docker build \
+  -f docker/Dockerfile.cuda \
+  --build-arg BASE_IMAGE=vllm/vllm-openai:v0.21.0 \
+  -t vllm-omni-cuda .
+```
+
+#### Launch the docker image
+
+##### Launch with OpenAI API Server
+
+!!! note
+    The model `Qwen/Qwen3-Omni-30B-A3B-Instruct` requires significant GPU memory. The example below has been verified on 2 x H100's.
+
+```bash
+docker run --runtime nvidia --gpus 2 \
+  -v ${HF_HOME:-$HOME/.cache/huggingface}:/root/.cache/huggingface \
+  --env "HF_TOKEN=$HF_TOKEN" \
+  -p 8091:8091 \
+  --ipc=host \
+  vllm-omni-cuda \
+  vllm serve --omni --model Qwen/Qwen3-Omni-30B-A3B-Instruct --port 8091
+```
+
+By default, this mounts `$HOME/.cache/huggingface` as the model cache directory. To use a custom location, set the `HF_HOME` environment variable before running the command (e.g., `export HF_HOME=/data/models`).
+
+##### Launch with interactive session for development
+
+```bash
+docker run --runtime nvidia --gpus all -it --rm \
+  -v ${HF_HOME:-$HOME/.cache/huggingface}:/root/.cache/huggingface \
+  --env "HF_TOKEN=$HF_TOKEN" \
+  -p 8091:8091 \
+  --ipc=host \
+  --entrypoint bash \
+  vllm-omni-cuda
+```
+
+# --8<-- [end:build-docker]