diff --git a/configs/install-nixl-dsv4.sh b/configs/install-nixl-dsv4.sh new file mode 100644 index 00000000..25d992ed --- /dev/null +++ b/configs/install-nixl-dsv4.sh @@ -0,0 +1,53 @@ +#!/bin/bash +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# Install SGLang with the DeepSeek-V4 NIXL state-buffer transport fix. +# Remove this once https://github.com/sgl-project/sglang/pull/23773 is merged upstream. + +set -euo pipefail + +SGLANG_DIR="${SGLANG_DIR:-/sgl-workspace/sglang}" +SGLANG_REMOTE="${SGLANG_REMOTE:-https://github.com/sgl-project/sglang.git}" +SGLANG_PR_NUMBER="${SGLANG_PR_NUMBER:-23773}" +SGLANG_PR_REF="refs/pull/${SGLANG_PR_NUMBER}/head" +SGLANG_LOCAL_BRANCH="${SGLANG_LOCAL_BRANCH:-nixl-dsv4-pr-${SGLANG_PR_NUMBER}}" + +echo "=== Installing SGLang NIXL DSV4 fix from PR #${SGLANG_PR_NUMBER} ===" + +if command -v flock >/dev/null 2>&1; then + mkdir -p /tmp/srt-slurm-locks + exec 9>/tmp/srt-slurm-locks/install-nixl-dsv4.lock + flock 9 +fi + +mkdir -p "$(dirname "$SGLANG_DIR")" + +if [ ! -d "$SGLANG_DIR/.git" ]; then + echo "Recreating $SGLANG_DIR from $SGLANG_REMOTE" + rm -rf "$SGLANG_DIR" + git clone --depth 1 "$SGLANG_REMOTE" "$SGLANG_DIR" +fi + +cd "$SGLANG_DIR" + +git config --global --add safe.directory "$SGLANG_DIR" || true + +if git remote get-url origin >/dev/null 2>&1; then + git remote set-url origin "$SGLANG_REMOTE" +else + git remote add origin "$SGLANG_REMOTE" +fi + +git fetch --depth 1 origin "$SGLANG_PR_REF" +git checkout -f -B "$SGLANG_LOCAL_BRANCH" FETCH_HEAD + +INSTALLED_COMMIT="$(git rev-parse HEAD)" +echo "Checked out SGLang PR #${SGLANG_PR_NUMBER} at ${INSTALLED_COMMIT}" + +NIXL_CONN="python/sglang/srt/disaggregation/nixl/conn.py" +if ! grep -q "send_state" "$NIXL_CONN" || ! grep -q "state_data_ptrs" "$NIXL_CONN"; then + echo "ERROR: expected NIXL state-buffer transport changes were not found in $NIXL_CONN" >&2 + exit 1 +fi + +echo "=== SGLang NIXL DSV4 fix installed ===" diff --git a/recipes/dsv4-pro/README.md b/recipes/dsv4-pro/README.md new file mode 100644 index 00000000..e535ad4c --- /dev/null +++ b/recipes/dsv4-pro/README.md @@ -0,0 +1,111 @@ +# DeepSeek-V4-Pro (1.6T MoE, MXFP4) — 1k/1k on GB300 + +This directory contains NVIDIA-verified SGLang recipes for **DeepSeek-V4-Pro** +(1.6T-parameter MoE with MXFP4 MoE weights + FP8 KV, UE8M0 scales) on **GB300** +(ARM64 Grace + Blackwell, 4 GPU per node), 1024 input / 1024 output workload. +Both **aggregated** (single-node SGLang) and **disaggregated** (1P+1D dynamo + +NIXL) serving modes are covered. + +## Container + +All recipes reference the `dsv4-grace-blackwell` alias defined in +`srtslurm.yaml.example`. Pull + convert: + +```bash +enroot import --output sglang-deepseek-v4-grace-blackwell.sqsh \ + docker://lmsysorg/sglang:deepseek-v4-grace-blackwell +``` + +(Use the `deepseek-v4-blackwell` image for B200 x86_64, or `deepseek-v4-hopper` for H200.) + +## Model checkpoint + +```bash +hf download deepseek-ai/DeepSeek-V4-Pro --local-dir /shared/models/deepseek/DeepSeek-V4-Pro +``` + +## Recipes + +### Aggregated (single SGLang server) + +| file | parallelism | MTP | target | notes | +|---|---|---|---|---| +| `agg-low-latency.yaml` | TP=4 | EAGLE 3/4 | minimum TPOT / best per-user latency | GB300 1 node | +| `agg-nomtp.yaml` | TP=4 | — | baseline throughput, no spec decoding | GB300 1 node | +| `agg-balanced-tep.yaml` | TP=4 + DP=4 + DP-attn + DeepEP | EAGLE 1/2 | Pareto mid-curve | GB300 1 node | +| `agg-max-tpt-tep.yaml` | TP=4 + DP=4 + DP-attn + DeepEP | — | maximum TPS/GPU | GB300 1 node | +| `agg-2n-low-latency.yaml` | TP=8 | EAGLE 3/4 | low-latency, 2× memory headroom | GB300 2 nodes | +| `agg-2n-nomtp.yaml` | TP=8 | — | throughput, 2× memory headroom | GB300 2 nodes | + +### Disaggregated (dynamo frontend, NIXL KV transfer) + +> ⚠️ **Required SGLang patch (upstreaming in flight).** All disagg +> recipes below depend on a fix to `python/sglang/srt/disaggregation/nixl/conn.py` +> that registers and transfers the model's auxiliary state buffers +> (SWA / NSA / Mamba) alongside the KV cache. Without this patch the NIXL +> backend silently drops the state buffer, causing decode-side accuracy +> to collapse on DSv4-Pro (GSM8K ≈ 0.13 vs 1.00 with the patch) even +> though throughput numbers look healthy. The fix mirrors what the +> Mooncake backend already does; an upstream sglang PR is being prepared +> separately. Until it lands, point your `dsv4-grace-blackwell` container +> at a build with the patch applied (mounting the patched +> `python/sglang/srt/disaggregation/nixl/` over the container path is +> sufficient). The recipes themselves intentionally do **not** declare +> any local mounts — pick up the patch via your container build process. +> +> Performance numbers in the table further down were measured against a +> patched build; they should reproduce on any build that includes the +> equivalent fix. + +`XPYD` in the table below denotes **X prefill nodes + Y decode nodes** +(one SGLang worker per role per node, NOT per-instance counts). Each +GB300 node has 4 GPUs, so e.g. 2P2D DEP=8 = 16 GPUs total. + +| file | topology | parallelism | MoE backend | target | notes | +|---|---|---|---|---|---| +| `disagg-1p1d-tp4-mxfp4.yaml` | 1P+1D (2 nodes / 8 GPU) | both TP=4 | flashinfer_mxfp4 | low-latency, low/medium concurrency | TP-only baseline | +| `disagg-1p1d-dep4-mega-moe.yaml` | 1P+1D (2 nodes / 8 GPU) | both TP=4 + DP=4 + DeepEP | mega_moe (DeepGEMM) | DEP throughput Pareto reference | TEP topology, mirrors `agg-max-tpt-tep.yaml` split across 2 nodes | +| `disagg-1p2d-dep4-to-dep8-mega-moe.yaml`| 1P+2D (3 nodes / 12 GPU) | P: TP=4+DP=4; D: TP=8+DP=8 + DeepEP | mega_moe (DeepGEMM) | **best per-GPU efficiency** for decode-heavy 1k/1k | asymmetric — decode EP domain doubled | +| `disagg-2p2d-dep8-mega-moe.yaml` | 2P+2D (4 nodes / 16 GPU) | both TP=8 + DP=8 + DeepEP | mega_moe (DeepGEMM) | largest DEP throughput config | symmetric counterpart to the 1P2D recipe | +| `disagg-2p2d-tp8-mxfp4.yaml` | 2P+2D (4 nodes / 16 GPU) | both TP=8 | flashinfer_mxfp4 | TP-only 4-node baseline | quantifies the DEP+DeepEP uplift on GB300 | + +Multi-node decode recipes intentionally do NOT set +`SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2`: CAR_V2 is single-node only and +silently corrupts results when used across nodes. + +#### Verified throughput (sa-bench, isl=osl=1024, random_range_ratio=0.8) + +Peak Total TPS / GPU at the saturation point of each curve (lower-conc +points trade throughput for latency; full Pareto curves available on +request): + +| recipe | GPUs | peak conc | Output TPS | Total TPS / GPU | Mean TTFT | Mean TPOT | +|---|---:|---:|---:|---:|---:|---:| +| `disagg-1p1d-tp4-mxfp4.yaml` | 8 | 128 | 3,349 | 838 | 1.05 s | 36.1 ms | +| `disagg-1p1d-dep4-mega-moe.yaml` | 8 | 128 | 3,293 | 824 | 0.88 s | 36.8 ms | +| `disagg-2p2d-tp8-mxfp4.yaml` | 16 | 512 | 6,863 | 857 | 2.26 s | 70.2 ms | +| `disagg-2p2d-dep8-mega-moe.yaml` | 16 | 2,048 | 32,840 | 4,104 | 2.12 s | 58.2 ms | +| `disagg-1p2d-dep4-to-dep8-mega-moe.yaml` | 12 | 2,048 | 33,442 | **5,572** | 4.26 s | 53.8 ms | + +Headline: the asymmetric 1P2D DEP4→DEP8 config delivers the highest +**per-GPU** total throughput because at 1k/1k the workload is +decode-heavy, so doubling the decode EP domain (4 → 8 GPUs, 256 → 32 +experts/GPU) buys far more than scaling prefill. + +## Key flags (derived from the SGLang DSv4 cookbook) + +- `moe-runner-backend: flashinfer_mxfp4` — MXFP4 MoE kernels (Blackwell only). +- `chunked-prefill-size: 4096` + `disable-flashinfer-autotune: true` — cookbook recipe. +- `disable-radix-cache: true` — synthetic benchmark best practice; also + reduces contiguous-allocator fragmentation at weight-reorder time. +- `mem-fraction-static: 0.78` — leaves headroom for the MXFP4 + `reorder_w1w3_to_w3w1` path (0.82 intermittently OOMs on GB300). +- TEP recipes: `enable-dp-attention + moe-a2a-backend: deepep` plus + `deepep-config num_sms=96` (DeepEP `DEEPEP_LARGE_SMS_FLAG` for single-node + Blackwell per cookbook). + +## References + +- [SGLang cookbook: `docs/cookbook/autoregressive/DeepSeek/DeepSeek-V4.mdx`](https://github.com/sgl-project/sglang/blob/main/docs/cookbook/autoregressive/DeepSeek/DeepSeek-V4.mdx) +- [DeepSeek-V4-Pro model card](https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro) +- Upstream SGLang PR: sgl-project/sglang#23600 diff --git a/recipes/gb300-fp4/1k1k-dsv4/agg-2n-low-latency.yaml b/recipes/dsv4-pro/sglang/gb300-fp4/1k1k/agg/mtp/agg-2n-low-latency.yaml similarity index 100% rename from recipes/gb300-fp4/1k1k-dsv4/agg-2n-low-latency.yaml rename to recipes/dsv4-pro/sglang/gb300-fp4/1k1k/agg/mtp/agg-2n-low-latency.yaml diff --git a/recipes/gb300-fp4/1k1k-dsv4/agg-balanced-tep.yaml b/recipes/dsv4-pro/sglang/gb300-fp4/1k1k/agg/mtp/agg-balanced-tep.yaml similarity index 100% rename from recipes/gb300-fp4/1k1k-dsv4/agg-balanced-tep.yaml rename to recipes/dsv4-pro/sglang/gb300-fp4/1k1k/agg/mtp/agg-balanced-tep.yaml diff --git a/recipes/gb300-fp4/1k1k-dsv4/agg-low-latency-chat.yaml b/recipes/dsv4-pro/sglang/gb300-fp4/1k1k/agg/mtp/agg-low-latency-chat.yaml similarity index 100% rename from recipes/gb300-fp4/1k1k-dsv4/agg-low-latency-chat.yaml rename to recipes/dsv4-pro/sglang/gb300-fp4/1k1k/agg/mtp/agg-low-latency-chat.yaml diff --git a/recipes/gb300-fp4/1k1k-dsv4/agg-low-latency.yaml b/recipes/dsv4-pro/sglang/gb300-fp4/1k1k/agg/mtp/agg-low-latency.yaml similarity index 100% rename from recipes/gb300-fp4/1k1k-dsv4/agg-low-latency.yaml rename to recipes/dsv4-pro/sglang/gb300-fp4/1k1k/agg/mtp/agg-low-latency.yaml diff --git a/recipes/gb300-fp4/1k1k-dsv4/agg-2n-nomtp.yaml b/recipes/dsv4-pro/sglang/gb300-fp4/1k1k/agg/stp/agg-2n-nomtp.yaml similarity index 94% rename from recipes/gb300-fp4/1k1k-dsv4/agg-2n-nomtp.yaml rename to recipes/dsv4-pro/sglang/gb300-fp4/1k1k/agg/stp/agg-2n-nomtp.yaml index 0b19b8ae..e171be06 100644 --- a/recipes/gb300-fp4/1k1k-dsv4/agg-2n-nomtp.yaml +++ b/recipes/dsv4-pro/sglang/gb300-fp4/1k1k/agg/stp/agg-2n-nomtp.yaml @@ -1,4 +1,4 @@ -# DeepSeek-V4-Pro aggregated on GB300 2 nodes (TP=8) - MTP enabled +# DeepSeek-V4-Pro aggregated on GB300 2 nodes (TP=8) name: "dsv4-pro-gb300-2n-agg-nomtp-1k1k-official" slurm: diff --git a/recipes/gb300-fp4/1k1k-dsv4/agg-max-tpt-tep.yaml b/recipes/dsv4-pro/sglang/gb300-fp4/1k1k/agg/stp/agg-max-tpt-tep.yaml similarity index 100% rename from recipes/gb300-fp4/1k1k-dsv4/agg-max-tpt-tep.yaml rename to recipes/dsv4-pro/sglang/gb300-fp4/1k1k/agg/stp/agg-max-tpt-tep.yaml diff --git a/recipes/gb300-fp4/1k1k-dsv4/agg-nomtp.yaml b/recipes/dsv4-pro/sglang/gb300-fp4/1k1k/agg/stp/agg-nomtp.yaml similarity index 100% rename from recipes/gb300-fp4/1k1k-dsv4/agg-nomtp.yaml rename to recipes/dsv4-pro/sglang/gb300-fp4/1k1k/agg/stp/agg-nomtp.yaml diff --git a/recipes/dsv4-pro/sglang/gb300-fp4/1k1k/disagg/stp/disagg-1p1d-dep4-mega-moe.yaml b/recipes/dsv4-pro/sglang/gb300-fp4/1k1k/disagg/stp/disagg-1p1d-dep4-mega-moe.yaml new file mode 100644 index 00000000..00ea8c23 --- /dev/null +++ b/recipes/dsv4-pro/sglang/gb300-fp4/1k1k/disagg/stp/disagg-1p1d-dep4-mega-moe.yaml @@ -0,0 +1,126 @@ +# DeepSeek-V4-Pro disaggregated on GB300 (1P1D, DEP=4, mega_moe) — dynamo frontend. +# +# 1 prefill node + 1 decode node, each TP=4 + DP=4 + DP-attention + DeepEP +# (the same TEP topology as the agg-balanced-tep / agg-max-tpt-tep recipes +# in this directory, but split across two nodes). Frontend is dynamo with +# the nginx fan-in container; KV transfer over NIXL. +# +# Throughput-oriented disagg counterpart to disagg-1p1d-tp4-mxfp4.yaml: +# DEP4 + mega_moe (DeepEP-based MoE all-to-all) gives roughly the same +# peak throughput at conc=128 as the pure-TP4 disagg recipe, but holds +# that throughput further into the saturation regime because DeepEP keeps +# MoE compute well-balanced. +name: "dsv4-pro-gb300-disagg-1p1d-dep4-mega-moe-1k1k" + +slurm: + partition: gb300 + time_limit: "4:00:00" + +frontend: + type: dynamo + nginx_container: nginx + +model: + path: "dsv4-pro" + container: "dsv4-grace-blackwell" + precision: "mxfp4" + +resources: + gpu_type: "gb300" + prefill_nodes: 1 + decode_nodes: 1 + prefill_workers: 1 + decode_workers: 1 + gpus_per_node: 4 + +backend: + type: sglang + + prefill_environment: + PYTHONUNBUFFERED: "1" + SGLANG_JIT_DEEPGEMM_PRECOMPILE: "0" + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_CUMEM_ENABLE: "1" + SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2: "1" + SGLANG_OPT_USE_DEEPGEMM_MEGA_MOE: "1" + SGLANG_OPT_DEEPGEMM_MEGA_MOE_NUM_MAX_TOKENS_PER_RANK: "4096" + SGLANG_OPT_FIX_MEGA_MOE_MEMORY: "1" + SGLANG_OPT_FIX_HASH_MEGA_MOE: "1" + SGLANG_OPT_FIX_NEXTN_MEGA_MOE: "1" + SGLANG_OPT_USE_FAST_MASK_EP: "1" + SGLANG_OPT_USE_JIT_INDEXER_METADATA: "1" + SGLANG_OPT_USE_JIT_NORM: "1" + SGLANG_OPT_USE_TOPK_V2: "1" + SGLANG_OPT_SWA_SPLIT_LEAF_ON_INSERT: "1" + SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: "0" + + decode_environment: + PYTHONUNBUFFERED: "1" + SGLANG_JIT_DEEPGEMM_PRECOMPILE: "0" + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_CUMEM_ENABLE: "1" + SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2: "1" + SGLANG_OPT_USE_DEEPGEMM_MEGA_MOE: "1" + SGLANG_OPT_DEEPGEMM_MEGA_MOE_NUM_MAX_TOKENS_PER_RANK: "4096" + SGLANG_OPT_FIX_MEGA_MOE_MEMORY: "1" + SGLANG_OPT_FIX_HASH_MEGA_MOE: "1" + SGLANG_OPT_FIX_NEXTN_MEGA_MOE: "1" + SGLANG_OPT_USE_FAST_MASK_EP: "1" + SGLANG_OPT_USE_JIT_INDEXER_METADATA: "1" + SGLANG_OPT_USE_JIT_NORM: "1" + SGLANG_OPT_USE_TOPK_V2: "1" + SGLANG_OPT_SWA_SPLIT_LEAF_ON_INSERT: "1" + SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: "0" + + sglang_config: + prefill: + served-model-name: "deepseek-ai/DeepSeek-V4-Pro" + model-path: "/model/" + trust-remote-code: true + + # DEP4: TP=4 + DP=4 + DP-attention + DeepEP (mega_moe path) + tensor-parallel-size: 4 + data-parallel-size: 4 + enable-dp-attention: true + moe-a2a-backend: "deepep" + deepep-config: '{"normal_dispatch":{"num_sms":96},"normal_combine":{"num_sms":96}}' + + disaggregation-mode: "prefill" + disaggregation-transfer-backend: nixl + + mem-fraction-static: 0.90 + max-running-requests: 1024 + cuda-graph-max-bs: 1024 + chunked-prefill-size: 32768 + disable-radix-cache: true + + decode: + served-model-name: "deepseek-ai/DeepSeek-V4-Pro" + model-path: "/model/" + trust-remote-code: true + + tensor-parallel-size: 4 + data-parallel-size: 4 + enable-dp-attention: true + moe-a2a-backend: "deepep" + deepep-config: '{"normal_dispatch":{"num_sms":96},"normal_combine":{"num_sms":96}}' + + disaggregation-mode: "decode" + disaggregation-transfer-backend: nixl + + mem-fraction-static: 0.90 + max-running-requests: 1024 + cuda-graph-max-bs: 1024 + chunked-prefill-size: 32768 + disable-radix-cache: true + +benchmark: + type: "sa-bench" + isl: 1024 + osl: 1024 + random_range_ratio: 0.8 + concurrencies: "4x8x16x32x64x128x256x512x1024x1536x2048" + req_rate: "inf" + use_chat_template: false diff --git a/recipes/dsv4-pro/sglang/gb300-fp4/1k1k/disagg/stp/disagg-1p1d-tp4-mxfp4.yaml b/recipes/dsv4-pro/sglang/gb300-fp4/1k1k/disagg/stp/disagg-1p1d-tp4-mxfp4.yaml new file mode 100644 index 00000000..2e3695a1 --- /dev/null +++ b/recipes/dsv4-pro/sglang/gb300-fp4/1k1k/disagg/stp/disagg-1p1d-tp4-mxfp4.yaml @@ -0,0 +1,82 @@ +# DeepSeek-V4-Pro disaggregated on GB300 (1P1D, TP=4, MXFP4) — dynamo frontend. +# +# 1 prefill node + 1 decode node, each TP=4 on a single GB300 (4 GPUs). +# Frontend is dynamo with an nginx fan-in container; KV transfer over NIXL. +# Companion to the agg-* recipes in this directory: same model + workload, same +# MXFP4 MoE kernels, but split prefill / decode across two nodes for steady +# decode TPOT under high concurrency. +name: "dsv4-pro-gb300-disagg-1p1d-tp4-mxfp4-1k1k" + +slurm: + partition: gb300 + time_limit: "4:00:00" + +frontend: + type: dynamo + # `nginx` resolves to nginx:latest from Docker Hub via enroot — no local alias + # required. dynamo uses it as the fan-in / health-check entrypoint. + nginx_container: nginx + +model: + path: "dsv4-pro" + container: "dsv4-grace-blackwell" + precision: "mxfp4" + +resources: + gpu_type: "gb300" + prefill_nodes: 1 + decode_nodes: 1 + prefill_workers: 1 + decode_workers: 1 + gpus_per_node: 4 + +backend: + type: sglang + + prefill_environment: + SGLANG_JIT_DEEPGEMM_PRECOMPILE: "0" + + decode_environment: + SGLANG_JIT_DEEPGEMM_PRECOMPILE: "0" + + sglang_config: + prefill: + served-model-name: "deepseek-ai/DeepSeek-V4-Pro" + model-path: "/model/" + trust-remote-code: true + tensor-parallel-size: 4 + disaggregation-mode: "prefill" + disaggregation-transfer-backend: nixl + moe-runner-backend: "flashinfer_mxfp4" + disable-flashinfer-autotune: true + mem-fraction-static: 0.90 + max-running-requests: 128 + cuda-graph-max-bs: 128 + chunked-prefill-size: 8192 + disable-radix-cache: true + + decode: + served-model-name: "deepseek-ai/DeepSeek-V4-Pro" + model-path: "/model/" + trust-remote-code: true + tensor-parallel-size: 4 + disaggregation-mode: "decode" + disaggregation-transfer-backend: nixl + moe-runner-backend: "flashinfer_mxfp4" + disable-flashinfer-autotune: true + mem-fraction-static: 0.90 + max-running-requests: 128 + cuda-graph-max-bs: 128 + chunked-prefill-size: 8192 + disable-radix-cache: true + +benchmark: + type: "sa-bench" + isl: 1024 + osl: 1024 + random_range_ratio: 0.8 + # Low-latency band only — TP4 1P1D saturates near conc=128 on GB300; + # for high-concurrency Pareto use the DEP variants. + concurrencies: "4x8x16x32x64x128" + req_rate: "inf" + use_chat_template: false diff --git a/recipes/dsv4-pro/sglang/gb300-fp4/1k1k/disagg/stp/disagg-1p2d-dep4-to-dep8-mega-moe.yaml b/recipes/dsv4-pro/sglang/gb300-fp4/1k1k/disagg/stp/disagg-1p2d-dep4-to-dep8-mega-moe.yaml new file mode 100644 index 00000000..b8c18a31 --- /dev/null +++ b/recipes/dsv4-pro/sglang/gb300-fp4/1k1k/disagg/stp/disagg-1p2d-dep4-to-dep8-mega-moe.yaml @@ -0,0 +1,135 @@ +# DeepSeek-V4-Pro disaggregated on GB300 (1P2D, asymmetric DEP4 -> DEP8) — dynamo frontend. +# +# 1 prefill node (TP=4 + DP=4 + DP-attention + DeepEP) + 2 decode nodes +# (TP=8 + DP=8 + DP-attention + DeepEP). The decode side scales out the +# EP domain to 8 GPUs while the prefill side stays at 4 GPUs — the +# asymmetric topology favors decode-bound 1k/1k workloads (each prefill +# step produces ~1024 decode steps). +# +# Why this beats symmetric 2P2D on per-GPU efficiency: when the workload +# is decode-heavy, increasing decode EP from 4 to 8 roughly doubles MoE +# throughput (each rank holds half as many experts → smaller MoE +# GroupedGEMM K-dim → memory-bound path more friendly), while the prefill +# bottleneck is comparatively cheap at 1k isl. +# +# Decode side drops `SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2` because CAR_V2 +# is single-node only — multi-node EP must use the default NCCL all-reduce. +name: "dsv4-pro-gb300-disagg-1p2d-dep4-to-dep8-mega-moe-1k1k" + +slurm: + partition: gb300 + time_limit: "4:00:00" + +frontend: + type: dynamo + nginx_container: nginx + +model: + path: "dsv4-pro" + container: "dsv4-grace-blackwell" + precision: "mxfp4" + +resources: + gpu_type: "gb300" + prefill_nodes: 1 + decode_nodes: 2 + prefill_workers: 1 + decode_workers: 1 + gpus_per_node: 4 + +backend: + type: sglang + + prefill_environment: + PYTHONUNBUFFERED: "1" + SGLANG_JIT_DEEPGEMM_PRECOMPILE: "0" + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_CUMEM_ENABLE: "1" + SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2: "1" + SGLANG_OPT_USE_DEEPGEMM_MEGA_MOE: "1" + SGLANG_OPT_DEEPGEMM_MEGA_MOE_NUM_MAX_TOKENS_PER_RANK: "4096" + SGLANG_OPT_FIX_MEGA_MOE_MEMORY: "1" + SGLANG_OPT_FIX_HASH_MEGA_MOE: "1" + SGLANG_OPT_FIX_NEXTN_MEGA_MOE: "1" + SGLANG_OPT_USE_FAST_MASK_EP: "1" + SGLANG_OPT_USE_JIT_INDEXER_METADATA: "1" + SGLANG_OPT_USE_JIT_NORM: "1" + SGLANG_OPT_USE_TOPK_V2: "1" + SGLANG_OPT_SWA_SPLIT_LEAF_ON_INSERT: "1" + SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: "0" + + decode_environment: + PYTHONUNBUFFERED: "1" + SGLANG_JIT_DEEPGEMM_PRECOMPILE: "0" + SGLANG_JIT_DEEPGEMM_FAST_WARMUP: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_CUMEM_ENABLE: "1" + SGLANG_OPT_USE_DEEPGEMM_MEGA_MOE: "1" + SGLANG_OPT_DEEPGEMM_MEGA_MOE_NUM_MAX_TOKENS_PER_RANK: "4096" + SGLANG_OPT_FIX_MEGA_MOE_MEMORY: "1" + SGLANG_OPT_FIX_HASH_MEGA_MOE: "1" + SGLANG_OPT_FIX_NEXTN_MEGA_MOE: "1" + SGLANG_OPT_USE_FAST_MASK_EP: "1" + SGLANG_OPT_USE_JIT_INDEXER_METADATA: "1" + SGLANG_OPT_USE_JIT_NORM: "1" + SGLANG_OPT_USE_TOPK_V2: "1" + SGLANG_OPT_SWA_SPLIT_LEAF_ON_INSERT: "1" + SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: "0" + # SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2 intentionally NOT set: CAR_V2 + # is single-node only and corrupts results in 2-node decode setups. + + sglang_config: + prefill: + served-model-name: "deepseek-ai/DeepSeek-V4-Pro" + model-path: "/model/" + trust-remote-code: true + + # Prefill: DEP4 (1 node) + tensor-parallel-size: 4 + data-parallel-size: 4 + enable-dp-attention: true + moe-a2a-backend: "deepep" + deepep-config: '{"normal_dispatch":{"num_sms":96},"normal_combine":{"num_sms":96}}' + + disaggregation-mode: "prefill" + disaggregation-transfer-backend: nixl + + mem-fraction-static: 0.90 + max-running-requests: 1024 + cuda-graph-max-bs: 1024 + chunked-prefill-size: 32768 + disable-radix-cache: true + + decode: + served-model-name: "deepseek-ai/DeepSeek-V4-Pro" + model-path: "/model/" + trust-remote-code: true + + # Decode: DEP8 (2 nodes) + tensor-parallel-size: 8 + data-parallel-size: 8 + enable-dp-attention: true + moe-a2a-backend: "deepep" + deepep-config: '{"normal_dispatch":{"num_sms":96},"normal_combine":{"num_sms":96}}' + + disaggregation-mode: "decode" + disaggregation-transfer-backend: nixl + + # Lower mfs on decode: DEP8 weight memory + KV pool both grow ~2x + # vs DEP4, so 0.83 leaves enough headroom for cuda-graph capture + # at cgmb=2048. + mem-fraction-static: 0.83 + max-running-requests: 2048 + cuda-graph-max-bs: 2048 + chunked-prefill-size: 32768 + disable-radix-cache: true + +benchmark: + type: "sa-bench" + isl: 1024 + osl: 1024 + random_range_ratio: 0.8 + concurrencies: "4x8x16x32x64x128x256x512x1024x1536x2048" + req_rate: "inf" + use_chat_template: false diff --git a/recipes/dsv4-pro/sglang/gb300-fp4/1k1k/disagg/stp/max-tpt.yaml b/recipes/dsv4-pro/sglang/gb300-fp4/1k1k/disagg/stp/max-tpt.yaml new file mode 100644 index 00000000..f59594a0 --- /dev/null +++ b/recipes/dsv4-pro/sglang/gb300-fp4/1k1k/disagg/stp/max-tpt.yaml @@ -0,0 +1,175 @@ +name: dsv4-pro-gb300-fp4_8k1k_hightpt_0 + +slurm: + partition: hpc-mid + time_limit: 03:00:00 + +sbatch_directives: + cpus-per-task: '144' + mem: '0' + +dynamo: + hash: 9d3c913d300eb368cda28b3f98a23a5762621e0d + +frontend: + type: dynamo + enable_multiple_frontends: true + num_additional_frontends: 8 + nginx_container: /mnt/home/yangminl/containers/nginx-1.27.4.sqsh + +model: + path: dsv4-pro + container: dsv4-grace-blackwell + precision: fp4 + +resources: + gpu_type: gb300 + gpus_per_node: 4 + prefill_nodes: 7 + prefill_workers: 7 + decode_nodes: 2 + decode_workers: 1 + +extra_mount: +- /mnt/home/yangminl/sglang-patched/sglang:/sgl-workspace/sglang +- /mnt/home/yangminl/sglang-patched/sglang:/workspace/sglang + +backend: + type: sglang + + prefill_environment: + # SGLANG_HACK_PRINT_REQ_LIFECYCLE: "1" # TODO temp debug + SGLANG_DG_CACHE_DIR: /configs/deepgemm_cache # NOTE hack for quick tests + PYTHONUNBUFFERED: '1' + SGLANG_JIT_DEEPGEMM_PRECOMPILE: '0' + SGLANG_ENABLE_THINKING: '1' + SGLANG_REASONING_EFFORT: max + SGLANG_OPT_SWA_SPLIT_LEAF_ON_INSERT: '1' + SGLANG_OPT_SWA_EVICT_DROP_PAGE_MARGIN: '1' + SGLANG_OPT_USE_JIT_NORM: '1' + SGLANG_OPT_USE_JIT_INDEXER_METADATA: '1' + SGLANG_OPT_USE_TOPK_V2: '1' + SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2: '1' + SGLANG_OPT_USE_DEEPGEMM_MEGA_MOE: '1' + SGLANG_OPT_FIX_HASH_MEGA_MOE: '1' + SGLANG_OPT_USE_FAST_MASK_EP: '1' + SGLANG_OPT_FIX_MEGA_MOE_MEMORY: '1' + SGLANG_OPT_DEEPGEMM_MEGA_MOE_NUM_MAX_TOKENS_PER_RANK: '9216' + SGLANG_OPT_FIX_NEXTN_MEGA_MOE: '1' + SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: '0' + NCCL_MNNVL_ENABLE: '1' + NCCL_CUMEM_ENABLE: '1' + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: 'True' + MC_FORCE_MNNVL: '1' + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: '100000' + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: '100000' + SGLANG_OPT_SWA_RELEASE_LEAF_LOCK_AFTER_WINDOW: '1' + DYN_SKIP_SGLANG_LOG_FORMATTING: '1' + SGLANG_LOG_FORWARD_ITERS: '1' + SGLANG_LOG_MS: '1' + SGLANG_REQUEST_STATE_WAIT_TIMEOUT: '60' + + decode_environment: + # SGLANG_HACK_PRINT_REQ_LIFECYCLE: "1" # TODO temp debug + SGLANG_DG_CACHE_DIR: /configs/deepgemm_cache # NOTE hack for quick tests + PYTHONUNBUFFERED: '1' + SGLANG_JIT_DEEPGEMM_PRECOMPILE: '0' + SGLANG_ENABLE_THINKING: '1' + SGLANG_REASONING_EFFORT: max + SGLANG_OPT_SWA_SPLIT_LEAF_ON_INSERT: '1' + SGLANG_OPT_SWA_EVICT_DROP_PAGE_MARGIN: '1' + SGLANG_OPT_USE_JIT_NORM: '1' + SGLANG_OPT_USE_JIT_INDEXER_METADATA: '1' + SGLANG_OPT_USE_TOPK_V2: '1' + SGLANG_OPT_USE_DEEPGEMM_MEGA_MOE: '1' + SGLANG_OPT_FIX_HASH_MEGA_MOE: '1' + SGLANG_OPT_USE_FAST_MASK_EP: '1' + SGLANG_OPT_FIX_MEGA_MOE_MEMORY: '1' + SGLANG_OPT_DEEPGEMM_MEGA_MOE_NUM_MAX_TOKENS_PER_RANK: '1152' + SGLANG_OPT_FIX_NEXTN_MEGA_MOE: '1' + SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: '0' + NCCL_MNNVL_ENABLE: '1' + NCCL_CUMEM_ENABLE: '1' + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: 'True' + MC_FORCE_MNNVL: '1' + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: '100000' + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: '100000' + SGLANG_OPT_SWA_RELEASE_LEAF_LOCK_AFTER_WINDOW: '1' + DYN_SKIP_SGLANG_LOG_FORMATTING: '1' + SGLANG_LOG_FORWARD_ITERS: '1' + SGLANG_LOG_MS: '1' + SGLANG_REQUEST_STATE_WAIT_TIMEOUT: '60' + # SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2 intentionally NOT set: CAR_V2 + # is single-node only and corrupts results in 2-node decode setups. + + sglang_config: + prefill: + served-model-name: deepseek-ai/DeepSeek-V4-Pro + model-path: /model/ + trust-remote-code: true + watchdog-timeout: 86400 + skip-tokenizer-init: true + stream-interval: 30 # pr50 sets it, let's do it + # tokenizer-worker-num: 16 # need this if we run tokenizer + + # Parallel + tensor-parallel-size: 4 + data-parallel-size: 4 + expert-parallel-size: 4 + + enable-dp-attention: true + moe-a2a-backend: deepep + deepep-config: '{"normal_dispatch":{"num_sms":96},"normal_combine":{"num_sms":96}}' + + disaggregation-mode: prefill + disaggregation-transfer-backend: mooncake + + mem-fraction-static: 0.9 + max-running-requests: 512 + cuda-graph-max-bs: 512 + chunked-prefill-size: 32768 + # disable-radix-cache: true # NOTE try to enable radix cache + + decode: + served-model-name: deepseek-ai/DeepSeek-V4-Pro + model-path: /model/ + trust-remote-code: true + watchdog-timeout: 86400 + skip-tokenizer-init: true + stream-interval: 30 # pr50 sets it, let's do it + # disable-radix-cache: true # NOTE try to enable radix cache + + disaggregation-mode: decode + disaggregation-transfer-backend: mooncake + + mem-fraction-static: 0.94 + swa-full-tokens-ratio: 0.15 + context-length: 16384 + tensor-parallel-size: 8 + data-parallel-size: 8 + expert-parallel-size: 8 + enable-dp-attention: true + enable-dp-lm-head: true + moe-a2a-backend: deepep + deepep-config: '{"normal_dispatch":{"num_sms":96},"normal_combine":{"num_sms":96}}' + max-running-requests: 9216 + cuda-graph-max-bs: 1152 + +benchmark: + type: custom + command: | + set -e + REPO=/configs/upstream-sa-bench/InferenceX + [ -d "$REPO" ] || git clone https://github.com/fzyzcjy/InferenceX.git "$REPO" + cd "$REPO/utils/bench_serving" + python3 benchmark_serving.py \ + --backend vllm --model deepseek-ai/DeepSeek-V4-Pro --tokenizer /model \ + --host 127.0.0.1 --port 8000 --endpoint /v1/completions \ + --dataset-name random \ + --random-input-len 8192 --random-output-len 1024 --random-range-ratio 0.8 \ + --random-num-workers 96 \ + --num-prompts 40960 --max-concurrency 4096 --request-rate 48 \ + --num-warmups 512 \ + --ignore-eos --trust-remote-code \ + --percentile-metrics ttft,tpot,itl,e2el \ + --save-result --result-dir /logs --result-filename results.json \ No newline at end of file diff --git a/recipes/gb300-fp4/1k1k-dsv4/README.md b/recipes/gb300-fp4/1k1k-dsv4/README.md deleted file mode 100644 index 36fb3ebb..00000000 --- a/recipes/gb300-fp4/1k1k-dsv4/README.md +++ /dev/null @@ -1,53 +0,0 @@ -# DeepSeek-V4-Pro (1.6T MoE, MXFP4) — 1k/1k aggregated on GB300 - -This directory contains NVIDIA-verified SGLang recipes for **DeepSeek-V4-Pro** -(1.6T-parameter MoE with MXFP4 MoE weights + FP8 KV, UE8M0 scales) on **GB300** -(ARM64 Grace + Blackwell, 4 GPU per node), aggregated serving mode, 1024 input / -1024 output workload. - -## Container - -All recipes reference the `dsv4-grace-blackwell` alias defined in -`srtslurm.yaml.example`. Pull + convert: - -```bash -enroot import --output sglang-deepseek-v4-grace-blackwell.sqsh \ - docker://lmsysorg/sglang:deepseek-v4-grace-blackwell -``` - -(Use the `deepseek-v4-blackwell` image for B200 x86_64, or `deepseek-v4-hopper` for H200.) - -## Model checkpoint - -```bash -hf download deepseek-ai/DeepSeek-V4-Pro --local-dir /shared/models/deepseek/DeepSeek-V4-Pro -``` - -## Recipes - -| file | parallelism | MTP | target | notes | -|---|---|---|---|---| -| `agg-low-latency.yaml` | TP=4 | EAGLE 3/4 | minimum TPOT / best per-user latency | GB300 1 node | -| `agg-nomtp.yaml` | TP=4 | — | baseline throughput, no spec decoding | GB300 1 node | -| `agg-balanced-tep.yaml` | TP=4 + DP=4 + DP-attn + DeepEP | EAGLE 1/2 | Pareto mid-curve | GB300 1 node | -| `agg-max-tpt-tep.yaml` | TP=4 + DP=4 + DP-attn + DeepEP | — | maximum TPS/GPU | GB300 1 node | -| `agg-2n-low-latency.yaml` | TP=8 | EAGLE 3/4 | low-latency, 2× memory headroom | GB300 2 nodes | -| `agg-2n-nomtp.yaml` | TP=8 | — | throughput, 2× memory headroom | GB300 2 nodes | - -## Key flags (derived from the SGLang DSv4 cookbook) - -- `moe-runner-backend: flashinfer_mxfp4` — MXFP4 MoE kernels (Blackwell only). -- `chunked-prefill-size: 4096` + `disable-flashinfer-autotune: true` — cookbook recipe. -- `disable-radix-cache: true` — synthetic benchmark best practice; also - reduces contiguous-allocator fragmentation at weight-reorder time. -- `mem-fraction-static: 0.78` — leaves headroom for the MXFP4 - `reorder_w1w3_to_w3w1` path (0.82 intermittently OOMs on GB300). -- TEP recipes: `enable-dp-attention + moe-a2a-backend: deepep` plus - `deepep-config num_sms=96` (DeepEP `DEEPEP_LARGE_SMS_FLAG` for single-node - Blackwell per cookbook). - -## References - -- [SGLang cookbook: `docs/cookbook/autoregressive/DeepSeek/DeepSeek-V4.mdx`](https://github.com/sgl-project/sglang/blob/main/docs/cookbook/autoregressive/DeepSeek/DeepSeek-V4.mdx) -- [DeepSeek-V4-Pro model card](https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro) -- Upstream SGLang PR: sgl-project/sglang#23600