Add low-latency FP8 disagg with MTP for GB200#116
Conversation
📝 WalkthroughWalkthroughAdds a new YAML configuration for a low-latency MTP deployment targeting gb200 with fp8 precision, specifying model, resources (1 prefill node, 4 decode nodes, gpus_per_node 4), backend environment tuning for prefill/decode, sglang_config stages, and benchmark settings. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@recipes/gb200-fp8/1k1k/low-latency-mtp.yaml`:
- Around line 61-62: Update the deployment docs and the YAML so the use of
trust-remote-code: true is accompanied by explicit security controls: pin the
served-model-name deepseek-ai/DeepSeek-R1 to an immutable revision (not a
floating tag), reference or switch to an internal mirror URL for that exact
revision, and add a note that env/runtime policy disallows pulling model code
from external sources; apply these changes for both the prefill block (where
trust-remote-code and kv-cache-dtype appear) and the decode block, and add a
short section describing the vetting evidence for the pinned revision.
🧹 Nitpick comments (1)
recipes/gb200-fp8/1k1k/low-latency-mtp.yaml (1)
16-55: Consider YAML anchors to dedupe env blocks.
Both env blocks are nearly identical; anchors reduce drift and make overrides obvious.♻️ Example refactor using YAML anchors
backend: - prefill_environment: + prefill_environment: &common_env TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" PYTHONUNBUFFERED: "1" DYN_SKIP_SGLANG_LOG_FORMATTING: "1" SGLANG_DG_CACHE_DIR: "/configs/dg-10212025" SGLANG_ENABLE_JIT_DEEPGEMM: "false" SGLANG_ENABLE_FLASHINFER_GEMM: "1" SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" MC_TE_METRIC: "true" MC_FORCE_MNNVL: "1" NCCL_MNNVL_ENABLE: "1" NCCL_CUMEM_ENABLE: "1" SGLANG_ENABLE_SPEC_V2: "1" decode_environment: + <<: *common_env - TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" - PYTHONUNBUFFERED: "1" - DYN_SKIP_SGLANG_LOG_FORMATTING: "1" - SGLANG_DG_CACHE_DIR: "/configs/dg-10212025" - SGLANG_ENABLE_JIT_DEEPGEMM: "false" - SGLANG_ENABLE_FLASHINFER_GEMM: "1" - SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" - SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" - SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000" SGLANG_HACK_SEQ_BOOTSTRAP_ROOM: "1" - SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" - SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" - SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" - MC_TE_METRIC: "true" - MC_FORCE_MNNVL: "1" - NCCL_MNNVL_ENABLE: "1" - NCCL_CUMEM_ENABLE: "1" - SGLANG_ENABLE_SPEC_V2: "1"
| trust-remote-code: true | ||
| kv-cache-dtype: "fp8_e4m3" |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, locate and examine the specific file
find . -name "low-latency-mtp.yaml" -type fRepository: ishandhanani/srt-slurm
Length of output: 111
🏁 Script executed:
# Read the full YAML file to understand context
cat "recipes/gb200-fp8/1k1k/low-latency-mtp.yaml"Repository: ishandhanani/srt-slurm
Length of output: 3850
🏁 Script executed:
# Search for any documentation or README files in the recipes directory
find recipes/gb200-fp8 -type f \( -name "*.md" -o -name "README*" -o -name "*.txt" \)Repository: ishandhanani/srt-slurm
Length of output: 48
🏁 Script executed:
# Look for any comments or documentation about trust-remote-code or model vetting
rg -i "trust-remote-code|model.*vet|internal.*mirror|immutable" recipes/gb200-fp8/ -A 2 -B 2Repository: ishandhanani/srt-slurm
Length of output: 5154
trust-remote-code: true is a security posture risk that requires documentation.
The deepseek-ai/DeepSeek-R1 model source is referenced by served-model-name without evidence that the model revision is pinned to an immutable commit or that an internal mirror is in place. Confirm that:
- The model code has been vetted and the specific revision is hardcoded (not latest)
- An internal mirror is used to prevent upstream code injection
- The deployment environment prevents pulling model code from external sources
This applies to both prefill (lines 61-62) and decode (lines 92-93) configurations.
🤖 Prompt for AI Agents
In `@recipes/gb200-fp8/1k1k/low-latency-mtp.yaml` around lines 61 - 62, Update the
deployment docs and the YAML so the use of trust-remote-code: true is
accompanied by explicit security controls: pin the served-model-name
deepseek-ai/DeepSeek-R1 to an immutable revision (not a floating tag), reference
or switch to an internal mirror URL for that exact revision, and add a note that
env/runtime policy disallows pulling model code from external sources; apply
these changes for both the prefill block (where trust-remote-code and
kv-cache-dtype appear) and the decode block, and add a short section describing
the vetting evidence for the pinned revision.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@recipes/gb200-fp8/1k1k/low-latency-mtp.yaml`:
- Around line 1-13: The config name "gb200-fp8-1p-2d-low-latency-mtp" does not
match the decode node count (decode_nodes: 4); update either the name or the
value so they align—either change decode_nodes to 2 to match "1p-2d" or rename
the "name" field to reflect "1p-4d" (or equivalent) so scheduling/selection
logic using the name and decode_nodes stays consistent; locate the top-level
name field and the decode_nodes key in this file to make the change.
| name: "gb200-fp8-1p-2d-low-latency-mtp" | ||
|
|
||
| model: | ||
| path: "dsfp8" | ||
| container: "0.5.8" | ||
| precision: "fp8" | ||
|
|
||
| resources: | ||
| gpu_type: "gb200" | ||
| prefill_nodes: 1 | ||
| decode_nodes: 4 | ||
| prefill_workers: 1 | ||
| decode_workers: 2 |
There was a problem hiding this comment.
Align the config name with the decode node count.
The name says 1p-2d but decode_nodes is 4. This will confuse scheduling/selection. Either rename the config or adjust decode_nodes to 2.
✏️ Proposed fix (if 4 decode nodes are intended)
-name: "gb200-fp8-1p-2d-low-latency-mtp"
+name: "gb200-fp8-1p-4d-low-latency-mtp"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| name: "gb200-fp8-1p-2d-low-latency-mtp" | |
| model: | |
| path: "dsfp8" | |
| container: "0.5.8" | |
| precision: "fp8" | |
| resources: | |
| gpu_type: "gb200" | |
| prefill_nodes: 1 | |
| decode_nodes: 4 | |
| prefill_workers: 1 | |
| decode_workers: 2 | |
| name: "gb200-fp8-1p-4d-low-latency-mtp" | |
| model: | |
| path: "dsfp8" | |
| container: "0.5.8" | |
| precision: "fp8" | |
| resources: | |
| gpu_type: "gb200" | |
| prefill_nodes: 1 | |
| decode_nodes: 4 | |
| prefill_workers: 1 | |
| decode_workers: 2 |
🤖 Prompt for AI Agents
In `@recipes/gb200-fp8/1k1k/low-latency-mtp.yaml` around lines 1 - 13, The config
name "gb200-fp8-1p-2d-low-latency-mtp" does not match the decode node count
(decode_nodes: 4); update either the name or the value so they align—either
change decode_nodes to 2 to match "1p-2d" or rename the "name" field to reflect
"1p-4d" (or equivalent) so scheduling/selection logic using the name and
decode_nodes stays consistent; locate the top-level name field and the
decode_nodes key in this file to make the change.
Based on
recipes/gb200-fp8/1k1k/low-latency.yamlbut with MTP enabled. I had to move the decode workers from 4xTP4 -> 2xTP8 due to memory limitationsSummary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.