Skip to content

Add low-latency FP8 disagg with MTP for GB200#116

Merged
trevor-m merged 2 commits intomainfrom
trevor-m/fp8-ll-mtp
Feb 3, 2026
Merged

Add low-latency FP8 disagg with MTP for GB200#116
trevor-m merged 2 commits intomainfrom
trevor-m/fp8-ll-mtp

Conversation

@trevor-m
Copy link
Copy Markdown
Collaborator

@trevor-m trevor-m commented Jan 28, 2026

Based on recipes/gb200-fp8/1k1k/low-latency.yaml but with MTP enabled. I had to move the decode workers from 4xTP4 -> 2xTP8 due to memory limitations

diff recipes/gb200-fp8/1k1k/low-latency.yaml recipes/gb200-fp8/1k1k/low-latency-mtp.yaml 
1c1
< name: "gb200-fp8-1p-4d-low-latency"
---
> name: "gb200-fp8-1p-2d-low-latency-mtp"
5c5
<   container: "0.5.5.post2"
---
>   container: "0.5.8"
13c13
<   decode_workers: 4
---
>   decode_workers: 2
33a34
>     SGLANG_ENABLE_SPEC_V2: "1"
53a55
>     SGLANG_ENABLE_SPEC_V2: "1"
57a60
>       model-path: "/model/"
68c71
<       mem-fraction-static: 0.95
---
>       mem-fraction-static: 0.90
80a84,87
>       speculative-algorithm: "EAGLE"
>       speculative-num-steps: 2
>       speculative-eagle-topk: 1
>       speculative-num-draft-tokens: 3
83a91
>       model-path: "/model/"
94c102
<       mem-fraction-static: 0.95
---
>       mem-fraction-static: 0.90
96c104
<       cuda-graph-max-bs: 128
---
>       cuda-graph-max-bs: 256
99d106
<       enable-flashinfer-allreduce-fusion: true
103c110
<       tensor-parallel-size: 4
---
>       tensor-parallel-size: 8
105a113,116
>       speculative-algorithm: "EAGLE"
>       speculative-num-steps: 2
>       speculative-eagle-topk: 1
>       speculative-num-draft-tokens: 3

Summary by CodeRabbit

  • Chores
    • Added a new low-latency multi-token prediction configuration for GB200 FP8. Includes staged prefill/decode setup, tuned resource allocation across stages, extensive environment-level performance and compatibility tuning, and benchmark parameters to improve throughput and latency for FP8 models.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Jan 28, 2026

📝 Walkthrough

Walkthrough

Adds a new YAML configuration for a low-latency MTP deployment targeting gb200 with fp8 precision, specifying model, resources (1 prefill node, 4 decode nodes, gpus_per_node 4), backend environment tuning for prefill/decode, sglang_config stages, and benchmark settings.

Changes

Cohort / File(s) Summary
Configuration
recipes/gb200-fp8/1k1k/low-latency-mtp.yaml
New 123-line YAML defining low-latency MTP: model spec (dsfp8, container 0.5.8, precision fp8), resources (gb200 GPUs, prefill_nodes 1, decode_nodes 4, gpus_per_node 4, worker counts), backend env vars (SGLANG/NCCL/timeouts/mempool/logging/compat flags), detailed sglang_config for prefill and decode stages (metadata, kv-cache-dtype fp8_e4m3, attention/quantization/moe/disaggregation settings, parallelism/speculative configs), and a benchmark section (sa-bench, isl/osl 1024, concurrencies list, req_rate inf).

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Suggested reviewers

  • ishandhanani

Poem

🐰 In configs I hop, with fp8 delight,
Prefill and decode, tuned through the night.
Nodes hum in chorus, latency small,
Parameters set—I'll celebrate the call! 🥕

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding a low-latency FP8 configuration with disaggregation and MTP (multi-token prediction) for GB200 hardware.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@recipes/gb200-fp8/1k1k/low-latency-mtp.yaml`:
- Around line 61-62: Update the deployment docs and the YAML so the use of
trust-remote-code: true is accompanied by explicit security controls: pin the
served-model-name deepseek-ai/DeepSeek-R1 to an immutable revision (not a
floating tag), reference or switch to an internal mirror URL for that exact
revision, and add a note that env/runtime policy disallows pulling model code
from external sources; apply these changes for both the prefill block (where
trust-remote-code and kv-cache-dtype appear) and the decode block, and add a
short section describing the vetting evidence for the pinned revision.
🧹 Nitpick comments (1)
recipes/gb200-fp8/1k1k/low-latency-mtp.yaml (1)

16-55: Consider YAML anchors to dedupe env blocks.
Both env blocks are nearly identical; anchors reduce drift and make overrides obvious.

♻️ Example refactor using YAML anchors
 backend:
-  prefill_environment:
+  prefill_environment: &common_env
     TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
     PYTHONUNBUFFERED: "1"
     DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
     SGLANG_DG_CACHE_DIR: "/configs/dg-10212025"
     SGLANG_ENABLE_JIT_DEEPGEMM: "false"
     SGLANG_ENABLE_FLASHINFER_GEMM: "1"
     SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
     SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
     SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
     SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
     SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
     SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
     MC_TE_METRIC: "true"
     MC_FORCE_MNNVL: "1"
     NCCL_MNNVL_ENABLE: "1"
     NCCL_CUMEM_ENABLE: "1"
     SGLANG_ENABLE_SPEC_V2: "1"

   decode_environment:
+    <<: *common_env
-    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
-    PYTHONUNBUFFERED: "1"
-    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
-    SGLANG_DG_CACHE_DIR: "/configs/dg-10212025"
-    SGLANG_ENABLE_JIT_DEEPGEMM: "false"
-    SGLANG_ENABLE_FLASHINFER_GEMM: "1"
-    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
-    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
-    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
     SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000"
     SGLANG_HACK_SEQ_BOOTSTRAP_ROOM: "1"
-    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
-    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
-    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
-    MC_TE_METRIC: "true"
-    MC_FORCE_MNNVL: "1"
-    NCCL_MNNVL_ENABLE: "1"
-    NCCL_CUMEM_ENABLE: "1"
-    SGLANG_ENABLE_SPEC_V2: "1"

Comment on lines +61 to +62
trust-remote-code: true
kv-cache-dtype: "fp8_e4m3"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, locate and examine the specific file
find . -name "low-latency-mtp.yaml" -type f

Repository: ishandhanani/srt-slurm

Length of output: 111


🏁 Script executed:

# Read the full YAML file to understand context
cat "recipes/gb200-fp8/1k1k/low-latency-mtp.yaml"

Repository: ishandhanani/srt-slurm

Length of output: 3850


🏁 Script executed:

# Search for any documentation or README files in the recipes directory
find recipes/gb200-fp8 -type f \( -name "*.md" -o -name "README*" -o -name "*.txt" \)

Repository: ishandhanani/srt-slurm

Length of output: 48


🏁 Script executed:

# Look for any comments or documentation about trust-remote-code or model vetting
rg -i "trust-remote-code|model.*vet|internal.*mirror|immutable" recipes/gb200-fp8/ -A 2 -B 2

Repository: ishandhanani/srt-slurm

Length of output: 5154


trust-remote-code: true is a security posture risk that requires documentation.

The deepseek-ai/DeepSeek-R1 model source is referenced by served-model-name without evidence that the model revision is pinned to an immutable commit or that an internal mirror is in place. Confirm that:

  1. The model code has been vetted and the specific revision is hardcoded (not latest)
  2. An internal mirror is used to prevent upstream code injection
  3. The deployment environment prevents pulling model code from external sources

This applies to both prefill (lines 61-62) and decode (lines 92-93) configurations.

🤖 Prompt for AI Agents
In `@recipes/gb200-fp8/1k1k/low-latency-mtp.yaml` around lines 61 - 62, Update the
deployment docs and the YAML so the use of trust-remote-code: true is
accompanied by explicit security controls: pin the served-model-name
deepseek-ai/DeepSeek-R1 to an immutable revision (not a floating tag), reference
or switch to an internal mirror URL for that exact revision, and add a note that
env/runtime policy disallows pulling model code from external sources; apply
these changes for both the prefill block (where trust-remote-code and
kv-cache-dtype appear) and the decode block, and add a short section describing
the vetting evidence for the pinned revision.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@recipes/gb200-fp8/1k1k/low-latency-mtp.yaml`:
- Around line 1-13: The config name "gb200-fp8-1p-2d-low-latency-mtp" does not
match the decode node count (decode_nodes: 4); update either the name or the
value so they align—either change decode_nodes to 2 to match "1p-2d" or rename
the "name" field to reflect "1p-4d" (or equivalent) so scheduling/selection
logic using the name and decode_nodes stays consistent; locate the top-level
name field and the decode_nodes key in this file to make the change.

Comment on lines +1 to +13
name: "gb200-fp8-1p-2d-low-latency-mtp"

model:
path: "dsfp8"
container: "0.5.8"
precision: "fp8"

resources:
gpu_type: "gb200"
prefill_nodes: 1
decode_nodes: 4
prefill_workers: 1
decode_workers: 2
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Align the config name with the decode node count.

The name says 1p-2d but decode_nodes is 4. This will confuse scheduling/selection. Either rename the config or adjust decode_nodes to 2.

✏️ Proposed fix (if 4 decode nodes are intended)
-name: "gb200-fp8-1p-2d-low-latency-mtp"
+name: "gb200-fp8-1p-4d-low-latency-mtp"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
name: "gb200-fp8-1p-2d-low-latency-mtp"
model:
path: "dsfp8"
container: "0.5.8"
precision: "fp8"
resources:
gpu_type: "gb200"
prefill_nodes: 1
decode_nodes: 4
prefill_workers: 1
decode_workers: 2
name: "gb200-fp8-1p-4d-low-latency-mtp"
model:
path: "dsfp8"
container: "0.5.8"
precision: "fp8"
resources:
gpu_type: "gb200"
prefill_nodes: 1
decode_nodes: 4
prefill_workers: 1
decode_workers: 2
🤖 Prompt for AI Agents
In `@recipes/gb200-fp8/1k1k/low-latency-mtp.yaml` around lines 1 - 13, The config
name "gb200-fp8-1p-2d-low-latency-mtp" does not match the decode node count
(decode_nodes: 4); update either the name or the value so they align—either
change decode_nodes to 2 to match "1p-2d" or rename the "name" field to reflect
"1p-4d" (or equivalent) so scheduling/selection logic using the name and
decode_nodes stays consistent; locate the top-level name field and the
decode_nodes key in this file to make the change.

@trevor-m trevor-m merged commit bfbf352 into main Feb 3, 2026
5 checks passed
@coderabbitai coderabbitai bot mentioned this pull request Apr 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant