Add low-latency FP8 disagg with MTP for GB200 by trevor-m · Pull Request #116 · ishandhanani/srt-slurm

trevor-m · 2026-01-28T22:37:28Z

Based on recipes/gb200-fp8/1k1k/low-latency.yaml but with MTP enabled. I had to move the decode workers from 4xTP4 -> 2xTP8 due to memory limitations

diff recipes/gb200-fp8/1k1k/low-latency.yaml recipes/gb200-fp8/1k1k/low-latency-mtp.yaml 
1c1
< name: "gb200-fp8-1p-4d-low-latency"
---
> name: "gb200-fp8-1p-2d-low-latency-mtp"
5c5
<   container: "0.5.5.post2"
---
>   container: "0.5.8"
13c13
<   decode_workers: 4
---
>   decode_workers: 2
33a34
>     SGLANG_ENABLE_SPEC_V2: "1"
53a55
>     SGLANG_ENABLE_SPEC_V2: "1"
57a60
>       model-path: "/model/"
68c71
<       mem-fraction-static: 0.95
---
>       mem-fraction-static: 0.90
80a84,87
>       speculative-algorithm: "EAGLE"
>       speculative-num-steps: 2
>       speculative-eagle-topk: 1
>       speculative-num-draft-tokens: 3
83a91
>       model-path: "/model/"
94c102
<       mem-fraction-static: 0.95
---
>       mem-fraction-static: 0.90
96c104
<       cuda-graph-max-bs: 128
---
>       cuda-graph-max-bs: 256
99d106
<       enable-flashinfer-allreduce-fusion: true
103c110
<       tensor-parallel-size: 4
---
>       tensor-parallel-size: 8
105a113,116
>       speculative-algorithm: "EAGLE"
>       speculative-num-steps: 2
>       speculative-eagle-topk: 1
>       speculative-num-draft-tokens: 3

Summary by CodeRabbit

Chores
- Added a new low-latency multi-token prediction configuration for GB200 FP8. Includes staged prefill/decode setup, tuned resource allocation across stages, extensive environment-level performance and compatibility tuning, and benchmark parameters to improve throughput and latency for FP8 models.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-01-28T22:37:43Z

📝 Walkthrough

Walkthrough

Adds a new YAML configuration for a low-latency MTP deployment targeting gb200 with fp8 precision, specifying model, resources (1 prefill node, 4 decode nodes, gpus_per_node 4), backend environment tuning for prefill/decode, sglang_config stages, and benchmark settings.

Changes

Cohort / File(s)	Summary
Configuration `recipes/gb200-fp8/1k1k/low-latency-mtp.yaml`	New 123-line YAML defining low-latency MTP: model spec (dsfp8, container 0.5.8, precision fp8), resources (gb200 GPUs, prefill_nodes 1, decode_nodes 4, gpus_per_node 4, worker counts), backend env vars (SGLANG/NCCL/timeouts/mempool/logging/compat flags), detailed `sglang_config` for `prefill` and `decode` stages (metadata, kv-cache-dtype fp8_e4m3, attention/quantization/moe/disaggregation settings, parallelism/speculative configs), and a benchmark section (sa-bench, isl/osl 1024, concurrencies list, req_rate inf).

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Suggested reviewers

ishandhanani

Poem

🐰 In configs I hop, with fp8 delight,
Prefill and decode, tuned through the night.
Nodes hum in chorus, latency small,
Parameters set—I'll celebrate the call! 🥕

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: adding a low-latency FP8 configuration with disaggregation and MTP (multi-token prediction) for GB200 hardware.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@recipes/gb200-fp8/1k1k/low-latency-mtp.yaml`:
- Around line 61-62: Update the deployment docs and the YAML so the use of
trust-remote-code: true is accompanied by explicit security controls: pin the
served-model-name deepseek-ai/DeepSeek-R1 to an immutable revision (not a
floating tag), reference or switch to an internal mirror URL for that exact
revision, and add a note that env/runtime policy disallows pulling model code
from external sources; apply these changes for both the prefill block (where
trust-remote-code and kv-cache-dtype appear) and the decode block, and add a
short section describing the vetting evidence for the pinned revision.

🧹 Nitpick comments (1)

recipes/gb200-fp8/1k1k/low-latency-mtp.yaml (1)

16-55: Consider YAML anchors to dedupe env blocks.
Both env blocks are nearly identical; anchors reduce drift and make overrides obvious.

♻️ Example refactor using YAML anchors

 backend:
-  prefill_environment:
+  prefill_environment: &common_env
     TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
     PYTHONUNBUFFERED: "1"
     DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
     SGLANG_DG_CACHE_DIR: "/configs/dg-10212025"
     SGLANG_ENABLE_JIT_DEEPGEMM: "false"
     SGLANG_ENABLE_FLASHINFER_GEMM: "1"
     SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
     SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
     SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
     SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
     SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
     SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
     MC_TE_METRIC: "true"
     MC_FORCE_MNNVL: "1"
     NCCL_MNNVL_ENABLE: "1"
     NCCL_CUMEM_ENABLE: "1"
     SGLANG_ENABLE_SPEC_V2: "1"

   decode_environment:
+    <<: *common_env
-    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
-    PYTHONUNBUFFERED: "1"
-    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
-    SGLANG_DG_CACHE_DIR: "/configs/dg-10212025"
-    SGLANG_ENABLE_JIT_DEEPGEMM: "false"
-    SGLANG_ENABLE_FLASHINFER_GEMM: "1"
-    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
-    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
-    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
     SGLANG_DECODE_BOOTSTRAP_TIMEOUT: "1000"
     SGLANG_HACK_SEQ_BOOTSTRAP_ROOM: "1"
-    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
-    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
-    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
-    MC_TE_METRIC: "true"
-    MC_FORCE_MNNVL: "1"
-    NCCL_MNNVL_ENABLE: "1"
-    NCCL_CUMEM_ENABLE: "1"
-    SGLANG_ENABLE_SPEC_V2: "1"

coderabbitai · 2026-01-28T22:42:34Z

+      trust-remote-code: true
+      kv-cache-dtype: "fp8_e4m3"


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, locate and examine the specific file find . -name "low-latency-mtp.yaml" -type f

Repository: ishandhanani/srt-slurm

Length of output: 111

🏁 Script executed:

# Read the full YAML file to understand context cat "recipes/gb200-fp8/1k1k/low-latency-mtp.yaml"

Repository: ishandhanani/srt-slurm

Length of output: 3850

🏁 Script executed:

# Search for any documentation or README files in the recipes directory find recipes/gb200-fp8 -type f \( -name "*.md" -o -name "README*" -o -name "*.txt" \)

Repository: ishandhanani/srt-slurm

Length of output: 48

🏁 Script executed:

# Look for any comments or documentation about trust-remote-code or model vetting rg -i "trust-remote-code|model.*vet|internal.*mirror|immutable" recipes/gb200-fp8/ -A 2 -B 2

Repository: ishandhanani/srt-slurm

Length of output: 5154

trust-remote-code: true is a security posture risk that requires documentation.

The deepseek-ai/DeepSeek-R1 model source is referenced by served-model-name without evidence that the model revision is pinned to an immutable commit or that an internal mirror is in place. Confirm that:

The model code has been vetted and the specific revision is hardcoded (not latest)

An internal mirror is used to prevent upstream code injection

The deployment environment prevents pulling model code from external sources

This applies to both prefill (lines 61-62) and decode (lines 92-93) configurations.

🤖 Prompt for AI Agents

In `@recipes/gb200-fp8/1k1k/low-latency-mtp.yaml` around lines 61 - 62, Update the deployment docs and the YAML so the use of trust-remote-code: true is accompanied by explicit security controls: pin the served-model-name deepseek-ai/DeepSeek-R1 to an immutable revision (not a floating tag), reference or switch to an internal mirror URL for that exact revision, and add a note that env/runtime policy disallows pulling model code from external sources; apply these changes for both the prefill block (where trust-remote-code and kv-cache-dtype appear) and the decode block, and add a short section describing the vetting evidence for the pinned revision.

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@recipes/gb200-fp8/1k1k/low-latency-mtp.yaml`:
- Around line 1-13: The config name "gb200-fp8-1p-2d-low-latency-mtp" does not
match the decode node count (decode_nodes: 4); update either the name or the
value so they align—either change decode_nodes to 2 to match "1p-2d" or rename
the "name" field to reflect "1p-4d" (or equivalent) so scheduling/selection
logic using the name and decode_nodes stays consistent; locate the top-level
name field and the decode_nodes key in this file to make the change.

coderabbitai · 2026-01-28T23:50:59Z

+name: "gb200-fp8-1p-2d-low-latency-mtp"
+
+model:
+  path: "dsfp8"
+  container: "0.5.8"
+  precision: "fp8"
+
+resources:
+  gpu_type: "gb200"
+  prefill_nodes: 1
+  decode_nodes: 4
+  prefill_workers: 1
+  decode_workers: 2


⚠️ Potential issue | 🟡 Minor

Align the config name with the decode node count.

The name says 1p-2d but decode_nodes is 4. This will confuse scheduling/selection. Either rename the config or adjust decode_nodes to 2.

✏️ Proposed fix (if 4 decode nodes are intended)

-name: "gb200-fp8-1p-2d-low-latency-mtp" +name: "gb200-fp8-1p-4d-low-latency-mtp"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

name: "gb200-fp8-1p-2d-low-latency-mtp"

model:

path: "dsfp8"

container: "0.5.8"

precision: "fp8"

resources:

gpu_type: "gb200"

prefill_nodes: 1

decode_nodes: 4

prefill_workers: 1

decode_workers: 2

name: "gb200-fp8-1p-4d-low-latency-mtp"

model:

path: "dsfp8"

container: "0.5.8"

precision: "fp8"

resources:

gpu_type: "gb200"

prefill_nodes: 1

decode_nodes: 4

prefill_workers: 1

decode_workers: 2

🤖 Prompt for AI Agents

In `@recipes/gb200-fp8/1k1k/low-latency-mtp.yaml` around lines 1 - 13, The config name "gb200-fp8-1p-2d-low-latency-mtp" does not match the decode node count (decode_nodes: 4); update either the name or the value so they align—either change decode_nodes to 2 to match "1p-2d" or rename the "name" field to reflect "1p-4d" (or equivalent) so scheduling/selection logic using the name and decode_nodes stays consistent; locate the top-level name field and the decode_nodes key in this file to make the change.

Add low-latency FP8 disagg with MTP for GB200

162e9d6

coderabbitai bot reviewed Jan 28, 2026

View reviewed changes

fixes

72fef8b

coderabbitai bot reviewed Jan 28, 2026

View reviewed changes

trevor-m merged commit bfbf352 into main Feb 3, 2026
5 checks passed

This was referenced Feb 3, 2026

Add all GB200/GB300 FP8 MTP recipes #134

Merged

Update gb200 recipes #130

Merged

Use mtp3 for 1k8k, 1k1k gb200 fp8 disagg low latency configs #143

Merged

Add new LHS datapoint for SGL-GB200-FP8-1k1k #148

Merged

coderabbitai bot mentioned this pull request Apr 3, 2026

Update Qwen3.5 recipes #240

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add low-latency FP8 disagg with MTP for GB200#116

Add low-latency FP8 disagg with MTP for GB200#116
trevor-m merged 2 commits intomainfrom
trevor-m/fp8-ll-mtp

trevor-m commented Jan 28, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 28, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Jan 28, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Jan 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

trevor-m commented Jan 28, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

trevor-m commented Jan 28, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 28, 2026 •

edited

Loading