Add GB200 DSR1-FP4 1k/8k recipies by kyleliang-nv · Pull Request #85 · ishandhanani/srt-slurm

kyleliang-nv · 2026-01-21T01:47:47Z

This is a preliminary 1k/8k recipes for GB200 DSR1-FP4. The recipe were modified from 1k/1k to get preliminary results.

Summary by CodeRabbit

New Features
- Added three GB200 FP4 configuration profiles (low-latency, max-throughput, balanced) for multi-stage inference. Each profile provides prefill/decode staging, resource and GPU/node allocation, environment and model-serving tuning, memory/token limits, parallelism/optimization options, and benchmark settings for performance validation.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-01-21T01:48:59Z

📝 Walkthrough

Walkthrough

Adds three new top-level YAML recipes for GB200 FP4 disaggregated inference (low-latency, max-throughput, mid-curve), each defining multi-stage prefill/decode configurations, environment flags, SGLang tuning (quantization, TP/DP/EP, CUDA-graph), resource topology, and SA-bench benchmarks.

Changes

Cohort / File(s)	Summary
GB200 FP4 Disaggregated Recipes `recipies/gb200-fp4/1k8k/low-latency.yaml`, `recipies/gb200-fp4/1k8k/max-tpt.yaml`, `recipies/gb200-fp4/1k8k/mid-curve.yaml`	Three new configuration artifacts introducing disaggregated prefill/decode pipelines. Each file specifies model/container/precision, resource layouts (prefill/decode nodes & GPUs), separate environment blocks, detailed `sglang_config` (prefill & decode subsections with quantization, KV/cache, TP/DP/EP sizing, CUDA-graph, memory/token limits, timeouts), and `benchmark` (sa-bench) parameters. Differences are tuning-focused per profile (latency vs throughput vs mid-curve).

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant Prefill_Node
  participant Decode_Node
  participant Model_Storage

  Client->>Prefill_Node: Send prefill request (tokens, context)
  Prefill_Node->>Model_Storage: Load model shard / weights
  Prefill_Node->>Prefill_Node: Populate KV-cache (prefill mode)
  Prefill_Node->>Decode_Node: Forward KV-cache + request metadata
  Decode_Node->>Model_Storage: Load decode-stage shards
  Decode_Node->>Decode_Node: Run decode loop (cuda-graph / TP/DP/EP)
  Decode_Node->>Client: Return tokens / responses

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Update recipe for fp4+Wideep+MTP on GB200 #62

Poem

🐰 I hopped in to tune and play,
Prefill then decode — hip hooray!
FP4 whispers through the mesh,
GB200 hums in vibrant flesh,
Small configs, big hops today. 🥕✨

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: adding GB200 DSR1-FP4 1k/8k configuration recipes, which matches the three new YAML files being introduced.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

kyleliang-nv · 2026-01-21T01:51:49Z

The container field needs to be set to ToT SGLang, however, I don't know how to specify that @ishandhanani

coderabbitai

Actionable comments posted: 3

🤖 Fix all issues with AI agents

In `@recipies/gb200-fp4/1k8k/low-latency.yaml`:
- Around line 79-81: Replace the long-form parallelism keys with the short-form
ones used elsewhere: change data-parallel-size to dp-size, tensor-parallel-size
to tp-size, and expert-parallel-size to ep-size (also update any other
occurrences of the long names such as the ones around the second block using the
same file) so the config uses dp-size, tp-size, and ep-size consistently with
the rest of the PR.
- Around line 69-72: The config sets max-total-tokens to 8192 which undercuts
the benchmark requirement (isl=1024 + osl=8192 = 9216) and conflicts with
context-length=9200; update the max-total-tokens entry to at least 9216 so
requests aren't truncated or rejected, and verify related fields like
context-length and chunked-prefill-size (currently 9200 and 8192) remain
consistent with the new total token budget.

In `@recipies/gb200-fp4/1k8k/mid-curve.yaml`:
- Around line 1-2: The top-of-file comment says "Does not use single batch
overlap" but the prefill section currently sets enable-single-batch-overlap to
true; update either the comment or the setting so they match: locate the prefill
section and the enable-single-batch-overlap flag and either set
enable-single-batch-overlap to false (to match the comment) or change the header
comment to state that single batch overlap is enabled; ensure the decision is
reflected consistently in both the file header and the prefill
enable-single-batch-overlap value.

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@recipies/gb200-fp4/1k8k/low-latency.yaml`:
- Around line 83-105: Add the missing data-parallel-size setting to the decode
block and normalize parallelism keys to the short forms used elsewhere: replace
tensor-parallel-size with tp-size, expert-parallel-size with ep-size, and
data-parallel-size with dp-size (set dp-size to 1 to match prefill). Update the
decode mapping (the decode section that contains served-model-name, model-path,
etc.) to use dp-size, tp-size, and ep-size instead of the long-form names so it
matches max-tpt.yaml and other gb*-files.

In `@recipies/gb200-fp4/1k8k/max-tpt.yaml`:
- Line 124: Replace the typod backend string "flashinfer_cutedsl" with the
correct "flashinfer_cutlass" so the decode section uses the same
moe-runner-backend value as the prefill section; locate the moe-runner-backend
entry in the decode block (and any other occurrences) and update the value to
"flashinfer_cutlass" to avoid runtime backend lookup failures.

♻️ Duplicate comments (1)

recipies/gb200-fp4/1k8k/low-latency.yaml (1)

69-81: Issues already flagged in previous review.

The max-total-tokens value (line 71) and parallelism key naming (lines 79-81) have been addressed in prior review comments.

coderabbitai · 2026-01-21T07:20:13Z

+    decode:
+      disaggregation-mode: "decode"
+      served-model-name: "deepseek-ai/DeepSeek-R1"
+      model-path: "/model/"
+      prefill-round-robin-balance: true
+      trust-remote-code: true
+      disable-radix-cache: true
+      kv-cache-dtype: "fp8_e4m3"
+      attention-backend: "trtllm_mla"
+      quantization: "modelopt_fp4"
+      moe-runner-backend: "flashinfer_trtllm"
+      disaggregation-bootstrap-port: 30001
+      stream-interval: 10
+      watchdog-timeout: 1000000
+      context-length: 9200
+      mem-fraction-static: 0.95
+      chunked-prefill-size: 8192
+      cuda-graph-max-bs: 256
+      scheduler-recv-interval: 10
+      moe-dense-tp-size: 1
+      tensor-parallel-size: 4
+      expert-parallel-size: 1
+


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

rg -n "dp-size|data-parallel-size" --type yaml -A 2 -B 2

Repository: ishandhanani/srt-slurm

Length of output: 13848

🏁 Script executed:

# Also check the full context of the file in question cat -n recipies/gb200-fp4/1k8k/low-latency.yaml | head -120

Repository: ishandhanani/srt-slurm

Length of output: 4229

🏁 Script executed:

# Search for other decode sections to see patterns rg -n "^\s+decode:" --type yaml -A 20

Repository: ishandhanani/srt-slurm

Length of output: 45703

Add data-parallel-size to decode section and align parallelism parameter naming.

The decode section omits data-parallel-size, while the prefill section explicitly sets it to 1. All other parallelism configs in the codebase (decode sections in max-tpt.yaml, gb300-fp4, and gb200-fp8 files) include this parameter. Additionally, to maintain consistency with other throughput-optimized configs (max-tpt.yaml variants), convert the parallelism parameters from long form to short form: data-parallel-size → dp-size, tensor-parallel-size → tp-size, expert-parallel-size → ep-size.

🤖 Prompt for AI Agents

In `@recipies/gb200-fp4/1k8k/low-latency.yaml` around lines 83 - 105, Add the missing data-parallel-size setting to the decode block and normalize parallelism keys to the short forms used elsewhere: replace tensor-parallel-size with tp-size, expert-parallel-size with ep-size, and data-parallel-size with dp-size (set dp-size to 1 to match prefill). Update the decode mapping (the decode section that contains served-model-name, model-path, etc.) to use dp-size, tp-size, and ep-size instead of the long-form names so it matches max-tpt.yaml and other gb*-files.

coderabbitai · 2026-01-21T07:20:13Z

+
+      # Quantization
+      quantization: "modelopt_fp4"
+      moe-runner-backend: "flashinfer_cutedsl"


⚠️ Potential issue | 🟠 Major

Possible typo: flashinfer_cutedsl should likely be flashinfer_cutlass.

The prefill section (line 74) uses flashinfer_cutlass, but the decode section uses flashinfer_cutedsl. This appears to be a typo that could cause a runtime failure if the backend name is not recognized.

Proposed fix

- moe-runner-backend: "flashinfer_cutedsl" + moe-runner-backend: "flashinfer_cutlass"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

moe-runner-backend: "flashinfer_cutedsl"

moe-runner-backend: "flashinfer_cutlass"

🤖 Prompt for AI Agents

In `@recipies/gb200-fp4/1k8k/max-tpt.yaml` at line 124, Replace the typod backend string "flashinfer_cutedsl" with the correct "flashinfer_cutlass" so the decode section uses the same moe-runner-backend value as the prefill section; locate the moe-runner-backend entry in the decode block (and any other occurrences) and update the value to "flashinfer_cutlass" to avoid runtime backend lookup failures.

kyleliang-nv added 2 commits January 20, 2026 17:30

Add GB200 DSR1-FP4 1k/8k base config just by coping over 1k/1k config

21e4214

Apply the changes for 1k/8k

2f14a54

kyleliang-nv requested a review from ishandhanani January 21, 2026 01:47

coderabbitai bot reviewed Jan 21, 2026

View reviewed changes

Comment thread recipies/gb200-fp4/1k8k/low-latency.yaml

Comment thread recipies/gb200-fp4/1k8k/low-latency.yaml

Comment thread recipies/gb200-fp4/1k8k/mid-curve.yaml

Update container

4d6a5b5

coderabbitai bot reviewed Jan 21, 2026

View reviewed changes

ishandhanani merged commit 85f8722 into main Jan 21, 2026
4 of 5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GB200 DSR1-FP4 1k/8k recipies#85

Add GB200 DSR1-FP4 1k/8k recipies#85
ishandhanani merged 3 commits intomainfrom
kylliang/dsr1_fp4_1k8k

kyleliang-nv commented Jan 21, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 21, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

kyleliang-nv commented Jan 21, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Jan 21, 2026

Uh oh!

coderabbitai bot Jan 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	moe-runner-backend: "flashinfer_cutedsl"
	moe-runner-backend: "flashinfer_cutlass"

Conversation

kyleliang-nv commented Jan 21, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

kyleliang-nv commented Jan 21, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kyleliang-nv commented Jan 21, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 21, 2026 •

edited

Loading