Skip to content

Add GB200 DSR1-FP4 1k/8k recipies#85

Merged
ishandhanani merged 3 commits intomainfrom
kylliang/dsr1_fp4_1k8k
Jan 21, 2026
Merged

Add GB200 DSR1-FP4 1k/8k recipies#85
ishandhanani merged 3 commits intomainfrom
kylliang/dsr1_fp4_1k8k

Conversation

@kyleliang-nv
Copy link
Copy Markdown
Collaborator

@kyleliang-nv kyleliang-nv commented Jan 21, 2026

This is a preliminary 1k/8k recipes for GB200 DSR1-FP4. The recipe were modified from 1k/1k to get preliminary results.

Summary by CodeRabbit

  • New Features
    • Added three GB200 FP4 configuration profiles (low-latency, max-throughput, balanced) for multi-stage inference. Each profile provides prefill/decode staging, resource and GPU/node allocation, environment and model-serving tuning, memory/token limits, parallelism/optimization options, and benchmark settings for performance validation.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Jan 21, 2026

📝 Walkthrough

Walkthrough

Adds three new top-level YAML recipes for GB200 FP4 disaggregated inference (low-latency, max-throughput, mid-curve), each defining multi-stage prefill/decode configurations, environment flags, SGLang tuning (quantization, TP/DP/EP, CUDA-graph), resource topology, and SA-bench benchmarks.

Changes

Cohort / File(s) Summary
GB200 FP4 Disaggregated Recipes
recipies/gb200-fp4/1k8k/low-latency.yaml, recipies/gb200-fp4/1k8k/max-tpt.yaml, recipies/gb200-fp4/1k8k/mid-curve.yaml
Three new configuration artifacts introducing disaggregated prefill/decode pipelines. Each file specifies model/container/precision, resource layouts (prefill/decode nodes & GPUs), separate environment blocks, detailed sglang_config (prefill & decode subsections with quantization, KV/cache, TP/DP/EP sizing, CUDA-graph, memory/token limits, timeouts), and benchmark (sa-bench) parameters. Differences are tuning-focused per profile (latency vs throughput vs mid-curve).

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant Prefill_Node
  participant Decode_Node
  participant Model_Storage

  Client->>Prefill_Node: Send prefill request (tokens, context)
  Prefill_Node->>Model_Storage: Load model shard / weights
  Prefill_Node->>Prefill_Node: Populate KV-cache (prefill mode)
  Prefill_Node->>Decode_Node: Forward KV-cache + request metadata
  Decode_Node->>Model_Storage: Load decode-stage shards
  Decode_Node->>Decode_Node: Run decode loop (cuda-graph / TP/DP/EP)
  Decode_Node->>Client: Return tokens / responses
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Poem

🐰 I hopped in to tune and play,
Prefill then decode — hip hooray!
FP4 whispers through the mesh,
GB200 hums in vibrant flesh,
Small configs, big hops today. 🥕✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: adding GB200 DSR1-FP4 1k/8k configuration recipes, which matches the three new YAML files being introduced.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@kyleliang-nv
Copy link
Copy Markdown
Collaborator Author

The container field needs to be set to ToT SGLang, however, I don't know how to specify that @ishandhanani

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@recipies/gb200-fp4/1k8k/low-latency.yaml`:
- Around line 79-81: Replace the long-form parallelism keys with the short-form
ones used elsewhere: change data-parallel-size to dp-size, tensor-parallel-size
to tp-size, and expert-parallel-size to ep-size (also update any other
occurrences of the long names such as the ones around the second block using the
same file) so the config uses dp-size, tp-size, and ep-size consistently with
the rest of the PR.
- Around line 69-72: The config sets max-total-tokens to 8192 which undercuts
the benchmark requirement (isl=1024 + osl=8192 = 9216) and conflicts with
context-length=9200; update the max-total-tokens entry to at least 9216 so
requests aren't truncated or rejected, and verify related fields like
context-length and chunked-prefill-size (currently 9200 and 8192) remain
consistent with the new total token budget.

In `@recipies/gb200-fp4/1k8k/mid-curve.yaml`:
- Around line 1-2: The top-of-file comment says "Does not use single batch
overlap" but the prefill section currently sets enable-single-batch-overlap to
true; update either the comment or the setting so they match: locate the prefill
section and the enable-single-batch-overlap flag and either set
enable-single-batch-overlap to false (to match the comment) or change the header
comment to state that single batch overlap is enabled; ensure the decision is
reflected consistently in both the file header and the prefill
enable-single-batch-overlap value.

Comment thread recipies/gb200-fp4/1k8k/low-latency.yaml
Comment thread recipies/gb200-fp4/1k8k/low-latency.yaml
Comment thread recipies/gb200-fp4/1k8k/mid-curve.yaml
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@recipies/gb200-fp4/1k8k/low-latency.yaml`:
- Around line 83-105: Add the missing data-parallel-size setting to the decode
block and normalize parallelism keys to the short forms used elsewhere: replace
tensor-parallel-size with tp-size, expert-parallel-size with ep-size, and
data-parallel-size with dp-size (set dp-size to 1 to match prefill). Update the
decode mapping (the decode section that contains served-model-name, model-path,
etc.) to use dp-size, tp-size, and ep-size instead of the long-form names so it
matches max-tpt.yaml and other gb*-files.

In `@recipies/gb200-fp4/1k8k/max-tpt.yaml`:
- Line 124: Replace the typod backend string "flashinfer_cutedsl" with the
correct "flashinfer_cutlass" so the decode section uses the same
moe-runner-backend value as the prefill section; locate the moe-runner-backend
entry in the decode block (and any other occurrences) and update the value to
"flashinfer_cutlass" to avoid runtime backend lookup failures.
♻️ Duplicate comments (1)
recipies/gb200-fp4/1k8k/low-latency.yaml (1)

69-81: Issues already flagged in previous review.

The max-total-tokens value (line 71) and parallelism key naming (lines 79-81) have been addressed in prior review comments.

Comment on lines +83 to +105
decode:
disaggregation-mode: "decode"
served-model-name: "deepseek-ai/DeepSeek-R1"
model-path: "/model/"
prefill-round-robin-balance: true
trust-remote-code: true
disable-radix-cache: true
kv-cache-dtype: "fp8_e4m3"
attention-backend: "trtllm_mla"
quantization: "modelopt_fp4"
moe-runner-backend: "flashinfer_trtllm"
disaggregation-bootstrap-port: 30001
stream-interval: 10
watchdog-timeout: 1000000
context-length: 9200
mem-fraction-static: 0.95
chunked-prefill-size: 8192
cuda-graph-max-bs: 256
scheduler-recv-interval: 10
moe-dense-tp-size: 1
tensor-parallel-size: 4
expert-parallel-size: 1

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

rg -n "dp-size|data-parallel-size" --type yaml -A 2 -B 2

Repository: ishandhanani/srt-slurm

Length of output: 13848


🏁 Script executed:

# Also check the full context of the file in question
cat -n recipies/gb200-fp4/1k8k/low-latency.yaml | head -120

Repository: ishandhanani/srt-slurm

Length of output: 4229


🏁 Script executed:

# Search for other decode sections to see patterns
rg -n "^\s+decode:" --type yaml -A 20

Repository: ishandhanani/srt-slurm

Length of output: 45703


Add data-parallel-size to decode section and align parallelism parameter naming.

The decode section omits data-parallel-size, while the prefill section explicitly sets it to 1. All other parallelism configs in the codebase (decode sections in max-tpt.yaml, gb300-fp4, and gb200-fp8 files) include this parameter. Additionally, to maintain consistency with other throughput-optimized configs (max-tpt.yaml variants), convert the parallelism parameters from long form to short form: data-parallel-sizedp-size, tensor-parallel-sizetp-size, expert-parallel-sizeep-size.

🤖 Prompt for AI Agents
In `@recipies/gb200-fp4/1k8k/low-latency.yaml` around lines 83 - 105, Add the
missing data-parallel-size setting to the decode block and normalize parallelism
keys to the short forms used elsewhere: replace tensor-parallel-size with
tp-size, expert-parallel-size with ep-size, and data-parallel-size with dp-size
(set dp-size to 1 to match prefill). Update the decode mapping (the decode
section that contains served-model-name, model-path, etc.) to use dp-size,
tp-size, and ep-size instead of the long-form names so it matches max-tpt.yaml
and other gb*-files.


# Quantization
quantization: "modelopt_fp4"
moe-runner-backend: "flashinfer_cutedsl"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Possible typo: flashinfer_cutedsl should likely be flashinfer_cutlass.

The prefill section (line 74) uses flashinfer_cutlass, but the decode section uses flashinfer_cutedsl. This appears to be a typo that could cause a runtime failure if the backend name is not recognized.

Proposed fix
-      moe-runner-backend: "flashinfer_cutedsl"
+      moe-runner-backend: "flashinfer_cutlass"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
moe-runner-backend: "flashinfer_cutedsl"
moe-runner-backend: "flashinfer_cutlass"
🤖 Prompt for AI Agents
In `@recipies/gb200-fp4/1k8k/max-tpt.yaml` at line 124, Replace the typod backend
string "flashinfer_cutedsl" with the correct "flashinfer_cutlass" so the decode
section uses the same moe-runner-backend value as the prefill section; locate
the moe-runner-backend entry in the decode block (and any other occurrences) and
update the value to "flashinfer_cutlass" to avoid runtime backend lookup
failures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants