Conversation
📝 WalkthroughWalkthroughAdds three new top-level YAML recipes for GB200 FP4 disaggregated inference (low-latency, max-throughput, mid-curve), each defining multi-stage prefill/decode configurations, environment flags, SGLang tuning (quantization, TP/DP/EP, CUDA-graph), resource topology, and SA-bench benchmarks. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Prefill_Node
participant Decode_Node
participant Model_Storage
Client->>Prefill_Node: Send prefill request (tokens, context)
Prefill_Node->>Model_Storage: Load model shard / weights
Prefill_Node->>Prefill_Node: Populate KV-cache (prefill mode)
Prefill_Node->>Decode_Node: Forward KV-cache + request metadata
Decode_Node->>Model_Storage: Load decode-stage shards
Decode_Node->>Decode_Node: Run decode loop (cuda-graph / TP/DP/EP)
Decode_Node->>Client: Return tokens / responses
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRsPoem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
The |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@recipies/gb200-fp4/1k8k/low-latency.yaml`:
- Around line 79-81: Replace the long-form parallelism keys with the short-form
ones used elsewhere: change data-parallel-size to dp-size, tensor-parallel-size
to tp-size, and expert-parallel-size to ep-size (also update any other
occurrences of the long names such as the ones around the second block using the
same file) so the config uses dp-size, tp-size, and ep-size consistently with
the rest of the PR.
- Around line 69-72: The config sets max-total-tokens to 8192 which undercuts
the benchmark requirement (isl=1024 + osl=8192 = 9216) and conflicts with
context-length=9200; update the max-total-tokens entry to at least 9216 so
requests aren't truncated or rejected, and verify related fields like
context-length and chunked-prefill-size (currently 9200 and 8192) remain
consistent with the new total token budget.
In `@recipies/gb200-fp4/1k8k/mid-curve.yaml`:
- Around line 1-2: The top-of-file comment says "Does not use single batch
overlap" but the prefill section currently sets enable-single-batch-overlap to
true; update either the comment or the setting so they match: locate the prefill
section and the enable-single-batch-overlap flag and either set
enable-single-batch-overlap to false (to match the comment) or change the header
comment to state that single batch overlap is enabled; ensure the decision is
reflected consistently in both the file header and the prefill
enable-single-batch-overlap value.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@recipies/gb200-fp4/1k8k/low-latency.yaml`:
- Around line 83-105: Add the missing data-parallel-size setting to the decode
block and normalize parallelism keys to the short forms used elsewhere: replace
tensor-parallel-size with tp-size, expert-parallel-size with ep-size, and
data-parallel-size with dp-size (set dp-size to 1 to match prefill). Update the
decode mapping (the decode section that contains served-model-name, model-path,
etc.) to use dp-size, tp-size, and ep-size instead of the long-form names so it
matches max-tpt.yaml and other gb*-files.
In `@recipies/gb200-fp4/1k8k/max-tpt.yaml`:
- Line 124: Replace the typod backend string "flashinfer_cutedsl" with the
correct "flashinfer_cutlass" so the decode section uses the same
moe-runner-backend value as the prefill section; locate the moe-runner-backend
entry in the decode block (and any other occurrences) and update the value to
"flashinfer_cutlass" to avoid runtime backend lookup failures.
♻️ Duplicate comments (1)
recipies/gb200-fp4/1k8k/low-latency.yaml (1)
69-81: Issues already flagged in previous review.The
max-total-tokensvalue (line 71) and parallelism key naming (lines 79-81) have been addressed in prior review comments.
| decode: | ||
| disaggregation-mode: "decode" | ||
| served-model-name: "deepseek-ai/DeepSeek-R1" | ||
| model-path: "/model/" | ||
| prefill-round-robin-balance: true | ||
| trust-remote-code: true | ||
| disable-radix-cache: true | ||
| kv-cache-dtype: "fp8_e4m3" | ||
| attention-backend: "trtllm_mla" | ||
| quantization: "modelopt_fp4" | ||
| moe-runner-backend: "flashinfer_trtllm" | ||
| disaggregation-bootstrap-port: 30001 | ||
| stream-interval: 10 | ||
| watchdog-timeout: 1000000 | ||
| context-length: 9200 | ||
| mem-fraction-static: 0.95 | ||
| chunked-prefill-size: 8192 | ||
| cuda-graph-max-bs: 256 | ||
| scheduler-recv-interval: 10 | ||
| moe-dense-tp-size: 1 | ||
| tensor-parallel-size: 4 | ||
| expert-parallel-size: 1 | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
rg -n "dp-size|data-parallel-size" --type yaml -A 2 -B 2Repository: ishandhanani/srt-slurm
Length of output: 13848
🏁 Script executed:
# Also check the full context of the file in question
cat -n recipies/gb200-fp4/1k8k/low-latency.yaml | head -120Repository: ishandhanani/srt-slurm
Length of output: 4229
🏁 Script executed:
# Search for other decode sections to see patterns
rg -n "^\s+decode:" --type yaml -A 20Repository: ishandhanani/srt-slurm
Length of output: 45703
Add data-parallel-size to decode section and align parallelism parameter naming.
The decode section omits data-parallel-size, while the prefill section explicitly sets it to 1. All other parallelism configs in the codebase (decode sections in max-tpt.yaml, gb300-fp4, and gb200-fp8 files) include this parameter. Additionally, to maintain consistency with other throughput-optimized configs (max-tpt.yaml variants), convert the parallelism parameters from long form to short form: data-parallel-size → dp-size, tensor-parallel-size → tp-size, expert-parallel-size → ep-size.
🤖 Prompt for AI Agents
In `@recipies/gb200-fp4/1k8k/low-latency.yaml` around lines 83 - 105, Add the
missing data-parallel-size setting to the decode block and normalize parallelism
keys to the short forms used elsewhere: replace tensor-parallel-size with
tp-size, expert-parallel-size with ep-size, and data-parallel-size with dp-size
(set dp-size to 1 to match prefill). Update the decode mapping (the decode
section that contains served-model-name, model-path, etc.) to use dp-size,
tp-size, and ep-size instead of the long-form names so it matches max-tpt.yaml
and other gb*-files.
|
|
||
| # Quantization | ||
| quantization: "modelopt_fp4" | ||
| moe-runner-backend: "flashinfer_cutedsl" |
There was a problem hiding this comment.
Possible typo: flashinfer_cutedsl should likely be flashinfer_cutlass.
The prefill section (line 74) uses flashinfer_cutlass, but the decode section uses flashinfer_cutedsl. This appears to be a typo that could cause a runtime failure if the backend name is not recognized.
Proposed fix
- moe-runner-backend: "flashinfer_cutedsl"
+ moe-runner-backend: "flashinfer_cutlass"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| moe-runner-backend: "flashinfer_cutedsl" | |
| moe-runner-backend: "flashinfer_cutlass" |
🤖 Prompt for AI Agents
In `@recipies/gb200-fp4/1k8k/max-tpt.yaml` at line 124, Replace the typod backend
string "flashinfer_cutedsl" with the correct "flashinfer_cutlass" so the decode
section uses the same moe-runner-backend value as the prefill section; locate
the moe-runner-backend entry in the decode block (and any other occurrences) and
update the value to "flashinfer_cutlass" to avoid runtime backend lookup
failures.
This is a preliminary 1k/8k recipes for GB200 DSR1-FP4. The recipe were modified from 1k/1k to get preliminary results.
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.