Conversation
📝 WalkthroughWalkthroughAdds nine new YAML deployment manifests for gb300-fp4 across three sequence-length scenarios (1k1k, 1k8k, 8k1k) and three optimization modes (low-latency, max-tpt, mid-curve), specifying frontend, model, resources, backend environments, sglang prefill/decode settings, and benchmarks. Changes
Sequence Diagram(s)(omitted — changes are configuration additions without new multi-component control-flow logic) Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In `@recipes/gb300-fp4/1k1k/low_latency.yaml`:
- Around line 8-9: The inline comment for the num_additional_frontends setting
is truncated; update the comment on the num_additional_frontends line (near
enable_multiple_frontends and num_additional_frontends) to complete the sentence
and clarify the total routers formula (e.g., "# Additional routers (total = 1 +
num_additional_frontends)"), ensuring the comment fully explains that the total
routers equals the base frontend plus the additional count.
In `@recipes/gb300-fp4/1k1k/mid_curve.yaml`:
- Around line 8-9: The comment for num_additional_frontends is truncated; update
the inline comment next to the num_additional_frontends field to complete the
sentence (e.g., clarify total routers calculation like "# Additional routers
(total = 1 + num_additional_frontends)") so it accurately documents how the
total number of frontends is computed; locate the keys enable_multiple_frontends
and num_additional_frontends in the YAML and fix the comment text.
In `@recipes/gb300-fp4/1k8k/low-latency.yaml`:
- Around line 16-17: The resources.gpu_type value is inconsistent (set to
"gb200") for this gb300-fp4 recipe; update the resources block's gpu_type key
from "gb200" to "gb300" (or confirm and document if "gb200" was intentional) so
the gpu_type in this file matches the other gb300-fp4 configs.
In `@recipes/gb300-fp4/8k1k/low_latency.yaml`:
- Around line 8-9: The inline comment for num_additional_frontends is truncated;
update the comment for the num_additional_frontends field (next to
enable_multiple_frontends) to complete the sentence or remove it—e.g., change `#
Additional routers (total = 1 + t` to `# Additional routers (total = 1 +
num_additional_frontends)` or simply `# Number of additional routers` so the
intent is clear; ensure the symbol names enable_multiple_frontends and
num_additional_frontends remain unchanged.
🧹 Nitpick comments (4)
recipes/gb300-fp4/1k8k/max-tpt.yaml (1)
111-113: Consider using consistent parallelism key naming across config files.This file uses short-form keys (
tp-size,dp-size,ep-size) while other files like1k1k/low_latency.yamland8k1k/low_latency.yamluse long-form keys (tensor-parallel-size,data-parallel-size,expert-parallel-size). Consider standardizing across all configs for maintainability.recipes/gb300-fp4/1k1k/max_tpt.yaml (2)
26-63: Consider YAML anchors to keep prefill/decode env blocks in sync.There’s a lot of duplication between
prefill_environmentanddecode_environment, which risks drift on future edits. A YAML anchor/merge can reduce that.♻️ Example refactor using anchors (if your loader supports merge keys)
backend: - prefill_environment: - TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" - PYTHONUNBUFFERED: "1" - DYN_SKIP_SGLANG_LOG_FORMATTING: "1" - SGLANG_NVFP4_CKPT_FP8_GEMM_IN_ATTN: "1" - SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: "1" - SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" - SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" - SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" - MC_TE_METRIC: "true" - MC_FORCE_MNNVL: "1" - NCCL_MNNVL_ENABLE: "1" - NCCL_CUMEM_ENABLE: "1" - SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" - SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" - SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" + common_environment: &common_environment + TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" + PYTHONUNBUFFERED: "1" + DYN_SKIP_SGLANG_LOG_FORMATTING: "1" + SGLANG_NVFP4_CKPT_FP8_GEMM_IN_ATTN: "1" + SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: "1" + SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" + SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" + SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" + MC_TE_METRIC: "true" + MC_FORCE_MNNVL: "1" + NCCL_MNNVL_ENABLE: "1" + NCCL_CUMEM_ENABLE: "1" + SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" + SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" + SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" + prefill_environment: + <<: *common_environment decode_environment: - TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800" - PYTHONUNBUFFERED: "1" - DYN_SKIP_SGLANG_LOG_FORMATTING: "1" - SGLANG_NVFP4_CKPT_FP8_GEMM_IN_ATTN: "1" - SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: "1" - SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000" - SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000" - SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000" - MC_TE_METRIC: "true" - MC_FORCE_MNNVL: "1" - NCCL_MNNVL_ENABLE: "1" - NCCL_CUMEM_ENABLE: "1" - SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True" - SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0" - SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1" + <<: *common_environment SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: "1024" SGLANG_MOE_NVFP4_DISPATCH: "1"
65-115: Ensure deep_gemm_precompile uses the same flags as this prefill config.If your pipeline runs
sglang.deep_gemm_precompile, it must mirror the runtime flags here (e.g., served-model-name, model path, quantization, backends). Please verify the precompile invocation is sourcing this recipe’s exact flags.Based on learnings: In SGLang backend (src/srtctl/backends/sglang.py), the sglang.deep_gemm_precompile module requires all the same flags as the actual serving run (including --served-model-name, --model-path, etc.) because it needs to mirror the runtime configuration to precompile the correct kernels.
recipes/gb300-fp4/1k8k/mid-curve.yaml (1)
25-62: Consider anchoring the shared env vars to prevent drift.These env blocks are nearly identical; a YAML anchor/merge would keep them aligned and reduce copy/paste edits.
| enable_multiple_frontends: true # Enable nginx + multiple routers | ||
| num_additional_frontends: 4 # Additional routers (total = 1 + t |
There was a problem hiding this comment.
Truncated comment.
Same as other files - comment on line 9 is incomplete: # Additional routers (total = 1 + t.
🤖 Prompt for AI Agents
In `@recipes/gb300-fp4/1k1k/low_latency.yaml` around lines 8 - 9, The inline
comment for the num_additional_frontends setting is truncated; update the
comment on the num_additional_frontends line (near enable_multiple_frontends and
num_additional_frontends) to complete the sentence and clarify the total routers
formula (e.g., "# Additional routers (total = 1 + num_additional_frontends)"),
ensuring the comment fully explains that the total routers equals the base
frontend plus the additional count.
| enable_multiple_frontends: true # Enable nginx + multiple routers | ||
| num_additional_frontends: 9 # Additional routers (total = 1 + t |
There was a problem hiding this comment.
Truncated comment.
Line 9 has an incomplete comment: # Additional routers (total = 1 + t.
🤖 Prompt for AI Agents
In `@recipes/gb300-fp4/1k1k/mid_curve.yaml` around lines 8 - 9, The comment for
num_additional_frontends is truncated; update the inline comment next to the
num_additional_frontends field to complete the sentence (e.g., clarify total
routers calculation like "# Additional routers (total = 1 +
num_additional_frontends)") so it accurately documents how the total number of
frontends is computed; locate the keys enable_multiple_frontends and
num_additional_frontends in the YAML and fix the comment text.
| resources: | ||
| gpu_type: "gb200" |
There was a problem hiding this comment.
Inconsistent gpu_type: "gb200" in a gb300-fp4 config.
This file specifies gpu_type: "gb200" but is located in the recipes/gb300-fp4/ directory. All other files in this PR use gpu_type: "gb300". Please verify this is intentional or correct to "gb300" for consistency.
Suggested fix if this should be gb300
resources:
- gpu_type: "gb200"
+ gpu_type: "gb300"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| resources: | |
| gpu_type: "gb200" | |
| resources: | |
| gpu_type: "gb300" |
🤖 Prompt for AI Agents
In `@recipes/gb300-fp4/1k8k/low-latency.yaml` around lines 16 - 17, The
resources.gpu_type value is inconsistent (set to "gb200") for this gb300-fp4
recipe; update the resources block's gpu_type key from "gb200" to "gb300" (or
confirm and document if "gb200" was intentional) so the gpu_type in this file
matches the other gb300-fp4 configs.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@recipes/gb300-fp4/8k1k/low_latency.yaml`:
- Line 1: The YAML name field currently has a duplicated "8k1k" ("name:
\"gb300-8k1k-fp4-low-latency-8k1k\""); edit the name value to remove the
redundancy and match the repository pattern (e.g., change it to
"gb300-fp4-8k1k-low-latency"), updating the name field in this file so it
follows the same naming convention as other configs.
🧹 Nitpick comments (1)
recipes/gb300-fp4/1k8k/max-tpt.yaml (1)
1-248: Configuration is well-structured.The configuration correctly sets up a 1k8k max-throughput deployment with appropriate parallelism settings (tp/dp/ep-size: 48 matching 12 decode nodes × 4 GPUs).
Note: The filename uses hyphens (
max-tpt.yaml) while files in1k1k/and8k1k/directories use underscores (max_tpt.yaml). Consider aligning filename conventions across all directories for consistency.
| @@ -0,0 +1,121 @@ | |||
| name: "gb300-8k1k-fp4-low-latency-8k1k" | |||
There was a problem hiding this comment.
Redundant "8k1k" in configuration name.
The name "gb300-8k1k-fp4-low-latency-8k1k" contains "8k1k" twice. Other configs in this PR follow patterns like "gb300-fp4-8k1k-max-tpt".
Suggested fix
-name: "gb300-8k1k-fp4-low-latency-8k1k"
+name: "gb300-fp4-8k1k-low-latency"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| name: "gb300-8k1k-fp4-low-latency-8k1k" | |
| name: "gb300-fp4-8k1k-low-latency" |
🤖 Prompt for AI Agents
In `@recipes/gb300-fp4/8k1k/low_latency.yaml` at line 1, The YAML name field
currently has a duplicated "8k1k" ("name: \"gb300-8k1k-fp4-low-latency-8k1k\"");
edit the name value to remove the redundancy and match the repository pattern
(e.g., change it to "gb300-fp4-8k1k-low-latency"), updating the name field in
this file so it follows the same naming convention as other configs.
* Update GB300-FP4 configs * Add nginx to GB300-FP4 configs --------- Co-authored-by: ishandhanani <ishandhanani@gmail.com>
Summary by CodeRabbit