Update GB300-FP4 configs by kyleliang-nv · Pull Request #131 · ishandhanani/srt-slurm

kyleliang-nv · 2026-02-03T00:03:40Z

Summary by CodeRabbit

New Features
- Added multiple FP4 deployment profiles for GB300 GPU clusters covering 1k1k, 1k8k, and 8k1k sequence-length combinations.
- Introduced three optimization profiles: low-latency, max-throughput, and mid-curve for each target scenario.
- Enabled multi-frontend routing and integrated end-to-end benchmarking presets for performance validation.

coderabbitai · 2026-02-03T00:04:03Z

📝 Walkthrough

Walkthrough

Adds nine new YAML deployment manifests for gb300-fp4 across three sequence-length scenarios (1k1k, 1k8k, 8k1k) and three optimization modes (low-latency, max-tpt, mid-curve), specifying frontend, model, resources, backend environments, sglang prefill/decode settings, and benchmarks.

Changes

Cohort / File(s)	Summary
GB300 FP4 1k1k `recipes/gb300-fp4/1k1k/low_latency.yaml`, `recipes/gb300-fp4/1k1k/max_tpt.yaml`, `recipes/gb300-fp4/1k1k/mid_curve.yaml`	Added three full deployment manifests for 1k1k experiments. Each defines dynamo frontend options (multi-frontend/nginx), model (path/container/FP4), resource counts, backend envs (prefill/decode), detailed `sglang_config` for prefill and decode, and sa-bench benchmark blocks.
GB300 FP4 1k8k `recipes/gb300-fp4/1k8k/low-latency.yaml`, `recipes/gb300-fp4/1k8k/max-tpt.yaml`, `recipes/gb300-fp4/1k8k/mid-curve.yaml`	Added three manifests for 1k8k setups with adjusted GPU/resource counts and tuned sglang prefill/decode parameters, backend env variables, and benchmark settings for longer output scenarios.
GB300 FP4 8k1k `recipes/gb300-fp4/8k1k/low_latency.yaml`, `recipes/gb300-fp4/8k1k/max_tpt.yaml`, `recipes/gb300-fp4/8k1k/mid_curve.yaml`	Added three manifests for 8k1k experiments. Each contains multi-frontend dynamo settings, FP4 model specs, resource allocations, extensive backend env tunables, `sglang_config` (prefill/decode) including DeepEP/CUDA-graph/parallelism lists, and sa-bench configs.

Sequence Diagram(s)

(omitted — changes are configuration additions without new multi-component control-flow logic)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

pass in frontend args #64: Implements frontend configuration (enable_multiple_frontends, num_additional_frontends, frontend args) that these recipes consume.

Suggested reviewers

ishandhanani

Poem

🐰 I hopped through YAML fields today,
FP4 flags all tucked away,
Nine new recipes in a row,
For latency, throughput, and the mid-curve flow,
May GPUs hum and benchmarks play! 🎋

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Update GB300-FP4 configs' is directly related to the main change—adding and updating multiple YAML configuration files for GB300-FP4 deployments across different scenarios (low_latency, max_tpt, mid_curve) and sequence lengths (1k1k, 1k8k, 8k1k).
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch kylliang/update_gb300_configs

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🤖 Fix all issues with AI agents

In `@recipes/gb300-fp4/1k1k/low_latency.yaml`:
- Around line 8-9: The inline comment for the num_additional_frontends setting
is truncated; update the comment on the num_additional_frontends line (near
enable_multiple_frontends and num_additional_frontends) to complete the sentence
and clarify the total routers formula (e.g., "# Additional routers (total = 1 +
num_additional_frontends)"), ensuring the comment fully explains that the total
routers equals the base frontend plus the additional count.

In `@recipes/gb300-fp4/1k1k/mid_curve.yaml`:
- Around line 8-9: The comment for num_additional_frontends is truncated; update
the inline comment next to the num_additional_frontends field to complete the
sentence (e.g., clarify total routers calculation like "# Additional routers
(total = 1 + num_additional_frontends)") so it accurately documents how the
total number of frontends is computed; locate the keys enable_multiple_frontends
and num_additional_frontends in the YAML and fix the comment text.

In `@recipes/gb300-fp4/1k8k/low-latency.yaml`:
- Around line 16-17: The resources.gpu_type value is inconsistent (set to
"gb200") for this gb300-fp4 recipe; update the resources block's gpu_type key
from "gb200" to "gb300" (or confirm and document if "gb200" was intentional) so
the gpu_type in this file matches the other gb300-fp4 configs.

In `@recipes/gb300-fp4/8k1k/low_latency.yaml`:
- Around line 8-9: The inline comment for num_additional_frontends is truncated;
update the comment for the num_additional_frontends field (next to
enable_multiple_frontends) to complete the sentence or remove it—e.g., change `#
Additional routers (total = 1 + t` to `# Additional routers (total = 1 +
num_additional_frontends)` or simply `# Number of additional routers` so the
intent is clear; ensure the symbol names enable_multiple_frontends and
num_additional_frontends remain unchanged.

🧹 Nitpick comments (4)

recipes/gb300-fp4/1k8k/max-tpt.yaml (1)

111-113: Consider using consistent parallelism key naming across config files.

This file uses short-form keys (tp-size, dp-size, ep-size) while other files like 1k1k/low_latency.yaml and 8k1k/low_latency.yaml use long-form keys (tensor-parallel-size, data-parallel-size, expert-parallel-size). Consider standardizing across all configs for maintainability.
recipes/gb300-fp4/1k1k/max_tpt.yaml (2)
26-63: Consider YAML anchors to keep prefill/decode env blocks in sync.

There’s a lot of duplication between prefill_environment and decode_environment, which risks drift on future edits. A YAML anchor/merge can reduce that.
♻️ Example refactor using anchors (if your loader supports merge keys)
 backend:
-  prefill_environment:
-    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
-    PYTHONUNBUFFERED: "1"
-    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
-    SGLANG_NVFP4_CKPT_FP8_GEMM_IN_ATTN: "1"
-    SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: "1"
-    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
-    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
-    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
-    MC_TE_METRIC: "true"
-    MC_FORCE_MNNVL: "1"
-    NCCL_MNNVL_ENABLE: "1"
-    NCCL_CUMEM_ENABLE: "1"
-    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
-    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
-    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
+  common_environment: &common_environment
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
+    PYTHONUNBUFFERED: "1"
+    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
+    SGLANG_NVFP4_CKPT_FP8_GEMM_IN_ATTN: "1"
+    SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: "1"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+    MC_TE_METRIC: "true"
+    MC_FORCE_MNNVL: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
+  prefill_environment:
+    <<: *common_environment
   decode_environment:
-    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
-    PYTHONUNBUFFERED: "1"
-    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
-    SGLANG_NVFP4_CKPT_FP8_GEMM_IN_ATTN: "1"
-    SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: "1"
-    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
-    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
-    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
-    MC_TE_METRIC: "true"
-    MC_FORCE_MNNVL: "1"
-    NCCL_MNNVL_ENABLE: "1"
-    NCCL_CUMEM_ENABLE: "1"
-    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
-    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
-    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
+    <<: *common_environment
     SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: "1024"
     SGLANG_MOE_NVFP4_DISPATCH: "1"
65-115: Ensure deep_gemm_precompile uses the same flags as this prefill config.

If your pipeline runs sglang.deep_gemm_precompile, it must mirror the runtime flags here (e.g., served-model-name, model path, quantization, backends). Please verify the precompile invocation is sourcing this recipe’s exact flags.

Based on learnings: In SGLang backend (src/srtctl/backends/sglang.py), the sglang.deep_gemm_precompile module requires all the same flags as the actual serving run (including --served-model-name, --model-path, etc.) because it needs to mirror the runtime configuration to precompile the correct kernels.
recipes/gb300-fp4/1k8k/mid-curve.yaml (1)

25-62: Consider anchoring the shared env vars to prevent drift.

These env blocks are nearly identical; a YAML anchor/merge would keep them aligned and reduce copy/paste edits.

coderabbitai · 2026-02-03T00:08:41Z

+  enable_multiple_frontends: true     # Enable nginx + multiple routers  
+  num_additional_frontends: 4         # Additional routers (total = 1 + t


⚠️ Potential issue | 🟡 Minor

Truncated comment.

Same as other files - comment on line 9 is incomplete: # Additional routers (total = 1 + t.

🤖 Prompt for AI Agents

In `@recipes/gb300-fp4/1k1k/low_latency.yaml` around lines 8 - 9, The inline comment for the num_additional_frontends setting is truncated; update the comment on the num_additional_frontends line (near enable_multiple_frontends and num_additional_frontends) to complete the sentence and clarify the total routers formula (e.g., "# Additional routers (total = 1 + num_additional_frontends)"), ensuring the comment fully explains that the total routers equals the base frontend plus the additional count.

coderabbitai · 2026-02-03T00:08:41Z

+  enable_multiple_frontends: true     # Enable nginx + multiple routers  
+  num_additional_frontends: 9         # Additional routers (total = 1 + t


⚠️ Potential issue | 🟡 Minor

Truncated comment.

Line 9 has an incomplete comment: # Additional routers (total = 1 + t.

🤖 Prompt for AI Agents

In `@recipes/gb300-fp4/1k1k/mid_curve.yaml` around lines 8 - 9, The comment for num_additional_frontends is truncated; update the inline comment next to the num_additional_frontends field to complete the sentence (e.g., clarify total routers calculation like "# Additional routers (total = 1 + num_additional_frontends)") so it accurately documents how the total number of frontends is computed; locate the keys enable_multiple_frontends and num_additional_frontends in the YAML and fix the comment text.

coderabbitai · 2026-02-03T00:08:41Z

+resources:
+  gpu_type: "gb200"


⚠️ Potential issue | 🟡 Minor

Inconsistent gpu_type: "gb200" in a gb300-fp4 config.

This file specifies gpu_type: "gb200" but is located in the recipes/gb300-fp4/ directory. All other files in this PR use gpu_type: "gb300". Please verify this is intentional or correct to "gb300" for consistency.

Suggested fix if this should be gb300

resources: - gpu_type: "gb200" + gpu_type: "gb300"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

resources:

gpu_type: "gb200"

resources:

gpu_type: "gb300"

🤖 Prompt for AI Agents

In `@recipes/gb300-fp4/1k8k/low-latency.yaml` around lines 16 - 17, The resources.gpu_type value is inconsistent (set to "gb200") for this gb300-fp4 recipe; update the resources block's gpu_type key from "gb200" to "gb300" (or confirm and document if "gb200" was intentional) so the gpu_type in this file matches the other gb300-fp4 configs.

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@recipes/gb300-fp4/8k1k/low_latency.yaml`:
- Line 1: The YAML name field currently has a duplicated "8k1k" ("name:
\"gb300-8k1k-fp4-low-latency-8k1k\""); edit the name value to remove the
redundancy and match the repository pattern (e.g., change it to
"gb300-fp4-8k1k-low-latency"), updating the name field in this file so it
follows the same naming convention as other configs.

🧹 Nitpick comments (1)

recipes/gb300-fp4/1k8k/max-tpt.yaml (1)

1-248: Configuration is well-structured.

The configuration correctly sets up a 1k8k max-throughput deployment with appropriate parallelism settings (tp/dp/ep-size: 48 matching 12 decode nodes × 4 GPUs).

Note: The filename uses hyphens (max-tpt.yaml) while files in 1k1k/ and 8k1k/ directories use underscores (max_tpt.yaml). Consider aligning filename conventions across all directories for consistency.

coderabbitai · 2026-02-03T01:09:25Z

@@ -0,0 +1,121 @@
+name: "gb300-8k1k-fp4-low-latency-8k1k"


⚠️ Potential issue | 🟡 Minor

Redundant "8k1k" in configuration name.

The name "gb300-8k1k-fp4-low-latency-8k1k" contains "8k1k" twice. Other configs in this PR follow patterns like "gb300-fp4-8k1k-max-tpt".

Suggested fix

-name: "gb300-8k1k-fp4-low-latency-8k1k" +name: "gb300-fp4-8k1k-low-latency"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

name: "gb300-8k1k-fp4-low-latency-8k1k"

name: "gb300-fp4-8k1k-low-latency"

🤖 Prompt for AI Agents

In `@recipes/gb300-fp4/8k1k/low_latency.yaml` at line 1, The YAML name field currently has a duplicated "8k1k" ("name: \"gb300-8k1k-fp4-low-latency-8k1k\""); edit the name value to remove the redundancy and match the repository pattern (e.g., change it to "gb300-fp4-8k1k-low-latency"), updating the name field in this file so it follows the same naming convention as other configs.

* Update GB300-FP4 configs * Add nginx to GB300-FP4 configs --------- Co-authored-by: ishandhanani <ishandhanani@gmail.com>

Update GB300-FP4 configs

f77aea2

kyleliang-nv requested review from gracehonv and ishandhanani February 3, 2026 00:04

coderabbitai bot reviewed Feb 3, 2026

View reviewed changes

Add nginx to GB300-FP4 configs

35a70e1

coderabbitai bot reviewed Feb 3, 2026

View reviewed changes

Merge branch 'main' into kylliang/update_gb300_configs

c6a3fdc

ishandhanani merged commit 1310a2d into main Feb 4, 2026
4 of 5 checks passed

ishandhanani added a commit that referenced this pull request Feb 4, 2026

Update GB300-FP4 configs (#131)

a8f20a1

* Update GB300-FP4 configs * Add nginx to GB300-FP4 configs --------- Co-authored-by: ishandhanani <ishandhanani@gmail.com>

coderabbitai bot mentioned this pull request Feb 6, 2026

Remove moe_dense_tp_size for TP configs #153

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update GB300-FP4 configs#131

Update GB300-FP4 configs#131
ishandhanani merged 3 commits intomainfrom
kylliang/update_gb300_configs

kyleliang-nv commented Feb 3, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Feb 3, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Feb 3, 2026

Uh oh!

coderabbitai bot Feb 3, 2026

Uh oh!

coderabbitai bot Feb 3, 2026

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Feb 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		enable_multiple_frontends: true # Enable nginx + multiple routers
		num_additional_frontends: 4 # Additional routers (total = 1 + t

		enable_multiple_frontends: true # Enable nginx + multiple routers
		num_additional_frontends: 9 # Additional routers (total = 1 + t

	name: "gb300-8k1k-fp4-low-latency-8k1k"
	name: "gb300-fp4-8k1k-low-latency"

Conversation

kyleliang-nv commented Feb 3, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kyleliang-nv commented Feb 3, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 3, 2026 •

edited

Loading