Skip to content
This repository was archived by the owner on Apr 20, 2026. It is now read-only.

Update GB300-FP4 configs#131

Merged
ishandhanani merged 3 commits intomainfrom
kylliang/update_gb300_configs
Feb 4, 2026
Merged

Update GB300-FP4 configs#131
ishandhanani merged 3 commits intomainfrom
kylliang/update_gb300_configs

Conversation

@kyleliang-nv
Copy link
Copy Markdown
Collaborator

@kyleliang-nv kyleliang-nv commented Feb 3, 2026

Summary by CodeRabbit

  • New Features
    • Added multiple FP4 deployment profiles for GB300 GPU clusters covering 1k1k, 1k8k, and 8k1k sequence-length combinations.
    • Introduced three optimization profiles: low-latency, max-throughput, and mid-curve for each target scenario.
    • Enabled multi-frontend routing and integrated end-to-end benchmarking presets for performance validation.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Feb 3, 2026

📝 Walkthrough

Walkthrough

Adds nine new YAML deployment manifests for gb300-fp4 across three sequence-length scenarios (1k1k, 1k8k, 8k1k) and three optimization modes (low-latency, max-tpt, mid-curve), specifying frontend, model, resources, backend environments, sglang prefill/decode settings, and benchmarks.

Changes

Cohort / File(s) Summary
GB300 FP4 1k1k
recipes/gb300-fp4/1k1k/low_latency.yaml, recipes/gb300-fp4/1k1k/max_tpt.yaml, recipes/gb300-fp4/1k1k/mid_curve.yaml
Added three full deployment manifests for 1k1k experiments. Each defines dynamo frontend options (multi-frontend/nginx), model (path/container/FP4), resource counts, backend envs (prefill/decode), detailed sglang_config for prefill and decode, and sa-bench benchmark blocks.
GB300 FP4 1k8k
recipes/gb300-fp4/1k8k/low-latency.yaml, recipes/gb300-fp4/1k8k/max-tpt.yaml, recipes/gb300-fp4/1k8k/mid-curve.yaml
Added three manifests for 1k8k setups with adjusted GPU/resource counts and tuned sglang prefill/decode parameters, backend env variables, and benchmark settings for longer output scenarios.
GB300 FP4 8k1k
recipes/gb300-fp4/8k1k/low_latency.yaml, recipes/gb300-fp4/8k1k/max_tpt.yaml, recipes/gb300-fp4/8k1k/mid_curve.yaml
Added three manifests for 8k1k experiments. Each contains multi-frontend dynamo settings, FP4 model specs, resource allocations, extensive backend env tunables, sglang_config (prefill/decode) including DeepEP/CUDA-graph/parallelism lists, and sa-bench configs.

Sequence Diagram(s)

(omitted — changes are configuration additions without new multi-component control-flow logic)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • pass in frontend args #64: Implements frontend configuration (enable_multiple_frontends, num_additional_frontends, frontend args) that these recipes consume.

Suggested reviewers

  • ishandhanani

Poem

🐰 I hopped through YAML fields today,
FP4 flags all tucked away,
Nine new recipes in a row,
For latency, throughput, and the mid-curve flow,
May GPUs hum and benchmarks play! 🎋

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Update GB300-FP4 configs' is directly related to the main change—adding and updating multiple YAML configuration files for GB300-FP4 deployments across different scenarios (low_latency, max_tpt, mid_curve) and sequence lengths (1k1k, 1k8k, 8k1k).
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch kylliang/update_gb300_configs

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In `@recipes/gb300-fp4/1k1k/low_latency.yaml`:
- Around line 8-9: The inline comment for the num_additional_frontends setting
is truncated; update the comment on the num_additional_frontends line (near
enable_multiple_frontends and num_additional_frontends) to complete the sentence
and clarify the total routers formula (e.g., "# Additional routers (total = 1 +
num_additional_frontends)"), ensuring the comment fully explains that the total
routers equals the base frontend plus the additional count.

In `@recipes/gb300-fp4/1k1k/mid_curve.yaml`:
- Around line 8-9: The comment for num_additional_frontends is truncated; update
the inline comment next to the num_additional_frontends field to complete the
sentence (e.g., clarify total routers calculation like "# Additional routers
(total = 1 + num_additional_frontends)") so it accurately documents how the
total number of frontends is computed; locate the keys enable_multiple_frontends
and num_additional_frontends in the YAML and fix the comment text.

In `@recipes/gb300-fp4/1k8k/low-latency.yaml`:
- Around line 16-17: The resources.gpu_type value is inconsistent (set to
"gb200") for this gb300-fp4 recipe; update the resources block's gpu_type key
from "gb200" to "gb300" (or confirm and document if "gb200" was intentional) so
the gpu_type in this file matches the other gb300-fp4 configs.

In `@recipes/gb300-fp4/8k1k/low_latency.yaml`:
- Around line 8-9: The inline comment for num_additional_frontends is truncated;
update the comment for the num_additional_frontends field (next to
enable_multiple_frontends) to complete the sentence or remove it—e.g., change `#
Additional routers (total = 1 + t` to `# Additional routers (total = 1 +
num_additional_frontends)` or simply `# Number of additional routers` so the
intent is clear; ensure the symbol names enable_multiple_frontends and
num_additional_frontends remain unchanged.
🧹 Nitpick comments (4)
recipes/gb300-fp4/1k8k/max-tpt.yaml (1)

111-113: Consider using consistent parallelism key naming across config files.

This file uses short-form keys (tp-size, dp-size, ep-size) while other files like 1k1k/low_latency.yaml and 8k1k/low_latency.yaml use long-form keys (tensor-parallel-size, data-parallel-size, expert-parallel-size). Consider standardizing across all configs for maintainability.

recipes/gb300-fp4/1k1k/max_tpt.yaml (2)

26-63: Consider YAML anchors to keep prefill/decode env blocks in sync.

There’s a lot of duplication between prefill_environment and decode_environment, which risks drift on future edits. A YAML anchor/merge can reduce that.

♻️ Example refactor using anchors (if your loader supports merge keys)
 backend:
-  prefill_environment:
-    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
-    PYTHONUNBUFFERED: "1"
-    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
-    SGLANG_NVFP4_CKPT_FP8_GEMM_IN_ATTN: "1"
-    SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: "1"
-    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
-    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
-    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
-    MC_TE_METRIC: "true"
-    MC_FORCE_MNNVL: "1"
-    NCCL_MNNVL_ENABLE: "1"
-    NCCL_CUMEM_ENABLE: "1"
-    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
-    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
-    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
+  common_environment: &common_environment
+    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
+    PYTHONUNBUFFERED: "1"
+    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
+    SGLANG_NVFP4_CKPT_FP8_GEMM_IN_ATTN: "1"
+    SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: "1"
+    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
+    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
+    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
+    MC_TE_METRIC: "true"
+    MC_FORCE_MNNVL: "1"
+    NCCL_MNNVL_ENABLE: "1"
+    NCCL_CUMEM_ENABLE: "1"
+    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
+  prefill_environment:
+    <<: *common_environment
   decode_environment:
-    TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
-    PYTHONUNBUFFERED: "1"
-    DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
-    SGLANG_NVFP4_CKPT_FP8_GEMM_IN_ATTN: "1"
-    SGLANG_PER_TOKEN_GROUP_QUANT_8BIT_V2: "1"
-    SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
-    SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
-    SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
-    MC_TE_METRIC: "true"
-    MC_FORCE_MNNVL: "1"
-    NCCL_MNNVL_ENABLE: "1"
-    NCCL_CUMEM_ENABLE: "1"
-    SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
-    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
-    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
+    <<: *common_environment
     SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: "1024"
     SGLANG_MOE_NVFP4_DISPATCH: "1"

65-115: Ensure deep_gemm_precompile uses the same flags as this prefill config.

If your pipeline runs sglang.deep_gemm_precompile, it must mirror the runtime flags here (e.g., served-model-name, model path, quantization, backends). Please verify the precompile invocation is sourcing this recipe’s exact flags.

Based on learnings: In SGLang backend (src/srtctl/backends/sglang.py), the sglang.deep_gemm_precompile module requires all the same flags as the actual serving run (including --served-model-name, --model-path, etc.) because it needs to mirror the runtime configuration to precompile the correct kernels.

recipes/gb300-fp4/1k8k/mid-curve.yaml (1)

25-62: Consider anchoring the shared env vars to prevent drift.

These env blocks are nearly identical; a YAML anchor/merge would keep them aligned and reduce copy/paste edits.

Comment thread recipes/gb300-fp4/1k1k/low_latency.yaml Outdated
Comment on lines +8 to +9
enable_multiple_frontends: true # Enable nginx + multiple routers
num_additional_frontends: 4 # Additional routers (total = 1 + t
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Truncated comment.

Same as other files - comment on line 9 is incomplete: # Additional routers (total = 1 + t.

🤖 Prompt for AI Agents
In `@recipes/gb300-fp4/1k1k/low_latency.yaml` around lines 8 - 9, The inline
comment for the num_additional_frontends setting is truncated; update the
comment on the num_additional_frontends line (near enable_multiple_frontends and
num_additional_frontends) to complete the sentence and clarify the total routers
formula (e.g., "# Additional routers (total = 1 + num_additional_frontends)"),
ensuring the comment fully explains that the total routers equals the base
frontend plus the additional count.

Comment thread recipes/gb300-fp4/1k1k/mid_curve.yaml Outdated
Comment on lines +8 to +9
enable_multiple_frontends: true # Enable nginx + multiple routers
num_additional_frontends: 9 # Additional routers (total = 1 + t
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Truncated comment.

Line 9 has an incomplete comment: # Additional routers (total = 1 + t.

🤖 Prompt for AI Agents
In `@recipes/gb300-fp4/1k1k/mid_curve.yaml` around lines 8 - 9, The comment for
num_additional_frontends is truncated; update the inline comment next to the
num_additional_frontends field to complete the sentence (e.g., clarify total
routers calculation like "# Additional routers (total = 1 +
num_additional_frontends)") so it accurately documents how the total number of
frontends is computed; locate the keys enable_multiple_frontends and
num_additional_frontends in the YAML and fix the comment text.

Comment on lines +16 to +17
resources:
gpu_type: "gb200"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Inconsistent gpu_type: "gb200" in a gb300-fp4 config.

This file specifies gpu_type: "gb200" but is located in the recipes/gb300-fp4/ directory. All other files in this PR use gpu_type: "gb300". Please verify this is intentional or correct to "gb300" for consistency.

Suggested fix if this should be gb300
 resources:
-  gpu_type: "gb200"
+  gpu_type: "gb300"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
resources:
gpu_type: "gb200"
resources:
gpu_type: "gb300"
🤖 Prompt for AI Agents
In `@recipes/gb300-fp4/1k8k/low-latency.yaml` around lines 16 - 17, The
resources.gpu_type value is inconsistent (set to "gb200") for this gb300-fp4
recipe; update the resources block's gpu_type key from "gb200" to "gb300" (or
confirm and document if "gb200" was intentional) so the gpu_type in this file
matches the other gb300-fp4 configs.

Comment thread recipes/gb300-fp4/8k1k/low_latency.yaml Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@recipes/gb300-fp4/8k1k/low_latency.yaml`:
- Line 1: The YAML name field currently has a duplicated "8k1k" ("name:
\"gb300-8k1k-fp4-low-latency-8k1k\""); edit the name value to remove the
redundancy and match the repository pattern (e.g., change it to
"gb300-fp4-8k1k-low-latency"), updating the name field in this file so it
follows the same naming convention as other configs.
🧹 Nitpick comments (1)
recipes/gb300-fp4/1k8k/max-tpt.yaml (1)

1-248: Configuration is well-structured.

The configuration correctly sets up a 1k8k max-throughput deployment with appropriate parallelism settings (tp/dp/ep-size: 48 matching 12 decode nodes × 4 GPUs).

Note: The filename uses hyphens (max-tpt.yaml) while files in 1k1k/ and 8k1k/ directories use underscores (max_tpt.yaml). Consider aligning filename conventions across all directories for consistency.

@@ -0,0 +1,121 @@
name: "gb300-8k1k-fp4-low-latency-8k1k"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Redundant "8k1k" in configuration name.

The name "gb300-8k1k-fp4-low-latency-8k1k" contains "8k1k" twice. Other configs in this PR follow patterns like "gb300-fp4-8k1k-max-tpt".

Suggested fix
-name: "gb300-8k1k-fp4-low-latency-8k1k"
+name: "gb300-fp4-8k1k-low-latency"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
name: "gb300-8k1k-fp4-low-latency-8k1k"
name: "gb300-fp4-8k1k-low-latency"
🤖 Prompt for AI Agents
In `@recipes/gb300-fp4/8k1k/low_latency.yaml` at line 1, The YAML name field
currently has a duplicated "8k1k" ("name: \"gb300-8k1k-fp4-low-latency-8k1k\"");
edit the name value to remove the redundancy and match the repository pattern
(e.g., change it to "gb300-fp4-8k1k-low-latency"), updating the name field in
this file so it follows the same naming convention as other configs.

@ishandhanani ishandhanani merged commit 1310a2d into main Feb 4, 2026
4 of 5 checks passed
ishandhanani added a commit that referenced this pull request Feb 4, 2026
* Update GB300-FP4 configs

* Add nginx to GB300-FP4 configs

---------

Co-authored-by: ishandhanani <ishandhanani@gmail.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants