Skip to content

fix: dashboard looks at err files#51

Merged
ishandhanani merged 20 commits intomainfrom
gb300-work
Dec 10, 2025
Merged

fix: dashboard looks at err files#51
ishandhanani merged 20 commits intomainfrom
gb300-work

Conversation

@ishandhanani
Copy link
Copy Markdown
Owner

@ishandhanani ishandhanani commented Dec 5, 2025

Summary by CodeRabbit

  • New Features

    • Extended log file parsing to support both .err and .out formats for comprehensive log analysis
    • Added support for pipeline parallelism configuration pattern detection
  • Updates

    • Updated user-facing messages and warnings to reflect support for both .err and .out log file formats
    • Enhanced cache invalidation mechanism to ensure accurate parsing after system updates

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Dec 5, 2025

Caution

Review failed

The pull request is closed.

Walkthrough

The changes extend log file parsing across the analysis module to support both .err and .out file formats instead of exclusively .err files. This includes updating cache patterns, filename parsing logic, validation messages, and introducing support for a new pipeline parallelism pattern (PP#) in log line parsing.

Changes

Cohort / File(s) Change Summary
Dashboard cache invalidation
analysis/dashboard/components.py
Added _cache_version parameter (default: 2) to load_node_metrics() function signature to enable explicit cache invalidation after parser changes. Updated docstring to document the new parameter and note metrics are loaded from both .err and .out files.
Dashboard messaging updates
analysis/dashboard/node_metrics_tab.py
Updated warning message from "No log files (.err) found" to "No log files found". Expanded informational message to list both .err and .out file patterns (prefill.err/out, decode.err/out).
Log parsing configuration
analysis/srtlog/config_reader.py
Extended parse_command_line_from_err() to parse both .err and .out files. Updated cache source patterns to include "*.out", adjusted filename and node/service extraction logic, replaced err_files_found counter with log_files_found, and updated validation messages and logging references to reflect combined .err/.out format.
Log parsing logic
analysis/srtlog/log_parser.py
Broadened file format handling to parse both .err and .out files. Updated cache source patterns to include "*.out", adjusted runtime file enumeration conditions, replaced .err references with "log files" in logging statements, added support for new pipeline parallelism pattern (PP#), and expanded DP/TP/EP tag parsing to handle three formats (full, simple TP, and pipeline PP).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • analysis/srtlog/config_reader.py: Multiple cascading changes to filename parsing, cache patterns, and variable naming require verification of consistency
  • analysis/srtlog/log_parser.py: New PP# pattern support and expanded DP/TP/EP parsing logic warrant careful validation of parsing correctness and edge cases
  • Cache invalidation strategy: Verify that the _cache_version parameter aligns with all cache-dependent functions and that no stale cache conflicts arise

Poem

🐰 Two files where once was one, now parsed with care,
Logs in .out and .err floating through the air,
Pipeline patterns dancing (PP#'s new delight),
Cache invalidation shines both clear and bright! ✨

Pre-merge checks and finishing touches

❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Title check ❓ Inconclusive The title 'gb300-moe' is vague and lacks descriptive detail about the actual changes made across multiple files. Use a more descriptive title that summarizes the main changes, such as 'Update gb300 fp4 configurations for MOE optimization' or 'Configure gb300-fp4 recipes with MOE backends and disaggregation'.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 39e9189 and b8750e3.

📒 Files selected for processing (1)
  • analysis/dashboard/components.py (1 hunks)

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
recipies/gb300-fp4/cutedsl-max-tpt.yaml (1)

60-60: Inconsistent environment variable key with sibling recipe file.

Line 60 in this file uses the old key SGLANG_CUTEDSL_MOE_NVFP4_DISPATCH: "1", while the related recipe file recipies/gb300-fp4/cutlass-max-tpt.yaml was updated to use the new key SGLANG_MOE_NVFP4_DISPATCH: "true" on lines 26 and 63. Both recipes are for the same compute type (gb300-fp4) and should have consistent environment variable names.

Update line 60 to match the renamed key and boolean format:

-    SGLANG_CUTEDSL_MOE_NVFP4_DISPATCH: "1"
+    SGLANG_MOE_NVFP4_DISPATCH: "true"

Also verify that the system/application code reading these environment variables has been updated to use the new key name across all affected recipe files.

recipies/gb300-fp4/cutlass-max-tpt.yaml (1)

1-5: Update header comment to reflect renamed environment variable.

The comment on line 5 still references the old environment variable name SGLANG_CUTEDSL_MOE_NVFP4_DISPATCH, but the actual configuration now uses SGLANG_MOE_NVFP4_DISPATCH. Keep the header comments in sync with the actual configuration.

Update line 5 in the header comment:

-#   - setting SGLANG_CUTEDSL_MOE_NVFP4_DISPATCH: 'true'
+#   - setting SGLANG_MOE_NVFP4_DISPATCH: 'true'
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 36a3965 and a690d2d.

📒 Files selected for processing (2)
  • recipies/gb300-fp4/cutedsl-max-tpt.yaml (1 hunks)
  • recipies/gb300-fp4/cutlass-max-tpt.yaml (3 hunks)
🔇 Additional comments (1)
recipies/gb300-fp4/cutedsl-max-tpt.yaml (1)

7-7: Verify container image format is valid and resolvable.

The container image reference was simplified from a fully-qualified Docker URL (docker://ishandhanani/sglang:rt-cu13-14483) to a bare tag (rt-cu13-14483). Without an explicit registry prefix or organization namespace, image resolution may fail depending on how the parsing/registry logic interprets this format.

Confirm:

  • Does the image resolution logic support bare tags without registry prefixes?
  • Is rt-cu13-14483 available in the expected image registry (e.g., Docker Hub)?
  • Should this be docker://rt-cu13-14483 or ishandhanani/sglang:rt-cu13-14483 instead?

Consider restoring the full image reference unless you've verified the bare format works with your system.

Comment thread recipies/gb300-fp4/cutlass-max-tpt.yaml Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
recipies/gb300-fp4/cutedsl-max-tpt.yaml (1)

1-1: Update comment to reflect new decode node count.

The file description mentions "8 Decode Nodes," but this configuration now scales to 12 decode nodes (line 13). Update the comment to accurately reflect the current topology.

Apply this diff to fix the comment:

-# 4P1D, with 8 Decode Nodes. Better per-gpu throughput with worse latency.
+# 4P1D, with 12 Decode Nodes. Better per-gpu throughput with worse latency.

Also applies to: 13-13

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a690d2d and 6c1e7b6.

📒 Files selected for processing (2)
  • recipies/gb300-fp4/cutedsl-max-tpt.yaml (3 hunks)
  • recipies/gb300-fp4/max-tpt-no-sbo.yaml (5 hunks)
🔇 Additional comments (8)
recipies/gb300-fp4/max-tpt-no-sbo.yaml (4)

3-3: Clean GPU hardware migration from GB300 to GB200.

Name and gpu_type are now consistently aligned. Configuration reflects the target hardware change.

Also applies to: 11-11


7-7: Verify container version compatibility across configuration files.

This file uses container version 0.5.5.post2, while the related configuration file uses rt-cu13-14483. Confirm these are compatible or intended to be different versions for different deployment scenarios.


37-37: MOE NVFP4 dispatch environment variables added consistently.

Both prefill and decode environments now include SGLANG_MOE_NVFP4_DISPATCH: "1", aligning with the migration away from CUTEDSL-specific dispatch handling.

Also applies to: 59-59


125-125: Verify backend changes are compatible with container version.

The decode configuration switches to flashinfer_cutedsl for MOE kernel execution and uses deepep with low_latency mode for expert-to-expert communication. Confirm these backends are supported by container version 0.5.5.post2.

Also applies to: 152-153

recipies/gb300-fp4/cutedsl-max-tpt.yaml (4)

7-7: Verify container image reference format consistency.

This file uses container reference rt-cu13-14483 (tag format), while the related configuration uses 0.5.5.post2 (version format). Confirm both resolve to the same or appropriate images, or clarify if intentional differences exist for different deployment paths.


12-14: Verify resource scaling does not exceed cluster capacity.

The configuration scales aggressively: prefill nodes 1→4, decode nodes 1→12, prefill workers 1→4. Confirm these allocations are compatible with cluster resources and that the 12x decode node scaling aligns with the 12x parallelism increase (ep-size: 4→48, below).


22-22: MOE NVFP4 dispatch environment variables added consistently.

Both prefill and decode environments now include SGLANG_MOE_NVFP4_DISPATCH: "1", aligning with File 1 and the unified MOE dispatch strategy.

Also applies to: 61-61


172-174: Verify parallelism scaling aligns with resource topology.

The decode configuration scales parallelism 12x (tp/dp/ep: 4→48) to match decode node scaling. Confirm this allocation is mathematically compatible with the hardware topology (12 decode nodes × 4 GPUs/node = 48 units) and that memory/communication overhead is acceptable at this scale.

Comment thread recipies/gb300-fp4/cutedsl-max-tpt.yaml Outdated
precision: "fp4"

resources:
gpu_type: "gb300"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n recipies/gb300-fp4/cutedsl-max-tpt.yaml | head -20

Repository: ishandhanani/srt-slurm

Length of output: 565


🏁 Script executed:

ls -la recipies/gb300-fp4/ | grep -E "\.yaml$"

Repository: ishandhanani/srt-slurm

Length of output: 263


🏁 Script executed:

cat -n recipies/gb300-fp4/cutedsl-max-tpt-no-sbo.yaml | head -20

Repository: ishandhanani/srt-slurm

Length of output: 144


🏁 Script executed:

cat -n recipies/gb300-fp4/max-tpt-no-sbo.yaml | head -20

Repository: ishandhanani/srt-slurm

Length of output: 562


🏁 Script executed:

cat -n recipies/gb300-fp4/1p2d.yaml | head -20

Repository: ishandhanani/srt-slurm

Length of output: 534


Fix GPU type to match configuration name.

The configuration is named gb200-fp4-max-tpt-2 but gpu_type is set to gb300. The related configuration file max-tpt-no-sbo.yaml has the same name with gpu_type: "gb200". Update to align:

resources:
-   gpu_type: "gb300"
+   gpu_type: "gb200"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
gpu_type: "gb300"
gpu_type: "gb200"
🤖 Prompt for AI Agents
In recipies/gb300-fp4/cutedsl-max-tpt.yaml around line 11, the gpu_type is
incorrectly set to "gb300" while the configuration name is gb200-fp4-max-tpt-2;
change gpu_type to "gb200" to match the configuration name and make it
consistent with max-tpt-no-sbo.yaml which uses gpu_type: "gb200".

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
recipies/gb300-fp4/cutedsl-max-tpt.yaml (1)

1-1: Comment/configuration mismatch: "8 Decode Nodes" vs. actual value 12.

Line 1 states "8 Decode Nodes" but Line 13 sets decode_nodes: 12. Update the comment to reflect the actual configuration.

-# 4P1D, with 8 Decode Nodes. Better per-gpu throughput with worse latency.
+# 4P1D, with 12 Decode Nodes. Better per-gpu throughput with worse latency.

Also applies to: 13-13

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a690d2d and 6c1e7b6.

📒 Files selected for processing (2)
  • recipies/gb300-fp4/cutedsl-max-tpt.yaml (3 hunks)
  • recipies/gb300-fp4/max-tpt-no-sbo.yaml (5 hunks)
🔇 Additional comments (6)
recipies/gb300-fp4/cutedsl-max-tpt.yaml (3)

7-7: Verify container reference format.

The container reference was simplified from a full Docker URI to "rt-cu13-14483". Confirm this format is valid and resolves correctly in your deployment environment. Ensure this shortened reference maintains the same semantics as the original docker://ishandhanani/sglang:rt-cu13-14483 or update to the appropriate full reference if needed.


12-14: Verify parallelism settings scale correctly with resource allocation.

Resource scaling increases decode nodes from 1 to 12 (48 total GPUs: 12 nodes × 4 GPUs/node), while parallelism settings are updated to tp-size: 48, dp-size: 48, ep-size: 48 in both prefill (tp/dp/ep from 4) and decode sections. Confirm these parallelism values are correct and compatible with the new resource allocation. A tensor-parallelism of 48 would span all decode GPUs—verify this aligns with your distributed inference topology and MOE expert placement strategy.

Also applies to: 172-174


22-22: MOE dispatch environment variable standardization looks good.

The environment variable naming has been standardized to SGLANG_MOE_NVFP4_DISPATCH: "1" in both prefill (line 22) and decode (line 61) sections. This is a clear, consistent change for MOE dispatch configuration.

Also applies to: 61-61

recipies/gb300-fp4/max-tpt-no-sbo.yaml (3)

7-7: Verify container reference is complete and resolvable.

The container reference is set to "0.5.5.post2", which appears to be a version string without a registry or image name. Compare this to file 1 which uses a clearer image reference format. Ensure this incomplete reference resolves correctly in your deployment environment or update to a full, valid image reference.


37-37: MOE dispatch environment standardization looks good.

The SGLANG_MOE_NVFP4_DISPATCH: "1" variable has been added to both prefill and decode environments consistently with the changes in file 1, supporting proper MOE dispatch configuration.

Also applies to: 59-59


125-125: Backend configuration updates are clear.

The moe-runner-backend is set to "flashinfer_cutedsl" and DeepEP configuration has been added with moe-a2a-backend: "deepep" and deepep-mode: "low_latency". Verify these backend choices are compatible with your model and inference stack.

Also applies to: 152-153

Comment thread recipies/gb300-fp4/max-tpt-no-sbo.yaml
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
recipies/gb300-fp4/long-ctx.yaml (1)

39-39: Remove trailing whitespace from environment variable values.

Lines 39 and 56 have trailing whitespace after "true", which can cause YAML parsing or comparison issues. Ensure consistency.

     prefill_environment:
       ...
-      SGLANG_ENABLE_FLASHINFER_GEMM: "true" 
+      SGLANG_ENABLE_FLASHINFER_GEMM: "true"
 
   decode_environment:
       ...
-      SGLANG_ENABLE_FLASHINFER_GEMM: "true" 
+      SGLANG_ENABLE_FLASHINFER_GEMM: "true"

Also applies to: 56-56

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c752878 and 5dc502c.

📒 Files selected for processing (1)
  • recipies/gb300-fp4/long-ctx.yaml (1 hunks)

Comment thread recipies/gb300-fp4/long-ctx.yaml Outdated
Comment thread recipies/gb300-fp4/long-ctx.yaml Outdated
Comment thread recipies/gb300-fp4/long-ctx-low-latency.yaml Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
analysis/srtlog/log_parser.py (1)

65-77: Variable naming inconsistency: total_err_files should reflect both file types.

The variable total_err_files is now counting both .err and .out files but retains the old name. Consider renaming to total_log_files for consistency with the updated messaging and behavior.

-        total_err_files = 0
+        total_log_files = 0
         parsed_successfully = 0

         for file in os.listdir(run_path):
             if (file.endswith(".err") or file.endswith(".out")) and ("prefill" in file or "decode" in file):
-                total_err_files += 1
+                total_log_files += 1
                 filepath = os.path.join(run_path, file)
                 node = self.parse_single_log(filepath)
                 if node:
                     nodes.append(node)
                     parsed_successfully += 1

-        logger.info(f"Parsed {parsed_successfully}/{total_err_files} prefill/decode log files from {run_path}")
+        logger.info(f"Parsed {parsed_successfully}/{total_log_files} prefill/decode log files from {run_path}")

-        if total_err_files == 0:
+        if total_log_files == 0:
             logger.warning(f"No prefill/decode log files found in {run_path}")
analysis/srtlog/config_reader.py (1)

288-298: Consider renaming function for clarity.

The function parse_command_line_from_err now parses both .err and .out files. While the docstring is updated, consider renaming to parse_command_line_from_logs for consistency with the broadened scope. This is optional since it would require updating all call sites.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fafe952 and 52bdfb9.

📒 Files selected for processing (5)
  • analysis/dashboard/components.py (1 hunks)
  • analysis/dashboard/node_metrics_tab.py (1 hunks)
  • analysis/srtlog/config_reader.py (5 hunks)
  • analysis/srtlog/log_parser.py (11 hunks)
  • recipies/gb300-fp4/long-ctx-low-latency.yaml (3 hunks)
🔇 Additional comments (7)
analysis/srtlog/log_parser.py (2)

470-474: Verify PP-to-TP mapping is intentional.

The pipeline parallelism value (PP) is mapped to the TP slot, which seems semantically inconsistent since pipeline parallelism and tensor parallelism are different concepts. Ensure this mapping is intentional and won't cause confusion in downstream metric analysis or visualizations.


592-606: LGTM!

The filename extraction regex correctly handles both .err and .out extensions with proper greedy matching for node names containing underscores.

recipies/gb300-fp4/long-ctx-low-latency.yaml (2)

9-15: Verify prefill resource allocation.

With prefill_nodes=4, prefill_workers=4, and pipeline-parallel-size=4, ensure that each prefill worker has access to 4 GPUs for the pipeline stages. The configuration assumes 16 total prefill processes across 4 nodes.


131-136: LGTM!

The benchmark configuration aligns well with the context-length settings (isl=128000 + osl=8000 = 136000 context-length).

analysis/dashboard/node_metrics_tab.py (1)

81-84: LGTM!

The updated messages correctly inform users that both .err and .out log files are now supported, consistent with the parser changes.

analysis/dashboard/components.py (1)

94-104: LGTM!

The _cache_version parameter provides a clean mechanism for cache invalidation after parser changes. The underscore prefix correctly signals to Streamlit that this parameter participates in cache key computation.

analysis/srtlog/config_reader.py (1)

343-364: LGTM!

The file filtering and parsing logic correctly handles both .err and .out files with appropriate greedy matching for node names containing underscores.

Comment on lines +123 to +129
# parallelism
moe-dense-tp-size: 1
tensor-parallel-size: 4
expert-parallel-size: 1
data-parallel-size: 4
enable-dp-attention: true
expert-parallel-size: 4
pipeline-parallel-size: 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Decode parallelism configuration exceeds available GPUs.

The decode parallelism settings require 64 GPU slots (TP=4 × DP=4 × EP=4 × PP=1 = 64), but only 4 GPUs are available (decode_nodes=1 × decode_workers=1 × gpus_per_node=4 = 4). This will cause a runtime failure.

Either reduce the parallelism to fit 4 GPUs:

       # parallelism
       moe-dense-tp-size: 1
-      tensor-parallel-size: 4
-      data-parallel-size: 4
+      tensor-parallel-size: 2
+      data-parallel-size: 2
       enable-dp-attention: true
-      expert-parallel-size: 4
+      expert-parallel-size: 1
       pipeline-parallel-size: 1

Or increase decode resources to match the parallelism requirements.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# parallelism
moe-dense-tp-size: 1
tensor-parallel-size: 4
expert-parallel-size: 1
data-parallel-size: 4
enable-dp-attention: true
expert-parallel-size: 4
pipeline-parallel-size: 1
# parallelism
moe-dense-tp-size: 1
tensor-parallel-size: 2
data-parallel-size: 2
enable-dp-attention: true
expert-parallel-size: 1
pipeline-parallel-size: 1
🤖 Prompt for AI Agents
In recipies/gb300-fp4/long-ctx-low-latency.yaml around lines 123-129, the
product TP×DP×EP×PP (4×4×4×1=64) exceeds the available GPUs
(decode_nodes×decode_workers×gpus_per_node = 1×1×4 = 4), causing runtime
failure; fix by either reducing parallelism so TP×DP×EP×PP ≤ 4 (e.g., set
tensor-parallel-size: 1, data-parallel-size: 1, expert-parallel-size: 4,
pipeline-parallel-size: 1 or other combination that multiplies to ≤4) or
increase decode resources to match 64 GPUs (e.g., adjust decode_nodes and/or
decode_workers so decode_nodes×decode_workers×gpus_per_node = 64, such as
decode_nodes: 4 and decode_workers: 4 with gpus_per_node: 4); ensure the final
configuration satisfies total GPU slots ≥ TP×DP×EP×PP.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
recipies/gb300-fp4/long-ctx-max-tpt.yaml (2)

120-120: Resolve uncertainty in chunked-prefill-size configuration.

The comment # why disable chunked prefill if we have dp attn? signals unresolved design intent. The value is set to -1 (disable), but the question suggests this choice may warrant reconsideration.

Please clarify: Is disabling chunked prefill intentional, or should this be revisited? I can help refine this setting or update the comment if needed.


18-35: Consider consolidating duplicated environment variables.

Both prefill_environment and decode_environment define identical entries: FLASHINFER_WORKSPACE_BASE, SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE, SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT, SGLANG_DISAGGREGATION_WAITING_TIMEOUT, and others.

If these settings should always match, consider defining a shared environment block (e.g., via a YAML anchor) to reduce duplication and maintenance burden:

defaults: &disagg_env
  SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
  SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
  SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
  # ... other shared vars

prefill_environment:
  <<: *disagg_env
  # prefill-specific overrides here

Also applies to: 38-58

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 52bdfb9 and 39e9189.

📒 Files selected for processing (1)
  • recipies/gb300-fp4/long-ctx-max-tpt.yaml (2 hunks)
🔇 Additional comments (1)
recipies/gb300-fp4/long-ctx-max-tpt.yaml (1)

92-92: Verify expert-parallel-size asymmetry between prefill and decode.

The prefill config sets expert-parallel-size: 1 (line 92), while decode sets expert-parallel-size: 32 (line 131). In disaggregated MOE setups, this disparity typically reflects different parallelization strategies, but the 1 vs. 32 gap is substantial.

Confirm this asymmetry is intentional. If prefill should use expert parallelism like decode, this may be a configuration oversight.

Also applies to: 131-131

concurrencies: "1x128x512x2048x4096x8192x12000x15000"
isl: 128000
osl: 8000
concurrencies: "16x32x64x128x256x512x644x768"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Verify benchmark concurrency value 644.

The concurrency sequence "16x32x64x128x256x512x644x768" contains an anomalous value: 644 breaks the expected progression between 512 and 768. This appears to be a typo.

Clarify the intended value. If following a doubling/step pattern, consider: 16x32x64x128x256x512x576x640x704x768 or similar.

- concurrencies: "16x32x64x128x256x512x644x768"
+ concurrencies: "16x32x64x128x256x512x576x640x704x768"  # or your intended values
🤖 Prompt for AI Agents
In recipies/gb300-fp4/long-ctx-max-tpt.yaml around line 143, the concurrencies
sequence contains an anomalous value "644" that breaks the expected numeric
progression between 512 and 768; replace or correct that value to match the
intended progression (e.g., use a consistent stepping/doubling pattern such as
adding intermediate values like 576, 640, 704 or adjust to the intended single
value) so the concurrencies string reads a valid sequence (for example
"16x32x64x128x256x512x576x640x704x768" or your chosen correct sequence).

@ishandhanani ishandhanani changed the title gb300-moe fix: dashboard looks at err files Dec 10, 2025
@ishandhanani ishandhanani merged commit 18af539 into main Dec 10, 2025
1 check was pending
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant