fix: dashboard looks at `err` files by ishandhanani · Pull Request #51 · ishandhanani/srt-slurm

ishandhanani · 2025-12-05T10:01:14Z

Summary by CodeRabbit

New Features
- Extended log file parsing to support both .err and .out formats for comprehensive log analysis
- Added support for pipeline parallelism configuration pattern detection
Updates
- Updated user-facing messages and warnings to reflect support for both .err and .out log file formats
- Enhanced cache invalidation mechanism to ensure accurate parsing after system updates

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-05T10:01:29Z

Caution

Review failed

The pull request is closed.

Walkthrough

The changes extend log file parsing across the analysis module to support both .err and .out file formats instead of exclusively .err files. This includes updating cache patterns, filename parsing logic, validation messages, and introducing support for a new pipeline parallelism pattern (PP#) in log line parsing.

Changes

Cohort / File(s)	Change Summary
Dashboard cache invalidation `analysis/dashboard/components.py`	Added `_cache_version` parameter (default: 2) to `load_node_metrics()` function signature to enable explicit cache invalidation after parser changes. Updated docstring to document the new parameter and note metrics are loaded from both `.err` and `.out` files.
Dashboard messaging updates `analysis/dashboard/node_metrics_tab.py`	Updated warning message from "No log files (.err) found" to "No log files found". Expanded informational message to list both `.err` and `.out` file patterns (prefill.err/out, decode.err/out).
Log parsing configuration `analysis/srtlog/config_reader.py`	Extended `parse_command_line_from_err()` to parse both `.err` and `.out` files. Updated cache source patterns to include `"*.out"`, adjusted filename and node/service extraction logic, replaced `err_files_found` counter with `log_files_found`, and updated validation messages and logging references to reflect combined `.err/.out` format.
Log parsing logic `analysis/srtlog/log_parser.py`	Broadened file format handling to parse both `.err` and `.out` files. Updated cache source patterns to include `"*.out"`, adjusted runtime file enumeration conditions, replaced `.err` references with "log files" in logging statements, added support for new pipeline parallelism pattern (PP#), and expanded DP/TP/EP tag parsing to handle three formats (full, simple TP, and pipeline PP).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

analysis/srtlog/config_reader.py: Multiple cascading changes to filename parsing, cache patterns, and variable naming require verification of consistency
analysis/srtlog/log_parser.py: New PP# pattern support and expanded DP/TP/EP parsing logic warrant careful validation of parsing correctness and edge cases
Cache invalidation strategy: Verify that the _cache_version parameter aligns with all cache-dependent functions and that no stale cache conflicts arise

Poem

🐰 Two files where once was one, now parsed with care,
Logs in .out and .err floating through the air,
Pipeline patterns dancing (PP#'s new delight),
Cache invalidation shines both clear and bright! ✨

Pre-merge checks and finishing touches

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The title 'gb300-moe' is vague and lacks descriptive detail about the actual changes made across multiple files.	Use a more descriptive title that summarizes the main changes, such as 'Update gb300 fp4 configurations for MOE optimization' or 'Configure gb300-fp4 recipes with MOE backends and disaggregation'.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 39e9189 and b8750e3.

📒 Files selected for processing (1)

analysis/dashboard/components.py (1 hunks)

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

recipies/gb300-fp4/cutedsl-max-tpt.yaml (1)
60-60: Inconsistent environment variable key with sibling recipe file.

Line 60 in this file uses the old key SGLANG_CUTEDSL_MOE_NVFP4_DISPATCH: "1", while the related recipe file recipies/gb300-fp4/cutlass-max-tpt.yaml was updated to use the new key SGLANG_MOE_NVFP4_DISPATCH: "true" on lines 26 and 63. Both recipes are for the same compute type (gb300-fp4) and should have consistent environment variable names.

Update line 60 to match the renamed key and boolean format:
-    SGLANG_CUTEDSL_MOE_NVFP4_DISPATCH: "1"
+    SGLANG_MOE_NVFP4_DISPATCH: "true"
Also verify that the system/application code reading these environment variables has been updated to use the new key name across all affected recipe files.
recipies/gb300-fp4/cutlass-max-tpt.yaml (1)
1-5: Update header comment to reflect renamed environment variable.

The comment on line 5 still references the old environment variable name SGLANG_CUTEDSL_MOE_NVFP4_DISPATCH, but the actual configuration now uses SGLANG_MOE_NVFP4_DISPATCH. Keep the header comments in sync with the actual configuration.

Update line 5 in the header comment:
-#   - setting SGLANG_CUTEDSL_MOE_NVFP4_DISPATCH: 'true'
+#   - setting SGLANG_MOE_NVFP4_DISPATCH: 'true'

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 36a3965 and a690d2d.

📒 Files selected for processing (2)

recipies/gb300-fp4/cutedsl-max-tpt.yaml (1 hunks)
recipies/gb300-fp4/cutlass-max-tpt.yaml (3 hunks)

🔇 Additional comments (1)

recipies/gb300-fp4/cutedsl-max-tpt.yaml (1)

7-7: Verify container image format is valid and resolvable.

The container image reference was simplified from a fully-qualified Docker URL (docker://ishandhanani/sglang:rt-cu13-14483) to a bare tag (rt-cu13-14483). Without an explicit registry prefix or organization namespace, image resolution may fail depending on how the parsing/registry logic interprets this format.

Confirm:

Does the image resolution logic support bare tags without registry prefixes?

Is rt-cu13-14483 available in the expected image registry (e.g., Docker Hub)?

Should this be docker://rt-cu13-14483 or ishandhanani/sglang:rt-cu13-14483 instead?

Consider restoring the full image reference unless you've verified the bare format works with your system.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

recipies/gb300-fp4/cutedsl-max-tpt.yaml (1)
1-1: Update comment to reflect new decode node count.

The file description mentions "8 Decode Nodes," but this configuration now scales to 12 decode nodes (line 13). Update the comment to accurately reflect the current topology.

Apply this diff to fix the comment:
-# 4P1D, with 8 Decode Nodes. Better per-gpu throughput with worse latency.
+# 4P1D, with 12 Decode Nodes. Better per-gpu throughput with worse latency.
Also applies to: 13-13

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a690d2d and 6c1e7b6.

📒 Files selected for processing (2)

recipies/gb300-fp4/cutedsl-max-tpt.yaml (3 hunks)
recipies/gb300-fp4/max-tpt-no-sbo.yaml (5 hunks)

🔇 Additional comments (8)

recipies/gb300-fp4/max-tpt-no-sbo.yaml (4)

3-3: Clean GPU hardware migration from GB300 to GB200.

Name and gpu_type are now consistently aligned. Configuration reflects the target hardware change.

Also applies to: 11-11

7-7: Verify container version compatibility across configuration files.

This file uses container version 0.5.5.post2, while the related configuration file uses rt-cu13-14483. Confirm these are compatible or intended to be different versions for different deployment scenarios.

37-37: MOE NVFP4 dispatch environment variables added consistently.

Both prefill and decode environments now include SGLANG_MOE_NVFP4_DISPATCH: "1", aligning with the migration away from CUTEDSL-specific dispatch handling.

Also applies to: 59-59

125-125: Verify backend changes are compatible with container version.

The decode configuration switches to flashinfer_cutedsl for MOE kernel execution and uses deepep with low_latency mode for expert-to-expert communication. Confirm these backends are supported by container version 0.5.5.post2.

Also applies to: 152-153

recipies/gb300-fp4/cutedsl-max-tpt.yaml (4)

7-7: Verify container image reference format consistency.

This file uses container reference rt-cu13-14483 (tag format), while the related configuration uses 0.5.5.post2 (version format). Confirm both resolve to the same or appropriate images, or clarify if intentional differences exist for different deployment paths.

12-14: Verify resource scaling does not exceed cluster capacity.

The configuration scales aggressively: prefill nodes 1→4, decode nodes 1→12, prefill workers 1→4. Confirm these allocations are compatible with cluster resources and that the 12x decode node scaling aligns with the 12x parallelism increase (ep-size: 4→48, below).

22-22: MOE NVFP4 dispatch environment variables added consistently.

Both prefill and decode environments now include SGLANG_MOE_NVFP4_DISPATCH: "1", aligning with File 1 and the unified MOE dispatch strategy.

Also applies to: 61-61

172-174: Verify parallelism scaling aligns with resource topology.

The decode configuration scales parallelism 12x (tp/dp/ep: 4→48) to match decode node scaling. Confirm this allocation is mathematically compatible with the hardware topology (12 decode nodes × 4 GPUs/node = 48 units) and that memory/communication overhead is acceptable at this scale.

coderabbitai · 2025-12-05T10:12:06Z

  precision: "fp4"

 resources:
  gpu_type: "gb300"


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n recipies/gb300-fp4/cutedsl-max-tpt.yaml | head -20

Repository: ishandhanani/srt-slurm

Length of output: 565

🏁 Script executed:

ls -la recipies/gb300-fp4/ | grep -E "\.yaml$"

Repository: ishandhanani/srt-slurm

Length of output: 263

🏁 Script executed:

cat -n recipies/gb300-fp4/cutedsl-max-tpt-no-sbo.yaml | head -20

Repository: ishandhanani/srt-slurm

Length of output: 144

🏁 Script executed:

cat -n recipies/gb300-fp4/max-tpt-no-sbo.yaml | head -20

Repository: ishandhanani/srt-slurm

Length of output: 562

🏁 Script executed:

cat -n recipies/gb300-fp4/1p2d.yaml | head -20

Repository: ishandhanani/srt-slurm

Length of output: 534

Fix GPU type to match configuration name.

The configuration is named gb200-fp4-max-tpt-2 but gpu_type is set to gb300. The related configuration file max-tpt-no-sbo.yaml has the same name with gpu_type: "gb200". Update to align:

resources: - gpu_type: "gb300" + gpu_type: "gb200"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

gpu_type: "gb300"

gpu_type: "gb200"

🤖 Prompt for AI Agents

In recipies/gb300-fp4/cutedsl-max-tpt.yaml around line 11, the gpu_type is incorrectly set to "gb300" while the configuration name is gb200-fp4-max-tpt-2; change gpu_type to "gb200" to match the configuration name and make it consistent with max-tpt-no-sbo.yaml which uses gpu_type: "gb200".

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

recipies/gb300-fp4/cutedsl-max-tpt.yaml (1)
1-1: Comment/configuration mismatch: "8 Decode Nodes" vs. actual value 12.

Line 1 states "8 Decode Nodes" but Line 13 sets decode_nodes: 12. Update the comment to reflect the actual configuration.
-# 4P1D, with 8 Decode Nodes. Better per-gpu throughput with worse latency.
+# 4P1D, with 12 Decode Nodes. Better per-gpu throughput with worse latency.
Also applies to: 13-13

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a690d2d and 6c1e7b6.

📒 Files selected for processing (2)

recipies/gb300-fp4/cutedsl-max-tpt.yaml (3 hunks)
recipies/gb300-fp4/max-tpt-no-sbo.yaml (5 hunks)

🔇 Additional comments (6)

recipies/gb300-fp4/cutedsl-max-tpt.yaml (3)

7-7: Verify container reference format.

The container reference was simplified from a full Docker URI to "rt-cu13-14483". Confirm this format is valid and resolves correctly in your deployment environment. Ensure this shortened reference maintains the same semantics as the original docker://ishandhanani/sglang:rt-cu13-14483 or update to the appropriate full reference if needed.

12-14: Verify parallelism settings scale correctly with resource allocation.

Resource scaling increases decode nodes from 1 to 12 (48 total GPUs: 12 nodes × 4 GPUs/node), while parallelism settings are updated to tp-size: 48, dp-size: 48, ep-size: 48 in both prefill (tp/dp/ep from 4) and decode sections. Confirm these parallelism values are correct and compatible with the new resource allocation. A tensor-parallelism of 48 would span all decode GPUs—verify this aligns with your distributed inference topology and MOE expert placement strategy.

Also applies to: 172-174

22-22: MOE dispatch environment variable standardization looks good.

The environment variable naming has been standardized to SGLANG_MOE_NVFP4_DISPATCH: "1" in both prefill (line 22) and decode (line 61) sections. This is a clear, consistent change for MOE dispatch configuration.

Also applies to: 61-61

recipies/gb300-fp4/max-tpt-no-sbo.yaml (3)

7-7: Verify container reference is complete and resolvable.

The container reference is set to "0.5.5.post2", which appears to be a version string without a registry or image name. Compare this to file 1 which uses a clearer image reference format. Ensure this incomplete reference resolves correctly in your deployment environment or update to a full, valid image reference.

37-37: MOE dispatch environment standardization looks good.

The SGLANG_MOE_NVFP4_DISPATCH: "1" variable has been added to both prefill and decode environments consistently with the changes in file 1, supporting proper MOE dispatch configuration.

Also applies to: 59-59

125-125: Backend configuration updates are clear.

The moe-runner-backend is set to "flashinfer_cutedsl" and DeepEP configuration has been added with moe-a2a-backend: "deepep" and deepep-mode: "low_latency". Verify these backend choices are compatible with your model and inference stack.

Also applies to: 152-153

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (1)

recipies/gb300-fp4/long-ctx.yaml (1)
39-39: Remove trailing whitespace from environment variable values.

Lines 39 and 56 have trailing whitespace after "true", which can cause YAML parsing or comparison issues. Ensure consistency.
     prefill_environment:
       ...
-      SGLANG_ENABLE_FLASHINFER_GEMM: "true" 
+      SGLANG_ENABLE_FLASHINFER_GEMM: "true"
 
   decode_environment:
       ...
-      SGLANG_ENABLE_FLASHINFER_GEMM: "true" 
+      SGLANG_ENABLE_FLASHINFER_GEMM: "true"
Also applies to: 56-56

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c752878 and 5dc502c.

📒 Files selected for processing (1)

recipies/gb300-fp4/long-ctx.yaml (1 hunks)

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

analysis/srtlog/log_parser.py (1)
65-77: Variable naming inconsistency: total_err_files should reflect both file types.

The variable total_err_files is now counting both .err and .out files but retains the old name. Consider renaming to total_log_files for consistency with the updated messaging and behavior.
-        total_err_files = 0
+        total_log_files = 0
         parsed_successfully = 0

         for file in os.listdir(run_path):
             if (file.endswith(".err") or file.endswith(".out")) and ("prefill" in file or "decode" in file):
-                total_err_files += 1
+                total_log_files += 1
                 filepath = os.path.join(run_path, file)
                 node = self.parse_single_log(filepath)
                 if node:
                     nodes.append(node)
                     parsed_successfully += 1

-        logger.info(f"Parsed {parsed_successfully}/{total_err_files} prefill/decode log files from {run_path}")
+        logger.info(f"Parsed {parsed_successfully}/{total_log_files} prefill/decode log files from {run_path}")

-        if total_err_files == 0:
+        if total_log_files == 0:
             logger.warning(f"No prefill/decode log files found in {run_path}")
analysis/srtlog/config_reader.py (1)

288-298: Consider renaming function for clarity.

The function parse_command_line_from_err now parses both .err and .out files. While the docstring is updated, consider renaming to parse_command_line_from_logs for consistency with the broadened scope. This is optional since it would require updating all call sites.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fafe952 and 52bdfb9.

📒 Files selected for processing (5)

analysis/dashboard/components.py (1 hunks)
analysis/dashboard/node_metrics_tab.py (1 hunks)
analysis/srtlog/config_reader.py (5 hunks)
analysis/srtlog/log_parser.py (11 hunks)
recipies/gb300-fp4/long-ctx-low-latency.yaml (3 hunks)

🔇 Additional comments (7)

analysis/srtlog/log_parser.py (2)

470-474: Verify PP-to-TP mapping is intentional.

The pipeline parallelism value (PP) is mapped to the TP slot, which seems semantically inconsistent since pipeline parallelism and tensor parallelism are different concepts. Ensure this mapping is intentional and won't cause confusion in downstream metric analysis or visualizations.

592-606: LGTM!

The filename extraction regex correctly handles both .err and .out extensions with proper greedy matching for node names containing underscores.

recipies/gb300-fp4/long-ctx-low-latency.yaml (2)

9-15: Verify prefill resource allocation.

With prefill_nodes=4, prefill_workers=4, and pipeline-parallel-size=4, ensure that each prefill worker has access to 4 GPUs for the pipeline stages. The configuration assumes 16 total prefill processes across 4 nodes.

131-136: LGTM!

The benchmark configuration aligns well with the context-length settings (isl=128000 + osl=8000 = 136000 context-length).

analysis/dashboard/node_metrics_tab.py (1)

81-84: LGTM!

The updated messages correctly inform users that both .err and .out log files are now supported, consistent with the parser changes.

analysis/dashboard/components.py (1)

94-104: LGTM!

The _cache_version parameter provides a clean mechanism for cache invalidation after parser changes. The underscore prefix correctly signals to Streamlit that this parameter participates in cache key computation.

analysis/srtlog/config_reader.py (1)

343-364: LGTM!

The file filtering and parsing logic correctly handles both .err and .out files with appropriate greedy matching for node names containing underscores.

coderabbitai · 2025-12-09T02:15:05Z

+      # parallelism
      moe-dense-tp-size: 1
      tensor-parallel-size: 4
-      expert-parallel-size: 1
+      data-parallel-size: 4
+      enable-dp-attention: true
+      expert-parallel-size: 4
+      pipeline-parallel-size: 1


⚠️ Potential issue | 🔴 Critical

Decode parallelism configuration exceeds available GPUs.

The decode parallelism settings require 64 GPU slots (TP=4 × DP=4 × EP=4 × PP=1 = 64), but only 4 GPUs are available (decode_nodes=1 × decode_workers=1 × gpus_per_node=4 = 4). This will cause a runtime failure.

Either reduce the parallelism to fit 4 GPUs:

# parallelism moe-dense-tp-size: 1 - tensor-parallel-size: 4 - data-parallel-size: 4 + tensor-parallel-size: 2 + data-parallel-size: 2 enable-dp-attention: true - expert-parallel-size: 4 + expert-parallel-size: 1 pipeline-parallel-size: 1

Or increase decode resources to match the parallelism requirements.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# parallelism

moe-dense-tp-size: 1

tensor-parallel-size: 4

expert-parallel-size: 1

data-parallel-size: 4

enable-dp-attention: true

expert-parallel-size: 4

pipeline-parallel-size: 1

# parallelism

moe-dense-tp-size: 1

tensor-parallel-size: 2

data-parallel-size: 2

enable-dp-attention: true

expert-parallel-size: 1

pipeline-parallel-size: 1

🤖 Prompt for AI Agents

In recipies/gb300-fp4/long-ctx-low-latency.yaml around lines 123-129, the product TP×DP×EP×PP (4×4×4×1=64) exceeds the available GPUs (decode_nodes×decode_workers×gpus_per_node = 1×1×4 = 4), causing runtime failure; fix by either reducing parallelism so TP×DP×EP×PP ≤ 4 (e.g., set tensor-parallel-size: 1, data-parallel-size: 1, expert-parallel-size: 4, pipeline-parallel-size: 1 or other combination that multiplies to ≤4) or increase decode resources to match 64 GPUs (e.g., adjust decode_nodes and/or decode_workers so decode_nodes×decode_workers×gpus_per_node = 64, such as decode_nodes: 4 and decode_workers: 4 with gpus_per_node: 4); ensure the final configuration satisfies total GPU slots ≥ TP×DP×EP×PP.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

recipies/gb300-fp4/long-ctx-max-tpt.yaml (2)
120-120: Resolve uncertainty in chunked-prefill-size configuration.

The comment # why disable chunked prefill if we have dp attn? signals unresolved design intent. The value is set to -1 (disable), but the question suggests this choice may warrant reconsideration.

Please clarify: Is disabling chunked prefill intentional, or should this be revisited? I can help refine this setting or update the comment if needed.

18-35: Consider consolidating duplicated environment variables.

Both prefill_environment and decode_environment define identical entries: FLASHINFER_WORKSPACE_BASE, SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE, SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT, SGLANG_DISAGGREGATION_WAITING_TIMEOUT, and others.

If these settings should always match, consider defining a shared environment block (e.g., via a YAML anchor) to reduce duplication and maintenance burden:
defaults: &disagg_env
  SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
  SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
  SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
  # ... other shared vars

prefill_environment:
  <<: *disagg_env
  # prefill-specific overrides here
Also applies to: 38-58

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 52bdfb9 and 39e9189.

📒 Files selected for processing (1)

recipies/gb300-fp4/long-ctx-max-tpt.yaml (2 hunks)

🔇 Additional comments (1)

recipies/gb300-fp4/long-ctx-max-tpt.yaml (1)

92-92: Verify expert-parallel-size asymmetry between prefill and decode.

The prefill config sets expert-parallel-size: 1 (line 92), while decode sets expert-parallel-size: 32 (line 131). In disaggregated MOE setups, this disparity typically reflects different parallelization strategies, but the 1 vs. 32 gap is substantial.

Confirm this asymmetry is intentional. If prefill should use expert parallelism like decode, this may be a configuration oversight.

Also applies to: 131-131

coderabbitai · 2025-12-09T07:23:21Z

-  concurrencies: "1x128x512x2048x4096x8192x12000x15000"
+  isl: 128000
+  osl: 8000
+  concurrencies: "16x32x64x128x256x512x644x768"


⚠️ Potential issue | 🔴 Critical

Verify benchmark concurrency value 644.

The concurrency sequence "16x32x64x128x256x512x644x768" contains an anomalous value: 644 breaks the expected progression between 512 and 768. This appears to be a typo.

Clarify the intended value. If following a doubling/step pattern, consider: 16x32x64x128x256x512x576x640x704x768 or similar.

- concurrencies: "16x32x64x128x256x512x644x768" + concurrencies: "16x32x64x128x256x512x576x640x704x768" # or your intended values

🤖 Prompt for AI Agents

In recipies/gb300-fp4/long-ctx-max-tpt.yaml around line 143, the concurrencies sequence contains an anomalous value "644" that breaks the expected numeric progression between 512 and 768; replace or correct that value to match the intended progression (e.g., use a consistent stepping/doubling pattern such as adding intermediate values like 576, 640, 704 or adjust to the intended single value) so the concurrencies string reads a valid sequence (for example "16x32x64x128x256x512x576x640x704x768" or your chosen correct sequence).

recivipies

a690d2d

go

c563578

coderabbitai bot reviewed Dec 5, 2025

View reviewed changes

Comment thread recipies/gb300-fp4/cutlass-max-tpt.yaml Outdated

ishandhanani added 2 commits December 5, 2025 02:05

up

6b1a42b

redo

6c1e7b6

coderabbitai bot reviewed Dec 5, 2025

View reviewed changes

Comment thread recipies/gb300-fp4/max-tpt-no-sbo.yaml

ishandhanani added 4 commits December 5, 2025 12:16

Merge branch 'main' into gb300-work

4cc72ab

some fixes

67922b9

Merge branch 'main' into gb300-work

c752878

long-ctx

5dc502c

coderabbitai bot reviewed Dec 6, 2025

View reviewed changes

Comment thread recipies/gb300-fp4/long-ctx.yaml Outdated

Comment thread recipies/gb300-fp4/long-ctx.yaml Outdated

Comment thread recipies/gb300-fp4/long-ctx-low-latency.yaml Outdated

ishandhanani added 7 commits December 5, 2025 18:41

go

faa9bcf

fix(gb300-fp4): remove unused cuda graph settings in long-ctx.yaml

b97b594

Merge branch 'main' into gb300-work

f08b61f

more cleanup

cdbb2f9

tuning

fafe952

go

98dda05

go

52bdfb9

coderabbitai bot reviewed Dec 9, 2025

View reviewed changes

go

39e9189

coderabbitai bot reviewed Dec 9, 2025

View reviewed changes

ishandhanani added 4 commits December 9, 2025 10:56

Merge branch 'main' into gb300-work

1ac0aa2

Merge branch 'main' into gb300-work

29b2414

go

a07b3bd

Merge branch 'main' into gb300-work

b8750e3

ishandhanani changed the title ~~gb300-moe~~ fix: dashboard looks at err files Dec 10, 2025

ishandhanani merged commit 18af539 into main Dec 10, 2025
1 check was pending

coderabbitai bot mentioned this pull request Dec 11, 2025

[draft]support sglang native disagg and profiling. #55

Merged

ishandhanani mentioned this pull request Dec 17, 2025

sgl-router and docs #61

Merged

coderabbitai bot mentioned this pull request Jan 27, 2026

add srtlog functionality to vectorize logs #101

Closed

coderabbitai bot mentioned this pull request Feb 26, 2026

support agg_logs and dashboard #193

Draft

Conversation

ishandhanani commented Dec 5, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ishandhanani commented Dec 5, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 5, 2025 •

edited

Loading