support agg_logs and dashboard by YAMY1234 · Pull Request #193 · ishandhanani/srt-slurm

YAMY1234 · 2026-02-26T22:15:31Z

Summary by CodeRabbit

New Features
- Added support for aggregated log format parsing alongside existing formats
- Added support for v2.0 run metadata format with enhanced backward compatibility
- Improved cache loading to search multiple directory locations
Bug Fixes
- Enhanced run identification for aggregated runs with distinct labeling
- Improved dashboard legend labeling using folder-derived suffixes and job names
Documentation
- Updated file pattern guidance in node metrics warnings to reflect supported file types

coderabbitai · 2026-02-26T22:15:42Z

📝 Walkthrough

Walkthrough

The changes extend the analysis pipeline to support aggregated logs: parser recognizes agg formats and log files; metadata extraction handles v2.0 structure with agg_nodes; run loader adapts cache patterns and run ID generation for aggregated runs; UI displays appropriate labels and guidance text.

Changes

Cohort / File(s)	Summary
Aggregated Log Format Support `analysis/srtlog/log_parser.py`	Expanded parsing to recognize agg files alongside prefill/decode; added regex and timestamp handling for Agg/ANSI log format; updated docstrings to document four-format support.
Metadata and Run Format Handling `analysis/srtlog/models.py`	Added support for v2.0 run format with agg_nodes extraction and fallback logic; unified handling for flat new-structure format to ensure consistent agg_nodes fallback across both paths.
Run Data Loading and Caching `analysis/srtlog/run_loader.py`	Broadened cache source patterns to include logs subdirectories; modified run ID construction to use aggregated metadata with A suffix when is_aggregated is true.
Dashboard Display Updates `analysis/dashboard/app.py`, `analysis/dashboard/node_metrics_tab.py`	Sidebar legend now derives folder_suffix from YAML filename with dash-based fallback to job_name; node metrics warning text updated to reference agg files alongside out files.

Sequence Diagram

sequenceDiagram
    participant User
    participant Dashboard as Dashboard UI
    participant RunLoader as Run Loader
    participant LogParser as Log Parser
    participant Metadata as Metadata Handler
    
    User->>Dashboard: Request analysis view
    Dashboard->>RunLoader: Load run data
    RunLoader->>LogParser: Parse log files (prefill/decode/agg)
    LogParser->>LogParser: Extract metrics & timestamps<br/>(incl. Agg/ANSI format)
    LogParser-->>RunLoader: Parsed log data
    RunLoader->>Metadata: Extract run metadata<br/>(v2.0 or flat format)
    Metadata->>Metadata: Derive agg_nodes & mode<br/>(with fallback logic)
    Metadata-->>RunLoader: RunMetadata instance
    RunLoader->>RunLoader: Generate run ID<br/>(agg path if aggregated)
    RunLoader-->>Dashboard: Populated dataframe
    Dashboard->>Dashboard: Construct sidebar legend<br/>(folder_suffix from filename)
    Dashboard-->>User: Rendered dashboard

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly Related PRs

fix: dashboard looks at err files #51: Overlaps in node_metrics_tab warning text updates and shared cache/file-pattern recognition for aggregated logs.

Suggested Reviewers

ishandhanani

Poem

🐰 Hop along the agg-gregate path,
Parsing formats, metadata bath,
From dash-named YAML to run IDs bright,
The dashboard now dances with aggregated might! ✨

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'support agg_logs and dashboard' is partially related to the changeset. It accurately reflects that the PR adds support for aggregated logs (agg) across multiple files, but it omits important implementation details about label construction changes and run metadata format handling that are significant parts of the changeset.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (2)

analysis/srtlog/models.py (1)

112-141: Consider extracting shared logic to reduce duplication.

The flat structure handling (lines 112-141) duplicates nearly all the extraction logic from the v2.0 format handling (lines 51-84). Both paths perform identical field extraction and agg_nodes fallback computation.

Consider extracting the common logic into a private helper method to improve maintainability.

♻️ Suggested refactor to reduce duplication

+    `@classmethod`
+    def _from_resources_model(cls, json_data: dict, resources: dict, model: dict, run_path: str) -> "RunMetadata":
+        """Helper to create RunMetadata from resources/model structure."""
+        agg_workers = resources.get("agg_workers") or 0
+        mode = "aggregated" if agg_workers > 0 else "disaggregated"
+        
+        agg_nodes = (resources.get("agg_nodes") or 0) if agg_workers > 0 else 0
+        if agg_workers > 0 and agg_nodes == 0:
+            agg_nodes = agg_workers
+        
+        return cls(
+            job_id=json_data.get("job_id", ""),
+            path=run_path,
+            run_date=json_data.get("generated_at", ""),
+            container=model.get("container", ""),
+            prefill_nodes=resources.get("prefill_nodes") or 0,
+            decode_nodes=resources.get("decode_nodes") or 0,
+            prefill_workers=resources.get("prefill_workers") or 0,
+            decode_workers=resources.get("decode_workers") or 0,
+            mode=mode,
+            job_name=json_data.get("job_name", ""),
+            partition="",
+            model_dir=model.get("path", ""),
+            gpus_per_node=resources.get("gpus_per_node") or 0,
+            gpu_type=resources.get("gpu_type", ""),
+            enable_multiple_frontends=False,
+            num_additional_frontends=0,
+            agg_nodes=agg_nodes,
+            agg_workers=agg_workers,
+        )
+
     `@classmethod`
     def from_json(cls, json_data: dict, run_path: str) -> "RunMetadata":
         # ... docstring ...
         
         # Check if this is the new v2.0 format
         if json_data.get("version") == "2.0":
             resources = json_data.get("resources", {})
             model = json_data.get("model", {})
-            agg_workers = resources.get("agg_workers") or 0
-            # ... all the extraction logic ...
+            return cls._from_resources_model(json_data, resources, model, run_path)
         
         # Old format with run_metadata key
         if "run_metadata" in json_data:
             # ... existing old format handling ...
         
         # New format (flat structure, no version field)
         model_data = json_data.get("model", {})
         resources_data = json_data.get("resources", {})
-        agg_workers = resources_data.get("agg_workers") or 0
-        # ... all the extraction logic ...
+        return cls._from_resources_model(json_data, resources_data, model_data, run_path)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@analysis/srtlog/models.py` around lines 112 - 141, The flat-structure branch
duplicates field extraction and agg_nodes logic already used for the v2.0 path;
extract that shared logic into a private helper (e.g., _extract_run_fields or
_build_run_kwargs) which takes model_data, resources_data, json_data and
run_path, computes agg_workers/agg_nodes/mode and returns the common kwargs or a
dict, then call that helper from both the v2.0 and flat branches and pass its
result into cls(...) so both branches reuse the same extraction and agg_nodes
fallback logic.

analysis/dashboard/app.py (1)

307-314: Clarify comment about folder name vs YAML file.

The comment mentions "YAML file name" but the code extracts the suffix from the run directory name (run.metadata.path), not a YAML file. Consider updating the comment to accurately describe what's being extracted.

📝 Suggested comment clarification

-        # Use folder name suffix (YAML file name) instead of YAML's internal name field
+        # Use folder name suffix (portion after first dash) for additional context
         dirname = os.path.basename(run.metadata.path)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@analysis/dashboard/app.py` around lines 307 - 314, The comment above the
suffix extraction is inaccurate: it says "YAML file name" but the code derives
dirname from run.metadata.path and extracts a folder suffix from the run
directory name (dirname -> folder_suffix) to append to label; update the comment
to clearly state that the code is using the run directory name
(run.metadata.path) and its suffix after the first "-" rather than any internal
YAML filename or YAML "name" field, and keep the existing logic in the block
that sets folder_suffix and falls back to run.metadata.job_name.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@analysis/dashboard/app.py`:
- Around line 307-314: The comment above the suffix extraction is inaccurate: it
says "YAML file name" but the code derives dirname from run.metadata.path and
extracts a folder suffix from the run directory name (dirname -> folder_suffix)
to append to label; update the comment to clearly state that the code is using
the run directory name (run.metadata.path) and its suffix after the first "-"
rather than any internal YAML filename or YAML "name" field, and keep the
existing logic in the block that sets folder_suffix and falls back to
run.metadata.job_name.

In `@analysis/srtlog/models.py`:
- Around line 112-141: The flat-structure branch duplicates field extraction and
agg_nodes logic already used for the v2.0 path; extract that shared logic into a
private helper (e.g., _extract_run_fields or _build_run_kwargs) which takes
model_data, resources_data, json_data and run_path, computes
agg_workers/agg_nodes/mode and returns the common kwargs or a dict, then call
that helper from both the v2.0 and flat branches and pass its result into
cls(...) so both branches reuse the same extraction and agg_nodes fallback
logic.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between acc6dd3 and 60ba478.

📒 Files selected for processing (5)

analysis/dashboard/app.py
analysis/dashboard/node_metrics_tab.py
analysis/srtlog/log_parser.py
analysis/srtlog/models.py
analysis/srtlog/run_loader.py

support agg_logs and dashboard

60ba478

YAMY1234 marked this pull request as draft February 26, 2026 22:15

gracehonv marked this pull request as ready for review February 26, 2026 23:12

coderabbitai bot reviewed Feb 26, 2026

View reviewed changes

gracehonv marked this pull request as draft February 26, 2026 23:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support agg_logs and dashboard#193

support agg_logs and dashboard#193
YAMY1234 wants to merge 1 commit intoishandhanani:mainfrom
YAMY1234:agg_dashboard

YAMY1234 commented Feb 26, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Feb 26, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated Code Review Effort

Possibly Related PRs

Suggested Reviewers

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

YAMY1234 commented Feb 26, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated Code Review Effort

Possibly Related PRs

Suggested Reviewers

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

YAMY1234 commented Feb 26, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 26, 2026 •

edited

Loading