test(l1): add per-phase timing breakdown to multisync Slack notifications#6136
Conversation
in the multisync monitoring script (docker_monitor.py). The sync completion logs already contain per-phase completion markers (e.g. "✓ BLOCK HEADERS complete: 25,693,009 headers in 0:29:00") but this data was not surfaced in the Slack messages or run summaries. This adds a parse_phase_timings() function that reads saved container logs and extracts timing, count, and duration for all 8 snap sync phases: Block Headers, Account Ranges, Account Insertion, Storage Ranges, Storage Insertion, State Healing, Storage Healing, and Bytecodes. The breakdown is appended to both the Slack notification (as a code block per network instance) and the text-based run log (run_history.log and per-run summary.txt). When a phase did not complete (e.g. on a failed run), it is simply omitted from the breakdown.
Greptile OverviewGreptile SummaryAdds per-phase timing breakdown to multisync monitoring Slack notifications and run logs by parsing saved container logs for phase completion markers. Key changes:
Implementation quality:
The change provides immediate visibility into performance bottlenecks without requiring manual log inspection. Confidence Score: 5/5
|
| Filename | Overview |
|---|---|
| tooling/sync/docker_monitor.py | Adds phase timing breakdown parsing and display for multisync monitoring - clean implementation with proper error handling |
Sequence Diagram
sequenceDiagram
participant Main as main()
participant SaveLogs as save_all_logs()
participant LogResult as log_run_result()
participant SlackNotify as slack_notify()
participant ParsePhase as parse_phase_timings()
participant LogFile as Log Files
Note over Main: Run completes (success or failed)
Main->>SaveLogs: save_all_logs(instances, run_id, compose_file)
SaveLogs->>LogFile: Write container logs to multisync_logs/run_{run_id}/{container}.log
LogFile-->>SaveLogs: Logs saved
SaveLogs-->>Main: Complete
Main->>LogResult: log_run_result(run_id, run_count, instances, ...)
loop For each instance
LogResult->>ParsePhase: parse_phase_timings(run_id, container)
ParsePhase->>LogFile: Read multisync_logs/run_{run_id}/{container}.log
LogFile-->>ParsePhase: Log content
ParsePhase->>ParsePhase: Apply regex patterns for 8 phases
ParsePhase-->>LogResult: [(phase_name, count, duration), ...]
LogResult->>LogResult: Format and append to text log
end
LogResult->>LogFile: Append to run_history.log and summary.txt
LogResult-->>Main: Complete
Main->>SlackNotify: slack_notify(run_id, run_count, instances, ...)
loop For each instance
SlackNotify->>ParsePhase: parse_phase_timings(run_id, container)
ParsePhase->>LogFile: Read multisync_logs/run_{run_id}/{container}.log
LogFile-->>ParsePhase: Log content
ParsePhase->>ParsePhase: Apply regex patterns for 8 phases
ParsePhase-->>SlackNotify: [(phase_name, count, duration), ...]
SlackNotify->>SlackNotify: Format phase breakdown code block
end
SlackNotify->>SlackNotify: POST to Slack webhook with blocks
SlackNotify-->>Main: Complete
There was a problem hiding this comment.
Pull request overview
This PR adds per-phase timing breakdown to multisync Slack notifications and log files, making it easier to identify performance bottlenecks in sync operations without manually parsing container logs.
Changes:
- Added
PHASE_COMPLETION_PATTERNSdictionary with regex patterns for all 8 snap sync phases - Added
parse_phase_timings()function to extract phase timing data from saved container logs - Enhanced Slack notifications and text logs to display phase breakdowns for each network instance
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Add phase breakdown for each instance | ||
| for i in instances: | ||
| phases = parse_phase_timings(run_id, i.container) | ||
| if phases: | ||
| phase_lines = [f"📊 *Phase Breakdown — {i.name}*", "```"] | ||
| max_name_len = max(len(name) for name, _, _ in phases) | ||
| for name, count, duration in phases: | ||
| phase_lines.append(f"{name:<{max_name_len}} {duration} ({count})") | ||
| phase_lines.append("```") | ||
| blocks.append({ | ||
| "type": "section", | ||
| "text": {"type": "mrkdwn", "text": "\n".join(phase_lines)} | ||
| }) |
There was a problem hiding this comment.
The phase breakdown sections are added in a separate loop (lines 360-371) from the instance status sections (lines 342-357). This means in the Slack notification, all instance statuses will be shown first, followed by all phase breakdowns. This differs from the text log format (lines 456-479) where each phase breakdown immediately follows its instance status. For better readability and consistency, consider moving the phase breakdown logic inside the first loop (after line 357) so each instance's breakdown appears immediately after its status, matching the text log format and the example in the PR description.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Motivation
The multisync monitoring script (
docker_monitor.py) sends Slack notifications at the end of each sync run, but they only report the total sync time per network. When investigating performance regressions or comparing runs, we had to manually SSH into the server and parse raw container logs to figure out which phase was slow. This is time-consuming and error-prone.The sync logs already contain per-phase completion markers like:
This PR surfaces that data directly in the Slack notification, so performance bottlenecks are visible at a glance.
Description
Adds three things to
tooling/sync/docker_monitor.py:PHASE_COMPLETION_PATTERNSdict — Regex patterns for all 8 snap sync phases:parse_phase_timings(run_id, container)function — Reads saved container log files frommultisync_logs/run_{run_id}/{container}.logand extracts(phase_name, item_count, duration)for each completed phase. Returns an empty list if logs are missing or if a phase didn't complete (e.g., on a failed run), so the behavior is graceful.Phase breakdown in Slack and run logs — After the per-instance status line, a code block is appended showing the full phase timing table. The same breakdown is also written to
run_history.logand the per-runsummary.txt.Expected Slack output (successful run)
The Slack message will now include a section like this for each network instance:
Phase names are left-aligned with padding for readability. The count in parentheses corresponds to the number of items processed (headers, accounts, storage slots, etc.).
Expected Slack output (failed run with partial phases)
If a run fails mid-sync (e.g., timeout during storage healing), only the phases that completed are shown:
Phases that never completed (State Healing, Storage Healing, Bytecodes in this case) are simply omitted — no placeholder or "N/A" rows.
Expected text log output (
summary.txt/run_history.log)How it works
The flow is:
save_all_logs()saves container logs to disk (already existed, no changes)log_run_result()now callsparse_phase_timings()and appends breakdown to text logslack_notify()now callsparse_phase_timings()and appends code blocks to Slack payloadSince
save_all_logs()is called before bothlog_run_result()andslack_notify()(lines 721→725 in main loop), the saved log files are always available for parsing.Edge cases
Checklist
STORE_SCHEMA_VERSION(crates/storage/lib.rs) if the PR includes breaking changes to theStorerequiring a re-sync.N/A — This PR only modifies the Python monitoring script, no Rust code or storage changes.