feat(l1): fullsync metrics#6466
Conversation
🤖 Kimi Code ReviewThis PR adds comprehensive Prometheus metrics for full sync operations. The implementation is generally well-structured and follows Rust best practices. Here are my observations: Critical IssuesNone found. This is a metrics-only change that doesn't modify consensus-critical logic. Potential Issues1. Integer Overflow Risk (Low severity, technical debt)File: The // Line 229
self.target_block.set(block.cast_signed());Suggestion: Use saturating conversion or explicit bounds checking: self.target_block.set(block.try_into().unwrap_or(i64::MAX));2. Missing Block Hash Debug InfoFile: The new timing log replaces the previous log that included first/last block hashes. While cleaner, this removes valuable debugging context when execution fails. Suggestion: Keep block hashes in the log or log them separately at debug level: debug!(
"Batch range: {:#?} to {:#?}",
blocks.first().map(|b| b.hash()),
blocks.last().map(|b| b.hash())
);3. Magic Numbers for Sync StagesFile: Stage values (0=idle, 1=headers, etc.) are documented but not enforced via types. Suggestion: Define an enum for type safety (conversion to i64 is trivial): #[repr(i64)]
enum SyncStage { Idle = 0, DownloadingHeaders = 1, ... }Code Quality Suggestions4. Macro HygieneFile: The 5. Registry AllocationFile:
Suggestion: Consider registering metrics once at startup and storing the registry in the struct, or document that this is intentionally ephemeral to avoid global registry state. 6. Timestamp PrecisionFile: Using Positive Observations
SummaryThe PR is ready to merge after addressing the overflow risk (Item 1). The metrics provide valuable observability into sync performance bottlenecks (execution vs. merkleization vs. storage) which will help operators diagnose node performance issues. Automated review by Kimi (Moonshot AI) · kimi-k2.5 · custom prompt |
Lines of code reportTotal lines added: Detailed view |
🤖 Claude Code ReviewNow I have enough information for a thorough review. Here it is: Review:
|
🤖 Codex Code ReviewFindings
No consensus, EVM, trie, or validation logic changed in this PR, so I did not find blockchain-correctness or security issues beyond these metrics-path regressions. I could not run Automated review by OpenAI Codex · gpt-5.4 · custom prompt |
…6486) ## Summary - The Hive consume-engine Amsterdam tests for EIP-7778 and EIP-8037 were failing because ethrex's per-tx gas limit checks were incompatible with Amsterdam's new gas accounting rules. - **EIP-7778** uses pre-refund gas for block accounting, so cumulative pre-refund gas can exceed the block gas limit even when a block builder correctly included all transactions. - **EIP-8037** introduces 2D gas accounting (`block_gas = max(regular, state)`), meaning cumulative total gas (regular + state) can legally exceed the block gas limit. - The fix skips the per-tx cumulative gas check for Amsterdam and adds a **post-execution** block-level overflow check using `max(sum_regular, sum_state)` in all three execution paths (sequential, pipeline, parallel). ## Local test results - **200/201** EIP-7778 + EIP-8037 Hive consume-engine tests pass - **105/105** EIP-7778 + EIP-8037 EF blockchain tests pass (4 + 101) - The single remaining Hive failure (`test_block_regular_gas_limit[exceed=True]`) expects `TransactionException.GAS_ALLOWANCE_EXCEEDED` but we return `BlockException.GAS_USED_OVERFLOW` — the block is correctly rejected, just with a different error classification. ## Test plan - [x] All EIP-7778 EF blockchain tests pass locally - [x] All EIP-8037 EF blockchain tests pass locally - [x] 200/201 Hive consume-engine Amsterdam tests pass locally - [ ] Full CI Amsterdam Hive suite passes --------- Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
🤖 Claude Code ReviewHere is my review of PR #6466: PR #6466 —
|
🤖 Kimi Code ReviewThis PR adds comprehensive metrics for full sync operations. The implementation is generally sound, but I have several specific observations: Issues Found1. Unused
|
🤖 Codex Code Review
No consensus, EVM, gas-accounting, or state-transition correctness issues stood out in the diff itself; the problems I found are observability/diagnostics. I wasn’t able to run Automated review by OpenAI Codex · gpt-5.4 · custom prompt |
Greptile SummaryThis PR adds Prometheus metrics and a Grafana dashboard for the full sync process, tracking stage transitions, header/body download rates, block execution progress, and per-phase batch timing (EVM execution, merkle, storage).
Confidence Score: 4/5Safe to merge after fixing the body_failures counter — the rest are P2 improvements. One P1 defect: the crates/networking/p2p/sync/full.rs (body_failures never incremented), crates/blockchain/blockchain.rs (unconditional timing + missing transitions metric)
|
| Filename | Overview |
|---|---|
| crates/blockchain/metrics/fullsync.rs | New Prometheus metrics struct for full sync; comprehensive coverage of stage, progress, rates, and batch timings, but missing a batch_transitions_ms gauge that is measured in the calling code. |
| crates/blockchain/blockchain.rs | Adds per-phase timing instrumentation and metrics emission; timing variables and [FULLSYNC TIMING] log are unconditional (not gated by metrics feature), and transitions_ms is logged but never emitted as a Prometheus metric. |
| crates/networking/p2p/sync/full.rs | Wires up sync-cycle metrics (stage transitions, header/body rates, batch timing); body_failures counter is exposed but never incremented on body download errors. |
| crates/blockchain/metrics/api.rs | Registers METRICS_FULLSYNC in the Prometheus scrape endpoint; follows the same pattern as existing metrics modules. |
| crates/blockchain/metrics/mod.rs | Adds fullsync submodule gated by api or metrics feature, consistent with existing modules. |
| metrics/provisioning/grafana/dashboards/common_dashboards/fullsync_dashboard.json | New Grafana dashboard with stage, progress, header/body/execution panels; has a hardcoded production instance in the template default value. |
Sequence Diagram
sequenceDiagram
participant SC as sync_cycle_full
participant AB as add_blocks_in_batch
participant BC as execute_blocks_batch
participant ME as METRICS_FULLSYNC
SC->>ME: reset_cycle(), set_stage(1), inc_cycles_started()
loop Header download batches
SC->>ME: set_target_block, set_lowest_header, inc_headers_downloaded
end
SC->>ME: set_blocks_total(), set_stage(2), set_execution_stage_start_now()
loop Execution batches
SC->>ME: set_batch_body_download_ms, set_bodies_per_second, set_stage(3)
SC->>AB: add_blocks_in_batch()
AB->>BC: execute_blocks_batch()
BC->>ME: set_batch_execution_ms (exec_ms)
Note over BC,ME: transitions_ms measured but NOT set in metrics
BC->>ME: set_batch_merkle_ms, set_batch_store_ms
AB->>ME: set_blocks_executed, set_blocks_per_second, set_batch_total_ms, set_batch_size
end
SC->>ME: set_stage(0), inc_cycles_completed()
Comments Outside Diff (1)
-
metrics/provisioning/grafana/dashboards/common_dashboards/fullsync_dashboard.json, line 980-987 (link)Hardcoded production instance in dashboard template variable
The
currentfield of theinstancetemplate variable is hardcoded to"ethrex-mainnet-8:3701". Anyone importing this dashboard on a different environment will see that specific production target pre-selected until they manually change it.Prompt To Fix With AI
This is a comment left during a code review. Path: metrics/provisioning/grafana/dashboards/common_dashboards/fullsync_dashboard.json Line: 980-987 Comment: **Hardcoded production instance in dashboard template variable** The `current` field of the `instance` template variable is hardcoded to `"ethrex-mainnet-8:3701"`. Anyone importing this dashboard on a different environment will see that specific production target pre-selected until they manually change it. How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: crates/networking/p2p/sync/full.rs
Line: 170-175
Comment:
**`body_failures` metric never incremented**
`inc_body_failures()` is defined, exposed via Prometheus, and shown in the Grafana dashboard, but it is never called anywhere in the PR. When `request_block_bodies` returns `None` or an error, the `?` propagates immediately and the counter stays at 0. Operators will see a perpetually-zero `fullsync_body_failures` gauge even during body download problems.
```suggestion
let bodies = peers
.request_block_bodies(header_batch)
.await
.inspect_err(|_| fullsync_metrics!(METRICS_FULLSYNC.inc_body_failures()))
?
.ok_or_else(|| {
fullsync_metrics!(METRICS_FULLSYNC.inc_body_failures());
SyncError::BodiesNotFound
})?;
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: crates/blockchain/blockchain.rs
Line: 2190-2293
Comment:
**Timing instrumentation always compiled regardless of `metrics` feature**
All four timing `Instant`s (`exec_start`, `transitions_start`, `merkle_start`, `store_start`) and the `[FULLSYNC TIMING]` `info!` log are unconditional — they run even when the `metrics` feature is disabled. The existing `metrics!` macro is specifically designed to gate this overhead. Similarly, `body_download_start` and its `[FULLSYNC TIMING]` info log in `full.rs` fire unconditionally. On a busy node this adds one `info!` line per batch and per body-download round, which can be verbose in non-metrics deployments.
Consider wrapping both the timing variables and the log under `metrics!(...)` / `fullsync_metrics!(...)`, or demoting the log to `debug!`.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: crates/blockchain/blockchain.rs
Line: 2235-2238
Comment:
**`transitions_ms` measured and logged but not exposed as a Prometheus metric**
`transitions_ms` (time spent in `vm.get_state_transitions()`) is included in the `[FULLSYNC TIMING]` log and clearly matters for profiling, but there is no corresponding `batch_transitions_ms` gauge in `MetricsFullSync`. The Grafana "Batch Time Breakdown" stacked-bar chart sums `body_download + execution + merkle + store`, so it will consistently under-report total batch time by the transitions duration. This gap will be invisible to operators using the dashboard.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: metrics/provisioning/grafana/dashboards/common_dashboards/fullsync_dashboard.json
Line: 980-987
Comment:
**Hardcoded production instance in dashboard template variable**
The `current` field of the `instance` template variable is hardcoded to `"ethrex-mainnet-8:3701"`. Anyone importing this dashboard on a different environment will see that specific production target pre-selected until they manually change it.
```suggestion
"current": {},
```
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "Merge branch 'main' into fullsync-metric..." | Re-trigger Greptile
| while !headers.is_empty() { | ||
| let header_batch = &headers[..min(MAX_BLOCK_BODIES_TO_REQUEST, headers.len())]; | ||
| let bodies = peers | ||
| .request_block_bodies(header_batch) | ||
| .await? | ||
| .ok_or(SyncError::BodiesNotFound)?; |
There was a problem hiding this comment.
body_failures metric never incremented
inc_body_failures() is defined, exposed via Prometheus, and shown in the Grafana dashboard, but it is never called anywhere in the PR. When request_block_bodies returns None or an error, the ? propagates immediately and the counter stays at 0. Operators will see a perpetually-zero fullsync_body_failures gauge even during body download problems.
| while !headers.is_empty() { | |
| let header_batch = &headers[..min(MAX_BLOCK_BODIES_TO_REQUEST, headers.len())]; | |
| let bodies = peers | |
| .request_block_bodies(header_batch) | |
| .await? | |
| .ok_or(SyncError::BodiesNotFound)?; | |
| let bodies = peers | |
| .request_block_bodies(header_batch) | |
| .await | |
| .inspect_err(|_| fullsync_metrics!(METRICS_FULLSYNC.inc_body_failures())) | |
| ? | |
| .ok_or_else(|| { | |
| fullsync_metrics!(METRICS_FULLSYNC.inc_body_failures()); | |
| SyncError::BodiesNotFound | |
| })?; |
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/networking/p2p/sync/full.rs
Line: 170-175
Comment:
**`body_failures` metric never incremented**
`inc_body_failures()` is defined, exposed via Prometheus, and shown in the Grafana dashboard, but it is never called anywhere in the PR. When `request_block_bodies` returns `None` or an error, the `?` propagates immediately and the counter stays at 0. Operators will see a perpetually-zero `fullsync_body_failures` gauge even during body download problems.
```suggestion
let bodies = peers
.request_block_bodies(header_batch)
.await
.inspect_err(|_| fullsync_metrics!(METRICS_FULLSYNC.inc_body_failures()))
?
.ok_or_else(|| {
fullsync_metrics!(METRICS_FULLSYNC.inc_body_failures());
SyncError::BodiesNotFound
})?;
```
How can I resolve this? If you propose a fix, please make it concise.| @@ -2232,9 +2235,13 @@ impl Blockchain { | |||
| tokio::task::yield_now().await; | |||
| } | |||
|
|
|||
| let exec_ms = exec_start.elapsed().as_millis(); | |||
|
|
|||
| let transitions_start = std::time::Instant::now(); | |||
| let account_updates = vm | |||
| .get_state_transitions() | |||
| .map_err(|err| (ChainError::EvmError(err), None))?; | |||
| let transitions_ms = transitions_start.elapsed().as_millis(); | |||
|
|
|||
| let last_block = blocks | |||
| .last() | |||
| @@ -2244,11 +2251,13 @@ impl Blockchain { | |||
| let last_block_gas_limit = last_block.header.gas_limit; | |||
|
|
|||
| // Apply the account updates over all blocks and compute the new state root | |||
| let merkle_start = std::time::Instant::now(); | |||
| let account_updates_list = self | |||
| .storage | |||
| .apply_account_updates_batch(first_block_header.parent_hash, &account_updates) | |||
| .map_err(|e| (e.into(), None))? | |||
| .ok_or((ChainError::ParentStateNotFound, None))?; | |||
| let merkle_ms = merkle_start.elapsed().as_millis(); | |||
|
|
|||
| let new_state_root = account_updates_list.state_trie_hash; | |||
| let state_updates = account_updates_list.state_updates; | |||
| @@ -2258,6 +2267,7 @@ impl Blockchain { | |||
| // Check state root matches the one in block header | |||
| validate_state_root(&last_block.header, new_state_root).map_err(|e| (e, None))?; | |||
|
|
|||
| let store_start = std::time::Instant::now(); | |||
| let update_batch = UpdateBatch { | |||
| account_updates: state_updates, | |||
| storage_updates: accounts_updates, | |||
| @@ -2270,6 +2280,17 @@ impl Blockchain { | |||
| self.storage | |||
| .store_block_updates(update_batch) | |||
| .map_err(|e| (e.into(), None))?; | |||
| let store_ms = store_start.elapsed().as_millis(); | |||
|
|
|||
| info!( | |||
| "[FULLSYNC TIMING] Batch {}: exec={}ms, transitions={}ms, merkle={}ms, store={}ms, total={}ms", | |||
| blocks_len, | |||
| exec_ms, | |||
| transitions_ms, | |||
| merkle_ms, | |||
| store_ms, | |||
| interval.elapsed().as_millis() | |||
| ); | |||
There was a problem hiding this comment.
Timing instrumentation always compiled regardless of
metrics feature
All four timing Instants (exec_start, transitions_start, merkle_start, store_start) and the [FULLSYNC TIMING] info! log are unconditional — they run even when the metrics feature is disabled. The existing metrics! macro is specifically designed to gate this overhead. Similarly, body_download_start and its [FULLSYNC TIMING] info log in full.rs fire unconditionally. On a busy node this adds one info! line per batch and per body-download round, which can be verbose in non-metrics deployments.
Consider wrapping both the timing variables and the log under metrics!(...) / fullsync_metrics!(...), or demoting the log to debug!.
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/blockchain/blockchain.rs
Line: 2190-2293
Comment:
**Timing instrumentation always compiled regardless of `metrics` feature**
All four timing `Instant`s (`exec_start`, `transitions_start`, `merkle_start`, `store_start`) and the `[FULLSYNC TIMING]` `info!` log are unconditional — they run even when the `metrics` feature is disabled. The existing `metrics!` macro is specifically designed to gate this overhead. Similarly, `body_download_start` and its `[FULLSYNC TIMING]` info log in `full.rs` fire unconditionally. On a busy node this adds one `info!` line per batch and per body-download round, which can be verbose in non-metrics deployments.
Consider wrapping both the timing variables and the log under `metrics!(...)` / `fullsync_metrics!(...)`, or demoting the log to `debug!`.
How can I resolve this? If you propose a fix, please make it concise.| tokio::task::yield_now().await; | ||
| } | ||
|
|
||
| let exec_ms = exec_start.elapsed().as_millis(); |
There was a problem hiding this comment.
transitions_ms measured and logged but not exposed as a Prometheus metric
transitions_ms (time spent in vm.get_state_transitions()) is included in the [FULLSYNC TIMING] log and clearly matters for profiling, but there is no corresponding batch_transitions_ms gauge in MetricsFullSync. The Grafana "Batch Time Breakdown" stacked-bar chart sums body_download + execution + merkle + store, so it will consistently under-report total batch time by the transitions duration. This gap will be invisible to operators using the dashboard.
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/blockchain/blockchain.rs
Line: 2235-2238
Comment:
**`transitions_ms` measured and logged but not exposed as a Prometheus metric**
`transitions_ms` (time spent in `vm.get_state_transitions()`) is included in the `[FULLSYNC TIMING]` log and clearly matters for profiling, but there is no corresponding `batch_transitions_ms` gauge in `MetricsFullSync`. The Grafana "Batch Time Breakdown" stacked-bar chart sums `body_download + execution + merkle + store`, so it will consistently under-report total batch time by the transitions duration. This gap will be invisible to operators using the dashboard.
How can I resolve this? If you propose a fix, please make it concise.
ElFantasma
left a comment
There was a problem hiding this comment.
Historically we approached metrics modules in a manual way, but we have done some work to fix this. Now there's a split in how Prometheus is wired:
-
register_*macro / default-registry pattern (2 modules):rpc.rs(introduced in #5335 by @rodrigo-o, 2025-11-24) andsync.rs(#6470, 2026-04-20). Each metric is created withregister_int_gauge!/register_counter_vec!/ etc., which registers it into Prometheus's default registry at construction time.api.rssurfaces them automatically viagather_default_metrics(). -
Manual
Registry::new()pattern (6 modules):blocks.rs,node.rs,p2p.rs,process.rs,transactions.rs,l2/metrics.rs. Each module owns a singleton struct of bareIntGauge/IntCountercollectors, exposes agather_metrics()method that builds a freshRegistry::new()per call, andapi.rshas to know about each module and concatenate their output.
We should start moving all the metrics to the first method, but that's tech debt to be addressed in some future effort.
As this PR is adding a new module from scratch, it's a free moment to land it in the better shape rather than join the manual-Registry side and grow the divergence. Concrete deltas vs the current fullsync.rs:
-
Use the
register_*macros (register_int_gauge!,register_int_counter!,register_int_counter_vec!) instead ofIntGauge::new()/IntCounter::new(). The macros register into the default registry at construction time. New metrics auto-surface viagather_default_metrics()with zeroapi.rschanges — the current PR builds a freshRegistry::new()on everygather_metrics()call and forcesapi.rsto know about this module, both of which the macro pattern avoids. -
Collapse parallel counters into label vectors (
IntCounterVecwith&["component"]or&["outcome"]). #6470 does this forpivot_updates,storage_requests,header_resolution. Here,header_failures+body_failuresare a natural pair that should be onefullsync_failures_total{component="header|body"}— Grafana queries becomesum(rate(...)[5m]) by (component)instead of two separate series. -
Naming convention. Every other metric in this codebase is
ethrex_*(ethrex_blocks_*,ethrex_p2p_*,ethrex_sync_*).fullsync_*will look orphaned in Grafana's metric picker — rename toethrex_fullsync_*. -
Drop
reset_cycle(). It exists because some of the "counters" here (cycles_started,cycles_completed) are gauges-pretending-to-be-counters. With properIntCounterregistration via the macros, monotonicity is enforced and reset becomes meaningless — and counter monotonicity is whatrate()queries assume. -
Drop
api.rs's explicitgather_metrics()call. Once metrics register into the default registry via macros,api.rsdoesn't need to know this module exists. See the comment inapi.rs:57-58from #6470: "METRICS_SYNC registers into the default Prometheus registry at init, so its metrics are already included ingather_default_metrics()above."
The 5 inline comments above stand independently of this — they're tactical fixes that apply either way.
| let mut transactions_count = 0; | ||
|
|
||
| let interval = Instant::now(); | ||
| let interval = std::time::Instant::now(); |
There was a problem hiding this comment.
These four timing Instants and the [FULLSYNC TIMING] log run regardless of the metrics feature. Cost is small but inconsistent with the rest of the metrics infra — wrap with #[cfg(feature = "metrics")] to match. The info! line below should also be feature-gated.
| "[FULLSYNC TIMING] Batch {}: exec={}ms, transitions={}ms, merkle={}ms, store={}ms, total={}ms", | ||
| blocks_len, | ||
| exec_ms, | ||
| transitions_ms, |
There was a problem hiding this comment.
transitions_ms is logged here but never exposed as a Prometheus gauge — Grafana can't graph it. Add set_batch_transitions_ms on MetricsFullSync (mirroring batch_execution_ms etc.) and call it in the metrics block below. (Greptile flagged.)
|
|
||
| // Reliability | ||
| header_failures: IntCounter, | ||
| body_failures: IntCounter, |
There was a problem hiding this comment.
P1: inc_body_failures() is exposed and surfaced on the dashboard but never called anywhere in this PR — any panel reading this metric will permanently show zero, masking real body-fetch failures. Either wire it at the body-download error site in crates/networking/p2p/sync/full.rs (alongside inc_header_failures()), or drop the metric until both increment and dashboard panel land together. (Greptile flagged.)
| self.stage.set(stage); | ||
| } | ||
| pub fn set_target_block(&self, block: u64) { | ||
| self.target_block.set(block.cast_signed()); |
There was a problem hiding this comment.
block.cast_signed() is u64 as i64 — fine for any realistic block number, but if u64::MAX is ever passed as a sentinel "unknown target", this silently reports -1 and Grafana queries break. i64::try_from(block).unwrap_or(0) would be more defensive at zero perf cost. Same pattern in the next three setters.
| } | ||
|
|
||
| /// Reset gauges at the start of a new sync cycle so stale data doesn't persist | ||
| pub fn reset_cycle(&self) { |
There was a problem hiding this comment.
Worth a doc comment that this resets gauges only — the IntCounters (headers_downloaded, bodies_downloaded, *_failures, cycles_*) intentionally stay monotonic for rate() queries. Right now the asymmetry looks like an oversight.
Add metrics for full sync so that they can be seen in a grafana dashboard.