feat(l1): fullsync metrics by Arkenan · Pull Request #6466 · lambdaclass/ethrex

Arkenan · 2026-04-11T15:09:28Z

Add metrics for full sync so that they can be seen in a grafana dashboard.

github-actions · 2026-04-11T15:10:38Z

🤖 Kimi Code Review

This PR adds comprehensive Prometheus metrics for full sync operations. The implementation is generally well-structured and follows Rust best practices. Here are my observations:

Critical Issues

None found. This is a metrics-only change that doesn't modify consensus-critical logic.

Potential Issues

1. Integer Overflow Risk (Low severity, technical debt)

File: crates/blockchain/metrics/fullsync.rs
Lines: 229, 232, 238, 241

The cast_signed() method converts u64 block numbers to i64 for Prometheus IntGauge. While current Ethereum block numbers (~20M) are far below i64::MAX (~9e18), this is a latent overflow risk.

// Line 229
self.target_block.set(block.cast_signed());

Suggestion: Use saturating conversion or explicit bounds checking:

self.target_block.set(block.try_into().unwrap_or(i64::MAX));

2. Missing Block Hash Debug Info

File: crates/networking/p2p/sync/full.rs
Lines: 185-191

The new timing log replaces the previous log that included first/last block hashes. While cleaner, this removes valuable debugging context when execution fails.

Suggestion: Keep block hashes in the log or log them separately at debug level:

debug!(
    "Batch range: {:#?} to {:#?}",
    blocks.first().map(|b| b.hash()),
    blocks.last().map(|b| b.hash())
);

3. Magic Numbers for Sync Stages

File: crates/blockchain/metrics/fullsync.rs
Lines: 15-17, and usage in full.rs

Stage values (0=idle, 1=headers, etc.) are documented but not enforced via types.

Suggestion: Define an enum for type safety (conversion to i64 is trivial):

#[repr(i64)]
enum SyncStage { Idle = 0, DownloadingHeaders = 1, ... }

Code Quality Suggestions

4. Macro Hygiene

File: crates/networking/p2p/sync/full.rs
Lines: 14-21

The fullsync_metrics! macro is clean, but consider moving it to ethrex-metrics crate if other modules will need similar conditional instrumentation.

5. Registry Allocation

File: crates/blockchain/metrics/fullsync.rs
Lines: 302-327

gather_metrics() creates a new Registry and clones all metrics on every scrape. While acceptable for low-frequency scrapes, this allocates unnecessarily.

Suggestion: Consider registering metrics once at startup and storing the registry in the struct, or document that this is intentionally ephemeral to avoid global registry state.

6. Timestamp Precision

File: crates/blockchain/metrics/fullsync.rs
Lines: 260-270

Using unwrap_or_default() on SystemTime::duration_since will return 0 for times before Unix epoch. Document this behavior or use unwrap_or_else with a more appropriate fallback.

Positive Observations

Proper feature gating: All metrics code is correctly guarded with #[cfg(feature = "metrics")]
Division by zero protection: Good use of .max(0.001) in rate calculations (lines 296 in full.rs)
Consistent naming: Metrics follow Prometheus conventions with fullsync_ prefix and _ms suffix for milliseconds
Modern Rust: Uses LazyLock (Rust 1.80+) for static initialization instead of lazy_static
Error handling: The gather_metrics method properly propagates errors rather than panicking

Summary

The PR is ready to merge after addressing the overflow risk (Item 1). The metrics provide valuable observability into sync performance bottlenecks (execution vs. merkleization vs. storage) which will help operators diagnose node performance issues.

Automated review by Kimi (Moonshot AI) · kimi-k2.5 · custom prompt

github-actions · 2026-04-11T15:12:47Z

Lines of code report

Total lines added: 363
Total lines removed: 0
Total lines changed: 363

Detailed view

+----------------------------------------------+-------+------+
| File                                         | Lines | Diff |
+----------------------------------------------+-------+------+
| ethrex/crates/blockchain/blockchain.rs       | 2509  | +22  |
+----------------------------------------------+-------+------+
| ethrex/crates/blockchain/metrics/api.rs      | 59    | +5   |
+----------------------------------------------+-------+------+
| ethrex/crates/blockchain/metrics/fullsync.rs | 280   | +280 |
+----------------------------------------------+-------+------+
| ethrex/crates/blockchain/metrics/mod.rs      | 54    | +2   |
+----------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/sync/full.rs    | 302   | +54  |
+----------------------------------------------+-------+------+

github-actions · 2026-04-11T15:14:15Z

🤖 Claude Code Review

Now I have enough information for a thorough review. Here it is:

Review: `feat(l1): fullsync metrics` (#6466)

Overall: A well-structured metrics addition that follows existing patterns in the codebase. There are two correctness issues worth addressing before merge.

Bugs

1. batch_total_ms is measured twice with inconsistent values

crates/networking/p2p/sync/full.rs:298-304

let execution_time: f64 = execution_start.elapsed().as_millis() as f64 / 1000.0;
let blocks_per_second = blocks_len as f64 / execution_time.max(0.001);

fullsync_metrics!(
    ...
    METRICS_FULLSYNC.set_batch_total_ms(execution_start.elapsed().as_millis() as f64);
    ...
);

execution_start.elapsed() is called a second time inside the fullsync_metrics! block. By the time it's called, the time elapsed includes the set_blocks_executed and set_blocks_per_second calls that precede it. The batch_total_ms gauge will always be slightly larger than the execution_time used for blocks_per_second, creating a subtle inconsistency in the dashboard.

Fix: capture elapsed once and reuse it:

let elapsed_ms = execution_start.elapsed().as_millis();
let execution_time: f64 = elapsed_ms as f64 / 1000.0;
let blocks_per_second = blocks_len as f64 / execution_time.max(0.001);

fullsync_metrics!(
    ...
    METRICS_FULLSYNC.set_batch_total_ms(elapsed_ms as f64);
    ...
);

2. transitions_ms is measured and logged but not exposed as a metric

crates/blockchain/blockchain.rs:2277

info!(
    "[FULLSYNC TIMING] Batch {}: exec={}ms, transitions={}ms, merkle={}ms, store={}ms, total={}ms",
    blocks_len, exec_ms, transitions_ms, merkle_ms, store_ms, interval.elapsed().as_millis()
);

The [FULLSYNC TIMING] log line reports four phase durations, but MetricsFullSync only has setters for three (set_batch_execution_ms, set_batch_merkle_ms, set_batch_store_ms). There is no batch_transitions_ms field or setter. A Grafana user comparing the log output against the dashboard will find one phase missing. Either add a batch_transitions_ms gauge and wire it up in the #[cfg(feature = "metrics")] block in blockchain.rs, or drop transitions_ms from the log line.

Concerns

3. batch_total_ms semantics are not additive with the exposed parts

batch_total_ms is set from add_blocks_in_batch and covers the full execution round-trip including store.forkchoice_update. The three "part" timings (execution_ms, merkle_ms, store_ms) are set from blockchain.rs and cover only the inner phases. A Grafana panel summing the parts will always show less than batch_total_ms, and the gap will be unexplained. A comment in the metric description (e.g. "includes forkchoice update overhead not captured in individual phase metrics") would help dashboard authors.

4. Lost block hash at the start of execution

The pre-existing log:

"Executing {} blocks for full sync. First block hash: {:#?} Last block hash: {:#?}"

is replaced by a timing log. Block hashes are still present in the post-execution [SYNCING] Executed & stored… log, but removing the pre-execution hash log makes it harder to identify which batch was being processed if the sync hangs mid-execution. Worth keeping as a debug log or restoring it alongside the timing log.

Minor

5. set_stage uses raw magic numbers

Call sites pass 0/1/2/3 directly:

METRICS_FULLSYNC.set_stage(1);  // what does 1 mean here?

The semantics are documented only in the metric help string. Named constants (STAGE_IDLE, STAGE_HEADERS, STAGE_BODIES, STAGE_EXECUTING) or a small enum would make call sites self-documenting and prevent passing out-of-range values.

6. Mixed fullsync_metrics! and bare #[cfg] at the variable declaration

full.rs:72-73:

#[cfg(feature = "metrics")]
let mut header_batch_start = Instant::now();

This is technically required (variables declared inside the macro block wouldn't be visible later), but it creates a visual inconsistency with the macro used everywhere else. A comment explaining why this particular site must use the attribute directly would help future readers.

7. Gauge for timing instead of Histogram

The batch timing metrics (batch_execution_ms, batch_total_ms, etc.) only expose the most recent batch's value. Over a long sync, Prometheus/Grafana cannot compute percentiles or spot outliers. For a one-shot operation like full sync this is pragmatic, but a Histogram with a few buckets (e.g. 100ms, 500ms, 1s, 5s, 10s) would give much richer observability. Not blocking, but worth considering if sync latency becomes a focus of investigation.

What's good

The reset_cycle() method correctly resets only gauges and not IntCounters, respecting Prometheus's monotonicity invariant.
The fullsync_metrics! macro cleanly eliminates boilerplate #[cfg(feature = "metrics")] at call sites.
The .max(0.001) guard on elapsed time when computing rates (fixing a potential division-by-zero) is the correct approach.
The gather_metrics() pattern is consistent with the rest of the codebase.
The LazyLock global and the Default → new() delegation follow the existing METRICS_BLOCKS, METRICS_P2P pattern precisely.

Automated review by Claude (Anthropic) · sonnet · custom prompt

github-actions · 2026-04-11T15:16:36Z

🤖 Codex Code Review

Findings

full.rs / full.rs / full.rs / fullsync.rs: fullsync_stage is set to 1/2/3, but only reset to 0 on the happy path at the very end. Any early Ok(()) after header retry exhaustion, or any ? during body download / execution / cleanup, leaves the exporter reporting an active sync stage indefinitely. reset_cycle() also does not clear stage, so the stale state persists until the next sync attempt. This is misleading enough to break alerting and dashboards; stage reset should happen on every exit path, ideally via a guard.
full.rs / fullsync.rs: fullsync_body_failures is defined, exported, and never incremented. The two body-fetch failure paths are request_block_bodies(...).await? and .ok_or(SyncError::BodiesNotFound)?, both of which return immediately without touching the counter. That makes the new reliability metric blind to the main body-download failure mode.
full.rs / full.rs / full.rs: headers_per_second is measured from a timer that is only reset after successful batches. If a header request fails and the code sleeps 5 seconds before retrying, the next successful batch rate includes that backoff time and reports an artificially low throughput. If this metric is meant to represent peer/header download rate rather than end-to-end sync progress, the timer should also be reset on failed attempts.

No consensus, EVM, trie, or validation logic changed in this PR, so I did not find blockchain-correctness or security issues beyond these metrics-path regressions.

I could not run cargo check in this sandbox because dependency resolution is blocked by network/toolchain restrictions.

Automated review by OpenAI Codex · gpt-5.4 · custom prompt

…6486) ## Summary - The Hive consume-engine Amsterdam tests for EIP-7778 and EIP-8037 were failing because ethrex's per-tx gas limit checks were incompatible with Amsterdam's new gas accounting rules. - **EIP-7778** uses pre-refund gas for block accounting, so cumulative pre-refund gas can exceed the block gas limit even when a block builder correctly included all transactions. - **EIP-8037** introduces 2D gas accounting (`block_gas = max(regular, state)`), meaning cumulative total gas (regular + state) can legally exceed the block gas limit. - The fix skips the per-tx cumulative gas check for Amsterdam and adds a **post-execution** block-level overflow check using `max(sum_regular, sum_state)` in all three execution paths (sequential, pipeline, parallel). ## Local test results - **200/201** EIP-7778 + EIP-8037 Hive consume-engine tests pass - **105/105** EIP-7778 + EIP-8037 EF blockchain tests pass (4 + 101) - The single remaining Hive failure (`test_block_regular_gas_limit[exceed=True]`) expects `TransactionException.GAS_ALLOWANCE_EXCEEDED` but we return `BlockException.GAS_USED_OVERFLOW` — the block is correctly rejected, just with a different error classification. ## Test plan - [x] All EIP-7778 EF blockchain tests pass locally - [x] All EIP-8037 EF blockchain tests pass locally - [x] 200/201 Hive consume-engine Amsterdam tests pass locally - [ ] Full CI Amsterdam Hive suite passes --------- Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>

github-actions · 2026-04-17T09:07:36Z

🤖 Claude Code Review

Here is my review of PR #6466:

PR #6466 — `feat(l1): fullsync metrics`

Overview

Adds a new MetricsFullSync struct with Prometheus gauges/counters tracking full sync phases (header download, body download, EVM execution, merkleization, storage), wires it into the sync cycle and block execution batch, and ships a Grafana dashboard JSON. The approach is consistent with the existing metrics pattern in the codebase.

Correctness Issues

transitions_ms measured but never exported

In crates/blockchain/blockchain.rs (around lines 2235–2238), transitions_ms is measured and included in the [FULLSYNC TIMING] log:

exec={}ms, transitions={}ms, merkle={}ms, store={}ms

However, MetricsFullSync has no batch_transitions_ms field and no corresponding setter. The metric is silently dropped — Grafana will show execution + merkle + store but the state-transition phase is invisible. Either add the field/setter and export it, or remove the measurement if it's intentional to omit it.

body_failures counter is declared but never incremented

MetricsFullSync has body_failures: IntCounter and inc_body_failures() is implemented (fullsync.rs:384), but there are no call sites in full.rs. When request_block_bodies returns None and the code returns SyncError::BodiesNotFound, the failure counter is not bumped. The "Failures per Minute" Grafana panel will always show zero body failures.

interval.elapsed() called twice for different purposes

In blockchain.rs (~line 2267 and ~line 2276):

info!("[FULLSYNC TIMING] ... total={}ms", interval.elapsed().as_millis());
// ...
let elapsed_seconds = interval.elapsed().as_secs_f64(); // second call

The second elapsed() measures a slightly later wall-clock time than the total in the log. More importantly, elapsed_seconds is used for throughput (gigagas/s) calculation — this was the original sole purpose of interval. The two values will diverge slightly. Consider capturing the elapsed value once into a variable before logging.

Design/Consistency Concerns

batch_total_ms measured at two different stack levels

batch_total_ms is set in full.rs:add_blocks_in_batch from execution_start.elapsed(), which wraps the entire call into blockchain.execute_blocks_batch (or equivalent). Meanwhile, the individual phase timings (exec_ms, transitions_ms, merkle_ms, store_ms) are measured inside that called function. The sum of the four internal phases will therefore not equal batch_total_ms — there is overhead between them. This is acceptable if the intent is "wall-clock outer vs inner phase breakdown," but the Grafana stacked bar chart ("Batch Time Breakdown") implies they should sum to the total, which they won't. A comment clarifying the discrepancy, or setting batch_total_ms as the sum of the four phases, would help.

Mixed conditional compilation styles

Some sites use #[cfg(feature = "metrics")] directly:

#[cfg(feature = "metrics")]
let mut header_batch_start = Instant::now();

Others use the fullsync_metrics! macro. The macro is a good DRY solution — it would be cleaner to use it consistently throughout full.rs, including the variable declarations.

set_stage() takes a raw i64

Callers use magic numbers (set_stage(0), set_stage(1), etc.). A local enum (even repr(i64)) would prevent passing an invalid stage value and make call sites self-documenting. The numeric mapping is documented in the metric description string, but a type would enforce it.

Minor Issues

Hardcoded internal instance in Grafana dashboard

fullsync_dashboard.json (line ~984):

"current": { "text": "ethrex-mainnet-8:3701", "value": "ethrex-mainnet-8:3701" }

This commits a specific internal infrastructure host:port into the public repo. The current value in a Grafana template variable is just the "last saved selection" — setting it to "All" or an empty string would be cleaner for a public dashboard.

blocks_executed metric name is ambiguous

The description says "Highest block number executed so far" — so it's a gauge of the current tip, not a count. The name blocks_executed reads like a total count. last_executed_block_number or highest_executed_block would be clearer, though renaming changes the Prometheus metric name and any existing dashboards.

[FULLSYNC TIMING] log always emitted

The structured timing log in blockchain.rs fires regardless of the metrics feature flag. This is probably intentional (timing logs are useful without Prometheus), but it's worth making explicit in a comment if so, since the tag [FULLSYNC TIMING] implies it's metrics-related.

What's Working Well

The fullsync_metrics! macro avoids repetitive #[cfg] attribute spam and keeps the call sites readable.
reset_cycle() correctly resets only gauges and not counters — counters are intentionally monotonic across cycles.
The gather_metrics pattern (create registry per call, register clones, gather) is consistent with the rest of the metrics codebase and works correctly because prometheus types are Arc-backed.
Stage transition points (set_stage) are placed logically around the actual phase transitions in sync_cycle_full.
The Grafana ETA calculations use clamp_min and and ... > 0 guards to avoid division-by-zero and displaying ETAs when there's no data — this is solid PromQL hygiene.

Summary of Actionable Items

Add batch_transitions_ms: Gauge to MetricsFullSync and export it, or remove the timing measurement if it's intentionally omitted.
Add a inc_body_failures() call site when body fetching fails.
Capture interval.elapsed() once before the log and reuse it for elapsed_seconds.
Clear the hardcoded current instance value from the Grafana dashboard JSON.
Consider calling inc_body_failures() on the BodiesNotFound error path in full.rs.

Automated review by Claude (Anthropic) · sonnet · custom prompt

github-actions · 2026-04-17T09:07:59Z

🤖 Kimi Code Review

This PR adds comprehensive metrics for full sync operations. The implementation is generally sound, but I have several specific observations:

Issues Found

1. Unused `body_failures` Counter (Minor)

File: crates/networking/p2p/sync/full.rs

The body_failures counter is defined in MetricsFullSync but never incremented. You increment header_failures when header fetches fail (line 89), but there's no corresponding increment for body fetch failures.

Suggestion: Add fullsync_metrics!(METRICS_FULLSYNC.inc_body_failures()); when peers.request_block_bodies() returns Ok(None) or when the body request errors.

2. Inconsistent `Instant` Types (Style)

File: crates/blockchain/blockchain.rs (lines 2193, 2194)

You mix tokio::time::Instant (imported at the top of the file as use tokio::time::Instant;) with std::time::Instant. While functionally similar, this is inconsistent. Since the existing interval variable uses tokio::time::Instant, consider using that consistently throughout.

Current:

let interval = std::time::Instant::now();  // Changed from tokio::time::Instant
let exec_start = std::time::Instant::now();

3. Potential Precision Loss in Cast (Minor)

File: crates/blockchain/metrics/fullsync.rs (lines 218-220)

The cast_signed() method (presumably converting u64 to i64 for Prometheus gauges) could theoretically overflow if Ethereum block numbers exceed i64::MAX (≈9×10¹⁸). While practically impossible for Ethereum mainnet, consider documenting this assumption or using saturating conversion.

4. Missing `batch_total_ms` Update in Blockchain (Minor)

File: crates/blockchain/blockchain.rs

You set batch_execution_ms, batch_merkle_ms, and batch_store_ms, but never set METRICS_FULLSYNC.set_batch_total_ms(). The interval.elapsed().as_millis() is calculated for the log message but not stored as a metric.

Suggestion: Add after line 2280:

METRICS_FULLSYNC.set_batch_total_ms(interval.elapsed().as_millis() as f64);

5. Registry Recreation Overhead (Performance)

File: crates/blockchain/metrics/fullsync.rs (lines 282-312)

The gather_metrics() method creates a new Registry and registers all metrics on every call. This follows the existing pattern in the codebase, but be aware this allocates on each metrics scrape. Consider registering once in LazyLock initialization if this becomes a bottleneck.

Positive Observations

Good use of the fullsync_metrics! macro in full.rs to reduce #[cfg(feature = "metrics")] boilerplate throughout the sync logic.
Proper counter/gauge distinction in reset_cycle(): You correctly reset gauges (like blocks_executed) but not counters (like headers_downloaded) which should accumulate across cycles.
Correct timestamp handling: Using SystemTime for Unix epoch timestamps and Instant for durations is the right approach.
Defensive division: Using .max(0.001) to avoid division by zero in rate calculations (lines 175, 297) is good practice.

Nitpicks

Line 2283 in blockchain.rs: The log message uses blocks_len as a pseudo-batch-ID, but this is just the count. Consider adding an actual batch sequence number if you need to correlate log lines across batches.
Line 136 in full.rs: header_batch_start is reset at the end of the metrics block, which correctly measures the inter-batch interval including processing overhead.

The Grafana dashboard JSON looks comprehensive and follows standard Prometheus conventions.

Overall: No critical bugs or security issues. The code is ready to merge after addressing the unused body_failures counter and potentially the missing batch_total_ms metric update.

Automated review by Kimi (Moonshot AI) · kimi-k2.5 · custom prompt

github-actions · 2026-04-17T09:08:06Z

🤖 Codex Code Review

fullsync_body_failures is dead code right now. The counter is defined and exposed, but the body fetch path never increments it, even when request_block_bodies(...) exhausts retries and returns BodiesNotFound. That means the new reliability panel will report 0 body failures while full sync is actually failing on bodies. Increment it on the None -> BodiesNotFound path at minimum, and probably on recoverable body-fetch retries too if you want parity with header failures. full.rs fullsync.rs
execution_stage_start_timestamp is started before body download begins, but the stage machine still reports stage 2 = downloading_bodies until later. As a result, the dashboard’s “Execution Stage Elapsed” panel measures body download + execution, not execution only. Either move set_execution_stage_start_now() to the same point as set_stage(3), or rename the metric/panel to reflect that it covers the whole post-header phase. full.rs full.rs fullsync_dashboard.json
The batch timing breakdown is internally inconsistent. batch_execution_ms now stops before get_state_transitions(), while batch_total_ms is measured outside add_blocks() and also includes later work such as forkchoice_update(). The stacked “Body Download / Execution / Merkle / Storage” panel will therefore undercount the real batch total, which makes bottleneck analysis misleading. Either include transition extraction in execution, or export separate transitions/forkchoice timings so the dashboard sums to the same boundary as total. blockchain.rs blockchain.rs full.rs fullsync_dashboard.json

No consensus, EVM, gas-accounting, or state-transition correctness issues stood out in the diff itself; the problems I found are observability/diagnostics. I wasn’t able to run cargo check in this sandbox because rustup tried to write outside the writable roots.

Automated review by OpenAI Codex · gpt-5.4 · custom prompt

greptile-apps · 2026-04-17T09:09:46Z

Greptile Summary

This PR adds Prometheus metrics and a Grafana dashboard for the full sync process, tracking stage transitions, header/body download rates, block execution progress, and per-phase batch timing (EVM execution, merkle, storage).

P1 — fullsync_body_failures always 0: inc_body_failures() is declared and exposed but never called; body download errors propagate via ? without updating the counter, making the metric misleading in the dashboard.
P2 — transitions_ms missing from metrics: state-transition extraction time is measured and logged but absent from both MetricsFullSync and the Grafana stacked-bar chart, so the bars will not sum to the total batch time.
P2 — unconditional timing overhead: timing Instants and [FULLSYNC TIMING] info! logs in blockchain.rs and full.rs are compiled and run regardless of the metrics feature, contrary to the pattern established by the metrics! macro.

Confidence Score: 4/5

Safe to merge after fixing the body_failures counter — the rest are P2 improvements.

One P1 defect: the body_failures Prometheus counter is declared, registered, and displayed in the Grafana dashboard, but inc_body_failures() is never called, so it will always report 0 regardless of actual failures. The remaining findings are P2 (log noise without metrics feature, missing transitions metric in chart, hardcoded dashboard instance).

crates/networking/p2p/sync/full.rs (body_failures never incremented), crates/blockchain/blockchain.rs (unconditional timing + missing transitions metric)

Important Files Changed

Filename	Overview
crates/blockchain/metrics/fullsync.rs	New Prometheus metrics struct for full sync; comprehensive coverage of stage, progress, rates, and batch timings, but missing a `batch_transitions_ms` gauge that is measured in the calling code.
crates/blockchain/blockchain.rs	Adds per-phase timing instrumentation and metrics emission; timing variables and `[FULLSYNC TIMING]` log are unconditional (not gated by metrics feature), and `transitions_ms` is logged but never emitted as a Prometheus metric.
crates/networking/p2p/sync/full.rs	Wires up sync-cycle metrics (stage transitions, header/body rates, batch timing); `body_failures` counter is exposed but never incremented on body download errors.
crates/blockchain/metrics/api.rs	Registers METRICS_FULLSYNC in the Prometheus scrape endpoint; follows the same pattern as existing metrics modules.
crates/blockchain/metrics/mod.rs	Adds `fullsync` submodule gated by `api` or `metrics` feature, consistent with existing modules.
metrics/provisioning/grafana/dashboards/common_dashboards/fullsync_dashboard.json	New Grafana dashboard with stage, progress, header/body/execution panels; has a hardcoded production instance in the template default value.

Sequence Diagram

sequenceDiagram
    participant SC as sync_cycle_full
    participant AB as add_blocks_in_batch
    participant BC as execute_blocks_batch
    participant ME as METRICS_FULLSYNC

    SC->>ME: reset_cycle(), set_stage(1), inc_cycles_started()
    loop Header download batches
        SC->>ME: set_target_block, set_lowest_header, inc_headers_downloaded
    end
    SC->>ME: set_blocks_total(), set_stage(2), set_execution_stage_start_now()

    loop Execution batches
        SC->>ME: set_batch_body_download_ms, set_bodies_per_second, set_stage(3)
        SC->>AB: add_blocks_in_batch()
        AB->>BC: execute_blocks_batch()
        BC->>ME: set_batch_execution_ms (exec_ms)
        Note over BC,ME: transitions_ms measured but NOT set in metrics
        BC->>ME: set_batch_merkle_ms, set_batch_store_ms
        AB->>ME: set_blocks_executed, set_blocks_per_second, set_batch_total_ms, set_batch_size
    end

    SC->>ME: set_stage(0), inc_cycles_completed()

Comments Outside Diff (1)

metrics/provisioning/grafana/dashboards/common_dashboards/fullsync_dashboard.json, line 980-987 (link)

Hardcoded production instance in dashboard template variable

The current field of the instance template variable is hardcoded to "ethrex-mainnet-8:3701". Anyone importing this dashboard on a different environment will see that specific production target pre-selected until they manually change it.

Prompt To Fix With AI

This is a comment left during a code review.
Path: metrics/provisioning/grafana/dashboards/common_dashboards/fullsync_dashboard.json
Line: 980-987

Comment:
**Hardcoded production instance in dashboard template variable**

The `current` field of the `instance` template variable is hardcoded to `"ethrex-mainnet-8:3701"`. Anyone importing this dashboard on a different environment will see that specific production target pre-selected until they manually change it.



How can I resolve this? If you propose a fix, please make it concise.

Prompt To Fix All With AI

This is a comment left during a code review.
Path: crates/networking/p2p/sync/full.rs
Line: 170-175

Comment:
**`body_failures` metric never incremented**

`inc_body_failures()` is defined, exposed via Prometheus, and shown in the Grafana dashboard, but it is never called anywhere in the PR. When `request_block_bodies` returns `None` or an error, the `?` propagates immediately and the counter stays at 0. Operators will see a perpetually-zero `fullsync_body_failures` gauge even during body download problems.

```suggestion
            let bodies = peers
                .request_block_bodies(header_batch)
                .await
                .inspect_err(|_| fullsync_metrics!(METRICS_FULLSYNC.inc_body_failures()))
                ?
                .ok_or_else(|| {
                    fullsync_metrics!(METRICS_FULLSYNC.inc_body_failures());
                    SyncError::BodiesNotFound
                })?;
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: crates/blockchain/blockchain.rs
Line: 2190-2293

Comment:
**Timing instrumentation always compiled regardless of `metrics` feature**

All four timing `Instant`s (`exec_start`, `transitions_start`, `merkle_start`, `store_start`) and the `[FULLSYNC TIMING]` `info!` log are unconditional — they run even when the `metrics` feature is disabled. The existing `metrics!` macro is specifically designed to gate this overhead. Similarly, `body_download_start` and its `[FULLSYNC TIMING]` info log in `full.rs` fire unconditionally. On a busy node this adds one `info!` line per batch and per body-download round, which can be verbose in non-metrics deployments.

Consider wrapping both the timing variables and the log under `metrics!(...)` / `fullsync_metrics!(...)`, or demoting the log to `debug!`.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: crates/blockchain/blockchain.rs
Line: 2235-2238

Comment:
**`transitions_ms` measured and logged but not exposed as a Prometheus metric**

`transitions_ms` (time spent in `vm.get_state_transitions()`) is included in the `[FULLSYNC TIMING]` log and clearly matters for profiling, but there is no corresponding `batch_transitions_ms` gauge in `MetricsFullSync`. The Grafana "Batch Time Breakdown" stacked-bar chart sums `body_download + execution + merkle + store`, so it will consistently under-report total batch time by the transitions duration. This gap will be invisible to operators using the dashboard.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: metrics/provisioning/grafana/dashboards/common_dashboards/fullsync_dashboard.json
Line: 980-987

Comment:
**Hardcoded production instance in dashboard template variable**

The `current` field of the `instance` template variable is hardcoded to `"ethrex-mainnet-8:3701"`. Anyone importing this dashboard on a different environment will see that specific production target pre-selected until they manually change it.

```suggestion
        "current": {},
```

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (1): Last reviewed commit: "Merge branch 'main' into fullsync-metric..." | Re-trigger Greptile}

greptile-apps · 2026-04-17T09:09:50Z

        while !headers.is_empty() {
            let header_batch = &headers[..min(MAX_BLOCK_BODIES_TO_REQUEST, headers.len())];
            let bodies = peers
                .request_block_bodies(header_batch)
                .await?
                .ok_or(SyncError::BodiesNotFound)?;


body_failures metric never incremented

inc_body_failures() is defined, exposed via Prometheus, and shown in the Grafana dashboard, but it is never called anywhere in the PR. When request_block_bodies returns None or an error, the ? propagates immediately and the counter stays at 0. Operators will see a perpetually-zero fullsync_body_failures gauge even during body download problems.

Suggested change

while !headers.is_empty() {

let header_batch = &headers[..min(MAX_BLOCK_BODIES_TO_REQUEST, headers.len())];

let bodies = peers

.request_block_bodies(header_batch)

.await?

.ok_or(SyncError::BodiesNotFound)?;

let bodies = peers

.request_block_bodies(header_batch)

.await

.inspect_err(|_| fullsync_metrics!(METRICS_FULLSYNC.inc_body_failures()))

?

.ok_or_else(|| {

fullsync_metrics!(METRICS_FULLSYNC.inc_body_failures());

SyncError::BodiesNotFound

})?;

Prompt To Fix With AI

This is a comment left during a code review. Path: crates/networking/p2p/sync/full.rs Line: 170-175 Comment: **`body_failures` metric never incremented** `inc_body_failures()` is defined, exposed via Prometheus, and shown in the Grafana dashboard, but it is never called anywhere in the PR. When `request_block_bodies` returns `None` or an error, the `?` propagates immediately and the counter stays at 0. Operators will see a perpetually-zero `fullsync_body_failures` gauge even during body download problems. ```suggestion let bodies = peers .request_block_bodies(header_batch) .await .inspect_err(|_| fullsync_metrics!(METRICS_FULLSYNC.inc_body_failures())) ? .ok_or_else(|| { fullsync_metrics!(METRICS_FULLSYNC.inc_body_failures()); SyncError::BodiesNotFound })?; ``` How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-04-17T09:09:51Z

@@ -2232,9 +2235,13 @@ impl Blockchain {
            tokio::task::yield_now().await;
        }

+        let exec_ms = exec_start.elapsed().as_millis();
+
+        let transitions_start = std::time::Instant::now();
        let account_updates = vm
            .get_state_transitions()
            .map_err(|err| (ChainError::EvmError(err), None))?;
+        let transitions_ms = transitions_start.elapsed().as_millis();

        let last_block = blocks
            .last()
@@ -2244,11 +2251,13 @@ impl Blockchain {
        let last_block_gas_limit = last_block.header.gas_limit;

        // Apply the account updates over all blocks and compute the new state root
+        let merkle_start = std::time::Instant::now();
        let account_updates_list = self
            .storage
            .apply_account_updates_batch(first_block_header.parent_hash, &account_updates)
            .map_err(|e| (e.into(), None))?
            .ok_or((ChainError::ParentStateNotFound, None))?;
+        let merkle_ms = merkle_start.elapsed().as_millis();

        let new_state_root = account_updates_list.state_trie_hash;
        let state_updates = account_updates_list.state_updates;
@@ -2258,6 +2267,7 @@ impl Blockchain {
        // Check state root matches the one in block header
        validate_state_root(&last_block.header, new_state_root).map_err(|e| (e, None))?;

+        let store_start = std::time::Instant::now();
        let update_batch = UpdateBatch {
            account_updates: state_updates,
            storage_updates: accounts_updates,
@@ -2270,6 +2280,17 @@ impl Blockchain {
        self.storage
            .store_block_updates(update_batch)
            .map_err(|e| (e.into(), None))?;
+        let store_ms = store_start.elapsed().as_millis();
+
+        info!(
+            "[FULLSYNC TIMING] Batch {}: exec={}ms, transitions={}ms, merkle={}ms, store={}ms, total={}ms",
+            blocks_len,
+            exec_ms,
+            transitions_ms,
+            merkle_ms,
+            store_ms,
+            interval.elapsed().as_millis()
+        );


Timing instrumentation always compiled regardless of metrics feature

All four timing Instants (exec_start, transitions_start, merkle_start, store_start) and the [FULLSYNC TIMING] info! log are unconditional — they run even when the metrics feature is disabled. The existing metrics! macro is specifically designed to gate this overhead. Similarly, body_download_start and its [FULLSYNC TIMING] info log in full.rs fire unconditionally. On a busy node this adds one info! line per batch and per body-download round, which can be verbose in non-metrics deployments.

Consider wrapping both the timing variables and the log under metrics!(...) / fullsync_metrics!(...), or demoting the log to debug!.

Prompt To Fix With AI

This is a comment left during a code review. Path: crates/blockchain/blockchain.rs Line: 2190-2293 Comment: **Timing instrumentation always compiled regardless of `metrics` feature** All four timing `Instant`s (`exec_start`, `transitions_start`, `merkle_start`, `store_start`) and the `[FULLSYNC TIMING]` `info!` log are unconditional — they run even when the `metrics` feature is disabled. The existing `metrics!` macro is specifically designed to gate this overhead. Similarly, `body_download_start` and its `[FULLSYNC TIMING]` info log in `full.rs` fire unconditionally. On a busy node this adds one `info!` line per batch and per body-download round, which can be verbose in non-metrics deployments. Consider wrapping both the timing variables and the log under `metrics!(...)` / `fullsync_metrics!(...)`, or demoting the log to `debug!`. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-04-17T09:09:52Z

            tokio::task::yield_now().await;
        }

+        let exec_ms = exec_start.elapsed().as_millis();


transitions_ms measured and logged but not exposed as a Prometheus metric

transitions_ms (time spent in vm.get_state_transitions()) is included in the [FULLSYNC TIMING] log and clearly matters for profiling, but there is no corresponding batch_transitions_ms gauge in MetricsFullSync. The Grafana "Batch Time Breakdown" stacked-bar chart sums body_download + execution + merkle + store, so it will consistently under-report total batch time by the transitions duration. This gap will be invisible to operators using the dashboard.

Prompt To Fix With AI

This is a comment left during a code review. Path: crates/blockchain/blockchain.rs Line: 2235-2238 Comment: **`transitions_ms` measured and logged but not exposed as a Prometheus metric** `transitions_ms` (time spent in `vm.get_state_transitions()`) is included in the `[FULLSYNC TIMING]` log and clearly matters for profiling, but there is no corresponding `batch_transitions_ms` gauge in `MetricsFullSync`. The Grafana "Batch Time Breakdown" stacked-bar chart sums `body_download + execution + merkle + store`, so it will consistently under-report total batch time by the transitions duration. This gap will be invisible to operators using the dashboard. How can I resolve this? If you propose a fix, please make it concise.

ElFantasma

Historically we approached metrics modules in a manual way, but we have done some work to fix this. Now there's a split in how Prometheus is wired:

register_* macro / default-registry pattern (2 modules): rpc.rs (introduced in #5335 by @rodrigo-o, 2025-11-24) and sync.rs (#6470, 2026-04-20). Each metric is created with register_int_gauge! / register_counter_vec! / etc., which registers it into Prometheus's default registry at construction time. api.rs surfaces them automatically via gather_default_metrics().
Manual Registry::new() pattern (6 modules): blocks.rs, node.rs, p2p.rs, process.rs, transactions.rs, l2/metrics.rs. Each module owns a singleton struct of bare IntGauge / IntCounter collectors, exposes a gather_metrics() method that builds a fresh Registry::new() per call, and api.rs has to know about each module and concatenate their output.

We should start moving all the metrics to the first method, but that's tech debt to be addressed in some future effort.

As this PR is adding a new module from scratch, it's a free moment to land it in the better shape rather than join the manual-Registry side and grow the divergence. Concrete deltas vs the current fullsync.rs:

Use the register_* macros (register_int_gauge!, register_int_counter!, register_int_counter_vec!) instead of IntGauge::new() / IntCounter::new(). The macros register into the default registry at construction time. New metrics auto-surface via gather_default_metrics() with zero api.rs changes — the current PR builds a fresh Registry::new() on every gather_metrics() call and forces api.rs to know about this module, both of which the macro pattern avoids.
Collapse parallel counters into label vectors (IntCounterVec with &["component"] or &["outcome"]). #6470 does this for pivot_updates, storage_requests, header_resolution. Here, header_failures + body_failures are a natural pair that should be one fullsync_failures_total{component="header|body"} — Grafana queries become sum(rate(...)[5m]) by (component) instead of two separate series.
Naming convention. Every other metric in this codebase is ethrex_* (ethrex_blocks_*, ethrex_p2p_*, ethrex_sync_*). fullsync_* will look orphaned in Grafana's metric picker — rename to ethrex_fullsync_*.
Drop reset_cycle(). It exists because some of the "counters" here (cycles_started, cycles_completed) are gauges-pretending-to-be-counters. With proper IntCounter registration via the macros, monotonicity is enforced and reset becomes meaningless — and counter monotonicity is what rate() queries assume.
Drop api.rs's explicit gather_metrics() call. Once metrics register into the default registry via macros, api.rs doesn't need to know this module exists. See the comment in api.rs:57-58 from #6470: "METRICS_SYNC registers into the default Prometheus registry at init, so its metrics are already included in gather_default_metrics() above."

The 5 inline comments above stand independently of this — they're tactical fixes that apply either way.

ElFantasma · 2026-04-29T18:07:22Z

        let mut transactions_count = 0;

-        let interval = Instant::now();
+        let interval = std::time::Instant::now();


These four timing Instants and the [FULLSYNC TIMING] log run regardless of the metrics feature. Cost is small but inconsistent with the rest of the metrics infra — wrap with #[cfg(feature = "metrics")] to match. The info! line below should also be feature-gated.

ElFantasma · 2026-04-29T18:07:22Z

+            "[FULLSYNC TIMING] Batch {}: exec={}ms, transitions={}ms, merkle={}ms, store={}ms, total={}ms",
+            blocks_len,
+            exec_ms,
+            transitions_ms,


transitions_ms is logged here but never exposed as a Prometheus gauge — Grafana can't graph it. Add set_batch_transitions_ms on MetricsFullSync (mirroring batch_execution_ms etc.) and call it in the metrics block below. (Greptile flagged.)

ElFantasma · 2026-04-29T18:07:22Z

+
+    // Reliability
+    header_failures: IntCounter,
+    body_failures: IntCounter,


P1: inc_body_failures() is exposed and surfaced on the dashboard but never called anywhere in this PR — any panel reading this metric will permanently show zero, masking real body-fetch failures. Either wire it at the body-download error site in crates/networking/p2p/sync/full.rs (alongside inc_header_failures()), or drop the metric until both increment and dashboard panel land together. (Greptile flagged.)

ElFantasma · 2026-04-29T18:07:22Z

+        self.stage.set(stage);
+    }
+    pub fn set_target_block(&self, block: u64) {
+        self.target_block.set(block.cast_signed());


block.cast_signed() is u64 as i64 — fine for any realistic block number, but if u64::MAX is ever passed as a sentinel "unknown target", this silently reports -1 and Grafana queries break. i64::try_from(block).unwrap_or(0) would be more defensive at zero perf cost. Same pattern in the next three setters.

ElFantasma · 2026-04-29T18:07:23Z

+    }
+
+    /// Reset gauges at the start of a new sync cycle so stale data doesn't persist
+    pub fn reset_cycle(&self) {


Worth a doc comment that this resets gauges only — the IntCounters (headers_downloaded, bodies_downloaded, *_failures, cycles_*) intentionally stay monotonic for rate() queries. Right now the asymmetry looks like an oversight.

Arkenan added 2 commits April 10, 2026 19:25

metrics

a94044c

moar metrics

88a37f2

github-actions Bot assigned Arkenan Apr 11, 2026

github-actions Bot added the L1 Ethereum client label Apr 11, 2026

github-project-automation Bot added this to ethrex_l1 Apr 11, 2026

Arkenan and others added 5 commits April 13, 2026 19:16

add grafana dasbhoard for full sync

bdfef9e

format

47eda94

Merge remote-tracking branch 'origin/main' into fullsync-metrics

23ff4dd

Merge branch 'main' into fullsync-metrics

24bfd57

Arkenan marked this pull request as ready for review April 17, 2026 09:05

Arkenan requested a review from a team as a code owner April 17, 2026 09:05

ethrex-project-sync Bot moved this to In Review in ethrex_l1 Apr 17, 2026

greptile-apps Bot reviewed Apr 17, 2026

View reviewed changes

lferrigno force-pushed the main branch from 28f3e58 to c921e50 Compare April 17, 2026 19:31

ElFantasma requested changes Apr 30, 2026

View reviewed changes

github-project-automation Bot moved this from In Review to In Progress in ethrex_l1 Apr 30, 2026

ElFantasma mentioned this pull request May 21, 2026

feat(l1): bal prometheus instruments, dashboard panels, and localnet fixture #6678

Open

5 tasks

Conversation

Arkenan commented Apr 11, 2026

Uh oh!

github-actions Bot commented Apr 11, 2026

🤖 Kimi Code Review

Critical Issues

Potential Issues

1. Integer Overflow Risk (Low severity, technical debt)

2. Missing Block Hash Debug Info

3. Magic Numbers for Sync Stages

Code Quality Suggestions

4. Macro Hygiene

5. Registry Allocation

6. Timestamp Precision

Positive Observations

Summary

Uh oh!

github-actions Bot commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Lines of code report

Uh oh!

github-actions Bot commented Apr 11, 2026

🤖 Claude Code Review

Review: feat(l1): fullsync metrics (#6466)

Bugs

Concerns

Minor

What's good

Uh oh!

github-actions Bot commented Apr 11, 2026

🤖 Codex Code Review

Uh oh!

github-actions Bot commented Apr 17, 2026

🤖 Claude Code Review

PR #6466 — feat(l1): fullsync metrics

Overview

Correctness Issues

Design/Consistency Concerns

Minor Issues

What's Working Well

Summary of Actionable Items

Uh oh!

github-actions Bot commented Apr 17, 2026

🤖 Kimi Code Review

Issues Found

1. Unused body_failures Counter (Minor)

2. Inconsistent Instant Types (Style)

3. Potential Precision Loss in Cast (Minor)

4. Missing batch_total_ms Update in Blockchain (Minor)

5. Registry Recreation Overhead (Performance)

Positive Observations

Nitpicks

Uh oh!

github-actions Bot commented Apr 17, 2026

🤖 Codex Code Review

Uh oh!

greptile-apps Bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (1)

Uh oh!

greptile-apps Bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

ElFantasma left a comment

Choose a reason for hiding this comment

Uh oh!

ElFantasma Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

ElFantasma Apr 29, 2026

github-actions Bot commented Apr 11, 2026 •

edited

Loading

Review: `feat(l1): fullsync metrics` (#6466)

PR #6466 — `feat(l1): fullsync metrics`

1. Unused `body_failures` Counter (Minor)

2. Inconsistent `Instant` Types (Style)

4. Missing `batch_total_ms` Update in Blockchain (Minor)

greptile-apps Bot commented Apr 17, 2026 •

edited

Loading