fix(l1): prevent snap sync crash on pivot update failure by ElFantasma · Pull Request #6475 · lambdaclass/ethrex

ElFantasma · 2026-04-14T22:55:07Z

Motivation

Snap sync crashes ~20% of the time on mainnet (and deterministically on small networks like hoodi) when update_pivot exhausts its 15-failure budget by repeatedly asking the same 1–2 peers. The crash (process::exit(2)) also corrupts the DB, requiring a full removedb and resync.

Root cause analysis and full details in #6474.

Description

Three targeted fixes that prevent the crash while leaving deeper peer selection improvements for a follow-up:

Reclassify PeerHandler and NoBlockHeaders errors as recoverable — instead of process::exit(2), the sync cycle retries. This is the key fix: even if peer selection is suboptimal, the node stays alive.
Add get_best_peer_excluding() — update_pivot now tracks which peers already failed and excludes them from selection. When all eligible peers have been tried, the exclude list clears and a new rotation starts. This ensures every peer gets a chance before any is retried.
Bump MAX_TOTAL_FAILURES from 15 to 100 and stop counting "no peers available" rotation resets as failures. The old budget of 15 was exhausted in ~2 minutes, with 40–73% of failures being passive waits rather than actual requests.

Checklist

Blockchain / VM / EVM tests pass
cargo fmt clean
No new warnings

…or diagnostics Add multi-layered observability for snap sync to diagnose intermittent pivot update failures on mainnet: - Prometheus metrics: eligible peers, inflight requests, pivot age, phase tracking, pivot update/storage request/header resolution outcome counters - Admin RPC: admin_peerScores (full peer table with scores, eligibility) and admin_syncStatus (phase, pivot, staleness, recent events) - Log instrumentation: TRACE per-peer dialogue, DEBUG for staleness, pivot updates with peer pool state, error classification - Docker monitor: polls new RPC endpoints, keeps rolling snapshot buffer, dumps on degradation detection with adaptive frequency - REPL: admin.peerScores and admin.syncStatus commands

…n issues detected The docker monitor bumps log level to TRACE via admin_setLogLevel RPC when degradation is detected, capturing detailed per-peer dialogue in container logs. Restores to DEBUG on recovery. Also sets default RUST_LOG in docker-compose to enable DEBUG-level sync events (staleness, pivot updates, error classification).

…shots Read pivot age from current time and progress counters from METRICS atomics on each admin_syncStatus call, so the RPC always returns fresh values instead of stale data from the last phase transition.

When a JSON response contains an array of objects (like admin_peerScores peers list), render as an aligned table with headers instead of collapsing to a single line. Makes diagnostic endpoints readable in the REPL.

…ect healing phase - on_failure: do a final RPC poll and bump log level before dumping - _dump_snapshots: support force=True to re-dump even if already dumped - Detect healing phase as degradation trigger to get 5s polling during the high-risk window where pivot updates happen - Always capture peer state at time of failure for post-mortem analysis

The method now detects both actual degradation (low eligible peers, staleness) and high-risk phases (healing), so the name should reflect the broader scope.

…radation The monitor was trying to dump peer snapshots to the run directory during degradation events, but the directory didn't exist yet (created at run end by save_all_logs). Now set_run_id() creates the directory immediately. Also adds peer_top.py live viewer script.

…ixes - fix clippy redundant closure in formatter.rs - fix unnecessary u64 casts in sync_manager.rs - migrate MetricsSync to default prometheus registry (Pattern B) so new sync metrics register once at init instead of per-gather. Removes gather_metrics() — exported via gather_default_metrics() automatically - fix admin_syncStatus reporting stale phase after recoverable error - fix on_failure() poll bypass — add force param so final poll executes even when instance status is already "failed" - fix peer_top.sh Python 3.12+ f-string syntax — use str.format() - remove dead DEGRADATION_STALL_TIMEOUT constant

…l tool

Rename hardcoded "healing" trigger to configurable WATCHED_PHASES set. Phases in this set get TRACE logging and fast polling — useful for investigating specific sync stages without editing the script. Default: healing (current investigation target). Override via --watched-phases flag, e.g. --watched-phases "healing,storage_insertion"

Incorporate progress metrics from PR #6468 (Tomi/Esteve) into the observability PR, with improvements: - Add progress gauges: headers, accounts, storage, healing, bytecodes (downloaded/inserted/total) + stage + pivot_block - Push from METRICS atomics via push_sync_prometheus_metrics() in network.rs, called each polling cycle and on phase completion - Grafana dashboard with 7 rows: overview, peer health, headers, accounts, storage, healing, bytecodes — with progress gauges, rate panels (using Grafana rate() instead of app-computed rates), and ETA - All metrics use default Prometheus registry (register at init) - New peer-health row with eligible peers, pivot age, inflight requests, and pivot update outcomes — not present in the original PR Supersedes #6468.

… 3701

…) - timestamp

- Add ethrex_sync_phase_start_timestamp{phase} labeled gauge — set on each phase transition, persists for completed phases - Grafana computes elapsed as time() - timestamp, per phase - Pivot Age now uses time() - ethrex_sync_pivot_timestamp (live) - pivot_age_seconds also updated each push cycle for RPC/peer_top - Added ETA panels for headers, accounts, bytecodes (remaining/rate) - Added elapsed panels for all phases in their respective rows - Overview row: replaced Throughput (N/A during sync) with Phase Elapsed showing all active/completed phase timings

…ync, not only after

…ing_syncing covers it

…h function in network.rs covers them

…702/3703)

…apsed per phase + full-width timeseries

- Reclassify PeerHandler and NoBlockHeaders errors as recoverable (retry sync cycle instead of process::exit) - Add get_best_peer_excluding() to rotate through all eligible peers before retrying any, preventing the same 1-2 peers from being asked repeatedly - Bump MAX_TOTAL_FAILURES from 15 to 100 - Don't count "no peers available" rotation resets as failures Closes #6474

github-actions · 2026-04-14T23:00:27Z

🤖 Claude Code Review

Now I have everything I need to write a thorough review.

PR Review: `fix(l1): prevent snap sync crash on pivot update failure`

Overall assessment: This is a solid, targeted fix that directly addresses a real crash. The motivation is well-documented, the three changes are cohesive, and the diff is clean. There are a few correctness concerns worth discussing before merge.

`sync.rs` — `is_recoverable` reclassification

Correct and important. Moving NoBlockHeaders and PeerHandler(_) to the recoverable arm is the right call. Both represent transient network/peer conditions, not corrupted local state.

One concern worth raising: all PeerHandlerError variants are now recoverable. If PeerHandlerError includes subtypes that indicate a fundamental protocol incompatibility (e.g., wrong network, unsupported version), treating those as transiently retryable could cause the sync loop to spin indefinitely rather than surfacing the real problem. Worth verifying that no fatal protocol-level variants live inside PeerHandlerError.

`snap_sync.rs` — `update_pivot` refactor

Correctness issue — PeerHandlerError bypasses peer rotation:

The new exclusion-list logic only activates when get_block_header returns Ok(None). When it returns Err(PeerHandlerError), the ? at snap_sync.rs:797 still propagates immediately out of update_pivot:

let result = peers
    .get_block_header(peer_id, &mut connection, new_pivot_block_number)
    .await
    .map_err(SyncError::PeerHandler)?;   // exits update_pivot entirely

So in the PeerHandlerError case:

excluded_peers is never updated for that peer.
The next update_pivot call starts with a fresh, empty excluded_peers.
The same peer may be selected and fail again in an identical cycle.

The peer exclusion machinery only helps for None responses; the other half of the failure space (protocol errors) still bypasses it. Since PeerHandlerError is now recoverable, this won't crash — but it weakens the "rotate through all peers before retrying" guarantee. Consider handling the error path the same way:

match peers.get_block_header(...).await {
    Err(e) => {
        peers.peer_table.record_failure(peer_id)?;
        total_failures = total_failures.saturating_add(1);
        break; // move to next peer via exclusion
    }
    Ok(result) => { /* existing None/Some handling */ }
}

consecutive_failures_on_current is always MAX_RETRIES_PER_PEER at the log site:

The inner for loop runs the full 0..MAX_RETRIES_PER_PEER range and only exits early on success. Every time the peer is excluded, consecutive_failures_on_current equals MAX_RETRIES_PER_PEER (3). The debug log at snap_sync.rs:838 will always print "3 failures". Either use the constant directly or restructure to track only actual failures if the peer can also fail via PeerHandlerError (Point 1 above).

MAX_RETRY_DELAY is unreachable:

const MAX_RETRY_DELAY: Duration = Duration::from_secs(30);
// ...
let delay = INITIAL_RETRY_DELAY.saturating_mul(1 << (total_failures / 10).min(4));
let delay = delay.min(MAX_RETRY_DELAY);

The exponent is capped at 4, so the maximum multiplier is 2^4 = 16, giving 1s * 16 = 16s. The .min(MAX_RETRY_DELAY) (30s) is never triggered. Either raise the cap to 1 << 5 = 32s (just above 30s so the cap kicks in) or remove MAX_RETRY_DELAY and use the .min(4) directly. This was present before the PR but the refactor is a good opportunity to clean it up.

Backoff fires on rotation resets (minor, likely intentional):

After excluded_peers.clear() and continue, the next outer-loop iteration hits the if total_failures > 0 { sleep(...) } guard and sleeps again before selecting a peer. This means there are two consecutive sleeps when transitioning between rotations. The comment says "Don't count rotation resets as failures" (correct), but the backoff still fires. Probably fine as a throttle, but worth a comment to make the intent clear.

`peer_table.rs` — `do_get_best_peer_excluding`

Code duplication: do_get_best_peer and do_get_best_peer_excluding are identical except for the excluded.contains(id) guard. The simpler implementation:

fn do_get_best_peer(&self, capabilities: &[Capability]) -> Option<(H256, PeerConnection)> {
    self.do_get_best_peer_excluding(capabilities, &[])
}

This eliminates a divergence risk if the peer-selection logic changes in the future.

excluded.contains(id) is O(n) per peer per lookup. With typical peer counts (dozens) this is not a practical problem, but if excluded_peers is expected to grow (e.g., in denser networks), a HashSet<H256> would be more appropriate. Given H256 is Copy + Eq + Hash, this is a one-line change at the call site.

Unnecessary clone per loop iteration in snap_sync.rs:

.get_best_peer_excluding(SUPPORTED_ETH_CAPABILITIES.to_vec(), excluded_peers.clone())

The Vec<H256> is cloned on every outer-loop iteration because the protocol trait takes ownership. This is likely unavoidable with the actor-message model (which requires owned arguments for serialization), but worth confirming. If the trait could take &[H256] instead, the allocation is eliminated. Minor in practice.

Positive notes

Reclassifying NoBlockHeaders and PeerHandler as recoverable directly solves the crash without touching unrelated code. Good surgical scope.
The empty-vs-nonempty excluded_peers distinction in the None branch (snap_sync.rs:746-760) is correct and well-commented.
MAX_TOTAL_FAILURES: 100 with the new accounting is much more appropriate than 15.
Removing the unnecessary .clone() on the Ok(pivot) return is a good cleanup.
The typo fix ("Succesfully" → "Successfully") is appreciated.

Automated review by Claude (Anthropic) · sonnet · custom prompt

greptile-apps · 2026-04-14T23:00:30Z

+        // Exponential backoff based on how many full rotations we've done
+        if total_failures > 0 {
+            let delay = INITIAL_RETRY_DELAY.saturating_mul(1 << (total_failures / 10).min(4));
            let delay = delay.min(MAX_RETRY_DELAY);


Misleading backoff comment

The comment says "based on how many full rotations we've done" but the formula total_failures / 10 is not the number of full peer rotations — it's just total failures divided by 10. With 3 failures per peer and, say, 4 peers, a full rotation produces 12 failures, making the "per-rotation" interpretation off by 1.2×. Consider updating the comment to reflect the actual formula.

Suggested change

// Exponential backoff based on how many full rotations we've done

if total_failures > 0 {

let delay = INITIAL_RETRY_DELAY.saturating_mul(1 << (total_failures / 10).min(4));

let delay = delay.min(MAX_RETRY_DELAY);

// Exponential backoff: delay doubles every 10 total failures, capped at MAX_RETRY_DELAY

if total_failures > 0 {

let delay = INITIAL_RETRY_DELAY.saturating_mul(1 << (total_failures / 10).min(4));

let delay = delay.min(MAX_RETRY_DELAY);

Prompt To Fix With AI

This is a comment left during a code review. Path: crates/networking/p2p/sync/snap_sync.rs Line: 725-728 Comment: **Misleading backoff comment** The comment says "based on how many full rotations we've done" but the formula `total_failures / 10` is not the number of full peer rotations — it's just total failures divided by 10. With 3 failures per peer and, say, 4 peers, a full rotation produces 12 failures, making the "per-rotation" interpretation off by 1.2×. Consider updating the comment to reflect the actual formula. ```suggestion // Exponential backoff: delay doubles every 10 total failures, capped at MAX_RETRY_DELAY if total_failures > 0 { let delay = INITIAL_RETRY_DELAY.saturating_mul(1 << (total_failures / 10).min(4)); let delay = delay.min(MAX_RETRY_DELAY); ``` How can I resolve this? If you propose a fix, please make it concise.

Addressed in 87880b0. Replaced total_failures / 10 with explicit rotation_count; backoff now doubles per rotation, comment matches implementation.

greptile-apps · 2026-04-14T23:00:31Z

        else {
-            debug!("We tried to get peers during update_pivot, but we found no free peers");
-            #[cfg(feature = "metrics")]
-            ethrex_metrics::sync::METRICS_SYNC.inc_pivot_update("no_peers");
-            consecutive_failures = consecutive_failures.saturating_add(1);
-            total_failures = total_failures.saturating_add(1);
+            if excluded_peers.is_empty() {
+                // Genuinely no peers at all
+                debug!("update_pivot: no free peers available");
+                #[cfg(feature = "metrics")]
+                ethrex_metrics::sync::METRICS_SYNC.inc_pivot_update("no_peers");
+                total_failures = total_failures.saturating_add(1);
+            } else {
+                // Tried all peers, start a new rotation
+                debug!(
+                    "update_pivot: exhausted all {} peers, clearing exclude list for next rotation",
+                    excluded_peers.len()
+                );
+                excluded_peers.clear();
+                // Don't count rotation resets as failures
+            }
            continue;


Busy peers can trigger premature rotation reset

get_best_peer_excluding returns None for two distinct reasons: (a) every non-excluded peer was tried this rotation, or (b) non-excluded peers exist but are all over the MAX_CONCURRENT_REQUESTS_PER_PEER limit or lack ETH capabilities. The code treats both as "rotation complete" because excluded_peers.is_empty() is false in either case.

In case (b), excluded_peers.clear() re-admits recently-failed peers to the candidate pool before the busy peers become available, potentially looping through the same failed peer(s) again while perfectly healthy-but-momentarily-busy peers wait. The rotation isn't truly exhausted — it just has no current slot.

A tighter guard (e.g., comparing excluded_peers.len() against eligible_count from the diagnostics snapshot already fetched a few lines down) would distinguish the two cases, though this is minor given the 100-failure budget.

Prompt To Fix With AI

This is a comment left during a code review. Path: crates/networking/p2p/sync/snap_sync.rs Line: 740-756 Comment: **Busy peers can trigger premature rotation reset** `get_best_peer_excluding` returns `None` for two distinct reasons: (a) every non-excluded peer was tried this rotation, or (b) non-excluded peers exist but are all over the `MAX_CONCURRENT_REQUESTS_PER_PEER` limit or lack ETH capabilities. The code treats both as "rotation complete" because `excluded_peers.is_empty()` is false in either case. In case (b), `excluded_peers.clear()` re-admits recently-failed peers to the candidate pool before the busy peers become available, potentially looping through the same failed peer(s) again while perfectly healthy-but-momentarily-busy peers wait. The rotation isn't truly exhausted — it just has no current slot. A tighter guard (e.g., comparing `excluded_peers.len()` against `eligible_count` from the diagnostics snapshot already fetched a few lines down) would distinguish the two cases, though this is minor given the 100-failure budget. How can I resolve this? If you propose a fix, please make it concise.

Addressed in 87880b0. When get_best_peer_excluding returns None we now probe get_best_peer (no exclusions) to tell whether (a) no peer is eligible at all (wait), (b) peers exist but none selectable (retry), or (c) all non-excluded peers were tried (rotation complete, clear list and bump rotation_count). Busy peers no longer trigger a premature reset.

greptile-apps · 2026-04-14T23:00:32Z

+    /// Like `do_get_best_peer`, but excludes specific peers from selection.
+    /// Used by `update_pivot` to rotate through peers on repeated failures.
+    fn do_get_best_peer_excluding(
+        &self,
+        capabilities: &[Capability],
+        excluded: &[H256],
+    ) -> Option<(H256, PeerConnection)> {
+        self.peers
+            .iter()
+            .filter_map(|(id, peer_data)| {
+                if excluded.contains(id)
+                    || !self.can_try_more_requests(&peer_data.score, &peer_data.requests)
+                    || !capabilities
+                        .iter()
+                        .any(|cap| peer_data.supported_capabilities.contains(cap))
+                {
+                    None
+                } else {
+                    let connection = peer_data.connection.clone()?;
+                    Some((*id, peer_data.score, peer_data.requests, connection))
+                }
+            })
+            .max_by_key(|(_, score, reqs, _)| self.weight_peer(score, reqs))
+            .map(|(k, _, _, v)| (k, v))
+    }


Duplicated peer-selection logic

do_get_best_peer_excluding is a strict superset of do_get_best_peer — only one extra excluded.contains(id) condition. Consider delegating to a shared helper to keep the scoring/eligibility logic in one place:

fn do_get_best_peer(&self, capabilities: &[Capability]) -> Option<(H256, PeerConnection)> { self.do_get_best_peer_excluding(capabilities, &[]) }

This avoids future drift if the eligibility check (e.g., can_try_more_requests) changes in one method but not the other.

Prompt To Fix With AI

This is a comment left during a code review. Path: crates/networking/p2p/peer_table.rs Line: 912-936 Comment: **Duplicated peer-selection logic** `do_get_best_peer_excluding` is a strict superset of `do_get_best_peer` — only one extra `excluded.contains(id)` condition. Consider delegating to a shared helper to keep the scoring/eligibility logic in one place: ```rust fn do_get_best_peer(&self, capabilities: &[Capability]) -> Option<(H256, PeerConnection)> { self.do_get_best_peer_excluding(capabilities, &[]) } ``` This avoids future drift if the eligibility check (e.g., `can_try_more_requests`) changes in one method but not the other. How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Addressed in 87880b0. do_get_best_peer now delegates to do_get_best_peer_excluding(caps, &[]).

…ction

- Narrow PeerHandlerError recoverability: dead actor, storage full, and snap errors stay fatal; only transient peer-interaction variants retry - Catch recoverable errors inside update_pivot inner loop so protocol errors advance the rotation like Ok(None) does - Track rotation_count explicitly instead of total_failures; cap at MAX_ROTATIONS=5 so the budget scales with peer count - Use rotation_count for exponential backoff (was raw failures) - Distinguish "rotation exhausted" from "all peers at capacity"; only clear exclude list in the former case - Deduplicate do_get_best_peer by delegating to do_get_best_peer_excluding

ElFantasma · 2026-04-15T14:15:35Z

Review feedback addressed (commit `87880b0`)

Thanks for the thorough reviews. All eight findings across Codex, Claude, and Greptile are addressed.

Codex

MAX_TOTAL_FAILURES = 100 too low for large peer sets — Fixed. Replaced the hardcoded failure count with MAX_ROTATIONS = 5. The cap now scales naturally with peer count: one rotation = every eligible peer tried once.
Backoff formula scaled with failures, not rotations — Fixed. Now uses an explicit rotation_count; backoff doubles per rotation (1s, 2s, 4s, 8s, 16s, 30s).
PeerHandlerError wholesale recoverable — Fixed. Added PeerHandlerError::is_recoverable() and made SyncError::is_recoverable() delegate. Only transient peer-interaction variants (BlockHeaders, SendMessageToPeer, EmptyResponseFromPeer, timeouts, etc.) retry. Actor failures (PeerTableError), StorageFull, and Snap(_) stay fatal — a dead peer-table actor now exits the node instead of looping forever.

Claude

PeerHandlerError bypassed peer rotation — Fixed. The inner loop now matches the outcome: Ok(Some) succeeds, Ok(None) and recoverable errors both count as a failure (advance retries, exclude peer on exhaustion), non-recoverable errors propagate. This closes the rotation gap you identified.
consecutive_failures_on_current always printed 3 — Fixed. Variable is now peer_failures, counted separately from the retry loop index. After docs: update milestones. #4 it reflects actual failure attempts (not just the loop cap), though in the error-free case it will still equal MAX_RETRIES_PER_PEER.

Greptile (inline replies above)

Misleading backoff comment — fixed with docs: add milestones #1/chore: create project structure #2.
Busy peers triggering premature rotation reset — fixed by probing get_best_peer (no exclusions) to distinguish "rotation done" from "all at capacity".
Duplicated peer-selection logic — do_get_best_peer now delegates to do_get_best_peer_excluding(caps, &[]).

Not addressed

Claude's note about MAX_RETRY_DELAY being unreachable is now addressed as a side-effect of #2: the new per-rotation backoff (1s << rotation_count) reaches 16s at rotation 4 and hits the 30s cap on rotation 5.

…onitor wording

… visibility)

… into fix/pivot-update-crash

- Add §1.18 observability tooling (PR #6470) - Add §1.19 pivot update reliability (PR #6475, issue #6474) - Add §1.20 big-account within-trie parallelization (issue #6477) - Add §1.21 small-account batching (issue #6476) - Add §1.22 decoded TrieLayerCache (PR #6348) - Add §1.23 bloom filter for non-existent storage (PR #6288) - Add §1.24 adaptive request sizing + bisection (PR #6181) - Add §1.25 concurrent bytecode + storage (PR #6205) - Add §1.26 phase completion markers (PR #6189) - Add §2.18 StorageTrieTracker refactor (PR #6171) - Update current-state bottleneck table with small-account and pivot-update findings - Reprioritize timeline: pivot-update crash fix is now priority 0 - Add two risks (pivot crash masks perf work, DB corruption on every crash) - Bump doc version to 1.3

iovoid · 2026-04-17T12:57:21Z

+            | SyncError::NoBlocks
+            | SyncError::NoBlockHeaders => true,
+            // PeerHandler handled above by delegation
+            SyncError::PeerHandler(_) => unreachable!(),


Why not call e.is_recoverable() here?

iovoid · 2026-04-17T13:33:08Z

+        let mut peer_failures: u64 = 0;
+        for attempt in 0..MAX_RETRIES_PER_PEER {
+            let outcome = peers
+                .get_block_header(peer_id, &mut connection, new_pivot_block_number)


get_block_header doesn't actually propagate most error types, it returns Ok(None)

(it seems like it could return PeerHandlerError::BlockHeaders, but that is unreachable)

Moved to #6506

Brings in main commits since the prior merge: #6516 EIP-8025 compliance (Electra-aligned ExecutionRequests typed container in NewPayloadRequest, MAX_CONSOLIDATION_REQUESTS_PER_PAYLOAD corrected from 1 to 2, to_encoded_requests() helper for EIP-7685 bytes, removal of ExecutionPayloadHeader/NewPayloadRequestHeader, new byte-oriented execution_program entrypoint that decodes the wire format internally and returns valid: false instead of erroring on post-decode failures), #6463 BAL withdrawal reverse check (DB->BAL direction so a malicious builder can't omit a withdrawal recipient from the BAL), #6505 Kademlia k-bucket revert (PeerTableServer::spawn no longer takes a node_id), plus snap-sync observability + dashboards (#6470), pivot-update crash fix (#6475), weighted peer selection (#6428), txpool_contentFrom/txpool_inspect RPC (#6446), block-by-block exec fallback (#6464), Amsterdam EELS branch pin (#6495), and rollup store SQLite v9->v10 migration (#6514). Conflict resolutions: - crates/common/types/stateless_ssz.rs: this branch had already moved the EIP-8025 SSZ types out of crates/common/types/eip8025_ssz.rs into stateless_ssz.rs and tucked the native-rollup containers below them. Kept that layout, applied #6516's content updates to the EIP-8025 section (renamed spec-limit constants, ExecutionRequests typed container with to_encoded_requests, dropped header types and their tests), pulled in the EncodedRequests import, and kept both the new test_execution_requests_to_encoded_bytes and the branch's stateless round-trip tests. - crates/guest-program/src/l1/program.rs: adopted #6516's new execution_program(bytes: &[u8], crypto) API with the internal decode_eip8025 call, the validate_eip8025_execution helper, and the decode-failure test. Rewrote all `eip-8025` feature gates as `experimental-devnet` and all `eip8025_ssz::` paths as `stateless_ssz::` to match this branch's renames. - crates/guest-program/bin/{sp1,risc0,zisk,openvm}/src/main.rs: applied #6516's simplification (drop decode_eip8025 import, pass &input straight to execution_program) under the experimental-devnet feature gate. Also flipped the rkyv::rancor::Error import gate from the old `eip-8025` name to `experimental-devnet` so the non-devnet build still has the import it needs. - crates/prover/src/backend/exec.rs: kept #6516's updated comment ("raw input bytes" instead of "(NewPayloadRequest, ExecutionWitness)") under the experimental-devnet feature gate. Auto-merged regions checked: crates/vm/backends/levm/mod.rs picked up all of #6463's Part B (DB->BAL) reverse check intact, and cmd/ethrex/l2/initializers.rs picked up #6505's PeerTableServer::spawn signature change. Verified cargo fmt --all clean, cargo check --workspace clean, cargo check --workspace --tests clean, and cargo check -p ethrex-guest-program --features experimental-devnet --tests clean.

Conflicts in snap/client.rs and sync/snap_sync.rs from main's permit-based peer-request refactor (#6523) and the snap-sync rotation logic (#6470/#6475). Adopted main's pattern (connection.outgoing_request + drop(permit), labeled break flow, peer rotation) and kept the async disk I/O changes.

Conflicts in peer_handler.rs and snap_sync.rs from main's permit-based peer-request refactor (#6523), observability (#6470), and pivot fix (#6475). Adopted main's permit pattern. Kept the new download_headers_background function and extended sync_cycle_snap/snap_sync signatures to accept the diagnostics handle from main alongside the snap_enabled flag from this PR. Test fixes: - snap_server_tests.rs: updated state_trie.hash() to take &NativeCrypto - added #![allow(clippy::unwrap_used)] for lazy_static block (test file)

ElFantasma and others added 30 commits April 13, 2026 18:07

fix(l1): add diagnostics for header download phase in snap sync

e363561

fix(l1): compute live sync diagnostics on query instead of stale snap…

bc7b049

…shots Read pivot age from current time and progress counters from METRICS atomics on each admin_syncStatus call, so the RPC always returns fresh values instead of stale data from the last phase transition.

feat(l1): render arrays of objects as tables in REPL formatter

f20c95f

When a JSON response contains an array of objects (like admin_peerScores peers list), render as an aligned table with headers instead of collapsing to a single line. Makes diagnostic endpoints readable in the REPL.

refactor(l1): rename _check_degradation to _check_alert_conditions

5066523

The method now detects both actual degradation (low eligible peers, staleness) and high-risk phases (healing), so the name should reflect the broader scope.

feat(l1): peer_top responsive layout — use full terminal width

3bbfdc8

chore(l1): remove redundant peer_top.sh — peer_top.py is the canonica…

68c9d9b

…l tool

fix(l1): default watched phases to empty — opt-in, not opt-out

e217075

Merge branch 'main' into feat/snap-sync-observability-pr

e04782a

feat(l1): add MULTISYNC_WATCHED_PHASES to Makefile for opt-in TRACE

fc87aa6

Merge branch 'main' into feat/snap-sync-observability-pr

bc3764e

fix(l1): remove internal jargon from code comment

572fa9c

feat(l1): enable metrics in multisync containers, expose mainnet port…

36e9d3d

… 3701

fix(l1): fix global declaration order in docker_monitor.py

ef39564

fix(l1): expose pivot_timestamp gauge — Grafana computes age as time(…

9fef0cb

…) - timestamp

fix(l1): push peer health metrics (eligible, snap, inflight) during s…

05a659f

…ync, not only after

refactor(l1): remove duplicate peer health push from after_sync — dur…

8ff1321

…ing_syncing covers it

refactor(l1): remove redundant prometheus pushes from snap_sync — pus…

f7f8670

…h function in network.rs covers them

feat(l1): expose metrics ports for all three multisync chains (3701/3…

0b860ad

…702/3703)

fix(l1): deduplicate dashboard panels, consistent 4-panel rows

0507f59

fix(l1): consistent dashboard layout — Progress+Counts, Rate, ETA, El…

964be80

…apsed per phase + full-width timeseries

Merge branch 'main' into feat/snap-sync-observability-pr

eb1155b

greptile-apps Bot reviewed Apr 14, 2026

View reviewed changes

ElFantasma added 3 commits April 14, 2026 20:37

feat(l1): add composite Sync Rates Overview chart to dashboard top se…

4d29538

…ction

fix(l1): move composite chart legend to bottom for time axis alignment

54db019

ElFantasma added 3 commits April 15, 2026 11:17

fix(l1): distinguish watched-phase tracing from real degradation in m…

866ea60

…onitor wording

fix(l1): log reason changes even when already in alert state (monitor…

905e368

… visibility)

Merge remote-tracking branch 'origin/feat/snap-sync-observability-pr'…

fc58f7a

… into fix/pivot-update-crash

ElFantasma mentioned this pull request Apr 15, 2026

docs(l1): snapsync roadmap #6112

Open

1 task

iovoid reviewed Apr 17, 2026

View reviewed changes

iovoid requested changes Apr 17, 2026

View reviewed changes

github-project-automation Bot moved this from In Review to In Progress in ethrex_l1 Apr 17, 2026

avilagaston9 approved these changes Apr 17, 2026

View reviewed changes

iovoid mentioned this pull request Apr 20, 2026

Improve get_block_header's usage of errors #6506

Open

iovoid approved these changes Apr 20, 2026

View reviewed changes

github-project-automation Bot moved this from In Progress to In Review in ethrex_l1 Apr 20, 2026

Base automatically changed from feat/snap-sync-observability-pr to main April 20, 2026 17:59

ilitteri approved these changes Apr 20, 2026

View reviewed changes

Merge branch 'main' into fix/pivot-update-crash

5b8b39d

ilitteri enabled auto-merge April 20, 2026 18:13

ilitteri added this pull request to the merge queue Apr 20, 2026

Merged via the queue into main with commit d03925f Apr 20, 2026
69 of 71 checks passed

ilitteri deleted the fix/pivot-update-crash branch April 20, 2026 19:42

github-project-automation Bot moved this from In Review to Done in ethrex_l1 Apr 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(l1): prevent snap sync crash on pivot update failure#6475

fix(l1): prevent snap sync crash on pivot update failure#6475
ilitteri merged 37 commits into
mainfrom
fix/pivot-update-crash

ElFantasma commented Apr 14, 2026

Uh oh!

github-actions Bot commented Apr 14, 2026

Uh oh!

greptile-apps Bot Apr 14, 2026

Uh oh!

ElFantasma Apr 15, 2026

Uh oh!

greptile-apps Bot Apr 14, 2026

Uh oh!

ElFantasma Apr 15, 2026

Uh oh!

greptile-apps Bot Apr 14, 2026

Uh oh!

ElFantasma Apr 15, 2026

Uh oh!

ElFantasma commented Apr 15, 2026

Uh oh!

iovoid Apr 17, 2026

Uh oh!

iovoid Apr 17, 2026

Uh oh!

iovoid Apr 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ElFantasma commented Apr 14, 2026

Motivation

Description

Checklist

Uh oh!

github-actions Bot commented Apr 14, 2026

🤖 Claude Code Review

PR Review: fix(l1): prevent snap sync crash on pivot update failure

sync.rs — is_recoverable reclassification

snap_sync.rs — update_pivot refactor

peer_table.rs — do_get_best_peer_excluding

Positive notes

Uh oh!

greptile-apps Bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

ElFantasma Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

ElFantasma Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

ElFantasma Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

ElFantasma commented Apr 15, 2026

Review feedback addressed (commit 87880b0)

Codex

Claude

Greptile (inline replies above)

Not addressed

Uh oh!

iovoid Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

iovoid Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

iovoid Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

PR Review: `fix(l1): prevent snap sync crash on pivot update failure`

`sync.rs` — `is_recoverable` reclassification

`snap_sync.rs` — `update_pivot` refactor

`peer_table.rs` — `do_get_best_peer_excluding`

Review feedback addressed (commit `87880b0`)