fix(l1): prevent snap sync crash on pivot update failure#6475
Conversation
…or diagnostics Add multi-layered observability for snap sync to diagnose intermittent pivot update failures on mainnet: - Prometheus metrics: eligible peers, inflight requests, pivot age, phase tracking, pivot update/storage request/header resolution outcome counters - Admin RPC: admin_peerScores (full peer table with scores, eligibility) and admin_syncStatus (phase, pivot, staleness, recent events) - Log instrumentation: TRACE per-peer dialogue, DEBUG for staleness, pivot updates with peer pool state, error classification - Docker monitor: polls new RPC endpoints, keeps rolling snapshot buffer, dumps on degradation detection with adaptive frequency - REPL: admin.peerScores and admin.syncStatus commands
…n issues detected The docker monitor bumps log level to TRACE via admin_setLogLevel RPC when degradation is detected, capturing detailed per-peer dialogue in container logs. Restores to DEBUG on recovery. Also sets default RUST_LOG in docker-compose to enable DEBUG-level sync events (staleness, pivot updates, error classification).
…shots Read pivot age from current time and progress counters from METRICS atomics on each admin_syncStatus call, so the RPC always returns fresh values instead of stale data from the last phase transition.
When a JSON response contains an array of objects (like admin_peerScores peers list), render as an aligned table with headers instead of collapsing to a single line. Makes diagnostic endpoints readable in the REPL.
…ect healing phase - on_failure: do a final RPC poll and bump log level before dumping - _dump_snapshots: support force=True to re-dump even if already dumped - Detect healing phase as degradation trigger to get 5s polling during the high-risk window where pivot updates happen - Always capture peer state at time of failure for post-mortem analysis
The method now detects both actual degradation (low eligible peers, staleness) and high-risk phases (healing), so the name should reflect the broader scope.
…radation The monitor was trying to dump peer snapshots to the run directory during degradation events, but the directory didn't exist yet (created at run end by save_all_logs). Now set_run_id() creates the directory immediately. Also adds peer_top.py live viewer script.
…ixes - fix clippy redundant closure in formatter.rs - fix unnecessary u64 casts in sync_manager.rs - migrate MetricsSync to default prometheus registry (Pattern B) so new sync metrics register once at init instead of per-gather. Removes gather_metrics() — exported via gather_default_metrics() automatically - fix admin_syncStatus reporting stale phase after recoverable error - fix on_failure() poll bypass — add force param so final poll executes even when instance status is already "failed" - fix peer_top.sh Python 3.12+ f-string syntax — use str.format() - remove dead DEGRADATION_STALL_TIMEOUT constant
Rename hardcoded "healing" trigger to configurable WATCHED_PHASES set. Phases in this set get TRACE logging and fast polling — useful for investigating specific sync stages without editing the script. Default: healing (current investigation target). Override via --watched-phases flag, e.g. --watched-phases "healing,storage_insertion"
Incorporate progress metrics from PR #6468 (Tomi/Esteve) into the observability PR, with improvements: - Add progress gauges: headers, accounts, storage, healing, bytecodes (downloaded/inserted/total) + stage + pivot_block - Push from METRICS atomics via push_sync_prometheus_metrics() in network.rs, called each polling cycle and on phase completion - Grafana dashboard with 7 rows: overview, peer health, headers, accounts, storage, healing, bytecodes — with progress gauges, rate panels (using Grafana rate() instead of app-computed rates), and ETA - All metrics use default Prometheus registry (register at init) - New peer-health row with eligible peers, pivot age, inflight requests, and pivot update outcomes — not present in the original PR Supersedes #6468.
- Add ethrex_sync_phase_start_timestamp{phase} labeled gauge — set on
each phase transition, persists for completed phases
- Grafana computes elapsed as time() - timestamp, per phase
- Pivot Age now uses time() - ethrex_sync_pivot_timestamp (live)
- pivot_age_seconds also updated each push cycle for RPC/peer_top
- Added ETA panels for headers, accounts, bytecodes (remaining/rate)
- Added elapsed panels for all phases in their respective rows
- Overview row: replaced Throughput (N/A during sync) with Phase Elapsed
showing all active/completed phase timings
…ync, not only after
…ing_syncing covers it
…h function in network.rs covers them
…apsed per phase + full-width timeseries
- Reclassify PeerHandler and NoBlockHeaders errors as recoverable (retry sync cycle instead of process::exit) - Add get_best_peer_excluding() to rotate through all eligible peers before retrying any, preventing the same 1-2 peers from being asked repeatedly - Bump MAX_TOTAL_FAILURES from 15 to 100 - Don't count "no peers available" rotation resets as failures Closes #6474
🤖 Claude Code ReviewNow I have everything I need to write a thorough review. PR Review:
|
| // Exponential backoff based on how many full rotations we've done | ||
| if total_failures > 0 { | ||
| let delay = INITIAL_RETRY_DELAY.saturating_mul(1 << (total_failures / 10).min(4)); | ||
| let delay = delay.min(MAX_RETRY_DELAY); |
There was a problem hiding this comment.
The comment says "based on how many full rotations we've done" but the formula total_failures / 10 is not the number of full peer rotations — it's just total failures divided by 10. With 3 failures per peer and, say, 4 peers, a full rotation produces 12 failures, making the "per-rotation" interpretation off by 1.2×. Consider updating the comment to reflect the actual formula.
| // Exponential backoff based on how many full rotations we've done | |
| if total_failures > 0 { | |
| let delay = INITIAL_RETRY_DELAY.saturating_mul(1 << (total_failures / 10).min(4)); | |
| let delay = delay.min(MAX_RETRY_DELAY); | |
| // Exponential backoff: delay doubles every 10 total failures, capped at MAX_RETRY_DELAY | |
| if total_failures > 0 { | |
| let delay = INITIAL_RETRY_DELAY.saturating_mul(1 << (total_failures / 10).min(4)); | |
| let delay = delay.min(MAX_RETRY_DELAY); |
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/networking/p2p/sync/snap_sync.rs
Line: 725-728
Comment:
**Misleading backoff comment**
The comment says "based on how many full rotations we've done" but the formula `total_failures / 10` is not the number of full peer rotations — it's just total failures divided by 10. With 3 failures per peer and, say, 4 peers, a full rotation produces 12 failures, making the "per-rotation" interpretation off by 1.2×. Consider updating the comment to reflect the actual formula.
```suggestion
// Exponential backoff: delay doubles every 10 total failures, capped at MAX_RETRY_DELAY
if total_failures > 0 {
let delay = INITIAL_RETRY_DELAY.saturating_mul(1 << (total_failures / 10).min(4));
let delay = delay.min(MAX_RETRY_DELAY);
```
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Addressed in 87880b0. Replaced total_failures / 10 with explicit rotation_count; backoff now doubles per rotation, comment matches implementation.
| else { | ||
| debug!("We tried to get peers during update_pivot, but we found no free peers"); | ||
| #[cfg(feature = "metrics")] | ||
| ethrex_metrics::sync::METRICS_SYNC.inc_pivot_update("no_peers"); | ||
| consecutive_failures = consecutive_failures.saturating_add(1); | ||
| total_failures = total_failures.saturating_add(1); | ||
| if excluded_peers.is_empty() { | ||
| // Genuinely no peers at all | ||
| debug!("update_pivot: no free peers available"); | ||
| #[cfg(feature = "metrics")] | ||
| ethrex_metrics::sync::METRICS_SYNC.inc_pivot_update("no_peers"); | ||
| total_failures = total_failures.saturating_add(1); | ||
| } else { | ||
| // Tried all peers, start a new rotation | ||
| debug!( | ||
| "update_pivot: exhausted all {} peers, clearing exclude list for next rotation", | ||
| excluded_peers.len() | ||
| ); | ||
| excluded_peers.clear(); | ||
| // Don't count rotation resets as failures | ||
| } | ||
| continue; |
There was a problem hiding this comment.
Busy peers can trigger premature rotation reset
get_best_peer_excluding returns None for two distinct reasons: (a) every non-excluded peer was tried this rotation, or (b) non-excluded peers exist but are all over the MAX_CONCURRENT_REQUESTS_PER_PEER limit or lack ETH capabilities. The code treats both as "rotation complete" because excluded_peers.is_empty() is false in either case.
In case (b), excluded_peers.clear() re-admits recently-failed peers to the candidate pool before the busy peers become available, potentially looping through the same failed peer(s) again while perfectly healthy-but-momentarily-busy peers wait. The rotation isn't truly exhausted — it just has no current slot.
A tighter guard (e.g., comparing excluded_peers.len() against eligible_count from the diagnostics snapshot already fetched a few lines down) would distinguish the two cases, though this is minor given the 100-failure budget.
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/networking/p2p/sync/snap_sync.rs
Line: 740-756
Comment:
**Busy peers can trigger premature rotation reset**
`get_best_peer_excluding` returns `None` for two distinct reasons: (a) every non-excluded peer was tried this rotation, or (b) non-excluded peers exist but are all over the `MAX_CONCURRENT_REQUESTS_PER_PEER` limit or lack ETH capabilities. The code treats both as "rotation complete" because `excluded_peers.is_empty()` is false in either case.
In case (b), `excluded_peers.clear()` re-admits recently-failed peers to the candidate pool before the busy peers become available, potentially looping through the same failed peer(s) again while perfectly healthy-but-momentarily-busy peers wait. The rotation isn't truly exhausted — it just has no current slot.
A tighter guard (e.g., comparing `excluded_peers.len()` against `eligible_count` from the diagnostics snapshot already fetched a few lines down) would distinguish the two cases, though this is minor given the 100-failure budget.
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Addressed in 87880b0. When get_best_peer_excluding returns None we now probe get_best_peer (no exclusions) to tell whether (a) no peer is eligible at all (wait), (b) peers exist but none selectable (retry), or (c) all non-excluded peers were tried (rotation complete, clear list and bump rotation_count). Busy peers no longer trigger a premature reset.
| /// Like `do_get_best_peer`, but excludes specific peers from selection. | ||
| /// Used by `update_pivot` to rotate through peers on repeated failures. | ||
| fn do_get_best_peer_excluding( | ||
| &self, | ||
| capabilities: &[Capability], | ||
| excluded: &[H256], | ||
| ) -> Option<(H256, PeerConnection)> { | ||
| self.peers | ||
| .iter() | ||
| .filter_map(|(id, peer_data)| { | ||
| if excluded.contains(id) | ||
| || !self.can_try_more_requests(&peer_data.score, &peer_data.requests) | ||
| || !capabilities | ||
| .iter() | ||
| .any(|cap| peer_data.supported_capabilities.contains(cap)) | ||
| { | ||
| None | ||
| } else { | ||
| let connection = peer_data.connection.clone()?; | ||
| Some((*id, peer_data.score, peer_data.requests, connection)) | ||
| } | ||
| }) | ||
| .max_by_key(|(_, score, reqs, _)| self.weight_peer(score, reqs)) | ||
| .map(|(k, _, _, v)| (k, v)) | ||
| } |
There was a problem hiding this comment.
Duplicated peer-selection logic
do_get_best_peer_excluding is a strict superset of do_get_best_peer — only one extra excluded.contains(id) condition. Consider delegating to a shared helper to keep the scoring/eligibility logic in one place:
fn do_get_best_peer(&self, capabilities: &[Capability]) -> Option<(H256, PeerConnection)> {
self.do_get_best_peer_excluding(capabilities, &[])
}This avoids future drift if the eligibility check (e.g., can_try_more_requests) changes in one method but not the other.
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/networking/p2p/peer_table.rs
Line: 912-936
Comment:
**Duplicated peer-selection logic**
`do_get_best_peer_excluding` is a strict superset of `do_get_best_peer` — only one extra `excluded.contains(id)` condition. Consider delegating to a shared helper to keep the scoring/eligibility logic in one place:
```rust
fn do_get_best_peer(&self, capabilities: &[Capability]) -> Option<(H256, PeerConnection)> {
self.do_get_best_peer_excluding(capabilities, &[])
}
```
This avoids future drift if the eligibility check (e.g., `can_try_more_requests`) changes in one method but not the other.
How can I resolve this? If you propose a fix, please make it concise.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
There was a problem hiding this comment.
Addressed in 87880b0. do_get_best_peer now delegates to do_get_best_peer_excluding(caps, &[]).
- Narrow PeerHandlerError recoverability: dead actor, storage full, and snap errors stay fatal; only transient peer-interaction variants retry - Catch recoverable errors inside update_pivot inner loop so protocol errors advance the rotation like Ok(None) does - Track rotation_count explicitly instead of total_failures; cap at MAX_ROTATIONS=5 so the budget scales with peer count - Use rotation_count for exponential backoff (was raw failures) - Distinguish "rotation exhausted" from "all peers at capacity"; only clear exclude list in the former case - Deduplicate do_get_best_peer by delegating to do_get_best_peer_excluding
Review feedback addressed (commit 87880b0)Thanks for the thorough reviews. All eight findings across Codex, Claude, and Greptile are addressed. Codex
Claude
Greptile (inline replies above)
Not addressedClaude's note about |
… into fix/pivot-update-crash
- Add §1.18 observability tooling (PR #6470) - Add §1.19 pivot update reliability (PR #6475, issue #6474) - Add §1.20 big-account within-trie parallelization (issue #6477) - Add §1.21 small-account batching (issue #6476) - Add §1.22 decoded TrieLayerCache (PR #6348) - Add §1.23 bloom filter for non-existent storage (PR #6288) - Add §1.24 adaptive request sizing + bisection (PR #6181) - Add §1.25 concurrent bytecode + storage (PR #6205) - Add §1.26 phase completion markers (PR #6189) - Add §2.18 StorageTrieTracker refactor (PR #6171) - Update current-state bottleneck table with small-account and pivot-update findings - Reprioritize timeline: pivot-update crash fix is now priority 0 - Add two risks (pivot crash masks perf work, DB corruption on every crash) - Bump doc version to 1.3
| | SyncError::NoBlocks | ||
| | SyncError::NoBlockHeaders => true, | ||
| // PeerHandler handled above by delegation | ||
| SyncError::PeerHandler(_) => unreachable!(), |
There was a problem hiding this comment.
Why not call e.is_recoverable() here?
| let mut peer_failures: u64 = 0; | ||
| for attempt in 0..MAX_RETRIES_PER_PEER { | ||
| let outcome = peers | ||
| .get_block_header(peer_id, &mut connection, new_pivot_block_number) |
There was a problem hiding this comment.
get_block_header doesn't actually propagate most error types, it returns Ok(None)
(it seems like it could return PeerHandlerError::BlockHeaders, but that is unreachable)
Brings in main commits since the prior merge: #6516 EIP-8025 compliance (Electra-aligned ExecutionRequests typed container in NewPayloadRequest, MAX_CONSOLIDATION_REQUESTS_PER_PAYLOAD corrected from 1 to 2, to_encoded_requests() helper for EIP-7685 bytes, removal of ExecutionPayloadHeader/NewPayloadRequestHeader, new byte-oriented execution_program entrypoint that decodes the wire format internally and returns valid: false instead of erroring on post-decode failures), #6463 BAL withdrawal reverse check (DB->BAL direction so a malicious builder can't omit a withdrawal recipient from the BAL), #6505 Kademlia k-bucket revert (PeerTableServer::spawn no longer takes a node_id), plus snap-sync observability + dashboards (#6470), pivot-update crash fix (#6475), weighted peer selection (#6428), txpool_contentFrom/txpool_inspect RPC (#6446), block-by-block exec fallback (#6464), Amsterdam EELS branch pin (#6495), and rollup store SQLite v9->v10 migration (#6514). Conflict resolutions: - crates/common/types/stateless_ssz.rs: this branch had already moved the EIP-8025 SSZ types out of crates/common/types/eip8025_ssz.rs into stateless_ssz.rs and tucked the native-rollup containers below them. Kept that layout, applied #6516's content updates to the EIP-8025 section (renamed spec-limit constants, ExecutionRequests typed container with to_encoded_requests, dropped header types and their tests), pulled in the EncodedRequests import, and kept both the new test_execution_requests_to_encoded_bytes and the branch's stateless round-trip tests. - crates/guest-program/src/l1/program.rs: adopted #6516's new execution_program(bytes: &[u8], crypto) API with the internal decode_eip8025 call, the validate_eip8025_execution helper, and the decode-failure test. Rewrote all `eip-8025` feature gates as `experimental-devnet` and all `eip8025_ssz::` paths as `stateless_ssz::` to match this branch's renames. - crates/guest-program/bin/{sp1,risc0,zisk,openvm}/src/main.rs: applied #6516's simplification (drop decode_eip8025 import, pass &input straight to execution_program) under the experimental-devnet feature gate. Also flipped the rkyv::rancor::Error import gate from the old `eip-8025` name to `experimental-devnet` so the non-devnet build still has the import it needs. - crates/prover/src/backend/exec.rs: kept #6516's updated comment ("raw input bytes" instead of "(NewPayloadRequest, ExecutionWitness)") under the experimental-devnet feature gate. Auto-merged regions checked: crates/vm/backends/levm/mod.rs picked up all of #6463's Part B (DB->BAL) reverse check intact, and cmd/ethrex/l2/initializers.rs picked up #6505's PeerTableServer::spawn signature change. Verified cargo fmt --all clean, cargo check --workspace clean, cargo check --workspace --tests clean, and cargo check -p ethrex-guest-program --features experimental-devnet --tests clean.
Conflicts in peer_handler.rs and snap_sync.rs from main's permit-based peer-request refactor (#6523), observability (#6470), and pivot fix (#6475). Adopted main's permit pattern. Kept the new download_headers_background function and extended sync_cycle_snap/snap_sync signatures to accept the diagnostics handle from main alongside the snap_enabled flag from this PR. Test fixes: - snap_server_tests.rs: updated state_trie.hash() to take &NativeCrypto - added #![allow(clippy::unwrap_used)] for lazy_static block (test file)
Motivation
Snap sync crashes ~20% of the time on mainnet (and deterministically on small networks like hoodi) when
update_pivotexhausts its 15-failure budget by repeatedly asking the same 1–2 peers. The crash (process::exit(2)) also corrupts the DB, requiring a fullremovedband resync.Root cause analysis and full details in #6474.
Description
Three targeted fixes that prevent the crash while leaving deeper peer selection improvements for a follow-up:
Reclassify
PeerHandlerandNoBlockHeaderserrors as recoverable — instead ofprocess::exit(2), the sync cycle retries. This is the key fix: even if peer selection is suboptimal, the node stays alive.Add
get_best_peer_excluding()—update_pivotnow tracks which peers already failed and excludes them from selection. When all eligible peers have been tried, the exclude list clears and a new rotation starts. This ensures every peer gets a chance before any is retried.Bump
MAX_TOTAL_FAILURESfrom 15 to 100 and stop counting "no peers available" rotation resets as failures. The old budget of 15 was exhausted in ~2 minutes, with 40–73% of failures being passive waits rather than actual requests.Checklist
cargo fmtclean