Skip to content

fix(l1): prevent snap sync crash on pivot update failure#6475

Merged
ilitteri merged 37 commits into
mainfrom
fix/pivot-update-crash
Apr 20, 2026
Merged

fix(l1): prevent snap sync crash on pivot update failure#6475
ilitteri merged 37 commits into
mainfrom
fix/pivot-update-crash

Conversation

@ElFantasma
Copy link
Copy Markdown
Contributor

Motivation

Snap sync crashes ~20% of the time on mainnet (and deterministically on small networks like hoodi) when update_pivot exhausts its 15-failure budget by repeatedly asking the same 1–2 peers. The crash (process::exit(2)) also corrupts the DB, requiring a full removedb and resync.

Root cause analysis and full details in #6474.

Description

Three targeted fixes that prevent the crash while leaving deeper peer selection improvements for a follow-up:

  1. Reclassify PeerHandler and NoBlockHeaders errors as recoverable — instead of process::exit(2), the sync cycle retries. This is the key fix: even if peer selection is suboptimal, the node stays alive.

  2. Add get_best_peer_excluding()update_pivot now tracks which peers already failed and excludes them from selection. When all eligible peers have been tried, the exclude list clears and a new rotation starts. This ensures every peer gets a chance before any is retried.

  3. Bump MAX_TOTAL_FAILURES from 15 to 100 and stop counting "no peers available" rotation resets as failures. The old budget of 15 was exhausted in ~2 minutes, with 40–73% of failures being passive waits rather than actual requests.

Checklist

  • Blockchain / VM / EVM tests pass
  • cargo fmt clean
  • No new warnings

ElFantasma and others added 30 commits April 13, 2026 18:07
…or diagnostics

Add multi-layered observability for snap sync to diagnose intermittent
pivot update failures on mainnet:

- Prometheus metrics: eligible peers, inflight requests, pivot age,
  phase tracking, pivot update/storage request/header resolution
  outcome counters
- Admin RPC: admin_peerScores (full peer table with scores, eligibility)
  and admin_syncStatus (phase, pivot, staleness, recent events)
- Log instrumentation: TRACE per-peer dialogue, DEBUG for staleness,
  pivot updates with peer pool state, error classification
- Docker monitor: polls new RPC endpoints, keeps rolling snapshot
  buffer, dumps on degradation detection with adaptive frequency
- REPL: admin.peerScores and admin.syncStatus commands
…n issues detected

The docker monitor bumps log level to TRACE via admin_setLogLevel RPC
when degradation is detected, capturing detailed per-peer dialogue in
container logs. Restores to DEBUG on recovery.

Also sets default RUST_LOG in docker-compose to enable DEBUG-level sync
events (staleness, pivot updates, error classification).
…shots

Read pivot age from current time and progress counters from METRICS
atomics on each admin_syncStatus call, so the RPC always returns
fresh values instead of stale data from the last phase transition.
When a JSON response contains an array of objects (like admin_peerScores
peers list), render as an aligned table with headers instead of collapsing
to a single line. Makes diagnostic endpoints readable in the REPL.
…ect healing phase

- on_failure: do a final RPC poll and bump log level before dumping
- _dump_snapshots: support force=True to re-dump even if already dumped
- Detect healing phase as degradation trigger to get 5s polling during
  the high-risk window where pivot updates happen
- Always capture peer state at time of failure for post-mortem analysis
The method now detects both actual degradation (low eligible peers,
staleness) and high-risk phases (healing), so the name should reflect
the broader scope.
…radation

The monitor was trying to dump peer snapshots to the run directory during
degradation events, but the directory didn't exist yet (created at run end
by save_all_logs). Now set_run_id() creates the directory immediately.

Also adds peer_top.py live viewer script.
…ixes

- fix clippy redundant closure in formatter.rs
- fix unnecessary u64 casts in sync_manager.rs
- migrate MetricsSync to default prometheus registry (Pattern B) so new
  sync metrics register once at init instead of per-gather. Removes
  gather_metrics() — exported via gather_default_metrics() automatically
- fix admin_syncStatus reporting stale phase after recoverable error
- fix on_failure() poll bypass — add force param so final poll executes
  even when instance status is already "failed"
- fix peer_top.sh Python 3.12+ f-string syntax — use str.format()
- remove dead DEGRADATION_STALL_TIMEOUT constant
Rename hardcoded "healing" trigger to configurable WATCHED_PHASES set.
Phases in this set get TRACE logging and fast polling — useful for
investigating specific sync stages without editing the script.

Default: healing (current investigation target).
Override via --watched-phases flag, e.g. --watched-phases "healing,storage_insertion"
Incorporate progress metrics from PR #6468 (Tomi/Esteve) into the
observability PR, with improvements:

- Add progress gauges: headers, accounts, storage, healing, bytecodes
  (downloaded/inserted/total) + stage + pivot_block
- Push from METRICS atomics via push_sync_prometheus_metrics() in
  network.rs, called each polling cycle and on phase completion
- Grafana dashboard with 7 rows: overview, peer health, headers,
  accounts, storage, healing, bytecodes — with progress gauges, rate
  panels (using Grafana rate() instead of app-computed rates), and ETA
- All metrics use default Prometheus registry (register at init)
- New peer-health row with eligible peers, pivot age, inflight requests,
  and pivot update outcomes — not present in the original PR

Supersedes #6468.
- Add ethrex_sync_phase_start_timestamp{phase} labeled gauge — set on
  each phase transition, persists for completed phases
- Grafana computes elapsed as time() - timestamp, per phase
- Pivot Age now uses time() - ethrex_sync_pivot_timestamp (live)
- pivot_age_seconds also updated each push cycle for RPC/peer_top
- Added ETA panels for headers, accounts, bytecodes (remaining/rate)
- Added elapsed panels for all phases in their respective rows
- Overview row: replaced Throughput (N/A during sync) with Phase Elapsed
  showing all active/completed phase timings
- Reclassify PeerHandler and NoBlockHeaders errors as recoverable
  (retry sync cycle instead of process::exit)
- Add get_best_peer_excluding() to rotate through all eligible peers
  before retrying any, preventing the same 1-2 peers from being asked
  repeatedly
- Bump MAX_TOTAL_FAILURES from 15 to 100
- Don't count "no peers available" rotation resets as failures

Closes #6474
@github-actions
Copy link
Copy Markdown

🤖 Claude Code Review

Now I have everything I need to write a thorough review.


PR Review: fix(l1): prevent snap sync crash on pivot update failure

Overall assessment: This is a solid, targeted fix that directly addresses a real crash. The motivation is well-documented, the three changes are cohesive, and the diff is clean. There are a few correctness concerns worth discussing before merge.


sync.rsis_recoverable reclassification

Correct and important. Moving NoBlockHeaders and PeerHandler(_) to the recoverable arm is the right call. Both represent transient network/peer conditions, not corrupted local state.

One concern worth raising: all PeerHandlerError variants are now recoverable. If PeerHandlerError includes subtypes that indicate a fundamental protocol incompatibility (e.g., wrong network, unsupported version), treating those as transiently retryable could cause the sync loop to spin indefinitely rather than surfacing the real problem. Worth verifying that no fatal protocol-level variants live inside PeerHandlerError.


snap_sync.rsupdate_pivot refactor

Correctness issue — PeerHandlerError bypasses peer rotation:

The new exclusion-list logic only activates when get_block_header returns Ok(None). When it returns Err(PeerHandlerError), the ? at snap_sync.rs:797 still propagates immediately out of update_pivot:

let result = peers
    .get_block_header(peer_id, &mut connection, new_pivot_block_number)
    .await
    .map_err(SyncError::PeerHandler)?;   // exits update_pivot entirely

So in the PeerHandlerError case:

  • excluded_peers is never updated for that peer.
  • The next update_pivot call starts with a fresh, empty excluded_peers.
  • The same peer may be selected and fail again in an identical cycle.

The peer exclusion machinery only helps for None responses; the other half of the failure space (protocol errors) still bypasses it. Since PeerHandlerError is now recoverable, this won't crash — but it weakens the "rotate through all peers before retrying" guarantee. Consider handling the error path the same way:

match peers.get_block_header(...).await {
    Err(e) => {
        peers.peer_table.record_failure(peer_id)?;
        total_failures = total_failures.saturating_add(1);
        break; // move to next peer via exclusion
    }
    Ok(result) => { /* existing None/Some handling */ }
}

consecutive_failures_on_current is always MAX_RETRIES_PER_PEER at the log site:

The inner for loop runs the full 0..MAX_RETRIES_PER_PEER range and only exits early on success. Every time the peer is excluded, consecutive_failures_on_current equals MAX_RETRIES_PER_PEER (3). The debug log at snap_sync.rs:838 will always print "3 failures". Either use the constant directly or restructure to track only actual failures if the peer can also fail via PeerHandlerError (Point 1 above).

MAX_RETRY_DELAY is unreachable:

const MAX_RETRY_DELAY: Duration = Duration::from_secs(30);
// ...
let delay = INITIAL_RETRY_DELAY.saturating_mul(1 << (total_failures / 10).min(4));
let delay = delay.min(MAX_RETRY_DELAY);

The exponent is capped at 4, so the maximum multiplier is 2^4 = 16, giving 1s * 16 = 16s. The .min(MAX_RETRY_DELAY) (30s) is never triggered. Either raise the cap to 1 << 5 = 32s (just above 30s so the cap kicks in) or remove MAX_RETRY_DELAY and use the .min(4) directly. This was present before the PR but the refactor is a good opportunity to clean it up.

Backoff fires on rotation resets (minor, likely intentional):

After excluded_peers.clear() and continue, the next outer-loop iteration hits the if total_failures > 0 { sleep(...) } guard and sleeps again before selecting a peer. This means there are two consecutive sleeps when transitioning between rotations. The comment says "Don't count rotation resets as failures" (correct), but the backoff still fires. Probably fine as a throttle, but worth a comment to make the intent clear.


peer_table.rsdo_get_best_peer_excluding

Code duplication: do_get_best_peer and do_get_best_peer_excluding are identical except for the excluded.contains(id) guard. The simpler implementation:

fn do_get_best_peer(&self, capabilities: &[Capability]) -> Option<(H256, PeerConnection)> {
    self.do_get_best_peer_excluding(capabilities, &[])
}

This eliminates a divergence risk if the peer-selection logic changes in the future.

excluded.contains(id) is O(n) per peer per lookup. With typical peer counts (dozens) this is not a practical problem, but if excluded_peers is expected to grow (e.g., in denser networks), a HashSet<H256> would be more appropriate. Given H256 is Copy + Eq + Hash, this is a one-line change at the call site.

Unnecessary clone per loop iteration in snap_sync.rs:

.get_best_peer_excluding(SUPPORTED_ETH_CAPABILITIES.to_vec(), excluded_peers.clone())

The Vec<H256> is cloned on every outer-loop iteration because the protocol trait takes ownership. This is likely unavoidable with the actor-message model (which requires owned arguments for serialization), but worth confirming. If the trait could take &[H256] instead, the allocation is eliminated. Minor in practice.


Positive notes

  • Reclassifying NoBlockHeaders and PeerHandler as recoverable directly solves the crash without touching unrelated code. Good surgical scope.
  • The empty-vs-nonempty excluded_peers distinction in the None branch (snap_sync.rs:746-760) is correct and well-commented.
  • MAX_TOTAL_FAILURES: 100 with the new accounting is much more appropriate than 15.
  • Removing the unnecessary .clone() on the Ok(pivot) return is a good cleanup.
  • The typo fix ("Succesfully" → "Successfully") is appreciated.

Automated review by Claude (Anthropic) · sonnet · custom prompt

Comment thread crates/networking/p2p/sync/snap_sync.rs Outdated
Comment on lines 725 to 728
// Exponential backoff based on how many full rotations we've done
if total_failures > 0 {
let delay = INITIAL_RETRY_DELAY.saturating_mul(1 << (total_failures / 10).min(4));
let delay = delay.min(MAX_RETRY_DELAY);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Misleading backoff comment

The comment says "based on how many full rotations we've done" but the formula total_failures / 10 is not the number of full peer rotations — it's just total failures divided by 10. With 3 failures per peer and, say, 4 peers, a full rotation produces 12 failures, making the "per-rotation" interpretation off by 1.2×. Consider updating the comment to reflect the actual formula.

Suggested change
// Exponential backoff based on how many full rotations we've done
if total_failures > 0 {
let delay = INITIAL_RETRY_DELAY.saturating_mul(1 << (total_failures / 10).min(4));
let delay = delay.min(MAX_RETRY_DELAY);
// Exponential backoff: delay doubles every 10 total failures, capped at MAX_RETRY_DELAY
if total_failures > 0 {
let delay = INITIAL_RETRY_DELAY.saturating_mul(1 << (total_failures / 10).min(4));
let delay = delay.min(MAX_RETRY_DELAY);
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/networking/p2p/sync/snap_sync.rs
Line: 725-728

Comment:
**Misleading backoff comment**

The comment says "based on how many full rotations we've done" but the formula `total_failures / 10` is not the number of full peer rotations — it's just total failures divided by 10. With 3 failures per peer and, say, 4 peers, a full rotation produces 12 failures, making the "per-rotation" interpretation off by 1.2×. Consider updating the comment to reflect the actual formula.

```suggestion
        // Exponential backoff: delay doubles every 10 total failures, capped at MAX_RETRY_DELAY
        if total_failures > 0 {
            let delay = INITIAL_RETRY_DELAY.saturating_mul(1 << (total_failures / 10).min(4));
            let delay = delay.min(MAX_RETRY_DELAY);
```

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 87880b0. Replaced total_failures / 10 with explicit rotation_count; backoff now doubles per rotation, comment matches implementation.

Comment on lines 740 to 756
else {
debug!("We tried to get peers during update_pivot, but we found no free peers");
#[cfg(feature = "metrics")]
ethrex_metrics::sync::METRICS_SYNC.inc_pivot_update("no_peers");
consecutive_failures = consecutive_failures.saturating_add(1);
total_failures = total_failures.saturating_add(1);
if excluded_peers.is_empty() {
// Genuinely no peers at all
debug!("update_pivot: no free peers available");
#[cfg(feature = "metrics")]
ethrex_metrics::sync::METRICS_SYNC.inc_pivot_update("no_peers");
total_failures = total_failures.saturating_add(1);
} else {
// Tried all peers, start a new rotation
debug!(
"update_pivot: exhausted all {} peers, clearing exclude list for next rotation",
excluded_peers.len()
);
excluded_peers.clear();
// Don't count rotation resets as failures
}
continue;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Busy peers can trigger premature rotation reset

get_best_peer_excluding returns None for two distinct reasons: (a) every non-excluded peer was tried this rotation, or (b) non-excluded peers exist but are all over the MAX_CONCURRENT_REQUESTS_PER_PEER limit or lack ETH capabilities. The code treats both as "rotation complete" because excluded_peers.is_empty() is false in either case.

In case (b), excluded_peers.clear() re-admits recently-failed peers to the candidate pool before the busy peers become available, potentially looping through the same failed peer(s) again while perfectly healthy-but-momentarily-busy peers wait. The rotation isn't truly exhausted — it just has no current slot.

A tighter guard (e.g., comparing excluded_peers.len() against eligible_count from the diagnostics snapshot already fetched a few lines down) would distinguish the two cases, though this is minor given the 100-failure budget.

Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/networking/p2p/sync/snap_sync.rs
Line: 740-756

Comment:
**Busy peers can trigger premature rotation reset**

`get_best_peer_excluding` returns `None` for two distinct reasons: (a) every non-excluded peer was tried this rotation, or (b) non-excluded peers exist but are all over the `MAX_CONCURRENT_REQUESTS_PER_PEER` limit or lack ETH capabilities. The code treats both as "rotation complete" because `excluded_peers.is_empty()` is false in either case.

In case (b), `excluded_peers.clear()` re-admits recently-failed peers to the candidate pool before the busy peers become available, potentially looping through the same failed peer(s) again while perfectly healthy-but-momentarily-busy peers wait. The rotation isn't truly exhausted — it just has no current slot.

A tighter guard (e.g., comparing `excluded_peers.len()` against `eligible_count` from the diagnostics snapshot already fetched a few lines down) would distinguish the two cases, though this is minor given the 100-failure budget.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 87880b0. When get_best_peer_excluding returns None we now probe get_best_peer (no exclusions) to tell whether (a) no peer is eligible at all (wait), (b) peers exist but none selectable (retry), or (c) all non-excluded peers were tried (rotation complete, clear list and bump rotation_count). Busy peers no longer trigger a premature reset.

Comment on lines +912 to +936
/// Like `do_get_best_peer`, but excludes specific peers from selection.
/// Used by `update_pivot` to rotate through peers on repeated failures.
fn do_get_best_peer_excluding(
&self,
capabilities: &[Capability],
excluded: &[H256],
) -> Option<(H256, PeerConnection)> {
self.peers
.iter()
.filter_map(|(id, peer_data)| {
if excluded.contains(id)
|| !self.can_try_more_requests(&peer_data.score, &peer_data.requests)
|| !capabilities
.iter()
.any(|cap| peer_data.supported_capabilities.contains(cap))
{
None
} else {
let connection = peer_data.connection.clone()?;
Some((*id, peer_data.score, peer_data.requests, connection))
}
})
.max_by_key(|(_, score, reqs, _)| self.weight_peer(score, reqs))
.map(|(k, _, _, v)| (k, v))
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Duplicated peer-selection logic

do_get_best_peer_excluding is a strict superset of do_get_best_peer — only one extra excluded.contains(id) condition. Consider delegating to a shared helper to keep the scoring/eligibility logic in one place:

fn do_get_best_peer(&self, capabilities: &[Capability]) -> Option<(H256, PeerConnection)> {
    self.do_get_best_peer_excluding(capabilities, &[])
}

This avoids future drift if the eligibility check (e.g., can_try_more_requests) changes in one method but not the other.

Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/networking/p2p/peer_table.rs
Line: 912-936

Comment:
**Duplicated peer-selection logic**

`do_get_best_peer_excluding` is a strict superset of `do_get_best_peer` — only one extra `excluded.contains(id)` condition. Consider delegating to a shared helper to keep the scoring/eligibility logic in one place:

```rust
fn do_get_best_peer(&self, capabilities: &[Capability]) -> Option<(H256, PeerConnection)> {
    self.do_get_best_peer_excluding(capabilities, &[])
}
```

This avoids future drift if the eligibility check (e.g., `can_try_more_requests`) changes in one method but not the other.

How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 87880b0. do_get_best_peer now delegates to do_get_best_peer_excluding(caps, &[]).

- Narrow PeerHandlerError recoverability: dead actor, storage full, and
  snap errors stay fatal; only transient peer-interaction variants retry
- Catch recoverable errors inside update_pivot inner loop so protocol
  errors advance the rotation like Ok(None) does
- Track rotation_count explicitly instead of total_failures; cap at
  MAX_ROTATIONS=5 so the budget scales with peer count
- Use rotation_count for exponential backoff (was raw failures)
- Distinguish "rotation exhausted" from "all peers at capacity";
  only clear exclude list in the former case
- Deduplicate do_get_best_peer by delegating to do_get_best_peer_excluding
@ElFantasma
Copy link
Copy Markdown
Contributor Author

Review feedback addressed (commit 87880b0)

Thanks for the thorough reviews. All eight findings across Codex, Claude, and Greptile are addressed.

Codex

  1. MAX_TOTAL_FAILURES = 100 too low for large peer sets — Fixed. Replaced the hardcoded failure count with MAX_ROTATIONS = 5. The cap now scales naturally with peer count: one rotation = every eligible peer tried once.

  2. Backoff formula scaled with failures, not rotations — Fixed. Now uses an explicit rotation_count; backoff doubles per rotation (1s, 2s, 4s, 8s, 16s, 30s).

  3. PeerHandlerError wholesale recoverable — Fixed. Added PeerHandlerError::is_recoverable() and made SyncError::is_recoverable() delegate. Only transient peer-interaction variants (BlockHeaders, SendMessageToPeer, EmptyResponseFromPeer, timeouts, etc.) retry. Actor failures (PeerTableError), StorageFull, and Snap(_) stay fatal — a dead peer-table actor now exits the node instead of looping forever.

Claude

  1. PeerHandlerError bypassed peer rotation — Fixed. The inner loop now matches the outcome: Ok(Some) succeeds, Ok(None) and recoverable errors both count as a failure (advance retries, exclude peer on exhaustion), non-recoverable errors propagate. This closes the rotation gap you identified.

  2. consecutive_failures_on_current always printed 3 — Fixed. Variable is now peer_failures, counted separately from the retry loop index. After docs: update milestones. #4 it reflects actual failure attempts (not just the loop cap), though in the error-free case it will still equal MAX_RETRIES_PER_PEER.

Greptile (inline replies above)

  1. Misleading backoff comment — fixed with docs: add milestones #1/chore: create project structure #2.
  2. Busy peers triggering premature rotation reset — fixed by probing get_best_peer (no exclusions) to distinguish "rotation done" from "all at capacity".
  3. Duplicated peer-selection logic — do_get_best_peer now delegates to do_get_best_peer_excluding(caps, &[]).

Not addressed

Claude's note about MAX_RETRY_DELAY being unreachable is now addressed as a side-effect of #2: the new per-rotation backoff (1s << rotation_count) reaches 16s at rotation 4 and hits the 30s cap on rotation 5.

ElFantasma added a commit that referenced this pull request Apr 15, 2026
- Add §1.18 observability tooling (PR #6470)
- Add §1.19 pivot update reliability (PR #6475, issue #6474)
- Add §1.20 big-account within-trie parallelization (issue #6477)
- Add §1.21 small-account batching (issue #6476)
- Add §1.22 decoded TrieLayerCache (PR #6348)
- Add §1.23 bloom filter for non-existent storage (PR #6288)
- Add §1.24 adaptive request sizing + bisection (PR #6181)
- Add §1.25 concurrent bytecode + storage (PR #6205)
- Add §1.26 phase completion markers (PR #6189)
- Add §2.18 StorageTrieTracker refactor (PR #6171)
- Update current-state bottleneck table with small-account and pivot-update findings
- Reprioritize timeline: pivot-update crash fix is now priority 0
- Add two risks (pivot crash masks perf work, DB corruption on every crash)
- Bump doc version to 1.3
@ElFantasma ElFantasma mentioned this pull request Apr 15, 2026
1 task
| SyncError::NoBlocks
| SyncError::NoBlockHeaders => true,
// PeerHandler handled above by delegation
SyncError::PeerHandler(_) => unreachable!(),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not call e.is_recoverable() here?

let mut peer_failures: u64 = 0;
for attempt in 0..MAX_RETRIES_PER_PEER {
let outcome = peers
.get_block_header(peer_id, &mut connection, new_pivot_block_number)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_block_header doesn't actually propagate most error types, it returns Ok(None)

(it seems like it could return PeerHandlerError::BlockHeaders, but that is unreachable)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to #6506

@github-project-automation github-project-automation Bot moved this from In Review to In Progress in ethrex_l1 Apr 17, 2026
@github-project-automation github-project-automation Bot moved this from In Progress to In Review in ethrex_l1 Apr 20, 2026
Base automatically changed from feat/snap-sync-observability-pr to main April 20, 2026 17:59
@ilitteri ilitteri enabled auto-merge April 20, 2026 18:13
@ilitteri ilitteri added this pull request to the merge queue Apr 20, 2026
Merged via the queue into main with commit d03925f Apr 20, 2026
69 of 71 checks passed
@ilitteri ilitteri deleted the fix/pivot-update-crash branch April 20, 2026 19:42
@github-project-automation github-project-automation Bot moved this from In Review to Done in ethrex_l1 Apr 20, 2026
avilagaston9 pushed a commit that referenced this pull request Apr 27, 2026
Brings in main commits since the prior merge: #6516 EIP-8025 compliance
(Electra-aligned ExecutionRequests typed container in NewPayloadRequest,
MAX_CONSOLIDATION_REQUESTS_PER_PAYLOAD corrected from 1 to 2,
to_encoded_requests() helper for EIP-7685 bytes, removal of
ExecutionPayloadHeader/NewPayloadRequestHeader, new byte-oriented
execution_program entrypoint that decodes the wire format internally and
returns valid: false instead of erroring on post-decode failures), #6463
BAL withdrawal reverse check (DB->BAL direction so a malicious builder
can't omit a withdrawal recipient from the BAL), #6505 Kademlia k-bucket
revert (PeerTableServer::spawn no longer takes a node_id), plus snap-sync
observability + dashboards (#6470), pivot-update crash fix (#6475),
weighted peer selection (#6428), txpool_contentFrom/txpool_inspect RPC
(#6446), block-by-block exec fallback (#6464), Amsterdam EELS branch pin
(#6495), and rollup store SQLite v9->v10 migration (#6514).

Conflict resolutions:
- crates/common/types/stateless_ssz.rs: this branch had already moved
  the EIP-8025 SSZ types out of crates/common/types/eip8025_ssz.rs into
  stateless_ssz.rs and tucked the native-rollup containers below them.
  Kept that layout, applied #6516's content updates to the EIP-8025
  section (renamed spec-limit constants, ExecutionRequests typed
  container with to_encoded_requests, dropped header types and their
  tests), pulled in the EncodedRequests import, and kept both the new
  test_execution_requests_to_encoded_bytes and the branch's stateless
  round-trip tests.
- crates/guest-program/src/l1/program.rs: adopted #6516's new
  execution_program(bytes: &[u8], crypto) API with the internal
  decode_eip8025 call, the validate_eip8025_execution helper, and the
  decode-failure test. Rewrote all `eip-8025` feature gates as
  `experimental-devnet` and all `eip8025_ssz::` paths as
  `stateless_ssz::` to match this branch's renames.
- crates/guest-program/bin/{sp1,risc0,zisk,openvm}/src/main.rs: applied
  #6516's simplification (drop decode_eip8025 import, pass &input
  straight to execution_program) under the experimental-devnet feature
  gate. Also flipped the rkyv::rancor::Error import gate from the old
  `eip-8025` name to `experimental-devnet` so the non-devnet build still
  has the import it needs.
- crates/prover/src/backend/exec.rs: kept #6516's updated comment ("raw
  input bytes" instead of "(NewPayloadRequest, ExecutionWitness)") under
  the experimental-devnet feature gate.

Auto-merged regions checked: crates/vm/backends/levm/mod.rs picked up
all of #6463's Part B (DB->BAL) reverse check intact, and
cmd/ethrex/l2/initializers.rs picked up #6505's PeerTableServer::spawn
signature change. Verified cargo fmt --all clean, cargo check --workspace
clean, cargo check --workspace --tests clean, and cargo check -p
ethrex-guest-program --features experimental-devnet --tests clean.
ElFantasma added a commit that referenced this pull request Apr 29, 2026
Conflicts in snap/client.rs and sync/snap_sync.rs from main's permit-based
peer-request refactor (#6523) and the snap-sync rotation logic (#6470/#6475).
Adopted main's pattern (connection.outgoing_request + drop(permit), labeled
break flow, peer rotation) and kept the async disk I/O changes.
ElFantasma added a commit that referenced this pull request May 4, 2026
Conflicts in peer_handler.rs and snap_sync.rs from main's permit-based
peer-request refactor (#6523), observability (#6470), and pivot fix (#6475).
Adopted main's permit pattern. Kept the new download_headers_background
function and extended sync_cycle_snap/snap_sync signatures to accept the
diagnostics handle from main alongside the snap_enabled flag from this PR.

Test fixes:
- snap_server_tests.rs: updated state_trie.hash() to take &NativeCrypto
- added #![allow(clippy::unwrap_used)] for lazy_static block (test file)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

L1 Ethereum client

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants