Skip to content

feat(l1): add EIP-8189 snap/2 (BAL-based state healing)#6544

Open
edg-l wants to merge 1 commit into
mainfrom
snap-v2
Open

feat(l1): add EIP-8189 snap/2 (BAL-based state healing)#6544
edg-l wants to merge 1 commit into
mainfrom
snap-v2

Conversation

@edg-l
Copy link
Copy Markdown
Contributor

@edg-l edg-l commented Apr 28, 2026

Motivation

Implements EIP-8189: snap/2, which replaces iterative trie-node fetching during state healing with Block Access List (EIP-7928) replay. This is cleaner, avoids the need for partial-trie reconstruction, and lets the local node serve the same BALs to peers over snap/2 once sync is complete.

Depends on EIP-7928 (BAL header commitment) already present in this repo.

Description

Wire protocol

  • Advertises snap(2) alongside snap(1); per-connection negotiation picks the highest common version.
  • New messages: GetBlockAccessLists (0x08) and BlockAccessLists (0x09).
  • snap/2 connections reject GetTrieNodes/TrieNodes; snap/1 connections serve them unchanged.
  • BlockAccessLists response: Vec<Option<BlockAccessList>> with 0x80 RLP sentinel for absent entries, preserving position correspondence with the request.
  • Compile-time assertions in rlpx/message.rs guard the new message range against BASED_CAPABILITY_OFFSET.

BAL persistence

  • New storage table BLOCK_ACCESS_LISTS with store_block_access_list, get_block_access_list, iter_block_access_lists_by_hashes.
  • execute_block returns Option<BlockAccessList>; store_block takes Option<&BlockAccessList>. Pre-Amsterdam blocks pass None.
  • All block-import paths thread the BAL: add_block, add_block_pipeline_inner, add_blocks_in_batch.
  • BAL hash validated once per block in execute_block / execute_block_from_state.

Server / client / replay engine

  • Server process_block_access_lists_request respects a 2 MiB soft cap (BAL_RESPONSE_SOFT_CAP_BYTES); first slot always included; remaining over-cap slots filled with None.
  • Client request_block_access_lists returns (Vec<Option<BlockAccessList>>, peer_id) to keep failure attribution coherent.
  • apply_bal (sync/bal_healing/apply.rs) applies BAL as state diffs: balance/nonce/code/storage post-values; implicit-empty entry triggers account deletion; missing local slots treated as zero pre-value; storage writes go directly to backend.
  • advance_state_via_bals fetches in batches, validates ordering/hash before each apply, applies in strict block order using locally-tracked current_root, persists each BAL. On peer exhaustion falls back to snap/1 heal_state_trie_wrap.

Sync state machine

Staleness loop in snap_sync.rs branches on snap/2 peer reachability: V2 path runs advance_state_via_bals then validates final state_root; V1 path is unchanged.

Tests

  • 6 codec round-trip tests for new wire types.
  • 5 server tests for process_block_access_lists_request (empty, all-known, mixed, 2 MiB cap, snap/1 rejection).
  • 7 apply_bal unit tests: account creation/destruction, storage deletion, code deployment, EIP-7702 delegation clear, fresh storage slot, bad-state-root detection.
  • 1 batch-import BAL persistence regression test in test/tests/blockchain/batch_tests.rs.
  • 3 storage tests for the new BAL API.
  • cargo test -p ethrex-p2p: 37 passing (was 27). cargo test -p ethrex-test: 421 passing (was 416).

Notes

  • End-to-end multi-node snap/2 integration tests (8.5/8.6 from the EIP) are deferred. test/tests/p2p/ has no multi-node RLPx harness, and the existing hive devp2p snap simulator calls the external Go devp2p CLI. A dedicated hive snap/2 simulator is needed.
  • Updated STORE_SCHEMA_VERSION in crates/storage/lib.rs for the new BLOCK_ACCESS_LISTS table.

Checklist

  • Updated STORE_SCHEMA_VERSION (crates/storage/lib.rs) if the PR includes breaking changes to the Store requiring a re-sync.

@github-actions github-actions Bot added the L1 Ethereum client label Apr 28, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 28, 2026

Lines of code report

Total lines added: 832
Total lines removed: 0
Total lines changed: 832

Detailed view
+-----------------------------------------------------------+-------+------+
| File                                                      | Lines | Diff |
+-----------------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/peer_handler.rs              | 641   | +36  |
+-----------------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/rlpx/connection/codec.rs     | 249   | +7   |
+-----------------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/rlpx/connection/handshake.rs | 508   | +13  |
+-----------------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/rlpx/connection/server.rs    | 1593  | +77  |
+-----------------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/rlpx/message.rs              | 504   | +45  |
+-----------------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/rlpx/p2p.rs                  | 321   | +1   |
+-----------------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/rlpx/snap/codec.rs           | 366   | +82  |
+-----------------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/rlpx/snap/messages.rs        | 75    | +11  |
+-----------------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/rlpx/snap/mod.rs             | 8     | +1   |
+-----------------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/snap/constants.rs            | 28    | +4   |
+-----------------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/sync.rs                      | 380   | +19  |
+-----------------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/sync/bal_healing/apply.rs    | 111   | +111 |
+-----------------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/sync/bal_healing/mod.rs      | 314   | +314 |
+-----------------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/sync/snap_sync.rs            | 1182  | +101 |
+-----------------------------------------------------------+-------+------+
| ethrex/crates/storage/store.rs                            | 2733  | +10  |
+-----------------------------------------------------------+-------+------+

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 28, 2026

⚠️ Known Issues — intentionally skipped tests

Source: docs/known_issues.md

Known Issues

Tests intentionally excluded from CI. Source of truth for the Known
Issues
section the L1 workflow appends to each ef-tests job summary
and posts as a sticky PR comment.

EF Tests — Stateless coverage narrowed to EIP-8025 optional-proofs

make -C tooling/ef_tests/blockchain test calls test-stateless-zkevm
instead of test-stateless. The zkevm@v0.3.3 fixtures are filled against
bal@v5.6.1, out of sync with current bal spec; the broad target trips ~549
fixtures. Re-broaden once the zkevm bundle is regenerated.

Why and resolution path

PR #6527 broadened
test-stateless to extract the entire for_amsterdam/ tree from the
zkevm bundle and run all of it under --features stateless; combined with
this branch's bal-devnet-7 semantics that scope produces ~549
GasUsedMismatch / ReceiptsRootMismatch /
BlockAccessListHashMismatch failures.

test-stateless-zkevm filters cargo to the eip8025_optional_proofs
suite, which still validates the stateless harness without the bal-version
mismatch.

Re-broaden by switching test: back to test-stateless in
tooling/ef_tests/blockchain/Makefile once the zkevm bundle is regenerated
against the current bal spec.

@edg-l edg-l changed the title feat(l1): EIP-8189 snap/2 (BAL-based state healing) feat(l1): add EIP-8189 snap/2 (BAL-based state healing) May 28, 2026
@edg-l edg-l force-pushed the snap-v2 branch 2 times, most recently from 601fd88 to 9304c5c Compare May 28, 2026 14:08
@edg-l edg-l moved this to In Progress in ethrex_l1 May 28, 2026
@edg-l edg-l force-pushed the snap-v2 branch 2 times, most recently from 6a01e0c to 5f60e0b Compare May 28, 2026 14:43
@edg-l edg-l marked this pull request as ready for review May 28, 2026 14:44
@edg-l edg-l requested a review from a team as a code owner May 28, 2026 14:44
@ethrex-project-sync ethrex-project-sync Bot moved this from In Progress to In Review in ethrex_l1 May 28, 2026
@github-actions
Copy link
Copy Markdown

🤖 Kimi Code Review

This PR implements snap/2 (EIP-8189) Block Access List (BAL) based state healing, replacing iterative GetTrieNodes round-trips with batched BAL downloads for post-Amsterdam blocks. The implementation is comprehensive and generally well-structured.

Critical Observations

1. Storage Trie Cache Interaction (apply_bal)
File: crates/networking/p2p/sync/bal_healing/apply.rs (lines 87-103)

The comment explains that open_storage_trie receives parent_state_root for the cache layer, but writes bypass the cache via write_batch. Ensure this assumption holds if TrieLayerCache is later modified to prefetch or shadow entries. Currently safe because:

  • BAL replay runs during exclusive sync phases (no concurrent state mutations)
  • write_batch writes directly to the backend (STORAGE_TRIE_NODES table)

2. Async Blocking Risk (load_headers_range)
File: crates/networking/p2p/sync/bal_healing/mod.rs (lines 395-420)

The function mixes async (get_canonical_block_hash) with sync (get_block_header_by_hash) storage calls. While get_block_header_by_hash is typically fast (cached), heavy disk I/O here could block the async executor. Consider using spawn_blocking for the synchronous header fetches if profiling reveals latency issues.

3. Header Lookup N+1 Query
File: crates/networking/p2p/rlpx/connection/server.rs (lines 1630-1635)

build_snap2_bal_response performs sequential get_block_header_by_hash calls within the loop. For 64 hashes, this is 64 storage lookups. Acceptable for MVP, but consider adding a get_block_headers_by_hashes batch API to the storage layer for future optimization.

Security & Correctness

4. RLP None Sentinel
File: crates/networking/p2p/rlpx/snap/codec.rs (lines 360-385)

The Snap2OptionalBal wrapper correctly encodes None as 0x80 (RLP empty string) per EIP-8189 §50/§58, distinguishing it from eth/71's 0xc0 (empty list). The test snap2_bal_none_payload_omits_0xc0_sentinel locks this behavior—critical for cross-client compatibility.

5. Version Gating
File: crates/networking/p2p/rlpx/message.rs (lines 310-330)

The SnapCapVersion::is_valid_code check prevents snap/1 peers from receiving snap/2 messages (and vice versa) at the codec level. The compile-time assertions (lines 41-44) ensuring message codes don't bleed into the based capability range are excellent defensive programming.

6. State Root Validation
File: crates/networking/p2p/sync/bal_healing/apply.rs (lines 170-175)

Per-block state root verification against header.state_root is correctly implemented. Any mismatch returns SyncError::StateRootMismatch, which is classified as recoverable (Item 4 in SyncError::is_recoverable), allowing peer rotation on bad BAL data.

Minor Issues

7. Typo in Constant Name
File: crates/networking/p2p/rlpx/connection/server.rs (line 28)

use crate::snap::constants::BAL_RESPONSE_SOFT_CAP_BYTES;

The constant is correctly spelled in constants.rs, but verify the import path is consistent (appears correct in diff).

8. Unused Parameter
File: crates/networking/p2p/sync/bal_healing/mod.rs (line 436)

_remaining_headers: &[BlockHeader],

The fallback function ignores the remaining headers (healing works from state root diffs). Consider removing the parameter or adding a comment explaining why it's intentionally unused.

9. Documentation Reference
File: crates/networking/p2p/peer_handler.rs (line 623)

/// B2: uses `Message::Snap2GetBlockAccessLists`...

The "B2" reference appears to be an internal ticket code. Consider removing or replacing with a public issue reference before merge.

Testing

The test coverage is thorough:

  • Codec: Round-trip encoding/decoding with snappy compression
  • Version gating: snap/1 vs snap/2 message code rejection
  • Server logic: Response truncation, pre-Amsterdam handling, unknown hashes
  • BAL application: Account creation/destruction, storage diffs, code deployment, delegation clearing
  • Integration: Duplex pipe E2E test simulating full request/response cycle

Recommendations

  1. Merge: The PR is ready for merge after addressing the minor documentation/cleanup items above.
  2. Monitoring: Add metrics for fallback_to_snap1_healing triggers to detect BAL validation failures in production.
  3. Future: Consider parallelizing header fetches in build_snap2_bal_response if CPU profiling shows it as a bottleneck.

The implementation correctly handles the EIP-8189 edge cases (orphaned blocks, pre-Amsterdam headers, byte budget truncation) and maintains safe fallback paths to snap/1 healing.


Automated review by Kimi (Moonshot AI) · kimi-k2.5 · custom prompt

@github-actions
Copy link
Copy Markdown

🤖 Codex Code Review

Findings

  1. A malformed snap/2 response can wedge BAL healing in an infinite retry loop. request_snap2_bals treats any Snap2BlockAccessLists with the right id as success, even if bals is empty or otherwise makes no progress (peer_handler.rs). In advance_state_via_bals, the loop only increments retries for explicit None slots; a zero-length response hits the zip, updates nothing, and the outer while batch_filled.iter().any(...) never terminates (bal_healing/mod.rs, bal_healing/mod.rs). This is a peer-triggerable DoS. Reject non-progress responses up front, or count all unanswered requested hashes as failures before looping again.

  2. The server-side response_bytes cap is applied too late to protect resources. build_snap2_bal_response first calls iter_block_access_lists_by_hashes(&req.block_hashes) for the full request, and that helper is just a per-hash loop over the DB (store.rs). A malicious peer can send a very large Snap2GetBlockAccessLists frame with a tiny response_bytes, and still force DB lookups and allocation for every hash before the later truncation logic runs. Given the 16 MiB frame limit, this is a meaningful CPU/memory DoS vector. The handler should stream hashes and stop once the budget is reached, and ideally hard-cap the number of requested hashes before any DB work.

I couldn’t run the new tests locally because cargo test failed in this environment when rustup tried to create temp files on a read-only path.


Automated review by OpenAI Codex · gpt-5.4 · custom prompt

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented May 28, 2026

Greptile Summary

This PR implements EIP-8189 (snap/2), replacing iterative trie-node fetching during state healing with Block Access List (BAL) replay. It adds the full wire protocol (new GetBlockAccessLists/BlockAccessLists messages at codes 0x08/0x09, SnapCapVersion negotiation, compile-time range guards), BAL persistence in storage, and the advance_state_via_bals orchestration that fetches, validates, and applies BALs block-by-block from pivot to chain head.

  • Wire layer (rlpx/message.rs, snap/codec.rs, connection/server.rs): SnapCapVersion is negotiated after the Hello handshake; is_valid_code gates snap/1-only and snap/2-only message codes at decode time; compile-time asserts prevent the new codes from bleeding into the based capability range; SNAP1_ONLY_CAPABILITIES ensures trie-node healing never selects snap/2 peers.
  • BAL apply engine (sync/bal_healing/apply.rs, mod.rs): apply_bal writes balance/nonce/code/storage diffs directly to the backend trie, validates the per-block state root, and persists the BAL so the node can serve it onward; advance_state_via_bals drives batched fetching with per-block retry and snap/1 fallback.
  • Snap-sync integration (sync/snap_sync.rs): The staleness healing loop branches on snap/2 peer availability and post-Amsterdam pivot; on successful BAL replay healing_done = true is set, skipping snap/1 trie and storage healing.

Confidence Score: 3/5

The wire-protocol and BAL-apply layers are well-structured and well-tested, but the snap-sync integration has a gap that could leave a node with an incomplete trie after sync completes.

When BAL replay produces the correct chain-head state root, the healing loop exits immediately with healing_done = true without calling heal_storage_trie. The storage_accounts map populated during the download phase is silently discarded, potentially leaving incomplete storage tries in the store for accounts not touched by any replayed BAL. Because trie roots are hash-based, the final state-root check passes even when some trie content is absent.

crates/networking/p2p/sync/snap_sync.rs (storage healing skip on BAL replay success) and crates/networking/p2p/sync/bal_healing/mod.rs (throwaway CodeHashCollector in fallback path)

Important Files Changed

Filename Overview
crates/networking/p2p/sync/snap_sync.rs Staleness loop now branches on snap/2 availability. On successful BAL replay healing_done is set to true without running heal_storage_trie, leaving storage_accounts unprocessed and potentially incomplete storage tries in the store.
crates/networking/p2p/sync/bal_healing/mod.rs New BAL-replay orchestration: header loading, batch fetching, per-block apply, retry and fallback logic. Fallback creates a throwaway CodeHashCollector; inner-batch ordering guard is correct but can cause redundant re-fetches when responses are out-of-order.
crates/networking/p2p/sync/bal_healing/apply.rs New apply_bal: correctly applies balance/nonce/code/storage diffs, validates state root, writes trie nodes. Empty-BAL early return skips state-root validation (protected by caller but not self-enforcing).
crates/networking/p2p/rlpx/connection/server.rs Adds SnapCapVersion negotiation after Hello handshake, build_snap2_bal_response with 2 MiB cap, defense-in-depth snap/2 check. Byte-cap truncation correctly preserves first entry and uses break rather than None-filling (per spec).
crates/networking/p2p/rlpx/message.rs Adds SnapCapVersion enum with is_valid_code gate; compile-time assertions guard the snap/2 message range against BASED_CAPABILITY_OFFSET. Clean, correct dispatch.
crates/networking/p2p/peer_handler.rs Adds request_snap2_bals which selects a snap/2 peer, sends GetBlockAccessLists, and attributes failures to the correct peer_id.
crates/storage/store.rs Adds iter_block_access_lists_by_hashes as a thin ordered wrapper over get_block_access_list; always returns exactly hashes.len() entries. Correct and simple.
test/tests/p2p/bal_healing_tests.rs Thorough unit tests for apply_bal (creation, destruction, storage, code, delegation clear, bad-root) and try_apply_bal_block (happy path, bad parent, bad hash, bad root, chain of three). Good coverage.

Sequence Diagram

sequenceDiagram
    participant SS as snap_sync
    participant AV as advance_state_via_bals
    participant PH as PeerHandler (snap/2)
    participant AP as apply_bal
    participant ST as Store

    SS->>SS: should_use_bal_replay?
    SS->>AV: advance_state_via_bals(pivot-head)
    loop per batch of 64 blocks
        AV->>PH: request_snap2_bals(batch_hashes)
        PH-->>AV: "Vec<Option<BAL>>, peer_id"
        loop per slot in batch (strict order)
            AV->>AP: apply_bal(store, parent_root, bal, header)
            AP->>ST: open_state_trie(parent_root)
            AP->>ST: write_batch(ACCOUNT_TRIE_NODES)
            AP->>ST: write_batch(STORAGE_TRIE_NODES)
            AP-->>AV: new_state_root
            AV->>ST: store_block_access_list(hash, bal)
        end
    end
    AV-->>SS: final_root
    SS->>SS: "final_root == head.state_root?"
    alt match
        SS->>SS: "healing_done = true"
    else mismatch
        SS->>SS: heal_state_trie_wrap + heal_storage_trie (snap/1 fallback)
    end
Loading
Prompt To Fix All With AI
Fix the following 3 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 3
crates/networking/p2p/sync/snap_sync.rs:540-542
**Storage trie healing skipped on successful BAL replay**

When BAL replay succeeds and `healing_done = true` is set, `heal_storage_trie` is never called. The `storage_accounts` map accumulated during the snap download phase (tracking accounts with incomplete storage tries) is silently discarded. BAL replay only writes storage trie nodes for accounts that appear in one of the replayed BALs; accounts untouched by any BAL in the pivot-to-head range retain whatever (potentially incomplete) storage nodes were downloaded during the snap download phase. Because trie roots are computed from hashes rather than full content, the final state-root check can pass even when underlying storage trie content is missing — leaving the synced node unable to serve those storage slots correctly. The snap/1 path always calls `heal_storage_trie(&storage_accounts, …)` before setting `healing_done = true`; the same step should be required here.

### Issue 2 of 3
crates/networking/p2p/sync/bal_healing/apply.rs:52-55
**Empty BAL early return bypasses per-block state-root validation**

When `bal.is_empty()` the function returns `parent_state_root` without checking it against `block_header.state_root`. In the production call-chain this is protected by the BAL hash check in `try_apply_bal_block`, but `apply_bal` is a public API and callers that bypass `try_apply_bal_block` (e.g. future direct callers or tests with a mismatched header) would silently receive the wrong root. A simple guard — `if bal.is_empty() && parent_state_root != block_header.state_root { return Err(SyncError::StateRootMismatch(…)) }` — would make the contract self-enforcing.

### Issue 3 of 3
crates/networking/p2p/sync/bal_healing/mod.rs:363-380
**Fallback `CodeHashCollector` writes to a throwaway temp directory**

`fallback_to_snap1_healing` creates a fresh `CodeHashCollector` rooted at `std::env::temp_dir().join("ethrex_bal_fallback_code_hashes")` rather than reusing the main sync's collector. Any code hashes discovered during snap/1 fallback healing are accumulated in this separate directory and never merged back into the main `code_hash_collector`. On the next sync iteration the main collector has no record of them, so they may be fetched again or silently missed. Additionally, the temp path is not cleaned up after use, which can accumulate stale files across sync attempts. The main collector (or at least its path) should be threaded into this function.

Reviews (1): Last reviewed commit: "feat(l1): EIP-8189 snap/2 BAL-based stat..." | Re-trigger Greptile

Comment on lines +540 to +542
if new_root == final_header.state_root {
// BAL replay succeeded — skip snap/1 trie healing and storage healing.
healing_done = true;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Storage trie healing skipped on successful BAL replay

When BAL replay succeeds and healing_done = true is set, heal_storage_trie is never called. The storage_accounts map accumulated during the snap download phase (tracking accounts with incomplete storage tries) is silently discarded. BAL replay only writes storage trie nodes for accounts that appear in one of the replayed BALs; accounts untouched by any BAL in the pivot-to-head range retain whatever (potentially incomplete) storage nodes were downloaded during the snap download phase. Because trie roots are computed from hashes rather than full content, the final state-root check can pass even when underlying storage trie content is missing — leaving the synced node unable to serve those storage slots correctly. The snap/1 path always calls heal_storage_trie(&storage_accounts, …) before setting healing_done = true; the same step should be required here.

Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/networking/p2p/sync/snap_sync.rs
Line: 540-542

Comment:
**Storage trie healing skipped on successful BAL replay**

When BAL replay succeeds and `healing_done = true` is set, `heal_storage_trie` is never called. The `storage_accounts` map accumulated during the snap download phase (tracking accounts with incomplete storage tries) is silently discarded. BAL replay only writes storage trie nodes for accounts that appear in one of the replayed BALs; accounts untouched by any BAL in the pivot-to-head range retain whatever (potentially incomplete) storage nodes were downloaded during the snap download phase. Because trie roots are computed from hashes rather than full content, the final state-root check can pass even when underlying storage trie content is missing — leaving the synced node unable to serve those storage slots correctly. The snap/1 path always calls `heal_storage_trie(&storage_accounts, …)` before setting `healing_done = true`; the same step should be required here.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +52 to +55
// Empty BAL: state root unchanged.
if bal.is_empty() {
return Ok(parent_state_root);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Empty BAL early return bypasses per-block state-root validation

When bal.is_empty() the function returns parent_state_root without checking it against block_header.state_root. In the production call-chain this is protected by the BAL hash check in try_apply_bal_block, but apply_bal is a public API and callers that bypass try_apply_bal_block (e.g. future direct callers or tests with a mismatched header) would silently receive the wrong root. A simple guard — if bal.is_empty() && parent_state_root != block_header.state_root { return Err(SyncError::StateRootMismatch(…)) } — would make the contract self-enforcing.

Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/networking/p2p/sync/bal_healing/apply.rs
Line: 52-55

Comment:
**Empty BAL early return bypasses per-block state-root validation**

When `bal.is_empty()` the function returns `parent_state_root` without checking it against `block_header.state_root`. In the production call-chain this is protected by the BAL hash check in `try_apply_bal_block`, but `apply_bal` is a public API and callers that bypass `try_apply_bal_block` (e.g. future direct callers or tests with a mismatched header) would silently receive the wrong root. A simple guard — `if bal.is_empty() && parent_state_root != block_header.state_root { return Err(SyncError::StateRootMismatch(…)) }` — would make the contract self-enforcing.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +363 to +380
let mut dummy_leafs: u64 = 0;
let mut dummy_storage = AccountStorageRoots::default();

// Use a temporary directory in the system temp path for the fallback code-hash collector.
let tmp_dir = std::env::temp_dir().join("ethrex_bal_fallback_code_hashes");
let mut dummy_collector = crate::sync::code_collector::CodeHashCollector::new(tmp_dir);

heal_state_trie_wrap(
state_root,
store.clone(),
peers,
staleness_timestamp,
&mut dummy_leafs,
&mut dummy_storage,
&mut dummy_collector,
)
.await?;
Ok(())
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Fallback CodeHashCollector writes to a throwaway temp directory

fallback_to_snap1_healing creates a fresh CodeHashCollector rooted at std::env::temp_dir().join("ethrex_bal_fallback_code_hashes") rather than reusing the main sync's collector. Any code hashes discovered during snap/1 fallback healing are accumulated in this separate directory and never merged back into the main code_hash_collector. On the next sync iteration the main collector has no record of them, so they may be fetched again or silently missed. Additionally, the temp path is not cleaned up after use, which can accumulate stale files across sync attempts. The main collector (or at least its path) should be threaded into this function.

Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/networking/p2p/sync/bal_healing/mod.rs
Line: 363-380

Comment:
**Fallback `CodeHashCollector` writes to a throwaway temp directory**

`fallback_to_snap1_healing` creates a fresh `CodeHashCollector` rooted at `std::env::temp_dir().join("ethrex_bal_fallback_code_hashes")` rather than reusing the main sync's collector. Any code hashes discovered during snap/1 fallback healing are accumulated in this separate directory and never merged back into the main `code_hash_collector`. On the next sync iteration the main collector has no record of them, so they may be fetched again or silently missed. Additionally, the temp path is not cleaned up after use, which can accumulate stale files across sync attempts. The main collector (or at least its path) should be threaded into this function.

How can I resolve this? If you propose a fix, please make it concise.

@github-actions
Copy link
Copy Markdown

🤖 Claude Code Review

Now I have a thorough understanding of the PR. Let me write the review.


Review: feat(l1): add EIP-8189 snap/2 (BAL-based state healing)

Overall this is a well-structured implementation. The wire protocol, codec, capability negotiation, and test coverage are solid. A few correctness and design issues are worth addressing before merge.


1. Inner fallback_to_snap1_healing heals the wrong root

File: crates/networking/p2p/sync/bal_healing/mod.rsfallback_to_snap1_healing and its call sites

When BAL replay stalls (no peer, exhausted retries), the code calls:

fallback_to_snap1_healing(store, peers, current_root, ...).await?;
return Ok(current_root);

fallback_to_snap1_healing invokes heal_state_trie_wrap(current_root, ...). This tries to heal missing snap trie nodes for an intermediate root that exists only because BAL replay wrote it — peers cannot serve snap/1 trie nodes for arbitrary mid-chain roots they've never advertised. At best this is a no-op (nodes already present); at worst it hangs or fails silently.

Back in snap_sync.rs, the returned current_root won't equal final_header.state_root, so the snap_sync fallback path also runs heal_state_trie_wrap(pivot_header.state_root, ...) — the correct root. The inner healing is therefore wasted work with a risk of peer confusion.

Suggested fix: Remove fallback_to_snap1_healing entirely from this module. Return a dedicated sentinel or Err so that snap_sync.rs is the sole fallback-to-snap/1 decision point, always healing from pivot_header.state_root.


2. _remaining_headers is accepted but silently ignored in the fallback

File: crates/networking/p2p/sync/bal_healing/mod.rs:1399

async fn fallback_to_snap1_healing(
    ...
    _remaining_headers: &[BlockHeader],   // ← underscore = unused
    staleness_timestamp: u64,
) -> Result<(), SyncError> {

The remaining headers are passed to indicate how much is left to heal, but nothing is done with them. This is a dead parameter that will confuse future readers. Either use it (e.g., to log progress) or remove it.


3. store_code_sync bypasses the normal code-storage API

File: crates/networking/p2p/sync/bal_healing/apply.rs:1024-1035

fn store_code_sync(store: &Store, code: Code) -> Result<(), SyncError> {
    store.write(ACCOUNT_CODES, hash_key_bytes.clone(), buf)?;
    store.write(ACCOUNT_CODE_METADATA, hash_key_bytes, metadata)?;
    Ok(())
}

This writes bytecode || jump_targets directly to ACCOUNT_CODES and a big-endian length to ACCOUNT_CODE_METADATA, bypassing Store::store_account_code. If that API ever changes its on-disk format (e.g., to add a version prefix or change jump-target serialization), store_code_sync will silently write in the old format and get_account_code will return garbled code — a silent state-corruption bug.

The test apply_bal_code_deployment does validate the round-trip today, but the coupling is invisible. Either call store.store_account_code(code) (if it can be made synchronous / infallible), or add a comment tying this format to the get_account_code decoder and add a unit test that cross-reads between the two paths.


4. Dual EMPTY_TRIE_HASH constants in the destruction check

File: crates/networking/p2p/sync/bal_healing/apply.rs:870,881,990-991

use ethrex_common::constants::{EMPTY_KECCACK_HASH, EMPTY_TRIE_HASH};
use ethrex_trie::EMPTY_TRIE_HASH as TRIE_EMPTY;

let is_destroyed = ...
    && (account_state.storage_root == *EMPTY_TRIE_HASH
        || account_state.storage_root == *TRIE_EMPTY);

Two separate constants imported from different crates are ORed together. If they hold different values the OR masks a real inconsistency in the codebase; if they're identical the second arm is dead code. Either way the duplication is a red flag. This should be resolved at the crate boundary (make one re-export the other, or assert equality at compile time), not worked around with an OR.


5. load_headers_range emits a useless H256::zero() in the error

File: crates/networking/p2p/sync/bal_healing/mod.rs:1381

.get_canonical_block_hash(number)
.await?
.ok_or(SyncError::MissingHeaderForBal(H256::zero()))?;

When a canonical hash is absent at a given block number, the error carries H256::zero() — not the missing block's hash (which isn't known yet) and not even the block number. The error is opaque; under production conditions there's no way to know which height triggered the failure. Consider adding a MissingCanonicalHash(u64) variant, or at minimum logging the block number before returning the error.


6. Hardcoded temp directory risks collision between concurrent runs

File: crates/networking/p2p/sync/bal_healing/mod.rs:1407-1408

let tmp_dir = std::env::temp_dir().join("ethrex_bal_fallback_code_hashes");

Multiple sync instances (e.g., two nodes on the same machine, integration tests) would clobber the same directory. Use tempfile::TempDir or append a PID/UUID suffix.


7. Deferred-BAL batching discards received data unnecessarily

File: crates/networking/p2p/sync/bal_healing/mod.rs:1259-1262

let all_prior_filled = (0..batch_idx).all(|k| batch_filled[k]);
if !all_prior_filled {
    continue;
}

When a BAL for block N arrives before block N-1 is applied, it is silently discarded. The next outer iteration re-requests block N from a peer. This is correct but means one extra RTT per out-of-order response. A small HashMap<usize, BlockAccessList> cache for "received but deferred" BALs within the batch would make the retry loop more efficient. This is a perf nit, not a bug.


8. Validation order: validate_ordering before compute_hash

File: crates/networking/p2p/sync/bal_healing/mod.rs:1103-1109

bal.validate_ordering().map_err(ApplyBalError::BadOrdering)?;

let expected_bal_hash = header.block_access_list_hash.unwrap_or(*EMPTY_BLOCK_ACCESS_LIST_HASH);
let actual_bal_hash = bal.compute_hash();
if actual_bal_hash != expected_bal_hash { ... }

If compute_hash() hashes the RLP of entries in their provided order, ordering validation is implied by the hash check — no correctly-hashed BAL can have wrong ordering without a hash collision. If compute_hash() sorts before hashing, the ordering check catches a BAL that was sorted for hashing but not for application (a practical attack vector). This distinction is not obvious from reading this code. A brief comment explaining which scenario validate_ordering defends against would help future reviewers.


Minor / Positive notes

  • The Snap2OptionalBal sentinel comment explicitly distinguishing 0x80 (RLP empty string) from eth/71's 0xc0 (empty list) is excellent — lock it with the codec test as intended.
  • Compile-time assertions in message.rs guarding snap message codes against the BASED_CAPABILITY_OFFSET range are good practice.
  • The SNAP1_ONLY_CAPABILITIES rename in trie-node healing call sites correctly prevents snap/2 peers from being selected for GetTrieNodes — this is a non-obvious but critical correctness fix.
  • is_valid_code cleanly enforces the snap/2 message-code exclusions at codec level (defense-in-depth on top of the server-level cap check).
  • Test coverage is substantial and the codec round-trip tests correctly pin the 0x80 sentinel.

Automated review by Claude (Anthropic) · sonnet · custom prompt

Implements snap/2 alongside snap/1 with version-gated codec dispatch.

Wire: SnapCapVersion enum, Snap2GetBlockAccessLists (0x08) /
Snap2BlockAccessLists (0x09) with None encoded as RLP 0x80 per
EIP-8189 §50,§58 (distinct from eth/71's 0xc0). Trie-node healing
restricted to snap/1 peers via SNAP1_ONLY_CAPABILITIES.

Server: honors §50 always-respond, §51 tail-truncation, §52
orphaned-block serving, §60 min(response_bytes, 2 MiB) cap, §100
None for pre-Amsterdam headers.

Client + applier: PeerHandler::request_snap2_bals with snap/2-only
peer filter; apply_bal validates per §68
(keccak256(rlp.encode(bal)) vs header.block_access_list_hash) and
bal.validate_ordering() before applying diffs.

Integration: when snap/2 peer connected and pivot is post-Amsterdam,
advance_state_via_bals replaces snap/1 trie-node healing at the
post-bulk-download site. Falls back to snap/1 on missing peer,
hash failure, or chain reorg.

Errors: StateRootMismatch (recoverable), MissingHeaderForBal and
ChainReorgDetected (non-recoverable).

Tests: 12 codec/version-gate, 7 server, 8 applier, 4 driver,
2 storage helper. E2E two-node BAL replay deferred (no harness).
Copy link
Copy Markdown
Contributor

@ElFantasma ElFantasma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It LGTM. Left several comments but none is too important. Feel free to decide if they are actionable in some way (this PR, new PR or disregard)

// yet had BAL[1] applied — producing the wrong root.
let all_prior_filled = (0..batch_idx).all(|k| batch_filled[k]);
if !all_prior_filled {
continue;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strict-ordering continue silently drops a usable BAL. If the peer returns [BAL[0]=None, BAL[1]=Some, BAL[2]=Some], we increment retry_counts[0] and then hit this guard for batch_idx=1 and batch_idx=2 — both Some BALs we just paid bandwidth for are discarded, and the next request re-asks for indices [0, 1, 2]. With BAL_REQUEST_BATCH_SIZE=64 and any small per-block hit rate at slot 0, you wind up re-fetching almost the entire batch on every miss.

Low-effort fix: buffer the BAL into a pending: Vec<Option<&BlockAccessList>> parallel to batch_filled, and when slot K finally fills, drain forward through K+1, K+2, ... applying any already-buffered BALs in order. Pure perf, not correctness — the spec doesn't require this — but the savings are real on a flaky peer set.

Non-blocking.

let hash = store
.get_canonical_block_hash(number)
.await?
.ok_or(SyncError::MissingHeaderForBal(H256::zero()))?;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SyncError::MissingHeaderForBal(H256::zero()) here loses the only piece of information that would let an operator debug this — the missing block number. If get_canonical_block_hash(number) returns None, we know the canonical chain is missing at number (a real DB inconsistency given we just successfully walked from start_number to end_number), but the error reports a zero hash instead of the number.

Two options:

  • Add a MissingCanonicalAtNumber(u64) variant for this branch and use it (clearest).
  • Or at minimum keep the existing variant but include the number in the error string by formatting it into the hash, or extending the variant payload.

Non-blocking, but cheap to fix and saves a 30-minute debug session later.


/// `PeerHandler` requires an `RLPxInitiator` actor to construct; that makes
/// it impractical to directly unit-test `advance_state_via_bals` here. The
/// orchestration is covered by the deferred E2E test (M4 — Phase 3). What
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The honesty about the coverage gap is appreciated, but the orchestration in advance_state_via_bals has the densest bug surface of the PR — out-of-order batches, per-block retry accounting, batch-boundary parent linkage, peer-failure recording, the early-exit conditions. "Deferred to M4 — Phase 3" leaves the most error-prone code uncovered until the integration phase.

A cheap intermediate: factor out the inner peer-response → fill-batch step into a pure function that takes a Vec<Option<BlockAccessList>> (the response) + the current batch_filled / retry_counts / current_root state, and returns the next state. That function is unit-testable without a PeerHandler, and the cases that benefit are exactly the ones I'm worried about (out-of-order responses, single-slot exhaustion, mid-batch reorg detection).

Non-blocking — the comment is fair — but worth pulling forward if M4 slips.

break;
}

let header = storage
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per-hash header lookup runs even when we already have a BAL. When raw_bals[i] is Some(bal), we already know the block is post-Amsterdam (we wouldn't have stored a BAL for a pre-Amsterdam block). The only branch that needs header.block_access_list_hash is the §100 case — i.e. when raw_bal is None and we need to distinguish "pre-Amsterdam" from "unknown/pruned".

Cheap rewrite:

let slot = match raw_bal {
    Some(bal) => Some(bal),
    None => match storage.get_block_header_by_hash(*hash)? {
        Some(h) if h.block_access_list_hash.is_none() => None, // §100
        _ => None, // unknown/pruned
    },
};

Saves N storage reads on the happy path where the peer has all the BALs. On a 1024-hash request that's ~1024 fewer disk hits per response, which is the kind of work this defense-against-DoS limit was designed to bound in the first place.

Non-blocking perf.


impl RLPDecode for Snap2OptionalBal {
fn decode_unfinished(rlp: &[u8]) -> Result<(Self, &[u8]), RLPDecodeError> {
if rlp.first() == Some(&0x80) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The byte-peek decoder is correct, but the invariant it relies on isn't written down here. RLP can encode a single-byte short string as 0x80..0xb7 and a list as 0xc0..0xff. 0x80 is unambiguous for None only because a BlockAccessList is always encoded as a list (first byte >= 0xc0). If anyone ever refactors BlockAccessList into a wrapper type whose RLP encoding could collapse into a short string, this decoder silently mis-decodes the first BAL of the batch as None.

Worth a one-line invariant comment right above the if, e.g.:

// INVARIANT: BlockAccessList encodes as an RLP list (first byte >= 0xc0),
// so 0x80 is unambiguously the None sentinel.

or a stricter check (rest @ [0x80] -> None; otherwise decode_unfinished). Either keeps the next refactor from biting silently.

Non-blocking.


fn encode(&self, buf: &mut dyn BufMut) -> Result<(), RLPEncodeError> {
let mut encoded_data = vec![];
let bals: Vec<Snap2OptionalBal> = self.bals.iter().cloned().map(Snap2OptionalBal).collect();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.iter().cloned() deep-clones every BlockAccessList in the response just to wrap it in Snap2OptionalBal for the encoder. BALs are not small (kilobytes per block, hundreds in a 64-block batch), so this is a real allocation hit on the server side, every response.

The wrapper exists only to give RLPEncode somewhere to dispatch the None-as-0x80 case. A borrowing variant — e.g. struct Snap2OptionalBalRef<'a>(Option<&'a BlockAccessList>) with its own RLPEncode impl — gets the same wire format with zero copies. Then:

let bals: Vec<Snap2OptionalBalRef<'_>> =
    self.bals.iter().map(|b| Snap2OptionalBalRef(b.as_ref())).collect();

Decode side can keep Snap2OptionalBal (owning) since the wire bytes are owned anyway.

Non-blocking perf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

L1 Ethereum client

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

3 participants