Conversation
Lines of code reportTotal lines added: Detailed view |
|
601fd88 to
9304c5c
Compare
6a01e0c to
5f60e0b
Compare
🤖 Kimi Code ReviewThis PR implements snap/2 (EIP-8189) Block Access List (BAL) based state healing, replacing iterative Critical Observations1. Storage Trie Cache Interaction ( The comment explains that
2. Async Blocking Risk ( The function mixes async ( 3. Header Lookup N+1 Query
Security & Correctness4. RLP None Sentinel The 5. Version Gating The 6. State Root Validation Per-block state root verification against Minor Issues7. Typo in Constant Name use crate::snap::constants::BAL_RESPONSE_SOFT_CAP_BYTES;The constant is correctly spelled in 8. Unused Parameter _remaining_headers: &[BlockHeader],The fallback function ignores the remaining headers (healing works from state root diffs). Consider removing the parameter or adding a comment explaining why it's intentionally unused. 9. Documentation Reference /// B2: uses `Message::Snap2GetBlockAccessLists`...
The "B2" reference appears to be an internal ticket code. Consider removing or replacing with a public issue reference before merge. TestingThe test coverage is thorough:
Recommendations
The implementation correctly handles the EIP-8189 edge cases (orphaned blocks, pre-Amsterdam headers, byte budget truncation) and maintains safe fallback paths to snap/1 healing. Automated review by Kimi (Moonshot AI) · kimi-k2.5 · custom prompt |
🤖 Codex Code ReviewFindings
I couldn’t run the new tests locally because Automated review by OpenAI Codex · gpt-5.4 · custom prompt |
Greptile SummaryThis PR implements EIP-8189 (snap/2), replacing iterative trie-node fetching during state healing with Block Access List (BAL) replay. It adds the full wire protocol (new
Confidence Score: 3/5The wire-protocol and BAL-apply layers are well-structured and well-tested, but the snap-sync integration has a gap that could leave a node with an incomplete trie after sync completes. When BAL replay produces the correct chain-head state root, the healing loop exits immediately with healing_done = true without calling heal_storage_trie. The storage_accounts map populated during the download phase is silently discarded, potentially leaving incomplete storage tries in the store for accounts not touched by any replayed BAL. Because trie roots are hash-based, the final state-root check passes even when some trie content is absent. crates/networking/p2p/sync/snap_sync.rs (storage healing skip on BAL replay success) and crates/networking/p2p/sync/bal_healing/mod.rs (throwaway CodeHashCollector in fallback path)
|
| Filename | Overview |
|---|---|
| crates/networking/p2p/sync/snap_sync.rs | Staleness loop now branches on snap/2 availability. On successful BAL replay healing_done is set to true without running heal_storage_trie, leaving storage_accounts unprocessed and potentially incomplete storage tries in the store. |
| crates/networking/p2p/sync/bal_healing/mod.rs | New BAL-replay orchestration: header loading, batch fetching, per-block apply, retry and fallback logic. Fallback creates a throwaway CodeHashCollector; inner-batch ordering guard is correct but can cause redundant re-fetches when responses are out-of-order. |
| crates/networking/p2p/sync/bal_healing/apply.rs | New apply_bal: correctly applies balance/nonce/code/storage diffs, validates state root, writes trie nodes. Empty-BAL early return skips state-root validation (protected by caller but not self-enforcing). |
| crates/networking/p2p/rlpx/connection/server.rs | Adds SnapCapVersion negotiation after Hello handshake, build_snap2_bal_response with 2 MiB cap, defense-in-depth snap/2 check. Byte-cap truncation correctly preserves first entry and uses break rather than None-filling (per spec). |
| crates/networking/p2p/rlpx/message.rs | Adds SnapCapVersion enum with is_valid_code gate; compile-time assertions guard the snap/2 message range against BASED_CAPABILITY_OFFSET. Clean, correct dispatch. |
| crates/networking/p2p/peer_handler.rs | Adds request_snap2_bals which selects a snap/2 peer, sends GetBlockAccessLists, and attributes failures to the correct peer_id. |
| crates/storage/store.rs | Adds iter_block_access_lists_by_hashes as a thin ordered wrapper over get_block_access_list; always returns exactly hashes.len() entries. Correct and simple. |
| test/tests/p2p/bal_healing_tests.rs | Thorough unit tests for apply_bal (creation, destruction, storage, code, delegation clear, bad-root) and try_apply_bal_block (happy path, bad parent, bad hash, bad root, chain of three). Good coverage. |
Sequence Diagram
sequenceDiagram
participant SS as snap_sync
participant AV as advance_state_via_bals
participant PH as PeerHandler (snap/2)
participant AP as apply_bal
participant ST as Store
SS->>SS: should_use_bal_replay?
SS->>AV: advance_state_via_bals(pivot-head)
loop per batch of 64 blocks
AV->>PH: request_snap2_bals(batch_hashes)
PH-->>AV: "Vec<Option<BAL>>, peer_id"
loop per slot in batch (strict order)
AV->>AP: apply_bal(store, parent_root, bal, header)
AP->>ST: open_state_trie(parent_root)
AP->>ST: write_batch(ACCOUNT_TRIE_NODES)
AP->>ST: write_batch(STORAGE_TRIE_NODES)
AP-->>AV: new_state_root
AV->>ST: store_block_access_list(hash, bal)
end
end
AV-->>SS: final_root
SS->>SS: "final_root == head.state_root?"
alt match
SS->>SS: "healing_done = true"
else mismatch
SS->>SS: heal_state_trie_wrap + heal_storage_trie (snap/1 fallback)
end
Prompt To Fix All With AI
Fix the following 3 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 3
crates/networking/p2p/sync/snap_sync.rs:540-542
**Storage trie healing skipped on successful BAL replay**
When BAL replay succeeds and `healing_done = true` is set, `heal_storage_trie` is never called. The `storage_accounts` map accumulated during the snap download phase (tracking accounts with incomplete storage tries) is silently discarded. BAL replay only writes storage trie nodes for accounts that appear in one of the replayed BALs; accounts untouched by any BAL in the pivot-to-head range retain whatever (potentially incomplete) storage nodes were downloaded during the snap download phase. Because trie roots are computed from hashes rather than full content, the final state-root check can pass even when underlying storage trie content is missing — leaving the synced node unable to serve those storage slots correctly. The snap/1 path always calls `heal_storage_trie(&storage_accounts, …)` before setting `healing_done = true`; the same step should be required here.
### Issue 2 of 3
crates/networking/p2p/sync/bal_healing/apply.rs:52-55
**Empty BAL early return bypasses per-block state-root validation**
When `bal.is_empty()` the function returns `parent_state_root` without checking it against `block_header.state_root`. In the production call-chain this is protected by the BAL hash check in `try_apply_bal_block`, but `apply_bal` is a public API and callers that bypass `try_apply_bal_block` (e.g. future direct callers or tests with a mismatched header) would silently receive the wrong root. A simple guard — `if bal.is_empty() && parent_state_root != block_header.state_root { return Err(SyncError::StateRootMismatch(…)) }` — would make the contract self-enforcing.
### Issue 3 of 3
crates/networking/p2p/sync/bal_healing/mod.rs:363-380
**Fallback `CodeHashCollector` writes to a throwaway temp directory**
`fallback_to_snap1_healing` creates a fresh `CodeHashCollector` rooted at `std::env::temp_dir().join("ethrex_bal_fallback_code_hashes")` rather than reusing the main sync's collector. Any code hashes discovered during snap/1 fallback healing are accumulated in this separate directory and never merged back into the main `code_hash_collector`. On the next sync iteration the main collector has no record of them, so they may be fetched again or silently missed. Additionally, the temp path is not cleaned up after use, which can accumulate stale files across sync attempts. The main collector (or at least its path) should be threaded into this function.
Reviews (1): Last reviewed commit: "feat(l1): EIP-8189 snap/2 BAL-based stat..." | Re-trigger Greptile
| if new_root == final_header.state_root { | ||
| // BAL replay succeeded — skip snap/1 trie healing and storage healing. | ||
| healing_done = true; |
There was a problem hiding this comment.
Storage trie healing skipped on successful BAL replay
When BAL replay succeeds and healing_done = true is set, heal_storage_trie is never called. The storage_accounts map accumulated during the snap download phase (tracking accounts with incomplete storage tries) is silently discarded. BAL replay only writes storage trie nodes for accounts that appear in one of the replayed BALs; accounts untouched by any BAL in the pivot-to-head range retain whatever (potentially incomplete) storage nodes were downloaded during the snap download phase. Because trie roots are computed from hashes rather than full content, the final state-root check can pass even when underlying storage trie content is missing — leaving the synced node unable to serve those storage slots correctly. The snap/1 path always calls heal_storage_trie(&storage_accounts, …) before setting healing_done = true; the same step should be required here.
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/networking/p2p/sync/snap_sync.rs
Line: 540-542
Comment:
**Storage trie healing skipped on successful BAL replay**
When BAL replay succeeds and `healing_done = true` is set, `heal_storage_trie` is never called. The `storage_accounts` map accumulated during the snap download phase (tracking accounts with incomplete storage tries) is silently discarded. BAL replay only writes storage trie nodes for accounts that appear in one of the replayed BALs; accounts untouched by any BAL in the pivot-to-head range retain whatever (potentially incomplete) storage nodes were downloaded during the snap download phase. Because trie roots are computed from hashes rather than full content, the final state-root check can pass even when underlying storage trie content is missing — leaving the synced node unable to serve those storage slots correctly. The snap/1 path always calls `heal_storage_trie(&storage_accounts, …)` before setting `healing_done = true`; the same step should be required here.
How can I resolve this? If you propose a fix, please make it concise.| // Empty BAL: state root unchanged. | ||
| if bal.is_empty() { | ||
| return Ok(parent_state_root); | ||
| } |
There was a problem hiding this comment.
Empty BAL early return bypasses per-block state-root validation
When bal.is_empty() the function returns parent_state_root without checking it against block_header.state_root. In the production call-chain this is protected by the BAL hash check in try_apply_bal_block, but apply_bal is a public API and callers that bypass try_apply_bal_block (e.g. future direct callers or tests with a mismatched header) would silently receive the wrong root. A simple guard — if bal.is_empty() && parent_state_root != block_header.state_root { return Err(SyncError::StateRootMismatch(…)) } — would make the contract self-enforcing.
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/networking/p2p/sync/bal_healing/apply.rs
Line: 52-55
Comment:
**Empty BAL early return bypasses per-block state-root validation**
When `bal.is_empty()` the function returns `parent_state_root` without checking it against `block_header.state_root`. In the production call-chain this is protected by the BAL hash check in `try_apply_bal_block`, but `apply_bal` is a public API and callers that bypass `try_apply_bal_block` (e.g. future direct callers or tests with a mismatched header) would silently receive the wrong root. A simple guard — `if bal.is_empty() && parent_state_root != block_header.state_root { return Err(SyncError::StateRootMismatch(…)) }` — would make the contract self-enforcing.
How can I resolve this? If you propose a fix, please make it concise.| let mut dummy_leafs: u64 = 0; | ||
| let mut dummy_storage = AccountStorageRoots::default(); | ||
|
|
||
| // Use a temporary directory in the system temp path for the fallback code-hash collector. | ||
| let tmp_dir = std::env::temp_dir().join("ethrex_bal_fallback_code_hashes"); | ||
| let mut dummy_collector = crate::sync::code_collector::CodeHashCollector::new(tmp_dir); | ||
|
|
||
| heal_state_trie_wrap( | ||
| state_root, | ||
| store.clone(), | ||
| peers, | ||
| staleness_timestamp, | ||
| &mut dummy_leafs, | ||
| &mut dummy_storage, | ||
| &mut dummy_collector, | ||
| ) | ||
| .await?; | ||
| Ok(()) |
There was a problem hiding this comment.
Fallback
CodeHashCollector writes to a throwaway temp directory
fallback_to_snap1_healing creates a fresh CodeHashCollector rooted at std::env::temp_dir().join("ethrex_bal_fallback_code_hashes") rather than reusing the main sync's collector. Any code hashes discovered during snap/1 fallback healing are accumulated in this separate directory and never merged back into the main code_hash_collector. On the next sync iteration the main collector has no record of them, so they may be fetched again or silently missed. Additionally, the temp path is not cleaned up after use, which can accumulate stale files across sync attempts. The main collector (or at least its path) should be threaded into this function.
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/networking/p2p/sync/bal_healing/mod.rs
Line: 363-380
Comment:
**Fallback `CodeHashCollector` writes to a throwaway temp directory**
`fallback_to_snap1_healing` creates a fresh `CodeHashCollector` rooted at `std::env::temp_dir().join("ethrex_bal_fallback_code_hashes")` rather than reusing the main sync's collector. Any code hashes discovered during snap/1 fallback healing are accumulated in this separate directory and never merged back into the main `code_hash_collector`. On the next sync iteration the main collector has no record of them, so they may be fetched again or silently missed. Additionally, the temp path is not cleaned up after use, which can accumulate stale files across sync attempts. The main collector (or at least its path) should be threaded into this function.
How can I resolve this? If you propose a fix, please make it concise.
🤖 Claude Code ReviewNow I have a thorough understanding of the PR. Let me write the review. Review: feat(l1): add EIP-8189 snap/2 (BAL-based state healing)Overall this is a well-structured implementation. The wire protocol, codec, capability negotiation, and test coverage are solid. A few correctness and design issues are worth addressing before merge. 1. Inner
|
Implements snap/2 alongside snap/1 with version-gated codec dispatch. Wire: SnapCapVersion enum, Snap2GetBlockAccessLists (0x08) / Snap2BlockAccessLists (0x09) with None encoded as RLP 0x80 per EIP-8189 §50,§58 (distinct from eth/71's 0xc0). Trie-node healing restricted to snap/1 peers via SNAP1_ONLY_CAPABILITIES. Server: honors §50 always-respond, §51 tail-truncation, §52 orphaned-block serving, §60 min(response_bytes, 2 MiB) cap, §100 None for pre-Amsterdam headers. Client + applier: PeerHandler::request_snap2_bals with snap/2-only peer filter; apply_bal validates per §68 (keccak256(rlp.encode(bal)) vs header.block_access_list_hash) and bal.validate_ordering() before applying diffs. Integration: when snap/2 peer connected and pivot is post-Amsterdam, advance_state_via_bals replaces snap/1 trie-node healing at the post-bulk-download site. Falls back to snap/1 on missing peer, hash failure, or chain reorg. Errors: StateRootMismatch (recoverable), MissingHeaderForBal and ChainReorgDetected (non-recoverable). Tests: 12 codec/version-gate, 7 server, 8 applier, 4 driver, 2 storage helper. E2E two-node BAL replay deferred (no harness).
ElFantasma
left a comment
There was a problem hiding this comment.
It LGTM. Left several comments but none is too important. Feel free to decide if they are actionable in some way (this PR, new PR or disregard)
| // yet had BAL[1] applied — producing the wrong root. | ||
| let all_prior_filled = (0..batch_idx).all(|k| batch_filled[k]); | ||
| if !all_prior_filled { | ||
| continue; |
There was a problem hiding this comment.
Strict-ordering continue silently drops a usable BAL. If the peer returns [BAL[0]=None, BAL[1]=Some, BAL[2]=Some], we increment retry_counts[0] and then hit this guard for batch_idx=1 and batch_idx=2 — both Some BALs we just paid bandwidth for are discarded, and the next request re-asks for indices [0, 1, 2]. With BAL_REQUEST_BATCH_SIZE=64 and any small per-block hit rate at slot 0, you wind up re-fetching almost the entire batch on every miss.
Low-effort fix: buffer the BAL into a pending: Vec<Option<&BlockAccessList>> parallel to batch_filled, and when slot K finally fills, drain forward through K+1, K+2, ... applying any already-buffered BALs in order. Pure perf, not correctness — the spec doesn't require this — but the savings are real on a flaky peer set.
Non-blocking.
| let hash = store | ||
| .get_canonical_block_hash(number) | ||
| .await? | ||
| .ok_or(SyncError::MissingHeaderForBal(H256::zero()))?; |
There was a problem hiding this comment.
SyncError::MissingHeaderForBal(H256::zero()) here loses the only piece of information that would let an operator debug this — the missing block number. If get_canonical_block_hash(number) returns None, we know the canonical chain is missing at number (a real DB inconsistency given we just successfully walked from start_number to end_number), but the error reports a zero hash instead of the number.
Two options:
- Add a
MissingCanonicalAtNumber(u64)variant for this branch and use it (clearest). - Or at minimum keep the existing variant but include the number in the error string by formatting it into the hash, or extending the variant payload.
Non-blocking, but cheap to fix and saves a 30-minute debug session later.
|
|
||
| /// `PeerHandler` requires an `RLPxInitiator` actor to construct; that makes | ||
| /// it impractical to directly unit-test `advance_state_via_bals` here. The | ||
| /// orchestration is covered by the deferred E2E test (M4 — Phase 3). What |
There was a problem hiding this comment.
The honesty about the coverage gap is appreciated, but the orchestration in advance_state_via_bals has the densest bug surface of the PR — out-of-order batches, per-block retry accounting, batch-boundary parent linkage, peer-failure recording, the early-exit conditions. "Deferred to M4 — Phase 3" leaves the most error-prone code uncovered until the integration phase.
A cheap intermediate: factor out the inner peer-response → fill-batch step into a pure function that takes a Vec<Option<BlockAccessList>> (the response) + the current batch_filled / retry_counts / current_root state, and returns the next state. That function is unit-testable without a PeerHandler, and the cases that benefit are exactly the ones I'm worried about (out-of-order responses, single-slot exhaustion, mid-batch reorg detection).
Non-blocking — the comment is fair — but worth pulling forward if M4 slips.
| break; | ||
| } | ||
|
|
||
| let header = storage |
There was a problem hiding this comment.
Per-hash header lookup runs even when we already have a BAL. When raw_bals[i] is Some(bal), we already know the block is post-Amsterdam (we wouldn't have stored a BAL for a pre-Amsterdam block). The only branch that needs header.block_access_list_hash is the §100 case — i.e. when raw_bal is None and we need to distinguish "pre-Amsterdam" from "unknown/pruned".
Cheap rewrite:
let slot = match raw_bal {
Some(bal) => Some(bal),
None => match storage.get_block_header_by_hash(*hash)? {
Some(h) if h.block_access_list_hash.is_none() => None, // §100
_ => None, // unknown/pruned
},
};Saves N storage reads on the happy path where the peer has all the BALs. On a 1024-hash request that's ~1024 fewer disk hits per response, which is the kind of work this defense-against-DoS limit was designed to bound in the first place.
Non-blocking perf.
|
|
||
| impl RLPDecode for Snap2OptionalBal { | ||
| fn decode_unfinished(rlp: &[u8]) -> Result<(Self, &[u8]), RLPDecodeError> { | ||
| if rlp.first() == Some(&0x80) { |
There was a problem hiding this comment.
The byte-peek decoder is correct, but the invariant it relies on isn't written down here. RLP can encode a single-byte short string as 0x80..0xb7 and a list as 0xc0..0xff. 0x80 is unambiguous for None only because a BlockAccessList is always encoded as a list (first byte >= 0xc0). If anyone ever refactors BlockAccessList into a wrapper type whose RLP encoding could collapse into a short string, this decoder silently mis-decodes the first BAL of the batch as None.
Worth a one-line invariant comment right above the if, e.g.:
// INVARIANT: BlockAccessList encodes as an RLP list (first byte >= 0xc0),
// so 0x80 is unambiguously the None sentinel.or a stricter check (rest @ [0x80] -> None; otherwise decode_unfinished). Either keeps the next refactor from biting silently.
Non-blocking.
|
|
||
| fn encode(&self, buf: &mut dyn BufMut) -> Result<(), RLPEncodeError> { | ||
| let mut encoded_data = vec![]; | ||
| let bals: Vec<Snap2OptionalBal> = self.bals.iter().cloned().map(Snap2OptionalBal).collect(); |
There was a problem hiding this comment.
.iter().cloned() deep-clones every BlockAccessList in the response just to wrap it in Snap2OptionalBal for the encoder. BALs are not small (kilobytes per block, hundreds in a 64-block batch), so this is a real allocation hit on the server side, every response.
The wrapper exists only to give RLPEncode somewhere to dispatch the None-as-0x80 case. A borrowing variant — e.g. struct Snap2OptionalBalRef<'a>(Option<&'a BlockAccessList>) with its own RLPEncode impl — gets the same wire format with zero copies. Then:
let bals: Vec<Snap2OptionalBalRef<'_>> =
self.bals.iter().map(|b| Snap2OptionalBalRef(b.as_ref())).collect();Decode side can keep Snap2OptionalBal (owning) since the wire bytes are owned anyway.
Non-blocking perf.
Motivation
Implements EIP-8189: snap/2, which replaces iterative trie-node fetching during state healing with Block Access List (EIP-7928) replay. This is cleaner, avoids the need for partial-trie reconstruction, and lets the local node serve the same BALs to peers over snap/2 once sync is complete.
Depends on EIP-7928 (BAL header commitment) already present in this repo.
Description
Wire protocol
snap(2)alongsidesnap(1); per-connection negotiation picks the highest common version.GetBlockAccessLists(0x08) andBlockAccessLists(0x09).GetTrieNodes/TrieNodes; snap/1 connections serve them unchanged.BlockAccessListsresponse:Vec<Option<BlockAccessList>>with 0x80 RLP sentinel for absent entries, preserving position correspondence with the request.rlpx/message.rsguard the new message range againstBASED_CAPABILITY_OFFSET.BAL persistence
BLOCK_ACCESS_LISTSwithstore_block_access_list,get_block_access_list,iter_block_access_lists_by_hashes.execute_blockreturnsOption<BlockAccessList>;store_blocktakesOption<&BlockAccessList>. Pre-Amsterdam blocks passNone.add_block,add_block_pipeline_inner,add_blocks_in_batch.execute_block/execute_block_from_state.Server / client / replay engine
process_block_access_lists_requestrespects a 2 MiB soft cap (BAL_RESPONSE_SOFT_CAP_BYTES); first slot always included; remaining over-cap slots filled withNone.request_block_access_listsreturns(Vec<Option<BlockAccessList>>, peer_id)to keep failure attribution coherent.apply_bal(sync/bal_healing/apply.rs) applies BAL as state diffs: balance/nonce/code/storage post-values; implicit-empty entry triggers account deletion; missing local slots treated as zero pre-value; storage writes go directly to backend.advance_state_via_balsfetches in batches, validates ordering/hash before each apply, applies in strict block order using locally-trackedcurrent_root, persists each BAL. On peer exhaustion falls back to snap/1heal_state_trie_wrap.Sync state machine
Staleness loop in
snap_sync.rsbranches on snap/2 peer reachability: V2 path runsadvance_state_via_balsthen validates finalstate_root; V1 path is unchanged.Tests
process_block_access_lists_request(empty, all-known, mixed, 2 MiB cap, snap/1 rejection).apply_balunit tests: account creation/destruction, storage deletion, code deployment, EIP-7702 delegation clear, fresh storage slot, bad-state-root detection.test/tests/blockchain/batch_tests.rs.cargo test -p ethrex-p2p: 37 passing (was 27).cargo test -p ethrex-test: 421 passing (was 416).Notes
test/tests/p2p/has no multi-node RLPx harness, and the existing hive devp2p snap simulator calls the external Godevp2pCLI. A dedicated hive snap/2 simulator is needed.STORE_SCHEMA_VERSIONincrates/storage/lib.rsfor the newBLOCK_ACCESS_LISTStable.Checklist
STORE_SCHEMA_VERSION(crates/storage/lib.rs) if the PR includes breaking changes to theStorerequiring a re-sync.