Skip to content

fix(l1): fix storage range fetching for big accounts#6553

Open
iovoid wants to merge 5 commits into
mainfrom
fix/snap-schedule-big-accounts
Open

fix(l1): fix storage range fetching for big accounts#6553
iovoid wants to merge 5 commits into
mainfrom
fix/snap-schedule-big-accounts

Conversation

@iovoid
Copy link
Copy Markdown
Contributor

@iovoid iovoid commented Apr 30, 2026

Motivation

Storage range download would get stuck on large networks with small pivot times.

Description

We always queued bulk tasks from [0, MAX] regardless of an account's existing intervals, relying on the response handler's big account split path to re-fire each call. That path only fires when a peer's response is truncated. When a peer fully covered a known-big account in one response, no branch ran, and the account's intervals stayed in accounts_with_storage_root forever.

@iovoid iovoid requested a review from a team as a code owner April 30, 2026 15:36
@github-actions github-actions Bot added the L1 Ethereum client label Apr 30, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 30, 2026

Lines of code report

Total lines added: 65
Total lines removed: 0
Total lines changed: 65

Detailed view
+---------------------------------------------+-------+------+
| File                                        | Lines | Diff |
+---------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/snap/client.rs | 1205  | +65  |
+---------------------------------------------+-------+------+

@github-actions
Copy link
Copy Markdown

🤖 Claude Code Review

Now I have everything I need. Here is my review:


PR #6553: fix(l1): fix storage range fetching for big accounts

Overview

The PR fixes a real correctness bug in snap sync: big accounts (those with existing interval tracking) were permanently stuck in accounts_with_storage_root because the task scheduler queued all accounts from [0, MAX] unconditionally. The "big account split" path in the response handler only fires when a peer truncates its response; if a peer fully covered a big account in one response, the split path never ran, intervals never drained, and the account was stuck forever.

The fix partitions scheduling into two explicit paths:

  • Bulk tasks for fresh accounts with empty intervals (batched as before).
  • Per-interval tasks for big accounts, one task per interval, and a new completion path that removes each interval after it's served.

The approach is sound. A few issues are worth addressing:


Correctness Issues

1. Linear scan picks the last matching account, not the first

crates/networking/p2p/snap/client.rs, new diff lines 94–101:

let mut acc_hash: H256 = H256::zero();
for account in accounts_by_root_hash[start_index].1.iter() {
    if let Some((_, old_intervals)) = account_storage_roots
        .accounts_with_storage_root
        .get(account)
        && !old_intervals.is_empty()
    {
        acc_hash = *account;   // no break — overwrites on every match
    }
}

Because there is no break, if the group at start_index contains multiple accounts with non-empty intervals (possible when multiple accounts share the same storage root), acc_hash ends up being the last one rather than any deterministic choice. The subsequent get_mut + remove only cleans up that one account's entry, while the others retain the stale interval. A find-based approach is cleaner and makes intent explicit:

let acc_hash = accounts_by_root_hash[start_index].1.iter().find(|account| {
    account_storage_roots
        .accounts_with_storage_root
        .get(*account)
        .map_or(false, |(_, ivs)| !ivs.is_empty())
}).copied();

2. Interval matched by end hash only

New diff line 113:

.position(|(_old_start, end)| end == &hash_end)

The _old_start is ignored. If two tracked intervals for the same account happen to share an end hash (edge case, but possible in pathological state), this silently removes the wrong interval. Since start_hash is available in the StorageTask currently being processed, matching on (start_hash, end_hash) would be strictly safer.

3. Interval cleanup inconsistency for shared-root account groups

When an interval is drained, the code removes it from acc_hash's entry (diff line 117) but marks all accounts in the group as done (diff lines 119–122). If every account in the group has its own separate entry in accounts_with_storage_root (which is the expected case — they all share the same storage root and therefore the same intervals), the other accounts' entries are never cleaned up. They stay with stale intervals in the map forever, even though those accounts are added to healed_accounts. This is a memory leak of tracking state. Either clean up all accounts' entries, or document/enforce the invariant that only first_account (the canonical representative) holds the interval list.


Minor Issues

4. Two-level if/else if could be an else clause

Diff line 53:

} else {
    if let Some(start) = bulk_chunk_start {

This is a Clippy lint (clippy::collapsible_else_if) and can be written as } else if let Some(start) = bulk_chunk_start {.

5. Very long line in get_mut error

Diff line 110 has a 130+ character line for the ok_or message. Wrapping it to match the surrounding style (as is done for the other ok_or_else calls) would keep the file consistent.

6. InternalError used for invariant violations

SnapError::InternalError is returned for conditions like "empty accounts vector" and "interval not found", which are programmer invariants, not recoverable runtime errors. Using expect() or a debug_assert! before the operation would communicate that intent more clearly, though returning Err is the more conservative choice for a network-path function. This is a style preference.


Missing Test Coverage

The new per-interval scheduling path and the interval completion/removal path have no unit tests. Given that the original bug was subtle (it only manifested when a peer's response exactly covered a big account without truncating), a regression test — even a minimal one with a mocked peer — would prevent future regressions and validate the accounting logic. This is the most impactful gap in the PR.


Summary

The root cause diagnosis is correct and the fix is logically sound. Items 1 and 2 above are the most important to address before merging: the wrong-account selection and end-hash-only matching could produce incorrect behavior in multi-account-per-root groups or edge cases with interval overlap. Item 3 (stale map entries) is lower severity but leaves the tracking state inconsistent. A regression test would significantly strengthen confidence in the fix.


Automated review by Claude (Anthropic) · sonnet · custom prompt

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Apr 30, 2026

Greptile Summary

This PR fixes a liveness bug in snap-sync storage downloading: accounts with large storage tries would get stuck because the per-interval re-queue path only fired on truncated peer responses, so a peer that covered the whole account in one reply left the intervals undrained forever. The fix partitions tasks at function entry — fresh accounts get bulk tasks, previously-split accounts get one task per recorded interval — and adds a completion handler that removes the finished interval and finalises the account group when all intervals are gone.

  • The partitioning loop keys the bulk-vs-per-interval decision on first_account, taken from a Vec built by iterating a HashMap (non-deterministic order). Because the split path stores intervals under the first_acc_hash determined in a different call, first_account can differ between calls for groups with more than one account sharing a storage root, silently rescheduling a big account as a bulk task and reproducing the original stuck behavior.
  • The completion handler matches an interval for removal by end_hash alone; matching (start_hash, end_hash) would be safer.

Confidence Score: 3/5

The fix resolves the described bug but introduces a latent regression path for accounts sharing a storage root due to non-deterministic HashMap iteration affecting interval lookup.

A P1 finding (non-deterministic first_account selection) can reproduce the original liveness bug for a specific account topology, capping confidence at 4; the additional P2 (end-hash-only matching) lowers it slightly further to 3.

crates/networking/p2p/snap/client.rs — partitioning loop (line ~571) and interval-removal handler (line ~944)

Important Files Changed

Filename Overview
crates/networking/p2p/snap/client.rs Fixes stuck storage-range download for big accounts by partitioning tasks into bulk and per-interval paths at function entry; introduces an interval-removal handler for fully-covered per-interval tasks, but the partitioning relies on first_account (non-deterministic HashMap iteration order) to detect existing intervals, which can silently fall back to the buggy bulk path when multiple accounts share a storage root.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[request_storage_ranges called] --> B[Build accounts_by_root_hash from HashMap]
    B --> C{For each group: check first_account.intervals}
    C -- empty intervals --> D[Schedule bulk StorageTask start_hash=zero]
    C -- non-empty intervals --> E[Schedule per-interval StorageTasks]
    D --> F[Task dispatched to peer]
    E --> F
    F --> G{peer response: remaining_start < remaining_end?}
    G -- yes, partial --> H{hash_start.is_zero?}
    H -- yes bulk partial --> I[Re-queue remaining bulk chunk]
    H -- no, per-interval partial --> J[Update interval start_hash, re-queue]
    G -- no, fully covered --> K{hash_end.is_some?}
    K -- no, bulk complete --> L[Mark accounts done]
    K -- yes, per-interval complete NEW PATH --> M[Find acc with non-empty intervals in group]
    M --> N[Remove matching interval by end_hash]
    N --> O{intervals empty?}
    O -- yes --> P[Mark all group accounts done + healed]
    O -- no --> Q[More intervals remain, await other tasks]
    G -- no, fully covered & hash_start!=0 & no hash_end --> R[Big account detected: split into chunks, store intervals]
Loading
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
crates/networking/p2p/snap/client.rs:571-583
**Interval lookup keyed on non-deterministic `first_account`**

The partitioning loop decides bulk vs. per-interval by looking up `first_account.intervals`, where `first_account` is `accounts.first()` from a `Vec` built by iterating `account_storage_roots.accounts_with_storage_root` — a `HashMap` whose iteration order is non-deterministic across calls.

The split path stores intervals under exactly one key, `first_acc_hash = accounts_by_root_hash[remaining_start].1.first()`, which was determined in a *previous* call's `HashMap` iteration order. In a subsequent call, `accounts_by_root_hash[i].1.first()` can land on a *different* account whose intervals are empty, causing the group to fall through to the bulk path and reproduce the original stuck behavior for accounts that share a storage root.

### Issue 2 of 2
crates/networking/p2p/snap/client.rs:944-946
**Interval matched by end-hash only**

`position(|(_old_start, end)| end == &hash_end)` ignores `start_hash` when locating the completed interval. If two intervals in the same account happen to share the same `end_hash` (e.g. arithmetic overflow causes two adjacent chunks to land on the same ceiling value), this will remove the wrong interval and leave the other one dangling. Matching on both `(start_hash, end_hash)` would make the lookup unambiguous.

```suggestion
                    let pos = old_intervals
                        .iter()
                        .position(|(old_start, end)| old_start == &hash_start && end == &hash_end)
                        .ok_or(SnapError::InternalError(
                            "Could not find an old interval that we were tracking".to_owned(),
                        ))?;
```

Reviews (1): Last reviewed commit: "snap: address review nits in scheduling ..." | Re-trigger Greptile

Comment thread crates/networking/p2p/snap/client.rs Outdated
Comment on lines +571 to +583
for (i, (_, accounts)) in accounts_by_root_hash.iter().enumerate() {
let first_account = *accounts.first().ok_or_else(|| {
SnapError::InternalError("Empty accounts vector while scheduling tasks".to_owned())
})?;
let intervals = &account_storage_roots
.accounts_with_storage_root
.get(&first_account)
.ok_or_else(|| {
SnapError::InternalError(
"Could not find intervals for account while scheduling".to_owned(),
)
})?
.1;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Interval lookup keyed on non-deterministic first_account

The partitioning loop decides bulk vs. per-interval by looking up first_account.intervals, where first_account is accounts.first() from a Vec built by iterating account_storage_roots.accounts_with_storage_root — a HashMap whose iteration order is non-deterministic across calls.

The split path stores intervals under exactly one key, first_acc_hash = accounts_by_root_hash[remaining_start].1.first(), which was determined in a previous call's HashMap iteration order. In a subsequent call, accounts_by_root_hash[i].1.first() can land on a different account whose intervals are empty, causing the group to fall through to the bulk path and reproduce the original stuck behavior for accounts that share a storage root.

Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/networking/p2p/snap/client.rs
Line: 571-583

Comment:
**Interval lookup keyed on non-deterministic `first_account`**

The partitioning loop decides bulk vs. per-interval by looking up `first_account.intervals`, where `first_account` is `accounts.first()` from a `Vec` built by iterating `account_storage_roots.accounts_with_storage_root` — a `HashMap` whose iteration order is non-deterministic across calls.

The split path stores intervals under exactly one key, `first_acc_hash = accounts_by_root_hash[remaining_start].1.first()`, which was determined in a *previous* call's `HashMap` iteration order. In a subsequent call, `accounts_by_root_hash[i].1.first()` can land on a *different* account whose intervals are empty, causing the group to fall through to the bulk path and reproduce the original stuck behavior for accounts that share a storage root.

How can I resolve this? If you propose a fix, please make it concise.

Comment thread crates/networking/p2p/snap/client.rs Outdated
Comment on lines +944 to +946
let pos = old_intervals
.iter()
.position(|(_old_start, end)| end == &hash_end)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Interval matched by end-hash only

position(|(_old_start, end)| end == &hash_end) ignores start_hash when locating the completed interval. If two intervals in the same account happen to share the same end_hash (e.g. arithmetic overflow causes two adjacent chunks to land on the same ceiling value), this will remove the wrong interval and leave the other one dangling. Matching on both (start_hash, end_hash) would make the lookup unambiguous.

Suggested change
let pos = old_intervals
.iter()
.position(|(_old_start, end)| end == &hash_end)
let pos = old_intervals
.iter()
.position(|(old_start, end)| old_start == &hash_start && end == &hash_end)
.ok_or(SnapError::InternalError(
"Could not find an old interval that we were tracking".to_owned(),
))?;
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/networking/p2p/snap/client.rs
Line: 944-946

Comment:
**Interval matched by end-hash only**

`position(|(_old_start, end)| end == &hash_end)` ignores `start_hash` when locating the completed interval. If two intervals in the same account happen to share the same `end_hash` (e.g. arithmetic overflow causes two adjacent chunks to land on the same ceiling value), this will remove the wrong interval and leave the other one dangling. Matching on both `(start_hash, end_hash)` would make the lookup unambiguous.

```suggestion
                    let pos = old_intervals
                        .iter()
                        .position(|(old_start, end)| old_start == &hash_start && end == &hash_end)
                        .ok_or(SnapError::InternalError(
                            "Could not find an old interval that we were tracking".to_owned(),
                        ))?;
```

How can I resolve this? If you propose a fix, please make it concise.

@github-actions
Copy link
Copy Markdown

🤖 Codex Code Review

No blocking findings.

The new task partitioning around client.rs:561 and the explicit “completed interval” cleanup in client.rs:918 fit the existing big-account bookkeeping, and I don’t see a correctness, security, or consensus-risk regression in the changed logic.

Residual risk: I don’t see targeted regression coverage for the exact case this fixes. A focused test that starts with persisted old_intervals, reruns request_storage_ranges, and hits the remaining_start == remaining_end && hash_end.is_some() path would make this safer to maintain. I also couldn’t run cargo test here because the sandboxed toolchain failed creating temp files under /home/runner/.rustup.


Automated review by OpenAI Codex · gpt-5.4 · custom prompt

Comment thread crates/networking/p2p/snap/client.rs Outdated
let chunk_end = (chunk_start + chunk_size).min(accounts_by_root_hash.len());
let mut bulk_chunk_start: Option<usize> = None;
for (i, (_, accounts)) in accounts_by_root_hash.iter().enumerate() {
let first_account = *accounts.first().ok_or_else(|| {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implicit invariant: this uses first_account to look up intervals for the entire group (multiple accounts sharing the same storage_root), assuming all accounts in the group share the same interval state. The pre-existing handler at line 836 has the same assumption, so this isn't introduced here — but it's worth a doc comment near accounts_by_root_hash's declaration (~line 549) noting that all accounts in a group are expected to share intervals because they share storage. If that invariant ever breaks (e.g., one account in a group finalizes while a sibling is still pending), this scheduling code silently uses the first account's state for all of them.

Comment thread crates/networking/p2p/snap/client.rs Outdated
// acc_hash stays zero when a sibling per-interval task for the
// same account already drained the last interval and finalized
// it earlier in this call's loop — there's nothing left to do.
if !acc_hash.is_zero() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth pinning the contrast with the existing similar block at lines 766-768: that one panic!s when acc_hash.is_zero(), while this one silently skips. The reason is given in the comment above (sibling already drained), but the differing-invariant-with-sibling-block is subtle. A one-liner like // (Compare to lines 766-768 which panic on zero — different scenario: that path runs only on remaining_start < remaining_end, so a missing acc_hash there genuinely means corruption.) would help future readers tell the two paths apart.

Comment thread crates/networking/p2p/snap/client.rs Outdated
// same account already drained the last interval and finalized
// it earlier in this call's loop — there's nothing left to do.
if !acc_hash.is_zero() {
let (_, old_intervals) = account_storage_roots
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this block (940-956) and the existing block at lines 769-787 do the same three-step dance — find acc_hash, remove the matching interval by end == hash_end, and on empty-intervals mark the account done + healed. Worth extracting a helper like clear_completed_interval(account_storage_roots, accounts_by_root_hash, accounts_done, start_index, hash_end) -> Result<(), SnapError> and calling it from both. Reduces drift risk if the bookkeeping shape changes later.

Comment thread crates/networking/p2p/snap/client.rs Outdated
.1;
if intervals.is_empty() {
let chunk_start = *bulk_chunk_start.get_or_insert(i);
if i + 1 - chunk_start >= chunk_size {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Off-by-one check: i + 1 - chunk_start >= chunk_size flushes when the accumulated batch reaches chunk_size accounts (since i+1 is the exclusive end and chunk_start is the inclusive start). That matches STORAGE_BATCH_SIZE semantics elsewhere in the file. ✓ Just confirming I read this right.

@ilitteri ilitteri requested a review from ElFantasma May 14, 2026 16:25
@github-project-automation github-project-automation Bot moved this to In Review in ethrex_l1 May 28, 2026
@github-actions
Copy link
Copy Markdown

⚠️ Known Issues — intentionally skipped tests

Source: docs/known_issues.md

Known Issues

Tests intentionally excluded from CI. Source of truth for the Known
Issues
section the L1 workflow appends to each ef-tests job summary
and posts as a sticky PR comment.

EF Tests — Stateless coverage narrowed to EIP-8025 optional-proofs

make -C tooling/ef_tests/blockchain test calls test-stateless-zkevm
instead of test-stateless. The zkevm@v0.3.3 fixtures are filled against
bal@v5.6.1, out of sync with current bal spec; the broad target trips ~549
fixtures. Re-broaden once the zkevm bundle is regenerated.

Why and resolution path

PR #6527 broadened
test-stateless to extract the entire for_amsterdam/ tree from the
zkevm bundle and run all of it under --features stateless; combined with
this branch's bal-devnet-7 semantics that scope produces ~549
GasUsedMismatch / ReceiptsRootMismatch /
BlockAccessListHashMismatch failures.

test-stateless-zkevm filters cargo to the eip8025_optional_proofs
suite, which still validates the stateless harness without the bal-version
mismatch.

Re-broaden by switching test: back to test-stateless in
tooling/ef_tests/blockchain/Makefile once the zkevm bundle is regenerated
against the current bal spec.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

L1 Ethereum client

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

5 participants