fix(l1): fix storage range fetching for big accounts by iovoid · Pull Request #6553 · lambdaclass/ethrex

iovoid · 2026-04-30T15:36:24Z

Motivation

Storage range download would get stuck on large networks with small pivot times.

Description

We always queued bulk tasks from [0, MAX] regardless of an account's existing intervals, relying on the response handler's big account split path to re-fire each call. That path only fires when a peer's response is truncated. When a peer fully covered a known-big account in one response, no branch ran, and the account's intervals stayed in accounts_with_storage_root forever.

github-actions · 2026-04-30T15:39:01Z

Lines of code report

Total lines added: 65
Total lines removed: 0
Total lines changed: 65

Detailed view

+---------------------------------------------+-------+------+
| File                                        | Lines | Diff |
+---------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/snap/client.rs | 1205  | +65  |
+---------------------------------------------+-------+------+

github-actions · 2026-04-30T15:39:11Z

🤖 Claude Code Review

Now I have everything I need. Here is my review:

PR #6553: fix(l1): fix storage range fetching for big accounts

Overview

The PR fixes a real correctness bug in snap sync: big accounts (those with existing interval tracking) were permanently stuck in accounts_with_storage_root because the task scheduler queued all accounts from [0, MAX] unconditionally. The "big account split" path in the response handler only fires when a peer truncates its response; if a peer fully covered a big account in one response, the split path never ran, intervals never drained, and the account was stuck forever.

The fix partitions scheduling into two explicit paths:

Bulk tasks for fresh accounts with empty intervals (batched as before).
Per-interval tasks for big accounts, one task per interval, and a new completion path that removes each interval after it's served.

The approach is sound. A few issues are worth addressing:

Correctness Issues

1. Linear scan picks the last matching account, not the first

crates/networking/p2p/snap/client.rs, new diff lines 94–101:

let mut acc_hash: H256 = H256::zero();
for account in accounts_by_root_hash[start_index].1.iter() {
    if let Some((_, old_intervals)) = account_storage_roots
        .accounts_with_storage_root
        .get(account)
        && !old_intervals.is_empty()
    {
        acc_hash = *account;   // no break — overwrites on every match
    }
}

Because there is no break, if the group at start_index contains multiple accounts with non-empty intervals (possible when multiple accounts share the same storage root), acc_hash ends up being the last one rather than any deterministic choice. The subsequent get_mut + remove only cleans up that one account's entry, while the others retain the stale interval. A find-based approach is cleaner and makes intent explicit:

let acc_hash = accounts_by_root_hash[start_index].1.iter().find(|account| {
    account_storage_roots
        .accounts_with_storage_root
        .get(*account)
        .map_or(false, |(_, ivs)| !ivs.is_empty())
}).copied();

2. Interval matched by end hash only

New diff line 113:

.position(|(_old_start, end)| end == &hash_end)

The _old_start is ignored. If two tracked intervals for the same account happen to share an end hash (edge case, but possible in pathological state), this silently removes the wrong interval. Since start_hash is available in the StorageTask currently being processed, matching on (start_hash, end_hash) would be strictly safer.

3. Interval cleanup inconsistency for shared-root account groups

When an interval is drained, the code removes it from acc_hash's entry (diff line 117) but marks all accounts in the group as done (diff lines 119–122). If every account in the group has its own separate entry in accounts_with_storage_root (which is the expected case — they all share the same storage root and therefore the same intervals), the other accounts' entries are never cleaned up. They stay with stale intervals in the map forever, even though those accounts are added to healed_accounts. This is a memory leak of tracking state. Either clean up all accounts' entries, or document/enforce the invariant that only first_account (the canonical representative) holds the interval list.

Minor Issues

4. Two-level if/else if could be an else clause

Diff line 53:

} else {
    if let Some(start) = bulk_chunk_start {

This is a Clippy lint (clippy::collapsible_else_if) and can be written as } else if let Some(start) = bulk_chunk_start {.

5. Very long line in get_mut error

Diff line 110 has a 130+ character line for the ok_or message. Wrapping it to match the surrounding style (as is done for the other ok_or_else calls) would keep the file consistent.

6. InternalError used for invariant violations

SnapError::InternalError is returned for conditions like "empty accounts vector" and "interval not found", which are programmer invariants, not recoverable runtime errors. Using expect() or a debug_assert! before the operation would communicate that intent more clearly, though returning Err is the more conservative choice for a network-path function. This is a style preference.

Missing Test Coverage

The new per-interval scheduling path and the interval completion/removal path have no unit tests. Given that the original bug was subtle (it only manifested when a peer's response exactly covered a big account without truncating), a regression test — even a minimal one with a mocked peer — would prevent future regressions and validate the accounting logic. This is the most impactful gap in the PR.

Summary

The root cause diagnosis is correct and the fix is logically sound. Items 1 and 2 above are the most important to address before merging: the wrong-account selection and end-hash-only matching could produce incorrect behavior in multi-account-per-root groups or edge cases with interval overlap. Item 3 (stale map entries) is lower severity but leaves the tracking state inconsistent. A regression test would significantly strengthen confidence in the fix.

Automated review by Claude (Anthropic) · sonnet · custom prompt

greptile-apps · 2026-04-30T15:40:03Z

Greptile Summary

This PR fixes a liveness bug in snap-sync storage downloading: accounts with large storage tries would get stuck because the per-interval re-queue path only fired on truncated peer responses, so a peer that covered the whole account in one reply left the intervals undrained forever. The fix partitions tasks at function entry — fresh accounts get bulk tasks, previously-split accounts get one task per recorded interval — and adds a completion handler that removes the finished interval and finalises the account group when all intervals are gone.

The partitioning loop keys the bulk-vs-per-interval decision on first_account, taken from a Vec built by iterating a HashMap (non-deterministic order). Because the split path stores intervals under the first_acc_hash determined in a different call, first_account can differ between calls for groups with more than one account sharing a storage root, silently rescheduling a big account as a bulk task and reproducing the original stuck behavior.
The completion handler matches an interval for removal by end_hash alone; matching (start_hash, end_hash) would be safer.

Confidence Score: 3/5

The fix resolves the described bug but introduces a latent regression path for accounts sharing a storage root due to non-deterministic HashMap iteration affecting interval lookup.

A P1 finding (non-deterministic first_account selection) can reproduce the original liveness bug for a specific account topology, capping confidence at 4; the additional P2 (end-hash-only matching) lowers it slightly further to 3.

crates/networking/p2p/snap/client.rs — partitioning loop (line ~571) and interval-removal handler (line ~944)

Important Files Changed

Filename	Overview
crates/networking/p2p/snap/client.rs	Fixes stuck storage-range download for big accounts by partitioning tasks into bulk and per-interval paths at function entry; introduces an interval-removal handler for fully-covered per-interval tasks, but the partitioning relies on `first_account` (non-deterministic HashMap iteration order) to detect existing intervals, which can silently fall back to the buggy bulk path when multiple accounts share a storage root.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[request_storage_ranges called] --> B[Build accounts_by_root_hash from HashMap]
    B --> C{For each group: check first_account.intervals}
    C -- empty intervals --> D[Schedule bulk StorageTask start_hash=zero]
    C -- non-empty intervals --> E[Schedule per-interval StorageTasks]
    D --> F[Task dispatched to peer]
    E --> F
    F --> G{peer response: remaining_start < remaining_end?}
    G -- yes, partial --> H{hash_start.is_zero?}
    H -- yes bulk partial --> I[Re-queue remaining bulk chunk]
    H -- no, per-interval partial --> J[Update interval start_hash, re-queue]
    G -- no, fully covered --> K{hash_end.is_some?}
    K -- no, bulk complete --> L[Mark accounts done]
    K -- yes, per-interval complete NEW PATH --> M[Find acc with non-empty intervals in group]
    M --> N[Remove matching interval by end_hash]
    N --> O{intervals empty?}
    O -- yes --> P[Mark all group accounts done + healed]
    O -- no --> Q[More intervals remain, await other tasks]
    G -- no, fully covered & hash_start!=0 & no hash_end --> R[Big account detected: split into chunks, store intervals]

Prompt To Fix All With AI

Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
crates/networking/p2p/snap/client.rs:571-583
**Interval lookup keyed on non-deterministic `first_account`**

The partitioning loop decides bulk vs. per-interval by looking up `first_account.intervals`, where `first_account` is `accounts.first()` from a `Vec` built by iterating `account_storage_roots.accounts_with_storage_root` — a `HashMap` whose iteration order is non-deterministic across calls.

The split path stores intervals under exactly one key, `first_acc_hash = accounts_by_root_hash[remaining_start].1.first()`, which was determined in a *previous* call's `HashMap` iteration order. In a subsequent call, `accounts_by_root_hash[i].1.first()` can land on a *different* account whose intervals are empty, causing the group to fall through to the bulk path and reproduce the original stuck behavior for accounts that share a storage root.

### Issue 2 of 2
crates/networking/p2p/snap/client.rs:944-946
**Interval matched by end-hash only**

`position(|(_old_start, end)| end == &hash_end)` ignores `start_hash` when locating the completed interval. If two intervals in the same account happen to share the same `end_hash` (e.g. arithmetic overflow causes two adjacent chunks to land on the same ceiling value), this will remove the wrong interval and leave the other one dangling. Matching on both `(start_hash, end_hash)` would make the lookup unambiguous.

```suggestion
                    let pos = old_intervals
                        .iter()
                        .position(|(old_start, end)| old_start == &hash_start && end == &hash_end)
                        .ok_or(SnapError::InternalError(
                            "Could not find an old interval that we were tracking".to_owned(),
                        ))?;
```

_{Reviews (1): Last reviewed commit: "snap: address review nits in scheduling ..." | Re-trigger Greptile}

greptile-apps · 2026-04-30T15:40:07Z

+    for (i, (_, accounts)) in accounts_by_root_hash.iter().enumerate() {
+        let first_account = *accounts.first().ok_or_else(|| {
+            SnapError::InternalError("Empty accounts vector while scheduling tasks".to_owned())
+        })?;
+        let intervals = &account_storage_roots
+            .accounts_with_storage_root
+            .get(&first_account)
+            .ok_or_else(|| {
+                SnapError::InternalError(
+                    "Could not find intervals for account while scheduling".to_owned(),
+                )
+            })?
+            .1;


Interval lookup keyed on non-deterministic first_account

The partitioning loop decides bulk vs. per-interval by looking up first_account.intervals, where first_account is accounts.first() from a Vec built by iterating account_storage_roots.accounts_with_storage_root — a HashMap whose iteration order is non-deterministic across calls.

The split path stores intervals under exactly one key, first_acc_hash = accounts_by_root_hash[remaining_start].1.first(), which was determined in a previous call's HashMap iteration order. In a subsequent call, accounts_by_root_hash[i].1.first() can land on a different account whose intervals are empty, causing the group to fall through to the bulk path and reproduce the original stuck behavior for accounts that share a storage root.

Prompt To Fix With AI

This is a comment left during a code review. Path: crates/networking/p2p/snap/client.rs Line: 571-583 Comment: **Interval lookup keyed on non-deterministic `first_account`** The partitioning loop decides bulk vs. per-interval by looking up `first_account.intervals`, where `first_account` is `accounts.first()` from a `Vec` built by iterating `account_storage_roots.accounts_with_storage_root` — a `HashMap` whose iteration order is non-deterministic across calls. The split path stores intervals under exactly one key, `first_acc_hash = accounts_by_root_hash[remaining_start].1.first()`, which was determined in a *previous* call's `HashMap` iteration order. In a subsequent call, `accounts_by_root_hash[i].1.first()` can land on a *different* account whose intervals are empty, causing the group to fall through to the bulk path and reproduce the original stuck behavior for accounts that share a storage root. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-04-30T15:40:08Z

+                    let pos = old_intervals
+                        .iter()
+                        .position(|(_old_start, end)| end == &hash_end)


Interval matched by end-hash only

position(|(_old_start, end)| end == &hash_end) ignores start_hash when locating the completed interval. If two intervals in the same account happen to share the same end_hash (e.g. arithmetic overflow causes two adjacent chunks to land on the same ceiling value), this will remove the wrong interval and leave the other one dangling. Matching on both (start_hash, end_hash) would make the lookup unambiguous.

Suggested change

let pos = old_intervals

.iter()

.position(|(_old_start, end)| end == &hash_end)

let pos = old_intervals

.iter()

.position(|(old_start, end)| old_start == &hash_start && end == &hash_end)

.ok_or(SnapError::InternalError(

"Could not find an old interval that we were tracking".to_owned(),

))?;

Prompt To Fix With AI

This is a comment left during a code review. Path: crates/networking/p2p/snap/client.rs Line: 944-946 Comment: **Interval matched by end-hash only** `position(|(_old_start, end)| end == &hash_end)` ignores `start_hash` when locating the completed interval. If two intervals in the same account happen to share the same `end_hash` (e.g. arithmetic overflow causes two adjacent chunks to land on the same ceiling value), this will remove the wrong interval and leave the other one dangling. Matching on both `(start_hash, end_hash)` would make the lookup unambiguous. ```suggestion let pos = old_intervals .iter() .position(|(old_start, end)| old_start == &hash_start && end == &hash_end) .ok_or(SnapError::InternalError( "Could not find an old interval that we were tracking".to_owned(), ))?; ``` How can I resolve this? If you propose a fix, please make it concise.

github-actions · 2026-04-30T15:41:33Z

🤖 Codex Code Review

No blocking findings.

The new task partitioning around client.rs:561 and the explicit “completed interval” cleanup in client.rs:918 fit the existing big-account bookkeeping, and I don’t see a correctness, security, or consensus-risk regression in the changed logic.

Residual risk: I don’t see targeted regression coverage for the exact case this fixes. A focused test that starts with persisted old_intervals, reruns request_storage_ranges, and hits the remaining_start == remaining_end && hash_end.is_some() path would make this safer to maintain. I also couldn’t run cargo test here because the sandboxed toolchain failed creating temp files under /home/runner/.rustup.

Automated review by OpenAI Codex · gpt-5.4 · custom prompt

ElFantasma · 2026-04-30T18:52:51Z

-        let chunk_end = (chunk_start + chunk_size).min(accounts_by_root_hash.len());
+    let mut bulk_chunk_start: Option<usize> = None;
+    for (i, (_, accounts)) in accounts_by_root_hash.iter().enumerate() {
+        let first_account = *accounts.first().ok_or_else(|| {


Implicit invariant: this uses first_account to look up intervals for the entire group (multiple accounts sharing the same storage_root), assuming all accounts in the group share the same interval state. The pre-existing handler at line 836 has the same assumption, so this isn't introduced here — but it's worth a doc comment near accounts_by_root_hash's declaration (~line 549) noting that all accounts in a group are expected to share intervals because they share storage. If that invariant ever breaks (e.g., one account in a group finalizes while a sibling is still pending), this scheduling code silently uses the first account's state for all of them.

ElFantasma · 2026-04-30T18:52:51Z

+                // acc_hash stays zero when a sibling per-interval task for the
+                // same account already drained the last interval and finalized
+                // it earlier in this call's loop — there's nothing left to do.
+                if !acc_hash.is_zero() {


Worth pinning the contrast with the existing similar block at lines 766-768: that one panic!s when acc_hash.is_zero(), while this one silently skips. The reason is given in the comment above (sibling already drained), but the differing-invariant-with-sibling-block is subtle. A one-liner like // (Compare to lines 766-768 which panic on zero — different scenario: that path runs only on remaining_start < remaining_end, so a missing acc_hash there genuinely means corruption.) would help future readers tell the two paths apart.

ElFantasma · 2026-04-30T18:52:52Z

+                // same account already drained the last interval and finalized
+                // it earlier in this call's loop — there's nothing left to do.
+                if !acc_hash.is_zero() {
+                    let (_, old_intervals) = account_storage_roots


nit: this block (940-956) and the existing block at lines 769-787 do the same three-step dance — find acc_hash, remove the matching interval by end == hash_end, and on empty-intervals mark the account done + healed. Worth extracting a helper like clear_completed_interval(account_storage_roots, accounts_by_root_hash, accounts_done, start_index, hash_end) -> Result<(), SnapError> and calling it from both. Reduces drift risk if the bookkeeping shape changes later.

ElFantasma · 2026-04-30T18:52:52Z

+            .1;
+        if intervals.is_empty() {
+            let chunk_start = *bulk_chunk_start.get_or_insert(i);
+            if i + 1 - chunk_start >= chunk_size {


Off-by-one check: i + 1 - chunk_start >= chunk_size flushes when the accumulated batch reaches chunk_size accounts (since i+1 is the exclusive end and chunk_start is the inclusive start). That matches STORAGE_BATCH_SIZE semantics elsewhere in the file. ✓ Just confirming I read this right.

…-accounts

github-actions · 2026-05-28T15:34:40Z

⚠️ Known Issues — intentionally skipped tests

Source: docs/known_issues.md

Known Issues

Tests intentionally excluded from CI. Source of truth for the Known
Issues section the L1 workflow appends to each ef-tests job summary
and posts as a sticky PR comment.

EF Tests — Stateless coverage narrowed to EIP-8025 optional-proofs

make -C tooling/ef_tests/blockchain test calls test-stateless-zkevm
instead of test-stateless. The zkevm@v0.3.3 fixtures are filled against
bal@v5.6.1, out of sync with current bal spec; the broad target trips ~549
fixtures. Re-broaden once the zkevm bundle is regenerated.

Why and resolution path

PR #6527 broadened
test-stateless to extract the entire for_amsterdam/ tree from the
zkevm bundle and run all of it under --features stateless; combined with
this branch's bal-devnet-7 semantics that scope produces ~549
GasUsedMismatch / ReceiptsRootMismatch /
BlockAccessListHashMismatch failures.

test-stateless-zkevm filters cargo to the eip8025_optional_proofs
suite, which still validates the stateless harness without the bal-version
mismatch.

Re-broaden by switching test: back to test-stateless in
tooling/ef_tests/blockchain/Makefile once the zkevm bundle is regenerated
against the current bal spec.

iovoid added 3 commits April 28, 2026 10:32

snap: schedule per-interval tasks for big accounts at function entry

ddbfc6b

snap: drop per-interval task interval on clean finish

5d2c088

snap: address review nits in scheduling fix

391b03c

iovoid requested a review from a team as a code owner April 30, 2026 15:36

github-actions Bot assigned iovoid Apr 30, 2026

github-actions Bot added the L1 Ethereum client label Apr 30, 2026

github-project-automation Bot added this to ethrex_l1 Apr 30, 2026

greptile-apps Bot reviewed Apr 30, 2026

View reviewed changes

ElFantasma reviewed May 4, 2026

View reviewed changes

azteca1998 approved these changes May 6, 2026

View reviewed changes

ilitteri approved these changes May 14, 2026

View reviewed changes

snap: address review nits on big-account scheduling

ae7c501

ilitteri requested a review from ElFantasma May 14, 2026 16:25

edg-l approved these changes May 28, 2026

View reviewed changes

github-project-automation Bot moved this to In Review in ethrex_l1 May 28, 2026

Merge remote-tracking branch 'origin/main' into fix/snap-schedule-big…

931fdb4

…-accounts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(l1): fix storage range fetching for big accounts#6553

fix(l1): fix storage range fetching for big accounts#6553
iovoid wants to merge 5 commits into
mainfrom
fix/snap-schedule-big-accounts

iovoid commented Apr 30, 2026

Uh oh!

github-actions Bot commented Apr 30, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 30, 2026

Uh oh!

greptile-apps Bot commented Apr 30, 2026

Important Files Changed

Uh oh!

greptile-apps Bot Apr 30, 2026

Uh oh!

greptile-apps Bot Apr 30, 2026

Uh oh!

github-actions Bot commented Apr 30, 2026

Uh oh!

ElFantasma Apr 30, 2026

Uh oh!

ElFantasma Apr 30, 2026

Uh oh!

ElFantasma Apr 30, 2026

Uh oh!

ElFantasma Apr 30, 2026

Uh oh!

github-actions Bot commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

iovoid commented Apr 30, 2026

Uh oh!

github-actions Bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Lines of code report

Uh oh!

github-actions Bot commented Apr 30, 2026

🤖 Claude Code Review

PR #6553: fix(l1): fix storage range fetching for big accounts

Overview

Correctness Issues

Minor Issues

Missing Test Coverage

Summary

Uh oh!

greptile-apps Bot commented Apr 30, 2026

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Apr 30, 2026

🤖 Codex Code Review

Uh oh!

ElFantasma Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

ElFantasma Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

ElFantasma Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

ElFantasma Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 28, 2026

⚠️ Known Issues — intentionally skipped tests

Known Issues

EF Tests — Stateless coverage narrowed to EIP-8025 optional-proofs

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

github-actions Bot commented Apr 30, 2026 •

edited

Loading