fix(l1): fix storage range fetching for big accounts#6553
Conversation
Lines of code reportTotal lines added: Detailed view |
🤖 Claude Code ReviewNow I have everything I need. Here is my review: PR #6553: fix(l1): fix storage range fetching for big accountsOverviewThe PR fixes a real correctness bug in snap sync: big accounts (those with existing interval tracking) were permanently stuck in The fix partitions scheduling into two explicit paths:
The approach is sound. A few issues are worth addressing: Correctness Issues1. Linear scan picks the last matching account, not the first
let mut acc_hash: H256 = H256::zero();
for account in accounts_by_root_hash[start_index].1.iter() {
if let Some((_, old_intervals)) = account_storage_roots
.accounts_with_storage_root
.get(account)
&& !old_intervals.is_empty()
{
acc_hash = *account; // no break — overwrites on every match
}
}Because there is no let acc_hash = accounts_by_root_hash[start_index].1.iter().find(|account| {
account_storage_roots
.accounts_with_storage_root
.get(*account)
.map_or(false, |(_, ivs)| !ivs.is_empty())
}).copied();2. Interval matched by end hash only New diff line 113: .position(|(_old_start, end)| end == &hash_end)The 3. Interval cleanup inconsistency for shared-root account groups When an interval is drained, the code removes it from Minor Issues4. Two-level Diff line 53: } else {
if let Some(start) = bulk_chunk_start {This is a Clippy lint ( 5. Very long line in Diff line 110 has a 130+ character line for the 6.
Missing Test CoverageThe new per-interval scheduling path and the interval completion/removal path have no unit tests. Given that the original bug was subtle (it only manifested when a peer's response exactly covered a big account without truncating), a regression test — even a minimal one with a mocked peer — would prevent future regressions and validate the accounting logic. This is the most impactful gap in the PR. SummaryThe root cause diagnosis is correct and the fix is logically sound. Items 1 and 2 above are the most important to address before merging: the wrong-account selection and end-hash-only matching could produce incorrect behavior in multi-account-per-root groups or edge cases with interval overlap. Item 3 (stale map entries) is lower severity but leaves the tracking state inconsistent. A regression test would significantly strengthen confidence in the fix. Automated review by Claude (Anthropic) · sonnet · custom prompt |
Greptile SummaryThis PR fixes a liveness bug in snap-sync storage downloading: accounts with large storage tries would get stuck because the per-interval re-queue path only fired on truncated peer responses, so a peer that covered the whole account in one reply left the intervals undrained forever. The fix partitions tasks at function entry — fresh accounts get bulk tasks, previously-split accounts get one task per recorded interval — and adds a completion handler that removes the finished interval and finalises the account group when all intervals are gone.
Confidence Score: 3/5The fix resolves the described bug but introduces a latent regression path for accounts sharing a storage root due to non-deterministic HashMap iteration affecting interval lookup. A P1 finding (non-deterministic first_account selection) can reproduce the original liveness bug for a specific account topology, capping confidence at 4; the additional P2 (end-hash-only matching) lowers it slightly further to 3. crates/networking/p2p/snap/client.rs — partitioning loop (line ~571) and interval-removal handler (line ~944)
|
| Filename | Overview |
|---|---|
| crates/networking/p2p/snap/client.rs | Fixes stuck storage-range download for big accounts by partitioning tasks into bulk and per-interval paths at function entry; introduces an interval-removal handler for fully-covered per-interval tasks, but the partitioning relies on first_account (non-deterministic HashMap iteration order) to detect existing intervals, which can silently fall back to the buggy bulk path when multiple accounts share a storage root. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[request_storage_ranges called] --> B[Build accounts_by_root_hash from HashMap]
B --> C{For each group: check first_account.intervals}
C -- empty intervals --> D[Schedule bulk StorageTask start_hash=zero]
C -- non-empty intervals --> E[Schedule per-interval StorageTasks]
D --> F[Task dispatched to peer]
E --> F
F --> G{peer response: remaining_start < remaining_end?}
G -- yes, partial --> H{hash_start.is_zero?}
H -- yes bulk partial --> I[Re-queue remaining bulk chunk]
H -- no, per-interval partial --> J[Update interval start_hash, re-queue]
G -- no, fully covered --> K{hash_end.is_some?}
K -- no, bulk complete --> L[Mark accounts done]
K -- yes, per-interval complete NEW PATH --> M[Find acc with non-empty intervals in group]
M --> N[Remove matching interval by end_hash]
N --> O{intervals empty?}
O -- yes --> P[Mark all group accounts done + healed]
O -- no --> Q[More intervals remain, await other tasks]
G -- no, fully covered & hash_start!=0 & no hash_end --> R[Big account detected: split into chunks, store intervals]
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 2
crates/networking/p2p/snap/client.rs:571-583
**Interval lookup keyed on non-deterministic `first_account`**
The partitioning loop decides bulk vs. per-interval by looking up `first_account.intervals`, where `first_account` is `accounts.first()` from a `Vec` built by iterating `account_storage_roots.accounts_with_storage_root` — a `HashMap` whose iteration order is non-deterministic across calls.
The split path stores intervals under exactly one key, `first_acc_hash = accounts_by_root_hash[remaining_start].1.first()`, which was determined in a *previous* call's `HashMap` iteration order. In a subsequent call, `accounts_by_root_hash[i].1.first()` can land on a *different* account whose intervals are empty, causing the group to fall through to the bulk path and reproduce the original stuck behavior for accounts that share a storage root.
### Issue 2 of 2
crates/networking/p2p/snap/client.rs:944-946
**Interval matched by end-hash only**
`position(|(_old_start, end)| end == &hash_end)` ignores `start_hash` when locating the completed interval. If two intervals in the same account happen to share the same `end_hash` (e.g. arithmetic overflow causes two adjacent chunks to land on the same ceiling value), this will remove the wrong interval and leave the other one dangling. Matching on both `(start_hash, end_hash)` would make the lookup unambiguous.
```suggestion
let pos = old_intervals
.iter()
.position(|(old_start, end)| old_start == &hash_start && end == &hash_end)
.ok_or(SnapError::InternalError(
"Could not find an old interval that we were tracking".to_owned(),
))?;
```
Reviews (1): Last reviewed commit: "snap: address review nits in scheduling ..." | Re-trigger Greptile
| for (i, (_, accounts)) in accounts_by_root_hash.iter().enumerate() { | ||
| let first_account = *accounts.first().ok_or_else(|| { | ||
| SnapError::InternalError("Empty accounts vector while scheduling tasks".to_owned()) | ||
| })?; | ||
| let intervals = &account_storage_roots | ||
| .accounts_with_storage_root | ||
| .get(&first_account) | ||
| .ok_or_else(|| { | ||
| SnapError::InternalError( | ||
| "Could not find intervals for account while scheduling".to_owned(), | ||
| ) | ||
| })? | ||
| .1; |
There was a problem hiding this comment.
Interval lookup keyed on non-deterministic
first_account
The partitioning loop decides bulk vs. per-interval by looking up first_account.intervals, where first_account is accounts.first() from a Vec built by iterating account_storage_roots.accounts_with_storage_root — a HashMap whose iteration order is non-deterministic across calls.
The split path stores intervals under exactly one key, first_acc_hash = accounts_by_root_hash[remaining_start].1.first(), which was determined in a previous call's HashMap iteration order. In a subsequent call, accounts_by_root_hash[i].1.first() can land on a different account whose intervals are empty, causing the group to fall through to the bulk path and reproduce the original stuck behavior for accounts that share a storage root.
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/networking/p2p/snap/client.rs
Line: 571-583
Comment:
**Interval lookup keyed on non-deterministic `first_account`**
The partitioning loop decides bulk vs. per-interval by looking up `first_account.intervals`, where `first_account` is `accounts.first()` from a `Vec` built by iterating `account_storage_roots.accounts_with_storage_root` — a `HashMap` whose iteration order is non-deterministic across calls.
The split path stores intervals under exactly one key, `first_acc_hash = accounts_by_root_hash[remaining_start].1.first()`, which was determined in a *previous* call's `HashMap` iteration order. In a subsequent call, `accounts_by_root_hash[i].1.first()` can land on a *different* account whose intervals are empty, causing the group to fall through to the bulk path and reproduce the original stuck behavior for accounts that share a storage root.
How can I resolve this? If you propose a fix, please make it concise.| let pos = old_intervals | ||
| .iter() | ||
| .position(|(_old_start, end)| end == &hash_end) |
There was a problem hiding this comment.
Interval matched by end-hash only
position(|(_old_start, end)| end == &hash_end) ignores start_hash when locating the completed interval. If two intervals in the same account happen to share the same end_hash (e.g. arithmetic overflow causes two adjacent chunks to land on the same ceiling value), this will remove the wrong interval and leave the other one dangling. Matching on both (start_hash, end_hash) would make the lookup unambiguous.
| let pos = old_intervals | |
| .iter() | |
| .position(|(_old_start, end)| end == &hash_end) | |
| let pos = old_intervals | |
| .iter() | |
| .position(|(old_start, end)| old_start == &hash_start && end == &hash_end) | |
| .ok_or(SnapError::InternalError( | |
| "Could not find an old interval that we were tracking".to_owned(), | |
| ))?; |
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/networking/p2p/snap/client.rs
Line: 944-946
Comment:
**Interval matched by end-hash only**
`position(|(_old_start, end)| end == &hash_end)` ignores `start_hash` when locating the completed interval. If two intervals in the same account happen to share the same `end_hash` (e.g. arithmetic overflow causes two adjacent chunks to land on the same ceiling value), this will remove the wrong interval and leave the other one dangling. Matching on both `(start_hash, end_hash)` would make the lookup unambiguous.
```suggestion
let pos = old_intervals
.iter()
.position(|(old_start, end)| old_start == &hash_start && end == &hash_end)
.ok_or(SnapError::InternalError(
"Could not find an old interval that we were tracking".to_owned(),
))?;
```
How can I resolve this? If you propose a fix, please make it concise.
🤖 Codex Code ReviewNo blocking findings. The new task partitioning around client.rs:561 and the explicit “completed interval” cleanup in client.rs:918 fit the existing big-account bookkeeping, and I don’t see a correctness, security, or consensus-risk regression in the changed logic. Residual risk: I don’t see targeted regression coverage for the exact case this fixes. A focused test that starts with persisted Automated review by OpenAI Codex · gpt-5.4 · custom prompt |
| let chunk_end = (chunk_start + chunk_size).min(accounts_by_root_hash.len()); | ||
| let mut bulk_chunk_start: Option<usize> = None; | ||
| for (i, (_, accounts)) in accounts_by_root_hash.iter().enumerate() { | ||
| let first_account = *accounts.first().ok_or_else(|| { |
There was a problem hiding this comment.
Implicit invariant: this uses first_account to look up intervals for the entire group (multiple accounts sharing the same storage_root), assuming all accounts in the group share the same interval state. The pre-existing handler at line 836 has the same assumption, so this isn't introduced here — but it's worth a doc comment near accounts_by_root_hash's declaration (~line 549) noting that all accounts in a group are expected to share intervals because they share storage. If that invariant ever breaks (e.g., one account in a group finalizes while a sibling is still pending), this scheduling code silently uses the first account's state for all of them.
| // acc_hash stays zero when a sibling per-interval task for the | ||
| // same account already drained the last interval and finalized | ||
| // it earlier in this call's loop — there's nothing left to do. | ||
| if !acc_hash.is_zero() { |
There was a problem hiding this comment.
Worth pinning the contrast with the existing similar block at lines 766-768: that one panic!s when acc_hash.is_zero(), while this one silently skips. The reason is given in the comment above (sibling already drained), but the differing-invariant-with-sibling-block is subtle. A one-liner like // (Compare to lines 766-768 which panic on zero — different scenario: that path runs only on remaining_start < remaining_end, so a missing acc_hash there genuinely means corruption.) would help future readers tell the two paths apart.
| // same account already drained the last interval and finalized | ||
| // it earlier in this call's loop — there's nothing left to do. | ||
| if !acc_hash.is_zero() { | ||
| let (_, old_intervals) = account_storage_roots |
There was a problem hiding this comment.
nit: this block (940-956) and the existing block at lines 769-787 do the same three-step dance — find acc_hash, remove the matching interval by end == hash_end, and on empty-intervals mark the account done + healed. Worth extracting a helper like clear_completed_interval(account_storage_roots, accounts_by_root_hash, accounts_done, start_index, hash_end) -> Result<(), SnapError> and calling it from both. Reduces drift risk if the bookkeeping shape changes later.
| .1; | ||
| if intervals.is_empty() { | ||
| let chunk_start = *bulk_chunk_start.get_or_insert(i); | ||
| if i + 1 - chunk_start >= chunk_size { |
There was a problem hiding this comment.
Off-by-one check: i + 1 - chunk_start >= chunk_size flushes when the accumulated batch reaches chunk_size accounts (since i+1 is the exclusive end and chunk_start is the inclusive start). That matches STORAGE_BATCH_SIZE semantics elsewhere in the file. ✓ Just confirming I read this right.
|
Motivation
Storage range download would get stuck on large networks with small pivot times.
Description
We always queued bulk tasks from [0, MAX] regardless of an account's existing intervals, relying on the response handler's big account split path to re-fire each call. That path only fires when a peer's response is truncated. When a peer fully covered a known-big account in one response, no branch ran, and the account's intervals stayed in accounts_with_storage_root forever.