Skip to content

perf(l1): remove redundant ancestor deletes in healing trie writes#6384

Closed
ElFantasma wants to merge 1 commit into
mainfrom
perf/remove-redundant-healing-deletes
Closed

perf(l1): remove redundant ancestor deletes in healing trie writes#6384
ElFantasma wants to merge 1 commit into
mainfrom
perf/remove-redundant-healing-deletes

Conversation

@ElFantasma
Copy link
Copy Markdown
Contributor

Motivation

Storage healing is the #1 bottleneck in snap sync, taking ~1h 10m (44% of total sync time) on mainnet. Profiling the write path revealed that for every node written, the code generated O(depth) delete operations for every ancestor path — deleting entries at the root, then [0], then [0,1], etc. For a node at trie depth 10, this meant 11 DB operations (10 deletes + 1 write) instead of just 1 write.

These ancestor deletes were intended to clear stale parent encodings from previous pivots, but they are redundant because:

  1. Healing commits bottom-up: commit_node walks from leaf to root via healing_queue, adding every ancestor to the write batch. So stale ancestors get overwritten by the new value in the same batch.

  2. get_node_checked validates hashes: When checking if a child exists, it verifies node.compute_hash() == expected_hash. A stale node from a previous pivot has a different hash, so it's detected as missing and re-downloaded. The subsequent put overwrites it atomically.

  3. Orphan nodes are harmless: Nodes from old pivots at paths not visited by healing become unreachable from the current root. Trie traversal always follows child references from the root — orphans are never read.

Description

  • Remove the for i in 0..path.len() loop in storage.rs that generated O(depth) empty-vec entries (which write_storage_trie_nodes_batch interprets as deletes)
  • Remove the same pattern in state.rs, plus the now-unused BTreeMap import
  • Update comments: "avoid out of order deletes" → "preserve write ordering"
  • Remove stale PERF comments that referenced the delete requirement

This reduces DB operations per healing node from ~O(depth) to O(1), which should meaningfully reduce the 1h 10m storage healing phase.

Checklist

  • Tested on mainnet snap sync to verify healing completes correctly
  • Compared sync times before/after

@ElFantasma ElFantasma requested a review from a team as a code owner March 19, 2026 01:00
@github-actions github-actions Bot added L1 Ethereum client performance Block execution throughput and performance in general labels Mar 19, 2026
@github-actions
Copy link
Copy Markdown

🤖 Kimi Code Review

⚠️ Review failed: Kimi API request failed with status 429: The engine is currently overloaded, please try again later


Automated review by Kimi (Moonshot AI)

@github-actions
Copy link
Copy Markdown

🤖 Codex Code Review

No findings.

The diff looks sound. The removed parent-prefix writes in state.rs and storage.rs appear to be redundant write amplification rather than required correctness machinery, because trie child presence checks are hash-validated via get_node_checked, not simple key existence.

Residual risk is coverage, not the implementation itself: I couldn’t find targeted tests for interrupted/stale healing resume paths. A regression test around partial flushes and next-cycle resume for the logic around state.rs and storage.rs would make this safer.


Automated review by OpenAI Codex · gpt-5.4 · custom prompt

@github-actions
Copy link
Copy Markdown

Lines of code report

Total lines added: 0
Total lines removed: 6
Total lines changed: 6

Detailed view
+------------------------------------------------------+-------+------+
| File                                                 | Lines | Diff |
+------------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/sync/healing/state.rs   | 386   | -3   |
+------------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/sync/healing/storage.rs | 624   | -3   |
+------------------------------------------------------+-------+------+

@github-actions
Copy link
Copy Markdown

🤖 Claude Code Review

Review: perf(l1): remove redundant ancestor deletes in healing trie writes

Summary

The optimization is well-reasoned and the correctness arguments hold. Removing O(depth) redundant operations per node is a meaningful improvement for the storage healing bottleneck.


Correctness Analysis

The three correctness arguments in the PR description all check out:

  • Bottom-up ordering: commit_node (state.rs:377, storage.rs:700–703) recursively commits children before parents. Within a single DB transaction, no ordering dependency exists between them.
  • get_node_checked hash validation: Any stale node from a previous pivot that hasn't been overwritten will fail hash validation and be re-fetched (node.rs, the compute_hash() == *hash check).
  • Orphan unreachability: Nodes outside the current root's subtree are never traversed during healing.

Issues

Behavioral difference between state.rs and storage.rs — minor imprecision in PR description

The PR describes both files as performing "deletes," but the storage layers behave differently:

  • In storage.rs: write_storage_trie_nodes_batch (store.rs:1186–1188) explicitly checks node_data.is_empty() and calls txn.delete(...). These were real deletes.
  • In state.rs: put_batch in trie.rs (line 139–151) has no empty-vec-as-delete logic — it calls tx.put_batch(table, vec![(key, value)]) regardless of value content. The old empty-vec entries in state.rs were storing empty byte arrays, not deleting. These would surface as corrupted/invalid nodes if read between batches (before the ancestor's real write arrived in a later batch).

The BTreeMap deduplication in the old state.rs code was doing the heavy lifting: if the ancestor was also committed in the same batch, its actual encoded node overwrote the empty-vec entry. For ancestors committed in a future batch, there was a transient window where an empty byte array sat at that path — making get_node_checked return None and unnecessarily re-trigger a download for what might have been a valid node.

The new code avoids both the wasted write and this transient-window risk. The removal is correct and actually fixes a subtle latent issue in state.rs.


Pre-existing: unhandled error in storage.rs

// storage.rs ~line 221 (unchanged by this PR)
if !db_joinset.is_empty() {
    db_joinset.join_next().await;  // result discarded
}

Contrast with state.rs:

if !db_joinset.is_empty()
    && let Some(result) = db_joinset.join_next().await
{
    result??;  // properly propagated
}

A DB write failure in the storage healing path will be silently swallowed. Not introduced by this PR, but worth fixing — a failed write means healing completes with missing nodes.


Checklist not completed

Both mainnet items are unchecked. Given that this changes the correctness invariants of the healing path (not just a micro-optimization), the results of the mainnet test should be in before merging. The "before/after sync time" comparison is also valuable to validate the 44% claim.


Code Quality

  • Removing the unused BTreeMap import is correct.
  • The Vec<_> and .collect() type inference in both files resolves correctly given the call sites.
  • Comment update from "avoid out of order deletes" → "preserve write ordering" is more accurate.
  • Removing the stale PERF comment is appropriate since the obstacle it referenced (needing to remove nodes) no longer applies.

Verdict

Approve with notes. The optimization is correct and the implementation is clean. The only blocking item is completing the mainnet validation from the checklist before merge. The pre-existing error-swallowing in storage.rs is worth fixing in a follow-up.


Automated review by Claude (Anthropic) · sonnet · custom prompt

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Mar 19, 2026

Greptile Summary

This PR removes an O(depth) redundant delete pattern from the snap-sync healing write paths in both state.rs and storage.rs. Previously, for every node written, the code generated empty-vec entries for all ancestor paths (e.g. a node at depth 10 triggered 10 deletes + 1 write = 11 DB ops). The new code reduces this to a single write per node, which should substantially reduce the ~1h 10m storage healing phase on mainnet.

The correctness argument is sound and verified against the implementation:

  • commit_node bottom-up chain: The recursive commit_node function (both in state.rs lines 370–400 and storage.rs lines 692–736) only commits a parent when all its children have been committed (pending_children_count == 0), building nodes_to_write in leaf-to-root order. Every ancestor along any committed path is therefore included in the same atomic write batch — the "stale ancestor overwrite" concern is fully addressed.

  • get_node_checked hash validation (node.rs lines 81–86): When checking if a child already exists, the code calls .filter(|rlp| !rlp.is_empty()) then verifies node.compute_hash() == *hash. A stale node with a different hash returns None (treated as missing) and is re-downloaded and overwritten. Crucially, the empty-value filter also means the old "soft deletes" (empty vecs written via put_batch in state.rs) were equivalent to actual deletes from the trie's perspective — reinforcing that the explicit delete loop was redundant.

  • Atomic batches: Both BackendTrieDB::put_batch (used in state.rs) and write_storage_trie_nodes_batch (used in storage.rs) operate within a single write transaction committed atomically. Ordering within a batch is irrelevant; the removal of the BTreeMap (which previously sorted root-to-leaf) does not affect correctness.

  • Orphan node safety: Trie traversal always starts from the canonical state root and follows child hashes, so unreachable nodes from previous pivots are simply never visited.

Changes are clean and well-motivated:

  • BTreeMap import removed from state.rs (now unused).
  • Stale PERF comments (which referenced the delete requirement) removed.
  • The single-task-in-background comment updated from "avoid out of order deletes" → "preserve write ordering", which is accurate.

One gap to be aware of: Both mainnet test checklist items (Tested on mainnet snap sync to verify healing completes correctly and Compared sync times before/after) remain unchecked. The theoretical case is solid, but a mainnet validation run would give higher confidence that no edge case (e.g. a partially-healed pivot followed by resumption) was overlooked.

Confidence Score: 4/5

  • Safe to merge with low risk — the optimization is theoretically sound and supported by the code structure, pending mainnet validation.
  • The three correctness pillars (bottom-up commit_node chain, get_node_checked hash validation, orphan-node unreachability) are all verified in the actual implementation. The atomic batch writes mean internal ordering changes are harmless. Score is 4 rather than 5 solely because the mainnet snap-sync test checklist items are unchecked — a real-world resumption from a stale pivot with partially-healed storage tries is the scenario most likely to surface a subtle edge case.
  • No files require special attention — both changed files are straightforward removals of redundant loops with clean diffs.

Important Files Changed

Filename Overview
crates/networking/p2p/sync/healing/state.rs Removes O(depth) empty-vec ancestor insertions in the state trie write path, replacing the BTreeMap batch with a simple Vec. The change is safe: commit_node already adds all ancestors bottom-up (so they overwrite any stale data in the same atomic batch), get_node_checked validates hashes to detect stale nodes, and unreachable orphan nodes are harmless.
crates/networking/p2p/sync/healing/storage.rs Removes the same O(depth) empty-vec ancestor-delete pattern for storage trie healing writes. Semantically identical to the state.rs change. One pre-existing concern (not introduced by this PR): the result of db_joinset.join_next().await at line 217 is discarded, so a panic in the previous write task is silently swallowed rather than propagated.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Download trie node] --> B{Has missing children?}
    B -- Yes --> C[Add to healing_queue\npending_children_count = N]
    B -- No --> D[commit_node - leaf]
    D --> E[Push leaf to nodes_to_write]
    E --> F{Parent pending_children_count == 0?}
    F -- No --> G[Decrement counter, keep in healing_queue]
    F -- Yes --> H[commit_node - parent\nPush parent to nodes_to_write]
    H --> I{More ancestors?}
    I -- Yes --> F
    I -- No / root reached --> J[nodes_to_write has full bottom-up chain]
    J --> K{Flush threshold\nor is_done or is_stale?}
    K -- Yes --> L[OLD: BTreeMap inserts empty vecs for\nall ancestor paths O-depth deletes\nNEW: Vec of path/node pairs only]
    L --> M[Atomic put_batch / write_storage_trie_nodes_batch]
    M --> N[DB updated]
    N --> O{stale ancestor in DB?}
    O -- Yes but different hash --> P[get_node_checked returns None\n= re-download on next traversal]
    O -- Same hash = correct node --> Q[Accept as valid, skip download]
    P --> R[Fresh download overwrites stale node]
    G --> C
Loading

Last reviewed commit: "perf(l1): remove red..."

@ElFantasma
Copy link
Copy Markdown
Contributor Author

Closing — the optimization is incorrect

Removing the ancestor deletes breaks correctness under pivot staleness. The deletes are not redundant — they're a critical invariant.

The bug

During healing, nodes are committed bottom-up: a child is added to to_write, and the parent stays in healing_queue until all its children are done. When the pivot goes stale mid-healing, to_write is flushed to DB but healing_queue is discarded. This means a child (B') can be written at path [1,0] while its parent (A) at path [1] remains from a previous pivot.

The ancestor delete loop (for i in 0..path.len()) deletes the parent's path when flushing the child. Without it, the stale parent A survives in the DB.

The scenario

T:   A at [1] → { B at [1,0], C at [1,1] }
T+2: Healing interrupted — B' written at [1,0], A unchanged at [1]
T+3: Tree reverts to T — root references A with hash(A_T)

At T+3, healing downloads the root, checks children:

  • A at [1]: get_node_checked finds the stale A from T. Since hash(A_T) == hash(A_T3) (tree reverted), the hash matches → healing skips A's subtree.
  • B' at [1,0] is never re-checked → trie is corrupt (parent A claims hash(B) but actual child is B').

With ancestor deletes: writing B' at [1,0] also deletes [1]. At T+3, get_node_checked finds nothing at [1] → A is re-downloaded → children verified → B' detected → re-downloaded.

Why the Merkle argument was wrong

The Merkle property ("matching hash = correct subtree") holds for the node itself, but NOT for the DB state underneath it. get_node_checked verifies that the node's encoding has the right hash, but doesn't verify that the node's children exist in the DB with correct data. The healing algorithm's invariant (line 118: "if a nodehash is present in the db, it and all of its children are present") is maintained by the ancestor deletes — without them, partially-flushed children can violate it.

Note for future optimization

A valid optimization would need to ensure that either:

  1. Ancestor paths are always invalidated when children are written (current approach)
  2. Children are never flushed without their parents (atomic flush of complete subtrees)
  3. get_node_checked is extended to verify children recursively (expensive)

Option 2 is the most promising — if commit_node only adds to to_write when the full path to root is committed, partial flushes can't create inconsistencies. This would require tracking whether ancestors are fully resolved before allowing flush.

A proper test for this scenario requires mocking the peer network to drive heal_storage_trie through the partial-flush + re-heal cycle. Not feasible with the current architecture but should be possible after the planned actor refactor.

@ElFantasma ElFantasma closed this Mar 20, 2026
@github-project-automation github-project-automation Bot moved this from Todo to Done in ethrex_performance Mar 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

L1 Ethereum client performance Block execution throughput and performance in general

Projects

Status: Done
Status: Done

Development

Successfully merging this pull request may close these issues.

1 participant