perf(storage): batch trie updates across blocks in save_blocks#21139
perf(storage): batch trie updates across blocks in save_blocks#21139
Conversation
Based on #eng-perf Slack discussions identifying key bottlenecks: - update_history_indices: 26% of persist time - write_trie_updates: 25.4% - write_trie_changesets: 24.2% - Execution cache contention under high throughput New benchmarks: - execution_cache: cache hit rates, contention, TIP-20 patterns - heavy_persistence: accumulated blocks, history indices, state root - heavy_root: parallel vs sync at scale, large storage tries Includes runner script and optimization opportunities doc.
Previously, `write_trie_updates_sorted` was called once per block in the save_blocks loop. This opened/closed cursors N times for N blocks. This change accumulates trie updates across all blocks using `extend_ref` and writes them in a single batch at the end. This reduces: - Cursor open/close overhead from N to 1 - MDBX transaction overhead For back-to-back block processing with 75-250 accumulated blocks (per #eng-perf profiling), this significantly reduces the ~25% of persist time spent in write_trie_updates. Expected improvement: ~50% reduction in write_trie_updates for b2b scenarios.
mediocregopher
left a comment
There was a problem hiding this comment.
All the changes which aren't in the crates/storage/provider/src/providers/database/provider.rs file should be left out of this PR
|
|
||
| // Accumulate trie updates across blocks to batch the write at the end. | ||
| // This reduces cursor open/close overhead from N calls to 1. | ||
| let mut accumulated_trie_updates: Option<TrieUpdatesSorted> = None; |
There was a problem hiding this comment.
This could just start as an empty TrieUpdatesSorted
Amp-Thread-ID: https://ampcode.com/threads/T-019bc811-0850-7320-902c-52e64a671eb5 Co-authored-by: Amp <amp@ampcode.com>
Local Benchmark ResultsThe micro-benchmark shows modest gains in the overlay merge path. The real impact will be in the To properly benchmark, run with real block data via samply record -- reth re-execute --from 21000000 --to 21001000 ...The expected improvement is more significant when:
|
Benchmark Results (Local)Ran benchmarks on local machine: Accumulated Blocks Benchmark (Overlay Merge)
State Root Sync vs Parallel
NotesThe trie batching optimization shows modest gains in these isolated benchmarks. The real impact is expected in:
Recommended next step: Run on a reth box with |
|
Closing in favor of #21106 which implements the same optimization with additional correctness for trie changesets overlay. My implementation missed the overlay handling for The benchmarks and approach are the same - accumulate trie updates and batch write at end. |
Summary
Batches trie updates across all blocks in
save_blocksinstead of writing per-block.Problem
Per #eng-perf profiling,
write_trie_updateswas taking ~25% of persistence time. The current implementation callswrite_trie_updates_sortedonce per block, opening/closing cursors N times.In back-to-back (b2b) scenarios with 75-250 accumulated blocks, this overhead compounds significantly.
Solution
Accumulate trie updates across blocks using the existing
extend_refmethod, then write them all in a single batch:Expected Impact
write_trie_updatestime for b2b scenariosTesting
reth-providertests passcargo bench -p reth-engine-tree --bench heavy_persistence -- accumulatedRelated
georgios/history-indices-from-memory(26% history indices optimization)