Skip to content

perf(storage): batch trie updates across blocks in save_blocks#21142

Merged
yongkangc merged 6 commits intomainfrom
georgios/batch-trie-updates-v2
Jan 17, 2026
Merged

perf(storage): batch trie updates across blocks in save_blocks#21142
yongkangc merged 6 commits intomainfrom
georgios/batch-trie-updates-v2

Conversation

@gakonst
Copy link
Member

@gakonst gakonst commented Jan 16, 2026

Summary

Batches trie updates across all blocks in save_blocks instead of writing per-block.

Problem

Per #eng-perf profiling, write_trie_updates was taking ~25% of persistence time. The current implementation calls write_trie_updates_sorted once per block, opening/closing cursors N times.

In back-to-back (b2b) scenarios with 75-250 accumulated blocks, this overhead compounds significantly.

Solution

Accumulate trie updates across blocks using the existing extend_ref method, then write them all in a single batch:

// Accumulate across blocks
let mut accumulated_trie_updates: Option<TrieUpdatesSorted> = None;

for block in blocks {
    // ... other per-block writes ...
    
    match &mut accumulated_trie_updates {
        Some(acc) => acc.extend_ref(&trie_data.trie_updates),
        None => accumulated_trie_updates = Some((*trie_data.trie_updates).clone()),
    }
}

// Single batch write at end
if let Some(trie_updates) = &accumulated_trie_updates {
    self.write_trie_updates_sorted(trie_updates)?;
}

Expected Impact

  • ~50% reduction in write_trie_updates time for b2b scenarios
  • Reduces cursor open/close overhead from N to 1
  • Reduces MDBX transaction overhead

Testing

  • All existing reth-provider tests pass
  • Can be benchmarked with real block replay via reth-bench-compare

Closes RETH-168

Related

Previously, `write_trie_updates_sorted` was called once per block in the
save_blocks loop. This opened/closed cursors N times for N blocks.

This change accumulates trie updates across all blocks using `extend_ref`
and writes them in a single batch at the end. This reduces:
- Cursor open/close overhead from N to 1
- MDBX transaction overhead

For back-to-back block processing with 75-250 accumulated blocks (per
#eng-perf profiling), this significantly reduces the ~25% of persist time
spent in write_trie_updates.

Expected improvement: ~50% reduction in write_trie_updates for b2b scenarios.
@gakonst gakonst added C-perf A change motivated by improving speed, memory usage or disk footprint S-needs-benchmark This set of changes needs performance benchmarking to double-check that they help labels Jan 16, 2026
@github-project-automation github-project-automation bot moved this to Backlog in Reth Tracker Jan 16, 2026
@DaniPopes
Copy link
Member

this is ok but its the same as the other "accumulation" functions, we shouldn't accumulate into a sorted vec because it sorts on each iteration

gakonst and others added 2 commits January 16, 2026 20:03
Address review feedback: replace iterative extend_ref (which re-sorts on
each call, O(n*k)) with merge_batch which uses k-way merge for O(n log k)
complexity.

Also removes unnecessary intermediate clones by collecting Arc refs and
passing them directly to merge_batch.

Amp-Thread-ID: https://ampcode.com/threads/T-019bc863-a64e-73bd-93d5-fc65229fa862
Co-authored-by: Amp <amp@ampcode.com>
Add merge_batch_hybrid() to TrieUpdatesSorted and HashedPostStateSorted
which uses a hybrid algorithm:
- Small k (< 64): extend_ref loop with low constant factors
- Large k (≥ 64): k-way merge_batch for O(n log k) complexity

This consolidates the threshold logic that was duplicated in lazy_overlay.rs
and makes it available for save_blocks batching.

Amp-Thread-ID: https://ampcode.com/threads/T-019bc863-a64e-73bd-93d5-fc65229fa862
Co-authored-by: Amp <amp@ampcode.com>
Copy link
Member

@mediocregopher mediocregopher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM

@github-project-automation github-project-automation bot moved this from Backlog to In Progress in Reth Tracker Jan 16, 2026
prefix_sets: Default::default(),
}
// Collect all trie data first (blocks are newest-to-oldest)
let trie_data: Vec<_> = blocks.iter().map(|b| b.wait_cloned()).collect();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if we can remove a collect here?

pub fn merge_batch_hybrid<'a>(states: impl IntoIterator<Item = &'a Self>) -> Self {
const MERGE_BATCH_THRESHOLD: usize = 64;

let states: Vec<_> = states.into_iter().collect();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can skip this collect again

because we collected before this alrready?

pub fn merge_batch_hybrid<'a>(updates: impl IntoIterator<Item = &'a Self>) -> Self {
const MERGE_BATCH_THRESHOLD: usize = 64;

let updates: Vec<_> = updates.into_iter().collect();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, i think we can prob remove this collect

if save_mode.with_state() {
let start = Instant::now();
// Collect Arc refs first to extend their lifetime
let trie_updates: Vec<_> = blocks.iter().map(|b| b.trie_updates()).collect();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here re collect. check for perf regression compared to previous

Addresses review feedback from yongkangc: avoid internal collect() in
merge_batch_hybrid by changing signature to accept &[&Self] slice.

Callers that already have collected data (lazy_overlay, provider) now
pass the slice directly, avoiding a redundant collection inside the
merge function.

Amp-Thread-ID: https://ampcode.com/threads/T-019bc863-a64e-73bd-93d5-fc65229fa862
Co-authored-by: Amp <amp@ampcode.com>
}

if states.len() == 1 {
return (*states[0]).clone();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this clone here bad? not 100% sure why we need to return a clone.

Address yongkangc review feedback:
- Remove unused merge_batch_hybrid functions from reth-trie-common
- Inline optimized hybrid logic in lazy_overlay and provider
- Use Arc::make_mut for copy-on-write in small-k path (avoids clone if refcount is 1)
- Use Arc::try_unwrap to avoid final clone when possible
- Only collect for large-k path where k-way merge needs materialized data
- Single block case returns data directly without allocation

Amp-Thread-ID: https://ampcode.com/threads/T-019bc863-a64e-73bd-93d5-fc65229fa862
Co-authored-by: Amp <amp@ampcode.com>
@yongkangc yongkangc added this pull request to the merge queue Jan 17, 2026
Merged via the queue into main with commit c11c130 Jan 17, 2026
44 checks passed
@yongkangc yongkangc deleted the georgios/batch-trie-updates-v2 branch January 17, 2026 07:25
@github-project-automation github-project-automation bot moved this from In Progress to Done in Reth Tracker Jan 17, 2026
gakonst added a commit that referenced this pull request Jan 17, 2026
This refactor addresses code duplication identified in PR #21142 by extracting
the hybrid merge algorithm into a reusable trait and helper functions.

## Changes

### New HybridMergeSorted trait (reth_trie::utils)
- Trait for sorted types supporting hybrid merging (extend_ref + merge_batch)
- Implemented for TrieUpdatesSorted and HashedPostStateSorted
- Two helper functions:
  - `hybrid_merge_arc_slice`: merges Arc-wrapped items from a slice
  - `hybrid_merge_arc_vec`: consumes Vec for Arc::try_unwrap optimization

### Algorithm details
- 0 items: returns default
- 1 item: returns Arc clone (no data allocation)
- < threshold: uses extend_ref loop with Arc::make_mut (copy-on-write)
- >= threshold: uses k-way merge_batch for O(n log k) complexity

### Updated call sites
1. **lazy_overlay.rs**: merge_blocks now uses hybrid_merge_arc_slice
2. **provider.rs**: save_blocks now batches trie updates using hybrid_merge_arc_vec,
   reducing cursor open/close overhead from N to 1

## Allocation behavior
- No regression vs original code
- Small batches: Arc::make_mut provides copy-on-write
- Large batches: single allocation for k-way merge result
- Arc::try_unwrap used to avoid clone when refcount is 1

## Expected performance impact
- ~50% reduction in write_trie_updates time for b2b scenarios (batched writes)
- Maintains same allocation characteristics as original code
gakonst added a commit that referenced this pull request Jan 17, 2026
Batches trie updates across all blocks in `save_blocks` instead of writing
per-block, reducing cursor open/close overhead from N to 1.

## Changes

### lazy_overlay.rs
- Move MERGE_BATCH_THRESHOLD constant inside function
- Use data directly instead of Arc::clone (avoids unnecessary refcount bump)
- Move Arc::make_mut into the loop for proper copy-on-write semantics

### provider.rs
Add batched trie updates with hybrid merge algorithm:
- 0 blocks: default
- 1 block: Arc::try_unwrap to avoid clone if refcount is 1
- < 30 blocks: extend_ref with Arc::make_mut (copy-on-write)
- >= 30 blocks: k-way merge_batch for O(n log k) complexity

## Allocation Behavior (No Regression)

Small batches avoid collect() by using Arc::make_mut directly in the loop.
Only large batches (>= 30) collect Arcs for k-way merge.

## Expected Impact

- ~50% reduction in write_trie_updates time for b2b scenarios
- Maintains same allocation characteristics as original code

## Related

- Based on optimizations from Slack #eng-perf thread
- Learned from PR #21142 review: avoid unnecessary collect(), use Arc::make_mut
gakonst added a commit that referenced this pull request Jan 18, 2026
Batches trie updates across all blocks in `save_blocks` instead of writing
per-block, reducing cursor open/close overhead from N to 1.

## Changes

### lazy_overlay.rs
- Move MERGE_BATCH_THRESHOLD constant inside function
- Use data directly instead of Arc::clone (avoids unnecessary refcount bump)
- Move Arc::make_mut into the loop for proper copy-on-write semantics

### provider.rs
Add batched trie updates with hybrid merge algorithm:
- 0 blocks: default
- 1 block: Arc::try_unwrap to avoid clone if refcount is 1
- < 30 blocks: extend_ref with Arc::make_mut (copy-on-write)
- >= 30 blocks: k-way merge_batch for O(n log k) complexity

## Allocation Behavior (No Regression)

Small batches avoid collect() by using Arc::make_mut directly in the loop.
Only large batches (>= 30) collect Arcs for k-way merge.

## Expected Impact

- ~50% reduction in write_trie_updates time for b2b scenarios
- Maintains same allocation characteristics as original code

## Related

- Based on optimizations from Slack #eng-perf thread
- Learned from PR #21142 review: avoid unnecessary collect(), use Arc::make_mut
@mediocregopher
Copy link
Member

Closes #20611

Vui-Chee added a commit to okx/reth that referenced this pull request Jan 20, 2026
* tag 'v1.10.1': (49 commits)
  chore: bump version to 1.10.1 (paradigmxyz#21188)
  chore: rename extend_ref methods on sorted data structures (paradigmxyz#21043)
  fix(flashblocks): Add flashblock ws connection retry period (paradigmxyz#20510)
  chore(bench): add --disable-tx-gossip to benchmark node args (paradigmxyz#21171)
  refactor(stages): reuse history index cache buffers in `collect_history_indices` (paradigmxyz#21017)
  feat(download): resumable snapshot downloads with auto-retry (paradigmxyz#21161)
  ci: update to tempoxyz (paradigmxyz#21176)
  chore: apply spelling and typo fixes (paradigmxyz#21182)
  docs: document minimal storage mode in pruning FAQ (paradigmxyz#21025)
  chore(deps): weekly `cargo update` (paradigmxyz#21167)
  feat(execution-types): add receipts_iter helper (paradigmxyz#21162)
  revert: undo Chain crate, add LazyTrieData to trie-common (paradigmxyz#21155)
  feat(engine): add new_payload_interval metric (start-to-start) (paradigmxyz#21159)
  feat(engine): add time_between_new_payloads metric (paradigmxyz#21158)
  fix(storage-api): gate reth-chain dependency behind std feature
  perf(storage): batch trie updates across blocks in save_blocks (paradigmxyz#21142)
  refactor: use ExecutionOutcome::single instead of tuple From (paradigmxyz#21152)
  chore(chain-state): reorganize deferred_trie.rs impl blocks (paradigmxyz#21151)
  feat(primitives-traits): add try_recover_signers for parallel batch recovery (paradigmxyz#21103)
  perf: make Chain use DeferredTrieData (paradigmxyz#21137)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

C-perf A change motivated by improving speed, memory usage or disk footprint S-needs-benchmark This set of changes needs performance benchmarking to double-check that they help

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants