Skip to content

perf: collect-based fast path for small block tx iterator#22078

Closed
yongkangc wants to merge 3 commits intomainfrom
yk/small-block-tx-iter-fastpath
Closed

perf: collect-based fast path for small block tx iterator#22078
yongkangc wants to merge 3 commits intomainfrom
yk/small-block-tx-iter-fastpath

Conversation

@yongkangc
Copy link
Member

@yongkangc yongkangc commented Feb 11, 2026

Summary
Reth spawns the same heavy parallel tx conversion pipeline regardless of block size: rayon for_each_with sends results out-of-order through an intermediate channel, then a second spawn_blocking task reorders via BTreeMap. For small blocks (<30 txs), this coordination overhead exceeds the benefit. This PR adds a collect-based fast path using par_iter().map().collect() which preserves order natively, eliminating the intermediate channel and reorder task. For large blocks, replaces BTreeMap with pre-allocated Vec<Option<_>> for O(1) reorder.

Changes

  • Add PARALLEL_REORDER_TX_THRESHOLD (30 txs)
  • Small blocks: single rayon::spawn with collect() + sequential channel sends
  • Large blocks: Vec<Option<_>> reorder buffer instead of BTreeMap

For blocks with < 128 transactions, use rayon's order-preserving
collect() instead of the out-of-order channel + BTreeMap reorder
pipeline. This eliminates a spawn_blocking task and BTreeMap
allocations for the common small block case.

For larger blocks, replace BTreeMap reorder with pre-allocated
Vec<Option<_>> buffer for O(1) insert and lookup.

Amp-Thread-ID: https://ampcode.com/threads/T-019c4d83-74c0-739b-84aa-6437db8d213a
@yongkangc yongkangc added C-perf A change motivated by improving speed, memory usage or disk footprint A-engine Related to the engine implementation labels Feb 11, 2026
@github-project-automation github-project-automation bot moved this to Backlog in Reth Tracker Feb 11, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Feb 11, 2026

⚠️ Changelog not found.

A changelog entry is required before merging. We've generated a suggested changelog based on your changes:

Preview
---
reth-engine-tree: patch
---

Optimized transaction iterator with fast-path for small blocks. Added a threshold-based approach that uses order-preserving `collect()` for blocks with fewer than 30 transactions, eliminating unnecessary channel overhead and reordering logic for common cases.

Add changelog to commit this to your branch.

Copy link
Member Author

@yongkangc yongkangc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: perf: collect-based fast path for small block tx iterator

Correctness

1. Prewarming latency regression on the small-block path (blocking)

The par_iter().map().collect() path delays all prewarm sends until every transaction conversion completes. In the current code, prewarm_tx.send() happens immediately inside for_each_with as each conversion finishes, so the prewarm task can start executing speculatively while other conversions are still running.

This matters most for blocks with 50–127 transactions (above SMALL_BLOCK_TX_THRESHOLD so prewarm is enabled, but below the new 128 threshold so the collect path runs). These blocks will get zero prewarm overlap with conversion, making prewarming less effective or useless. The current streaming design is intentionally pipelined for this reason.

Suggestion: Keep the prewarm sends streaming (inside the map closure or via for_each_with) even if the execution stream ordering is optimized. Prewarming order does not matter.

2. Silent transaction drop in the large-block path (blocking)

} else if idx < buf.len() {
    buf[idx] = Some(tx);
}

If idx >= buf.len(), the transaction is silently dropped. While enumerate() on an IndexedParallelIterator of known length should never produce an out-of-range index, silently discarding a transaction is a critical failure mode: the execute_rx consumer will block forever waiting for a transaction that was dropped, or the block will be built with missing transactions. This should at minimum be a debug_assert! / panic, not a silent discard.

// Suggested fix:
debug_assert!(idx < buf.len(), "transaction index {idx} out of range for buffer length {}", buf.len());
buf[idx] = Some(tx);

Performance

3. Vec<Option<Result<…>>> pre-allocation overhead

The Vec is pre-allocated for all transactions:

let mut buf: Vec<Option<Result<_, _>>> = (0..tx_count).map(|_| None).collect();

The BTreeMap only ever holds the out-of-order tail — typically bounded by worker parallelism (rayon thread count), not by block size. For a block with 5000 transactions, this allocates and zeroes 5000 slots of Option<Result<WithTxEnv<…>, Err>> (which are not small), while the BTreeMap would typically hold <64 entries at any time.

The Vec is O(1) lookup vs O(log n) for BTreeMap, but with the working set bounded by parallelism, the BTreeMap overhead is negligible while memory usage stays constant.

Suggestion: Benchmark before switching. If the Vec approach is pursued, at least use Box<Result<…>> per slot to reduce the per-slot size to a pointer.

Nits

4. PARALLEL_REORDER_TX_THRESHOLD as an associated constant

Defining it as const PARALLEL_REORDER_TX_THRESHOLD: usize = 128; inside the impl block is fine, but it would be more consistent with SMALL_BLOCK_TX_THRESHOLD (module-level pub const) to also define this at module level with a doc comment explaining the rationale and how the value was chosen.

5. Blank line left in imports

 use std::{
-    collections::BTreeMap,
+
     ops::Not,

The removal leaves a blank line inside the use block.

6. Missing benchmark data

PR body says "Benchmark pending." Given this is a perf PR, it is hard to evaluate whether the threshold of 128 or the collect-vs-streaming tradeoff is net positive without data. Would recommend running reth-bench before merging.

Verdict

Changes requested. The prewarming latency regression (#1) and the silent transaction drop (#2) are blocking issues. The performance claim needs benchmark validation (#6). The core idea of optimizing the reorder buffer is reasonable but the current implementation introduces regressions in the prewarming pipeline that should be preserved.

Ubuntu added 2 commits February 11, 2026 20:12
- Remove blank line in std use block (fmt)
- Collapse single-line vec init (fmt)
- Add backticks around BTreeMap in doc comment (clippy)

Amp-Thread-ID: https://ampcode.com/threads/T-019c4e7e-4c60-7228-a6e8-e24c9bdb0aa4
@yongkangc
Copy link
Member Author

Closing — the small-block collect-based path introduces a barrier that serializes the convert→prewarm→execute pipeline: no prewarm or execution sends happen until ALL conversions finish, which is a latency regression vs the current streaming design. The Vec<Option<_>> reorder for large blocks is directionally fine but marginal. No benchmark data was provided to show a measurable improvement. If the goal is reducing per-block coordination overhead for small blocks, a better approach would be sequential convert-and-send (no rayon, no reorder thread) while preserving streaming, rather than a collect barrier.

@yongkangc yongkangc closed this Feb 11, 2026
@github-project-automation github-project-automation bot moved this from Backlog to Done in Reth Tracker Feb 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-engine Related to the engine implementation C-perf A change motivated by improving speed, memory usage or disk footprint

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

1 participant