perf: collect-based fast path for small block tx iterator#22078
perf: collect-based fast path for small block tx iterator#22078
Conversation
For blocks with < 128 transactions, use rayon's order-preserving collect() instead of the out-of-order channel + BTreeMap reorder pipeline. This eliminates a spawn_blocking task and BTreeMap allocations for the common small block case. For larger blocks, replace BTreeMap reorder with pre-allocated Vec<Option<_>> buffer for O(1) insert and lookup. Amp-Thread-ID: https://ampcode.com/threads/T-019c4d83-74c0-739b-84aa-6437db8d213a
|
yongkangc
left a comment
There was a problem hiding this comment.
Review: perf: collect-based fast path for small block tx iterator
Correctness
1. Prewarming latency regression on the small-block path (blocking)
The par_iter().map().collect() path delays all prewarm sends until every transaction conversion completes. In the current code, prewarm_tx.send() happens immediately inside for_each_with as each conversion finishes, so the prewarm task can start executing speculatively while other conversions are still running.
This matters most for blocks with 50–127 transactions (above SMALL_BLOCK_TX_THRESHOLD so prewarm is enabled, but below the new 128 threshold so the collect path runs). These blocks will get zero prewarm overlap with conversion, making prewarming less effective or useless. The current streaming design is intentionally pipelined for this reason.
Suggestion: Keep the prewarm sends streaming (inside the map closure or via for_each_with) even if the execution stream ordering is optimized. Prewarming order does not matter.
2. Silent transaction drop in the large-block path (blocking)
} else if idx < buf.len() {
buf[idx] = Some(tx);
}If idx >= buf.len(), the transaction is silently dropped. While enumerate() on an IndexedParallelIterator of known length should never produce an out-of-range index, silently discarding a transaction is a critical failure mode: the execute_rx consumer will block forever waiting for a transaction that was dropped, or the block will be built with missing transactions. This should at minimum be a debug_assert! / panic, not a silent discard.
// Suggested fix:
debug_assert!(idx < buf.len(), "transaction index {idx} out of range for buffer length {}", buf.len());
buf[idx] = Some(tx);Performance
3. Vec<Option<Result<…>>> pre-allocation overhead
The Vec is pre-allocated for all transactions:
let mut buf: Vec<Option<Result<_, _>>> = (0..tx_count).map(|_| None).collect();The BTreeMap only ever holds the out-of-order tail — typically bounded by worker parallelism (rayon thread count), not by block size. For a block with 5000 transactions, this allocates and zeroes 5000 slots of Option<Result<WithTxEnv<…>, Err>> (which are not small), while the BTreeMap would typically hold <64 entries at any time.
The Vec is O(1) lookup vs O(log n) for BTreeMap, but with the working set bounded by parallelism, the BTreeMap overhead is negligible while memory usage stays constant.
Suggestion: Benchmark before switching. If the Vec approach is pursued, at least use Box<Result<…>> per slot to reduce the per-slot size to a pointer.
Nits
4. PARALLEL_REORDER_TX_THRESHOLD as an associated constant
Defining it as const PARALLEL_REORDER_TX_THRESHOLD: usize = 128; inside the impl block is fine, but it would be more consistent with SMALL_BLOCK_TX_THRESHOLD (module-level pub const) to also define this at module level with a doc comment explaining the rationale and how the value was chosen.
5. Blank line left in imports
use std::{
- collections::BTreeMap,
+
ops::Not,The removal leaves a blank line inside the use block.
6. Missing benchmark data
PR body says "Benchmark pending." Given this is a perf PR, it is hard to evaluate whether the threshold of 128 or the collect-vs-streaming tradeoff is net positive without data. Would recommend running reth-bench before merging.
Verdict
Changes requested. The prewarming latency regression (#1) and the silent transaction drop (#2) are blocking issues. The performance claim needs benchmark validation (#6). The core idea of optimizing the reorder buffer is reasonable but the current implementation introduces regressions in the prewarming pipeline that should be preserved.
- Remove blank line in std use block (fmt) - Collapse single-line vec init (fmt) - Add backticks around BTreeMap in doc comment (clippy) Amp-Thread-ID: https://ampcode.com/threads/T-019c4e7e-4c60-7228-a6e8-e24c9bdb0aa4
|
Closing — the small-block collect-based path introduces a barrier that serializes the convert→prewarm→execute pipeline: no prewarm or execution sends happen until ALL conversions finish, which is a latency regression vs the current streaming design. The Vec<Option<_>> reorder for large blocks is directionally fine but marginal. No benchmark data was provided to show a measurable improvement. If the goal is reducing per-block coordination overhead for small blocks, a better approach would be sequential convert-and-send (no rayon, no reorder thread) while preserving streaming, rather than a collect barrier. |
Summary
Reth spawns the same heavy parallel tx conversion pipeline regardless of block size: rayon
for_each_withsends results out-of-order through an intermediate channel, then a secondspawn_blockingtask reorders viaBTreeMap. For small blocks (<30 txs), this coordination overhead exceeds the benefit. This PR adds a collect-based fast path usingpar_iter().map().collect()which preserves order natively, eliminating the intermediate channel and reorder task. For large blocks, replacesBTreeMapwith pre-allocatedVec<Option<_>>for O(1) reorder.Changes
PARALLEL_REORDER_TX_THRESHOLD(30 txs)rayon::spawnwithcollect()+ sequential channel sendsVec<Option<_>>reorder buffer instead ofBTreeMap