perf(l1): prefetch all BAL storage synchronously before execution#6732
perf(l1): prefetch all BAL storage synchronously before execution#6732edg-l wants to merge 1 commit into
Conversation
|
Lines of code reportTotal lines added: Detailed view |
Benchmark Block Execution Results Comparison Against Main
|
60257f9 to
f627914
Compare
f627914 to
c9069cc
Compare
The BAL warmer and the parallel executor run concurrently sharing one CachingDatabase. `prefetch_storage` publishes to the cache only after its whole parallel fetch completes, so warming storage from the background warmer left the executor racing it to the trie for SSTORE original values (~22% of CPU on cold-cache import-bench). Prefetch all BAL storage slots synchronously before spawning the executor thread (parallel, on all cores) so execution runs fully warm with no storage fallthrough and no warmer contention. The concurrent warmer is reduced to account states and contract code, which overlap execution. Storage is warmed in natural account order. bal-devnet-7-mainnet-mix-460, import-bench --with-bal vs main: ~ -24%.
c9069cc to
2a745f9
Compare
🤖 Kimi Code ReviewOverall Assessment: The PR correctly addresses a race condition between concurrent storage warming and EVM execution. The approach is sound, but there's a critical error-handling oversight and a minor type consistency issue. Issues Found1. Silent error handling masks prefetch failuresFile: Ignoring the error with
Fix: Log the error or propagate it: if let Err(e) = caching_store.prefetch_storage(&slots) {
tracing::warn!("Storage prefetch failed, proceeding with cold cache: {}", e);
}2. Type path inconsistencyFile: The new function uses bare Fix: Use pub fn bal_storage_slots(bal: &BlockAccessList) -> Vec<(Address, ethrex_common::H256)> {
// ...
.map(move |slot| (ac.address, ethrex_common::H256::from_uint(&slot)))3. Potential missing cancellation check (Minor)File: The synchronous storage prefetch runs before the thread scope and doesn't check the cancellation flag. While this is likely acceptable (rayon parallel fetch should be fast), for very large BALs this could delay shutdown. Consider checking Positive Observations
Automated review by Kimi (Moonshot AI) · kimi-k2.5 · custom prompt |
Greptile SummaryThis PR moves BAL storage-slot prefetching from the concurrent warmer thread to a synchronous parallel pass that runs before the executor thread starts, eliminating the race where the executor fell through to the trie for SSTORE original values while the warmer was still building its cache snapshot. Account-state and code warming remain on the concurrent warmer because they already overlap execution well and moving them synchronously regressed.
Confidence Score: 4/5Safe to merge; the change is purely additive to the warm-up path and correctness falls back gracefully to trie reads on any failure. The synchronous prefetch is logically sound and well-guarded by feature flags. The one rough edge is that a prefetch_storage failure is silently discarded rather than logged, which would make it hard to diagnose a silent regression to cold-cache behavior in production. crates/blockchain/blockchain.rs — specifically the silent error discard on the synchronous prefetch call.
|
| Filename | Overview |
|---|---|
| crates/blockchain/blockchain.rs | Adds a synchronous prefetch_storage call for all BAL slots before the executor thread starts, removing the race between the warmer and executor for cold-cache SSTORE reads; the prefetch error is silently discarded. |
| crates/vm/backends/levm/mod.rs | Extracts storage-slot enumeration into a new bal_storage_slots helper and removes Phase 2 (storage) from the concurrent warmer, keeping only account-state and code warming in the background thread. |
| CHANGELOG.md | Adds a changelog entry for the synchronous BAL storage prefetch optimization. |
Sequence Diagram
sequenceDiagram
participant BC as blockchain.rs
participant CP as CachingDatabase
participant Trie as State Trie
participant WM as Warmer Thread
participant EX as Executor Thread
Note over BC: Before executor threads start
BC->>CP: "bal_storage_slots(bal) -> slots"
BC->>CP: prefetch_storage(slots) [parallel, all cores]
CP->>Trie: batch fetch storage slots
Trie-->>CP: "values -> cache (write-lock once)"
Note over BC: Now spawn threads concurrently
par Warmer Thread
BC->>WM: spawn warm_block_from_bal
WM->>CP: prefetch_accounts(addresses)
CP->>Trie: fetch account states
Trie-->>CP: account cache warm
WM->>CP: get_account_code (par)
and Executor Thread
BC->>EX: spawn execute_block_pipeline
EX->>CP: storage reads (SSTORE original values)
CP-->>EX: cache HIT (warm from sync prefetch)
Note over EX,CP: No trie fallthrough for storage
end
Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 1
crates/blockchain/blockchain.rs:520
The `prefetch_storage` error is silently discarded here, unlike the warmer thread where a failure is at least logged as `debug!("BAL warming failed (non-fatal): {e}")`. If the synchronous prefetch fails (e.g., DB I/O error, allocation failure), execution proceeds with a fully cold cache and there is no diagnostic — effectively an invisible regression to pre-optimization behavior on every block without any signal in the logs.
```suggestion
if let Err(e) = caching_store.prefetch_storage(&slots) {
debug!("synchronous BAL storage prefetch failed (non-fatal): {e}");
}
```
Reviews (1): Last reviewed commit: "perf(l1): prefetch all BAL storage synch..." | Re-trigger Greptile
🤖 Codex Code Review
I did not find a correctness, security, gas-accounting, or EIP-compliance regression in the actual storage/account/code warming reordering. Prefetching block pre-state here is still compatible with LEVM’s per-transaction Residual risk: the PR does not add targeted regression coverage for the new synchronous BAL storage prefetch path. I also could not run Automated review by OpenAI Codex · gpt-5.4 · custom prompt |
| if let Some(bal) = bal { | ||
| let slots = LEVM::bal_storage_slots(bal); | ||
| if !slots.is_empty() { | ||
| let _ = caching_store.prefetch_storage(&slots); |
There was a problem hiding this comment.
The
prefetch_storage error is silently discarded here, unlike the warmer thread where a failure is at least logged as debug!("BAL warming failed (non-fatal): {e}"). If the synchronous prefetch fails (e.g., DB I/O error, allocation failure), execution proceeds with a fully cold cache and there is no diagnostic — effectively an invisible regression to pre-optimization behavior on every block without any signal in the logs.
| let _ = caching_store.prefetch_storage(&slots); | |
| if let Err(e) = caching_store.prefetch_storage(&slots) { | |
| debug!("synchronous BAL storage prefetch failed (non-fatal): {e}"); | |
| } |
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/blockchain/blockchain.rs
Line: 520
Comment:
The `prefetch_storage` error is silently discarded here, unlike the warmer thread where a failure is at least logged as `debug!("BAL warming failed (non-fatal): {e}")`. If the synchronous prefetch fails (e.g., DB I/O error, allocation failure), execution proceeds with a fully cold cache and there is no diagnostic — effectively an invisible regression to pre-optimization behavior on every block without any signal in the logs.
```suggestion
if let Err(e) = caching_store.prefetch_storage(&slots) {
debug!("synchronous BAL storage prefetch failed (non-fatal): {e}");
}
```
How can I resolve this? If you propose a fix, please make it concise.
🤖 Claude Code ReviewHere is the review: PR #6732 —
|
What
The BAL warmer and the parallel executor run concurrently, sharing one
CachingDatabase. The warmer prefetched the block's storage slots from a background thread, andprefetch_storageonly populates the cache after its whole parallel fetch completes. So the executor started against a cold cache and raced the warmer to the trie for its SSTORE original values, losing for the early transactions. A perf-record profile on the fixture below showed those SSTORE original-value trie walks were ~22% of total CPU, with ~72% ofCachingDatabasestorage reads falling through to the trie.This change prefetches all BAL storage slots synchronously, before the executor thread starts. The prefetch is parallel (one
par_iteracross all cores), so it is fast, and it lets execution run fully warm with no trie fallthrough for storage and no warmer contending with the executor for CPU and the cache lock. The concurrent warmer is reduced to account states and contract code, which overlap execution well.Storage is warmed in natural account order (slots grouped per account for storage-trie locality). Account prefetch is intentionally left concurrent: warming accounts synchronously too was measured slower, since account reads already overlap execution and a synchronous pass adds serial latency plus a double-fetch against the warmer.
Why synchronous, not on the concurrent warmer?
The question is not "warmer thread vs exec thread", it is synchronous (before exec) vs concurrent (racing exec). The warmer and executor are spawned together and start at the same instant;
prefetch_storagepublishes to the cache only after its whole parallel fetch completes (a single write-lock at the end, no incremental visibility). So while the warmer is fetching, the executor sees a cold cache and reaches the trie itself for early-transaction SSTORE original values, paying a synchronous trie walk on the critical path.Because the executor always starts with zero lead over the warmer, no ordering or chunking of the concurrent prefetch removes that race; concurrent variants topped out around -7% to -13%. The two threads also compete for CPU cores and the cache
RwLock. Doing the storage prefetch up front, fully in parallel, costs only a few ms/block and lets exec run against a fully warm cache with no contention, for ~ -24%.The intuition that overlapping the warmer with exec is free does not hold here: on this workload the overlap was a net loss (race plus contention). Only storage is moved; accounts and code stay on the concurrent warmer because they already overlap exec well (moving accounts synchronously regressed).
Tradeoff to note: on a live single-block
newPayloadthe prefetch is now on the critical path before exec, so it can add a little latency versus the fully overlapped warmer; it is parallel and removes more fallthrough than it adds, so it should still net positive, and import / sync (back-to-back blocks) clearly win.Benchmark
Fixture:
bal-devnet-7-mainnet-mix-460(460 blocks, ~30 Ggas, transfer/EVM mix).release-with-debug,import-bench --with-bal. Baseline =main. 4 runs each.Correctness
The prefetch only fills a read cache with values fetched from the underlying DB (
or_insert, never overwriting executor writes); any cache miss falls through to the trie as before. Per EIP-7928 the BAL covers every storage slot accessed during the block, so coverage is complete. import-bench validates each block's state root and ran clean.Related: #6729.