Skip to content

perf(l1): prefetch all BAL storage synchronously before execution#6732

Open
edg-l wants to merge 1 commit into
mainfrom
perf/bal-warmer-txorder-prefetch
Open

perf(l1): prefetch all BAL storage synchronously before execution#6732
edg-l wants to merge 1 commit into
mainfrom
perf/bal-warmer-txorder-prefetch

Conversation

@edg-l
Copy link
Copy Markdown
Contributor

@edg-l edg-l commented May 27, 2026

What

The BAL warmer and the parallel executor run concurrently, sharing one CachingDatabase. The warmer prefetched the block's storage slots from a background thread, and prefetch_storage only populates the cache after its whole parallel fetch completes. So the executor started against a cold cache and raced the warmer to the trie for its SSTORE original values, losing for the early transactions. A perf-record profile on the fixture below showed those SSTORE original-value trie walks were ~22% of total CPU, with ~72% of CachingDatabase storage reads falling through to the trie.

This change prefetches all BAL storage slots synchronously, before the executor thread starts. The prefetch is parallel (one par_iter across all cores), so it is fast, and it lets execution run fully warm with no trie fallthrough for storage and no warmer contending with the executor for CPU and the cache lock. The concurrent warmer is reduced to account states and contract code, which overlap execution well.

Storage is warmed in natural account order (slots grouped per account for storage-trie locality). Account prefetch is intentionally left concurrent: warming accounts synchronously too was measured slower, since account reads already overlap execution and a synchronous pass adds serial latency plus a double-fetch against the warmer.

Why synchronous, not on the concurrent warmer?

The question is not "warmer thread vs exec thread", it is synchronous (before exec) vs concurrent (racing exec). The warmer and executor are spawned together and start at the same instant; prefetch_storage publishes to the cache only after its whole parallel fetch completes (a single write-lock at the end, no incremental visibility). So while the warmer is fetching, the executor sees a cold cache and reaches the trie itself for early-transaction SSTORE original values, paying a synchronous trie walk on the critical path.

Because the executor always starts with zero lead over the warmer, no ordering or chunking of the concurrent prefetch removes that race; concurrent variants topped out around -7% to -13%. The two threads also compete for CPU cores and the cache RwLock. Doing the storage prefetch up front, fully in parallel, costs only a few ms/block and lets exec run against a fully warm cache with no contention, for ~ -24%.

The intuition that overlapping the warmer with exec is free does not hold here: on this workload the overlap was a net loss (race plus contention). Only storage is moved; accounts and code stay on the concurrent warmer because they already overlap exec well (moving accounts synchronously regressed).

Tradeoff to note: on a live single-block newPayload the prefetch is now on the critical path before exec, so it can add a little latency versus the fully overlapped warmer; it is parallel and removes more fallthrough than it adds, so it should still net positive, and import / sync (back-to-back blocks) clearly win.

Benchmark

Fixture: bal-devnet-7-mainnet-mix-460 (460 blocks, ~30 Ggas, transfer/EVM mix). release-with-debug, import-bench --with-bal. Baseline = main. 4 runs each.

metric baseline (main) this PR delta
wall time 4687.9 ms 3577.9 ms -1110 ms (-23.7%)
exec total 4263.2 ms 3166.4 ms -1097 ms (-25.7%)
warmer total 3148.3 ms 864.8 ms warmer now covers accounts + code only
warmer overlap 97.0% 97.8% preserved

Correctness

The prefetch only fills a read cache with values fetched from the underlying DB (or_insert, never overwriting executor writes); any cache miss falls through to the trie as before. Per EIP-7928 the BAL covers every storage slot accessed during the block, so coverage is complete. import-bench validates each block's state root and ran clean.

Related: #6729.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 27, 2026

⚠️ Known Issues — intentionally skipped tests

Source: docs/known_issues.md

Known Issues

Tests intentionally excluded from CI. Source of truth for the Known
Issues
section the L1 workflow appends to each ef-tests job summary
and posts as a sticky PR comment.

EF Tests — Stateless coverage narrowed to EIP-8025 optional-proofs

make -C tooling/ef_tests/blockchain test calls test-stateless-zkevm
instead of test-stateless. The zkevm@v0.3.3 fixtures are filled against
bal@v5.6.1, out of sync with current bal spec; the broad target trips ~549
fixtures. Re-broaden once the zkevm bundle is regenerated.

Why and resolution path

PR #6527 broadened
test-stateless to extract the entire for_amsterdam/ tree from the
zkevm bundle and run all of it under --features stateless; combined with
this branch's bal-devnet-7 semantics that scope produces ~549
GasUsedMismatch / ReceiptsRootMismatch /
BlockAccessListHashMismatch failures.

test-stateless-zkevm filters cargo to the eip8025_optional_proofs
suite, which still validates the stateless harness without the bal-version
mismatch.

Re-broaden by switching test: back to test-stateless in
tooling/ef_tests/blockchain/Makefile once the zkevm bundle is regenerated
against the current bal spec.

@github-actions github-actions Bot added L1 Ethereum client performance Block execution throughput and performance in general labels May 27, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 27, 2026

Lines of code report

Total lines added: 7
Total lines removed: 3
Total lines changed: 10

Detailed view
+----------------------------------------+-------+------+
| File                                   | Lines | Diff |
+----------------------------------------+-------+------+
| ethrex/crates/blockchain/blockchain.rs | 2558  | +7   |
+----------------------------------------+-------+------+
| ethrex/crates/vm/backends/levm/mod.rs  | 2393  | -3   |
+----------------------------------------+-------+------+

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 27, 2026

Benchmark Block Execution Results Comparison Against Main

Command Mean [s] Min [s] Max [s] Relative
base 66.415 ± 0.095 66.278 66.540 1.00 ± 0.00
head 66.131 ± 0.227 65.735 66.371 1.00

@edg-l edg-l changed the title perf(l1): prefetch BAL storage in execution order perf(l1): prefetch all BAL storage synchronously before execution May 27, 2026
@edg-l edg-l force-pushed the perf/bal-warmer-txorder-prefetch branch from 60257f9 to f627914 Compare May 27, 2026 13:02
@edg-l edg-l force-pushed the perf/bal-warmer-txorder-prefetch branch from f627914 to c9069cc Compare May 27, 2026 13:11
The BAL warmer and the parallel executor run concurrently sharing one
CachingDatabase. `prefetch_storage` publishes to the cache only after its
whole parallel fetch completes, so warming storage from the background
warmer left the executor racing it to the trie for SSTORE original values
(~22% of CPU on cold-cache import-bench).

Prefetch all BAL storage slots synchronously before spawning the executor
thread (parallel, on all cores) so execution runs fully warm with no
storage fallthrough and no warmer contention. The concurrent warmer is
reduced to account states and contract code, which overlap execution.
Storage is warmed in natural account order.

bal-devnet-7-mainnet-mix-460, import-bench --with-bal vs main: ~ -24%.
@edg-l edg-l force-pushed the perf/bal-warmer-txorder-prefetch branch from c9069cc to 2a745f9 Compare May 27, 2026 13:23
@edg-l edg-l marked this pull request as ready for review May 27, 2026 13:30
@edg-l edg-l requested a review from a team as a code owner May 27, 2026 13:30
@ethrex-project-sync ethrex-project-sync Bot moved this to In Review in ethrex_l1 May 27, 2026
@github-actions
Copy link
Copy Markdown

🤖 Kimi Code Review

Overall Assessment: The PR correctly addresses a race condition between concurrent storage warming and EVM execution. The approach is sound, but there's a critical error-handling oversight and a minor type consistency issue.

Issues Found

1. Silent error handling masks prefetch failures

File: crates/blockchain/blockchain.rs
Line: ~510 (the let _ = caching_store.prefetch_storage(&slots); line)

Ignoring the error with let _ = is dangerous. If storage prefetch fails (e.g., database corruption, I/O error, or trie node missing), execution proceeds with a cold cache, potentially causing:

  • Silent performance degradation (violating the -24% performance guarantee)
  • Inconsistent behavior between nodes if prefetch fails on some but not others

Fix: Log the error or propagate it:

if let Err(e) = caching_store.prefetch_storage(&slots) {
    tracing::warn!("Storage prefetch failed, proceeding with cold cache: {}", e);
}

2. Type path inconsistency

File: crates/vm/backends/levm/mod.rs
Lines: 2093, 2097

The new function uses bare H256 while the existing code in the same file uses ethrex_common::H256. If H256 isn't imported in the module scope, this won't compile. Even if it compiles, explicit paths improve clarity.

Fix: Use ethrex_common::H256 for consistency with the surrounding code:

pub fn bal_storage_slots(bal: &BlockAccessList) -> Vec<(Address, ethrex_common::H256)> {
    // ...
    .map(move |slot| (ac.address, ethrex_common::H256::from_uint(&slot)))

3. Potential missing cancellation check (Minor)

File: crates/blockchain/blockchain.rs
Context: Between the new prefetch block and std::thread::scope

The synchronous storage prefetch runs before the thread scope and doesn't check the cancellation flag. While this is likely acceptable (rayon parallel fetch should be fast), for very large BALs this could delay shutdown. Consider checking cancelled before the prefetch if the flag is accessible at this scope.

Positive Observations

  1. Correct race condition fix: Moving storage prefetch out of the concurrent warmer and into the synchronous pre-execution phase correctly eliminates the race for SSTORE original values described in the comments.

  2. Appropriate feature gating: The #[cfg(all(feature = "rayon", not(feature = "eip-8025")))] attributes correctly match the conditional compilation requirements.

  3. Excellent documentation: The inline comments explaining the performance tradeoffs (-24% vs -7% to -13%), the rationale for not prefetching accounts, and the live-node latency implications are thorough and will prevent future regressions.

  4. Memory efficiency check: The if !slots.is_empty() guard before prefetching avoids unnecessary overhead for blocks without BAL storage accesses.


Automated review by Kimi (Moonshot AI) · kimi-k2.5 · custom prompt

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented May 27, 2026

Greptile Summary

This PR moves BAL storage-slot prefetching from the concurrent warmer thread to a synchronous parallel pass that runs before the executor thread starts, eliminating the race where the executor fell through to the trie for SSTORE original values while the warmer was still building its cache snapshot. Account-state and code warming remain on the concurrent warmer because they already overlap execution well and moving them synchronously regressed.

  • crates/blockchain/blockchain.rs: Adds a synchronous prefetch_storage call over all BAL slots (via the new bal_storage_slots helper) before spawning the executor/warmer threads, gated on the same rayon/eip-8025 feature flags as the warmer.
  • crates/vm/backends/levm/mod.rs: Introduces bal_storage_slots to enumerate (address, slot) pairs from a BAL, and removes Phase 2 (storage) from warm_block_from_bal, leaving only account-state and code prefetch in the concurrent warmer.

Confidence Score: 4/5

Safe to merge; the change is purely additive to the warm-up path and correctness falls back gracefully to trie reads on any failure.

The synchronous prefetch is logically sound and well-guarded by feature flags. The one rough edge is that a prefetch_storage failure is silently discarded rather than logged, which would make it hard to diagnose a silent regression to cold-cache behavior in production.

crates/blockchain/blockchain.rs — specifically the silent error discard on the synchronous prefetch call.

Important Files Changed

Filename Overview
crates/blockchain/blockchain.rs Adds a synchronous prefetch_storage call for all BAL slots before the executor thread starts, removing the race between the warmer and executor for cold-cache SSTORE reads; the prefetch error is silently discarded.
crates/vm/backends/levm/mod.rs Extracts storage-slot enumeration into a new bal_storage_slots helper and removes Phase 2 (storage) from the concurrent warmer, keeping only account-state and code warming in the background thread.
CHANGELOG.md Adds a changelog entry for the synchronous BAL storage prefetch optimization.

Sequence Diagram

sequenceDiagram
    participant BC as blockchain.rs
    participant CP as CachingDatabase
    participant Trie as State Trie
    participant WM as Warmer Thread
    participant EX as Executor Thread

    Note over BC: Before executor threads start
    BC->>CP: "bal_storage_slots(bal) -> slots"
    BC->>CP: prefetch_storage(slots) [parallel, all cores]
    CP->>Trie: batch fetch storage slots
    Trie-->>CP: "values -> cache (write-lock once)"

    Note over BC: Now spawn threads concurrently
    par Warmer Thread
        BC->>WM: spawn warm_block_from_bal
        WM->>CP: prefetch_accounts(addresses)
        CP->>Trie: fetch account states
        Trie-->>CP: account cache warm
        WM->>CP: get_account_code (par)
    and Executor Thread
        BC->>EX: spawn execute_block_pipeline
        EX->>CP: storage reads (SSTORE original values)
        CP-->>EX: cache HIT (warm from sync prefetch)
        Note over EX,CP: No trie fallthrough for storage
    end
Loading
Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
crates/blockchain/blockchain.rs:520
The `prefetch_storage` error is silently discarded here, unlike the warmer thread where a failure is at least logged as `debug!("BAL warming failed (non-fatal): {e}")`. If the synchronous prefetch fails (e.g., DB I/O error, allocation failure), execution proceeds with a fully cold cache and there is no diagnostic — effectively an invisible regression to pre-optimization behavior on every block without any signal in the logs.

```suggestion
                if let Err(e) = caching_store.prefetch_storage(&slots) {
                    debug!("synchronous BAL storage prefetch failed (non-fatal): {e}");
                }
```

Reviews (1): Last reviewed commit: "perf(l1): prefetch all BAL storage synch..." | Re-trigger Greptile

@github-actions
Copy link
Copy Markdown

🤖 Codex Code Review

  1. Low: crates/blockchain/blockchain.rs silently drops prefetch_storage errors with let _ = .... After this PR, BAL storage warming no longer has a second chance inside warm_block_from_bal, so a trie/db failure or poisoned cache lock now disables all BAL slot warming for that block with no signal at all. At minimum this should log like the warmer thread does; ideally reuse the same non-fatal error path so regressions in the new critical-path prefetch are observable.

I did not find a correctness, security, gas-accounting, or EIP-compliance regression in the actual storage/account/code warming reordering. Prefetching block pre-state here is still compatible with LEVM’s per-transaction storage_original_values tracking.

Residual risk: the PR does not add targeted regression coverage for the new synchronous BAL storage prefetch path. I also could not run cargo test here because rustup failed to create temp files under /home/runner/.rustup/tmp due a read-only filesystem error.


Automated review by OpenAI Codex · gpt-5.4 · custom prompt

if let Some(bal) = bal {
let slots = LEVM::bal_storage_slots(bal);
if !slots.is_empty() {
let _ = caching_store.prefetch_storage(&slots);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The prefetch_storage error is silently discarded here, unlike the warmer thread where a failure is at least logged as debug!("BAL warming failed (non-fatal): {e}"). If the synchronous prefetch fails (e.g., DB I/O error, allocation failure), execution proceeds with a fully cold cache and there is no diagnostic — effectively an invisible regression to pre-optimization behavior on every block without any signal in the logs.

Suggested change
let _ = caching_store.prefetch_storage(&slots);
if let Err(e) = caching_store.prefetch_storage(&slots) {
debug!("synchronous BAL storage prefetch failed (non-fatal): {e}");
}
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/blockchain/blockchain.rs
Line: 520

Comment:
The `prefetch_storage` error is silently discarded here, unlike the warmer thread where a failure is at least logged as `debug!("BAL warming failed (non-fatal): {e}")`. If the synchronous prefetch fails (e.g., DB I/O error, allocation failure), execution proceeds with a fully cold cache and there is no diagnostic — effectively an invisible regression to pre-optimization behavior on every block without any signal in the logs.

```suggestion
                if let Err(e) = caching_store.prefetch_storage(&slots) {
                    debug!("synchronous BAL storage prefetch failed (non-fatal): {e}");
                }
```

How can I resolve this? If you propose a fix, please make it concise.

@github-actions
Copy link
Copy Markdown

🤖 Claude Code Review

Here is the review:


PR #6732perf(l1): prefetch all BAL storage synchronously before execution

Overview

Moves BAL storage slot prefetching from the concurrent warmer thread to a synchronous, parallel pass that completes before the executor thread starts. This eliminates the race between the warmer and executor on the CachingDatabase write lock for SSTORE original-value lookups, yielding ~24% exec time improvement on the benchmark fixture. The concurrent warmer is reduced to accounts and contract code, which continue to overlap execution well.


Code Correctness

bal type — no ownership issue.
bal: Option<&BlockAccessList> is Copy (references are Copy), so the variable can be used at line 476 (bal.map(synthesize_bal_updates)), at line 517 (the new synchronous block), and at line 536 (the warmer closure) without any move issue. This is correct.

or_insert semantics confirmed.
CachingDatabase::prefetch_storage (db/mod.rs:210) uses cache.entry(key).or_insert(value), which never overwrites existing entries. Since the synchronous prefetch runs before the executor starts, there are no executor writes yet — this is safe and correct.

cfg consistency.
The call site in blockchain.rs and CachingDatabase::prefetch_storage both gate on #[cfg(all(feature = "rayon", not(feature = "eip-8025")))], so dynamic dispatch via Arc<dyn LevmDatabase> will always call the parallel rayon implementation when this path is active. Consistent.


Issues and Suggestions

Silent error discard — inconsistency with the warmer.

// blockchain.rs:520
let _ = caching_store.prefetch_storage(&slots);

The concurrent warmer logs non-fatal errors:

if let Err(e) = LEVM::warm_block_from_bal(bal, caching_store, cancelled_ref) {
    debug!("BAL warming failed (non-fatal): {e}");
}

A failure here silently degrades to a cold-cache run with no diagnostic. Suggest:

if let Err(e) = caching_store.prefetch_storage(&slots) {
    debug!("Storage prefetch failed (non-fatal): {e}");
}

All-or-nothing failure in prefetch_storage.

CachingDatabase::prefetch_storage (db/mod.rs:207) uses:

.collect::<Result<_, _>>()?;

If a single slot lookup fails (e.g., transient IO), the entire fetch aborts and no slots are written to the cache. The write-lock is only taken after a successful full collection (db/mod.rs:208). A single failure thus leaves storage entirely cold for that block, causing maximum trie fallthrough. This is functionally correct (execution falls through) but the performance impact of the regression is larger than the single failing slot would suggest.

For production resilience, consider tolerating per-slot failures:

let fetched: Vec<_> = keys
    .par_iter()
    .filter_map(|&(addr, key)| {
        self.inner.get_storage_value(addr, key)
            .map(|v| ((addr, key), v))
            .ok()
    })
    .collect();

This is a pre-existing behaviour that this PR didn't introduce, but it becomes more load-bearing now that storage prefetch is on the critical path.

Comment length.

The 25-line comment block at blockchain.rs:491–515 is thorough and the DO NOT warnings are genuinely useful guards against future regressions. However, the measurement prose (lines 500–503) duplicates the PR description verbatim. Consider condensing to keep the warnings and drop the duplicate numbers:

// Measured: concurrent variants ~ -7% to -13%; this approach ~ -24%.
// DO NOT move storage back to the concurrent warmer; the race is the problem.
// DO NOT add synchronous account prefetch; account reads overlap exec well
// and a synchronous pass both adds serial latency and double-fetches.

Design Observations

  • The PR description correctly identifies the core issue: prefetch_storage publishes the full result atomically at the end, so there is zero incremental visibility while the warmer runs. Chunking or reordering within the concurrent warmer cannot fix this.
  • The live-node tradeoff (prefetch on critical path for single newPayload) is acknowledged and measured. Accepted.
  • bal_storage_slots is a clean, pure extraction of the flattening logic. Natural account order for storage-trie locality is a sound heuristic.
  • Cancellation check removal is correct — the synchronous prefetch has no racing thread to cancel against.

Summary

The logic is correct, the benchmark numbers are compelling, and the design tradeoffs are clearly reasoned. The main actionable item is adding a debug!() log on prefetch failure to match the warmer's error-handling style. The all-or-nothing failure mode in prefetch_storage is worth a follow-up but is not introduced by this PR.


Automated review by Claude (Anthropic) · sonnet · custom prompt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

L1 Ethereum client performance Block execution throughput and performance in general

Projects

Status: In Review
Status: Todo

Development

Successfully merging this pull request may close these issues.

2 participants