Skip to content

fix(l1): bound RocksDB index and filter block memory#6735

Draft
ilitteri wants to merge 2 commits into
mainfrom
fix/rocksdb-bounded-index-filter-memory
Draft

fix(l1): bound RocksDB index and filter block memory#6735
ilitteri wants to merge 2 commits into
mainfrom
fix/rocksdb-bounded-index-filter-memory

Conversation

@ilitteri
Copy link
Copy Markdown
Collaborator

@ilitteri ilitteri commented May 27, 2026

Motivation

ethrex's RocksDB backend ties resident memory to database size with no upper bound: as the on-disk state grows, so does the in-heap footprint, with no ceiling. On any long-running node this presents as resident memory that climbs indefinitely — operationally indistinguishable from a memory leak — and on a large enough database it will eventually exhaust the host. The mechanism behind this (RocksDB keeping all SST files' index and filter blocks pinned in heap, outside its bounded LRU) is detailed below.

Description

This PR ships in two commits.

1. Store index and filter blocks in the shared block cache (67f3492f).
Enables cache_index_and_filter_blocks(true) + pin_l0_filter_and_index_blocks_in_cache(true) on every column family. With this change, RocksDB stops pinning every open SST's index and bloom-filter blocks in heap and instead routes them through its shared LRU cache. Total RocksDB resident memory now tracks the block cache size, not the database size.

2. Expose the block cache size as a CLI option (32ffd479).
Adds --rocksdb.block-cache-size <BYTES> (env ETHREX_ROCKSDB_BLOCK_CACHE_SIZE), default 20 GiB. Plumbed through a new StoreConfig struct and *_with_config constructor variants on Store and the init_store / load_store / open_store helpers; the existing zero-config constructors keep working with the default and are unchanged for tests, tools, and L2 callers.

The cache size now governs the memory vs. block-import-throughput trade-off: filter and index blocks share the cache with data blocks, so a cache that is too small to hold the filter + index working set plus a useful amount of hot data will stall execution. The CLI help text states this explicitly and warns against lowering the value below the default.

Validation (live on mainnet, 60-block window of head-following, same chain segment)

Stock baseline Fix @ 4 GiB Fix @ 20 GiB (default)
Median block-import 35.4 ms 53.0 ms 31.5 ms
Mean block-import 38.1 ms 66.2 ms 36.4 ms
Median ratio vs baseline 1.50× 0.89×
Mean ratio vs baseline 1.74× 1.00×
RSS at ~500 GB DB 16 GB, climbing 7 GB, bounded 27 GB, bounded
RSS projected at 1 TB DB ~28–30 GB ~7 GB ~27 GB (unchanged)

A jemalloc heap profile of the unfixed baseline attributed ~92% of resident memory to RocksDB, dominated by ~8 GB of index and bloom-filter blocks (~6 GB of which are bloom filters). With the fix applied, the corresponding PrefetchIndexAndFilterBlocks allocations drop from ~8 GB to under 1 GB — the rest is now demand-loaded into the bounded cache via GetOrReadFilterBlock.

At the 20 GiB default, block-import is at parity with the unfixed baseline and resident memory is bounded forever regardless of database growth.

Trade-off worth noting

At today's ~500 GB mainnet database the default 20 GiB cache uses more memory than the unfixed baseline (~27 GB vs ~16 GB). The value of the fix is bounded memory forever — the unfixed baseline keeps climbing as the database grows (state DBs only grow); the crossover lands around a ~1 TB database. Operators who need a lower ceiling at the cost of throughput can lower the cache size; the help text documents this.

… instead of

pinning them per open file. With max_open_files(-1) every SST stays open, and the
RocksDB default (cache_index_and_filter_blocks = false) keeps each file's index and
filter blocks in heap for the reader's lifetime, so table memory grows without bound
with the number of SST files. On a 490 GB mainnet DB this reached ~8 GB of pinned
index/filter blocks (~6 GB of it bloom filters), driving resident memory to ~20 GB.

Enabling cache_index_and_filter_blocks moves index and filter blocks into the bounded
block cache, capping total table memory at the cache size. pin_l0_filter_and_index_blocks_in_cache
keeps the hottest level's metadata resident to avoid a read-latency cliff on the cache.
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 27, 2026

⚠️ Known Issues — intentionally skipped tests

Source: docs/known_issues.md

Known Issues

Tests intentionally excluded from CI. Source of truth for the Known
Issues
section the L1 workflow appends to each ef-tests job summary
and posts as a sticky PR comment.

EF Tests — Stateless coverage narrowed to EIP-8025 optional-proofs

make -C tooling/ef_tests/blockchain test calls test-stateless-zkevm
instead of test-stateless. The zkevm@v0.3.3 fixtures are filled against
bal@v5.6.1, out of sync with current bal spec; the broad target trips ~549
fixtures. Re-broaden once the zkevm bundle is regenerated.

Why and resolution path

PR #6527 broadened
test-stateless to extract the entire for_amsterdam/ tree from the
zkevm bundle and run all of it under --features stateless; combined with
this branch's bal-devnet-7 semantics that scope produces ~549
GasUsedMismatch / ReceiptsRootMismatch /
BlockAccessListHashMismatch failures.

test-stateless-zkevm filters cargo to the eip8025_optional_proofs
suite, which still validates the stateless harness without the bal-version
mismatch.

Re-broaden by switching test: back to test-stateless in
tooling/ef_tests/blockchain/Makefile once the zkevm bundle is regenerated
against the current bal spec.

@github-actions github-actions Bot added the L1 Ethereum client label May 27, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 27, 2026

Lines of code report

Total lines added: 111
Total lines removed: 0
Total lines changed: 111

Detailed view
+------------------------------------------+-------+------+
| File                                     | Lines | Diff |
+------------------------------------------+-------+------+
| ethrex/cmd/ethrex/cli.rs                 | 1243  | +46  |
+------------------------------------------+-------+------+
| ethrex/cmd/ethrex/initializers.rs        | 676   | +21  |
+------------------------------------------+-------+------+
| ethrex/cmd/ethrex/l2/initializers.rs     | 386   | +5   |
+------------------------------------------+-------+------+
| ethrex/crates/storage/backend/rocksdb.rs | 334   | +5   |
+------------------------------------------+-------+------+
| ethrex/crates/storage/store.rs           | 2757  | +34  |
+------------------------------------------+-------+------+

(--rocksdb.block-cache-size, env ETHREX_ROCKSDB_BLOCK_CACHE_SIZE) with a default
of 20 GiB. Because the previous commit moved index and bloom-filter blocks into
the bounded block cache, the cache size now governs total RocksDB resident memory
and significantly influences block-import throughput. Measured on a synced mainnet
node: at a 4 GiB cache, filter blocks monopolize the cache and block exec is ~76%
slower than the unbounded baseline; at 20 GiB the cache comfortably holds the
filter + index working set plus the EVM's hot data and exec is at parity. The
help text spells the trade-off out explicitly and only recommends lowering it on
resource-constrained hosts.

Plumbed through a new StoreConfig struct (exposed from ethrex-storage) and
Store::new_with_config / new_from_genesis_with_config /
{init,load,open}_store_with_config variants. The existing zero-config
constructors continue to use the default and remain unchanged for tests and
tools, so callers that don't need to override the cache size are unaffected.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

L1 Ethereum client

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants