refactor: use file backed CID hashset to reduce memory usage in chain export#6939
refactor: use file backed CID hashset to reduce memory usage in chain export#6939hanabi1224 merged 9 commits intomainfrom
Conversation
WalkthroughGeneric "seen" tracking for IPLD/chain streaming was introduced: a Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant RPC
participant Chain
participant IpldStream
participant FileBackedSet
participant ParityDB
rect rgba(200,200,255,0.5)
Client->>RPC: export request
RPC->>Chain: export_from_head(ExportOptions{ seen: FileBackedCidHashSet::new(temp_dir) })
Chain->>IpldStream: stream_chain(..., seen)
end
rect rgba(200,255,200,0.5)
IpldStream->>FileBackedSet: insert(cid)
FileBackedSet->>ParityDB: get/check key
ParityDB-->>FileBackedSet: present? / not present
alt not present
FileBackedSet->>ParityDB: write Set (empty bytes)
FileBackedSet-->>IpldStream: inserted = true
else present
FileBackedSet-->>IpldStream: inserted = false
end
IpldStream-->>Chain: yield block(s)
Chain-->>RPC: stream CAR / response
RPC-->>Client: deliver export
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
✨ Simplify code
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (4)
src/cid_collections/hash_set.rs (2)
101-144: Consider returningResultinstead of panicking inDefault.The
panic!on line 140-142 could crash the process if temp directory creation fails (e.g., disk full, permissions issues). Per coding guidelines, production code should avoidunwrap()and panics.Consider either:
- Implementing a fallible constructor
new() -> anyhow::Result<Self>and using that at call sites- Or documenting this panic clearly if it's an intentional fail-fast for unrecoverable conditions
Given the author's preference for fail-fast approaches (per learnings), this may be intentional, but the panic message should be more descriptive about potential causes.
💡 Alternative: fallible constructor
impl FileBackedCidHashSet { pub fn new() -> anyhow::Result<Self> { const MAX_ATTEMPTS: usize = 10; for attempt in 0..MAX_ATTEMPTS { if let Some(dir) = tempfile::tempdir_in(".") .ok() .or_else(|| tempfile::tempdir().ok()) { // ... existing setup ... if let Ok(db) = parity_db::Db::open_or_create(&options) { return Ok(Self { db, _dir: dir, lru: ... }); } } } anyhow::bail!( "failed to create parity db temp directory after {MAX_ATTEMPTS} attempts. \ Possible causes: disk full, permission denied, or too many open files" ); } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/cid_collections/hash_set.rs` around lines 101 - 144, The Default impl for FileBackedCidHashSet currently panics after MAX_ATTEMPTS when creating a temp parity DB; instead implement a fallible constructor FileBackedCidHashSet::new() -> anyhow::Result<Self> that contains the loop (using parity_db::Db::open_or_create and the same options) and returns Ok(Self { db, _dir: dir, lru: ... }) on success or anyhow::bail! with a descriptive message listing likely causes (disk full, permission denied, etc.) on failure, then update impl Default for FileBackedCidHashSet to call FileBackedCidHashSet::new().expect("descriptive message") if you must keep panic in Default, or remove Default usages and propagate the Result at call sites.
159-168: Silent error handling may mask issues.Lines 159 and 163-165 silently ignore errors:
self.db.get(col, &key).ok().flatten()- DB read errors treated as "not found"_ = self.db.commit_changes_bytes(...)- DB write errors silently discardedFor a file-backed set, write failures could lead to duplicate CIDs being reported as "newly inserted" on subsequent calls, potentially causing incorrect behavior during chain export.
Consider at minimum logging these errors at trace/debug level for diagnosability.
💡 Suggested improvement for error visibility
- if self.db.get(col, &key).ok().flatten().is_some() { + let exists = match self.db.get(col, &key) { + Ok(Some(_)) => true, + Ok(None) => false, + Err(e) => { + tracing::trace!("FileBackedCidHashSet get error (treating as miss): {e}"); + false + } + }; + if exists { self.lru.insert(small, ()); false } else { - _ = self + if let Err(e) = self .db - .commit_changes_bytes([(col, parity_db::Operation::Set(key, EMPTY_VALUE.clone()))]); + .commit_changes_bytes([(col, parity_db::Operation::Set(key, EMPTY_VALUE.clone()))]) + { + tracing::trace!("FileBackedCidHashSet commit error: {e}"); + } self.lru.insert(small, ()); true }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/cid_collections/hash_set.rs` around lines 159 - 168, The DB read and write errors are currently ignored in the contains/insert path (calls to self.db.get(...) and self.db.commit_changes_bytes(...)), which can hide failures; change the logic to inspect the Result from self.db.get(col, &key) and from self.db.commit_changes_bytes(...) and log any Err at debug/trace level (e.g., using tracing::debug! or log::debug!) including the column (col) and key details, and for the read error treat it as an error case (not silently "not found") and for the write error log the failure and ensure the method's return reflects the failed persistence (do not unconditionally return true on commit error). Ensure you update places referencing EMPTY_VALUE, small, and lru insertion: only insert to lru after a successful read or successful commit, and include the error logs near the calls to self.db.get and self.db.commit_changes_bytes to aid diagnosability.src/tool/subcommands/archive_cmd.rs (1)
700-700: Consider usingFileBackedCidHashSetfor large snapshot merges.The merge operation uses in-memory
CidHashSet. For merging large snapshots (potentially the entire chain from genesis), this could consume significant memory. Was the in-memory choice intentional for merge performance, or should this also useFileBackedCidHashSetfor consistency with the memory-reduction goals?🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/tool/subcommands/archive_cmd.rs` at line 700, The code currently constructs the stream graph using an in-memory CidHashSet via stream_graph::<_, _, _, CidHashSet>(&store, heaviest_tipset.chain(&store), 0) which can OOM for large merges; replace CidHashSet with FileBackedCidHashSet and instantiate a FileBackedCidHashSet (e.g., using a temp file or configured backing path) so stream_graph::<_, _, _, FileBackedCidHashSet>(...) uses the file-backed implementation; update any constructors or trait-bound calls around stream_graph, and ensure FileBackedCidHashSet is created and passed where the previous in-memory set was assumed (for example where blocks is defined) so large snapshot merges use disk-backed storage.src/ipld/util.rs (1)
424-430: Consider adding a default type parameter for consistency.
ChainStreamdefinesS = CidHashSetas a default type parameter, butIpldStreamdoes not. While the single existing usage ofIpldStreamin the codebase (src/dev/subcommands/export_state_tree_cmd.rs:112) already explicitly provides both type parameters, aligning the pattern withChainStreamwould improve consistency.♻️ Suggested change
pin_project! { - pub struct IpldStream<DB, S> { + pub struct IpldStream<DB, S = CidHashSet> { db: DB, cid_vec: Vec<Cid>, seen: S, } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/ipld/util.rs` around lines 424 - 430, Add a default type parameter for IpldStream so its second generic defaults to CidHashSet for consistency with ChainStream; update the type declaration pub struct IpldStream<DB, S = CidHashSet> and ensure any uses that relied on the old signature (e.g., IpldStream in export_state_tree_cmd.rs) continue to compile without changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/chain/mod.rs`:
- Around line 35-42: The ExportOptions struct lost its Clone and Debug derives
because its field seen: FileBackedCidHashSet contains
non-cloneable/non-debuggable types (parity_db::Db and tempfile::TempDir), so
restore correct behavior by documenting this as a breaking change rather than
reintroducing derives; update the crate's CHANGELOG and public API notes to
state that ExportOptions no longer implements Clone/Debug (mention ExportOptions
and FileBackedCidHashSet and the underlying parity_db::Db and tempfile::TempDir
types), and if external consumers require clone/debug, provide guidance or an
alternative API (e.g., a lightweight Clone/Debug DTO or factory functions)
rather than attempting to derive these traits on ExportOptions itself.
---
Nitpick comments:
In `@src/cid_collections/hash_set.rs`:
- Around line 101-144: The Default impl for FileBackedCidHashSet currently
panics after MAX_ATTEMPTS when creating a temp parity DB; instead implement a
fallible constructor FileBackedCidHashSet::new() -> anyhow::Result<Self> that
contains the loop (using parity_db::Db::open_or_create and the same options) and
returns Ok(Self { db, _dir: dir, lru: ... }) on success or anyhow::bail! with a
descriptive message listing likely causes (disk full, permission denied, etc.)
on failure, then update impl Default for FileBackedCidHashSet to call
FileBackedCidHashSet::new().expect("descriptive message") if you must keep panic
in Default, or remove Default usages and propagate the Result at call sites.
- Around line 159-168: The DB read and write errors are currently ignored in the
contains/insert path (calls to self.db.get(...) and
self.db.commit_changes_bytes(...)), which can hide failures; change the logic to
inspect the Result from self.db.get(col, &key) and from
self.db.commit_changes_bytes(...) and log any Err at debug/trace level (e.g.,
using tracing::debug! or log::debug!) including the column (col) and key
details, and for the read error treat it as an error case (not silently "not
found") and for the write error log the failure and ensure the method's return
reflects the failed persistence (do not unconditionally return true on commit
error). Ensure you update places referencing EMPTY_VALUE, small, and lru
insertion: only insert to lru after a successful read or successful commit, and
include the error logs near the calls to self.db.get and
self.db.commit_changes_bytes to aid diagnosability.
In `@src/ipld/util.rs`:
- Around line 424-430: Add a default type parameter for IpldStream so its second
generic defaults to CidHashSet for consistency with ChainStream; update the type
declaration pub struct IpldStream<DB, S = CidHashSet> and ensure any uses that
relied on the old signature (e.g., IpldStream in export_state_tree_cmd.rs)
continue to compile without changes.
In `@src/tool/subcommands/archive_cmd.rs`:
- Line 700: The code currently constructs the stream graph using an in-memory
CidHashSet via stream_graph::<_, _, _, CidHashSet>(&store,
heaviest_tipset.chain(&store), 0) which can OOM for large merges; replace
CidHashSet with FileBackedCidHashSet and instantiate a FileBackedCidHashSet
(e.g., using a temp file or configured backing path) so stream_graph::<_, _, _,
FileBackedCidHashSet>(...) uses the file-backed implementation; update any
constructors or trait-bound calls around stream_graph, and ensure
FileBackedCidHashSet is created and passed where the previous in-memory set was
assumed (for example where blocks is defined) so large snapshot merges use
disk-backed storage.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 717a2e40-5926-4156-9884-6eae73a37938
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (9)
Cargo.tomlsrc/chain/mod.rssrc/cid_collections/hash_set.rssrc/cid_collections/mod.rssrc/dev/subcommands/export_state_tree_cmd.rssrc/ipld/util.rssrc/tool/subcommands/archive_cmd.rssrc/tool/subcommands/benchmark_cmd.rssrc/tool/subcommands/snapshot_cmd.rs
|
What's the memory footprint reduction? |
@LesnyRumcajs 7-8GiB on mainnet and 1.8-2GiB on calibnet |
Worth it! |
LesnyRumcajs
left a comment
There was a problem hiding this comment.
Let's add a changelog entry for this; folks would like to know Forest potentially dropped its memory requirements by half (at the cost of slightly longer export times)
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/rpc/methods/chain.rs (1)
384-407:⚠️ Potential issue | 🟠 MajorGuarantee export-state cleanup across early
?returns.The export lock and status are set before this new
FileBackedCidHashSet::new(..)?call. If it fails,handlereturns beforeend_export()and beforeCHAIN_EXPORT_LOCKis cleared, so every later export request is rejected until process restart.Suggested fix
start_export(); - - let head = ctx.chain_store().load_required_tipset_or_heaviest(&tsk)?; - let start_ts = - ctx.chain_index() - .tipset_by_height(epoch, head, ResolveNullTipset::TakeOlder)?; - - let options = ExportOptions { - skip_checksum, - include_receipts, - include_events, - include_tipset_keys, - seen: FileBackedCidHashSet::new(ctx.temp_dir.as_path())?, - }; - let writer = if dry_run { - tokio_util::either::Either::Left(VoidAsyncWriter) - } else { - tokio_util::either::Either::Right(tokio::fs::File::create(&output_path).await?) - }; - let result = match version { + let result: anyhow::Result<ApiExportResult> = async { + let head = ctx.chain_store().load_required_tipset_or_heaviest(&tsk)?; + let start_ts = + ctx.chain_index() + .tipset_by_height(epoch, head, ResolveNullTipset::TakeOlder)?; + + let options = ExportOptions { + skip_checksum, + include_receipts, + include_events, + include_tipset_keys, + seen: FileBackedCidHashSet::new(ctx.temp_dir.as_path())?, + }; + let writer = if dry_run { + tokio_util::either::Either::Left(VoidAsyncWriter) + } else { + tokio_util::either::Either::Right(tokio::fs::File::create(&output_path).await?) + }; + + match version { FilecoinSnapshotVersion::V1 => { // ... } FilecoinSnapshotVersion::V2 => { // ... } - }; + } + }.await; end_export(); let mut guard = CHAIN_EXPORT_LOCK.lock().await; *guard = None;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/rpc/methods/chain.rs` around lines 384 - 407, The code sets CHAIN_EXPORT_LOCK and calls start_export() before constructing FileBackedCidHashSet::new(...)? which can early-return and leave the lock/status set; to fix, ensure export-state cleanup on all exit paths by either creating the FileBackedCidHashSet (FileBackedCidHashSet::new(ctx.temp_dir.as_path())) before acquiring CHAIN_EXPORT_LOCK/start_export(), or wrap the locked section with a scope guard/finalizer that always calls end_export() and clears CHAIN_EXPORT_LOCK (using the same CancellationToken and token.clone()) so that any early `?` return does not leave the export state stuck; update the code around CHAIN_EXPORT_LOCK, start_export, end_export, and ExportOptions to implement one of these approaches.src/tool/subcommands/archive_cmd.rs (1)
587-603:⚠️ Potential issue | 🟠 MajorDon’t back the seen-set in the current working directory.
FileBackedCidHashSet::new(".")?ignores the user-selected export destination. A large export can now fail because the cwd is read-only or out of space even whenoutput_pathis valid, and it can spill multi-GiB temp files into an unrelated directory.Suggested fix
- let seen = if let Some(diff) = diff { + let output_path = build_output_path(network.to_string(), genesis.timestamp, epoch, output_path); + let seen_root = output_path + .parent() + .unwrap_or_else(|| std::path::Path::new(".")); + + let seen = if let Some(diff) = diff { let diff_ts: Tipset = index .tipset_by_height(diff, ts.shallow_clone(), ResolveNullTipset::TakeOlder) .context("diff epoch must be smaller than target epoch")?; let diff_ts: &Tipset = &diff_ts; let diff_limit = diff_depth.map(|depth| diff_ts.epoch() - depth).unwrap_or(0); let mut stream = stream_chain( store.clone(), diff_ts.clone().chain_owned(store.clone()), diff_limit, - FileBackedCidHashSet::new(".")?, + FileBackedCidHashSet::new(seen_root)?, ); while stream.try_next().await?.is_some() {} stream.into_seen() } else { - FileBackedCidHashSet::new(".")? + FileBackedCidHashSet::new(seen_root)? }; - - let output_path = build_output_path(network.to_string(), genesis.timestamp, epoch, output_path);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/tool/subcommands/archive_cmd.rs` around lines 587 - 603, The seen-set is being created with FileBackedCidHashSet::new(".") which uses the current working directory; change both instances where FileBackedCidHashSet::new(".")? is called (inside the diff branch and the else branch) to create the file-backed set in the user-selected export destination instead (e.g. FileBackedCidHashSet::new(output_path) or a temp file/dir under output_path), ensuring the export destination variable (output_path/output_dir) is in scope; keep the same use with stream_chain(..., FileBackedCidHashSet::new(...)? ) and when returning the fallback so the seen-set is always backed at the chosen output location rather than ".".
🧹 Nitpick comments (3)
src/ipld/util.rs (1)
201-246: Update the public docs for the newseenparameter.
stream_chain/stream_graphnow require callers to provide the dedup store, but the doc comment still describes only the old three-argument API. Please document whatseenis expected to do and who owns its storage/lifecycle.As per coding guidelines "Document public functions and structs with doc comments".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/ipld/util.rs` around lines 201 - 246, Update the public doc comments for stream_chain and stream_graph to document the new seen parameter: explain that seen is a deduplication store of CIDs (implements CidHashSetLike) used to avoid revisiting already-yielded nodes during DFS, specify that the caller provides and owns the seen storage (ownership/lifecycle managed by caller and may be reused across calls for incremental traversal), state expected behavior when seen contains CIDs (they will be skipped) and when it's empty, and note any thread-safety or mutability expectations (e.g., that seen will be mutated by the stream). Also update the top-level function docs to mention seen in the argument list and its purpose so downstream callers know to construct and pass an appropriate CidHashSetLike instance.src/db/gc/snapshot.rs (1)
101-103: Add path context to the temp-dir setup.This introduces a new filesystem dependency during GC initialization, but a bare
create_dir_allerror won’t say which path failed.As per coding guidelines "Use
anyhow::Result<T>for most operations and add context with.context()when errors occur".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/db/gc/snapshot.rs` around lines 101 - 103, The call to std::fs::create_dir_all(&chain_tmp_root) lacks contextual error information; wrap the filesystem operation with anyhow context so failures include the path (use .with_context or .context()) e.g. add context referencing chain_tmp_root and chain_data_path; update the create_dir_all call near the chain_tmp_root variable initialization so any error message names the tmp dir path, and keep db_root(&chain_data_path)? unchanged.src/chain/mod.rs (1)
35-41: Document the newseenbackend contract onExportOptions.
ExportOptions<S>now carries the main behavioral choice for export deduplication, but the struct itself has no rustdoc explaining whatseenis expected to do or when callers should preferCidHashSetversusFileBackedCidHashSet. A short doc comment here would make the generic API much easier to use correctly.As per coding guidelines,
**/*.rs: Document public functions and structs with doc comments.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/chain/mod.rs` around lines 35 - 41, Add a Rust doc comment to the public ExportOptions<S> struct documenting that the seen field is the export deduplication backend: it must behave like a set of CIDs with contains/insert semantics used to decide whether a CID has already been exported; document when to use the in-memory CidHashSet (ephemeral, faster) versus FileBackedCidHashSet (persistent across process restarts), and note any expected trait/behavioral requirements (e.g., that callers provide a type S implementing set-like operations used by the export logic). Explain typical ownership/lifetime expectations (caller supplies and retains the backend) so callers know how to choose and supply seen.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/cid_collections/hash_set.rs`:
- Around line 175-176: The QuickCheck helper test_cid_hashset is calling
Vec::dedup(), which only removes consecutive duplicates and allows inputs like
[a, b, a] to remain duplicated; fix by ensuring cids contains unique entries
before testing—either sort the vector then dedup (e.g., call sort() then
dedup()) or rebuild uniqueness via a HashSet and collect back to a Vec; apply
the same change to the other similar helper referenced around lines 187-188 so
both insertion loops operate on fully de-duplicated CID lists.
- Around line 13-17: Change the CidHashSetLike::insert signature to return
anyhow::Result<bool> and update the FileBackedCidHashSet implementation to
surface I/O errors instead of swallowing them: propagate errors from Db::get and
from commit_changes_bytes (do not ignore with "_ ="), wrapping them with context
(e.g., "reading seen-set" / "writing seen-set"). Update all call sites that
previously expected bool (the conditionals in functions referenced in
src/ipld/util.rs, the filter in src/utils/db/car_util.rs, and the four tests) to
handle anyhow::Result<bool> (propagate the error with ? or map_err with context,
or adjust tests to expect Result). Ensure unique symbols updated are
CidHashSetLike::insert and FileBackedCidHashSet::insert plus any internal calls
to Db::get and commit_changes_bytes.
In `@src/dev/subcommands/export_state_tree_cmd.rs`:
- Line 112: The current call creates a CWD-dependent seen-set with
FileBackedCidHashSet::new(".") inside IpldStream::new, which can fail or leak
files; change this to create a deterministic/writable path (e.g. use
tempfile::tempdir() or std::env::temp_dir() or a configurable output directory)
and pass that path to FileBackedCidHashSet::new instead of "." (update the code
around the IpldStream::new invocation and the FileBackedCidHashSet::new call to
construct and use the temp/config path and propagate any errors accordingly).
---
Outside diff comments:
In `@src/rpc/methods/chain.rs`:
- Around line 384-407: The code sets CHAIN_EXPORT_LOCK and calls start_export()
before constructing FileBackedCidHashSet::new(...)? which can early-return and
leave the lock/status set; to fix, ensure export-state cleanup on all exit paths
by either creating the FileBackedCidHashSet
(FileBackedCidHashSet::new(ctx.temp_dir.as_path())) before acquiring
CHAIN_EXPORT_LOCK/start_export(), or wrap the locked section with a scope
guard/finalizer that always calls end_export() and clears CHAIN_EXPORT_LOCK
(using the same CancellationToken and token.clone()) so that any early `?`
return does not leave the export state stuck; update the code around
CHAIN_EXPORT_LOCK, start_export, end_export, and ExportOptions to implement one
of these approaches.
In `@src/tool/subcommands/archive_cmd.rs`:
- Around line 587-603: The seen-set is being created with
FileBackedCidHashSet::new(".") which uses the current working directory; change
both instances where FileBackedCidHashSet::new(".")? is called (inside the diff
branch and the else branch) to create the file-backed set in the user-selected
export destination instead (e.g. FileBackedCidHashSet::new(output_path) or a
temp file/dir under output_path), ensuring the export destination variable
(output_path/output_dir) is in scope; keep the same use with stream_chain(...,
FileBackedCidHashSet::new(...)? ) and when returning the fallback so the
seen-set is always backed at the chosen output location rather than ".".
---
Nitpick comments:
In `@src/chain/mod.rs`:
- Around line 35-41: Add a Rust doc comment to the public ExportOptions<S>
struct documenting that the seen field is the export deduplication backend: it
must behave like a set of CIDs with contains/insert semantics used to decide
whether a CID has already been exported; document when to use the in-memory
CidHashSet (ephemeral, faster) versus FileBackedCidHashSet (persistent across
process restarts), and note any expected trait/behavioral requirements (e.g.,
that callers provide a type S implementing set-like operations used by the
export logic). Explain typical ownership/lifetime expectations (caller supplies
and retains the backend) so callers know how to choose and supply seen.
In `@src/db/gc/snapshot.rs`:
- Around line 101-103: The call to std::fs::create_dir_all(&chain_tmp_root)
lacks contextual error information; wrap the filesystem operation with anyhow
context so failures include the path (use .with_context or .context()) e.g. add
context referencing chain_tmp_root and chain_data_path; update the
create_dir_all call near the chain_tmp_root variable initialization so any error
message names the tmp dir path, and keep db_root(&chain_data_path)? unchanged.
In `@src/ipld/util.rs`:
- Around line 201-246: Update the public doc comments for stream_chain and
stream_graph to document the new seen parameter: explain that seen is a
deduplication store of CIDs (implements CidHashSetLike) used to avoid revisiting
already-yielded nodes during DFS, specify that the caller provides and owns the
seen storage (ownership/lifecycle managed by caller and may be reused across
calls for incremental traversal), state expected behavior when seen contains
CIDs (they will be skipped) and when it's empty, and note any thread-safety or
mutability expectations (e.g., that seen will be mutated by the stream). Also
update the top-level function docs to mention seen in the argument list and its
purpose so downstream callers know to construct and pass an appropriate
CidHashSetLike instance.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 89436308-be9e-4926-85a1-f39f55421a73
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (22)
CHANGELOG.mdCargo.tomldocs/docs/users/guides/gc.mdsrc/chain/mod.rssrc/chain/tests.rssrc/cid_collections/hash_set.rssrc/cid_collections/mod.rssrc/daemon/context.rssrc/daemon/mod.rssrc/db/gc/snapshot.rssrc/dev/subcommands/export_state_tree_cmd.rssrc/ipld/util.rssrc/rpc/methods/chain.rssrc/rpc/methods/sync.rssrc/rpc/mod.rssrc/tool/offline_server/server.rssrc/tool/subcommands/api_cmd/generate_test_snapshot.rssrc/tool/subcommands/api_cmd/test_snapshot.rssrc/tool/subcommands/archive_cmd.rssrc/tool/subcommands/benchmark_cmd.rssrc/tool/subcommands/snapshot_cmd.rssrc/utils/db/car_stream/tests.rs
✅ Files skipped from review due to trivial changes (3)
- docs/docs/users/guides/gc.md
- src/tool/subcommands/benchmark_cmd.rs
- CHANGELOG.md
🚧 Files skipped from review as they are similar to previous changes (1)
- Cargo.toml
LesnyRumcajs
left a comment
There was a problem hiding this comment.
LGTM, anything from 🐰 to address?
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/cid_collections/hash_set.rs (1)
157-159: Consider propagatingdb.get()errors for consistency.Read errors are swallowed with
.ok()while write errors on line 164 are now propagated. Under disk corruption or I/O failure, this treats read failures as cache misses, potentially causing duplicate processing (though idempotent and not incorrect). For symmetric error handling:♻️ Suggested change
- if self.db.get(col, &key).ok().flatten().is_some() { + if self + .db + .get(col, &key) + .context("failed to read CID presence from parity-db")? + .is_some() + {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/cid_collections/hash_set.rs` around lines 157 - 159, The read path currently swallows errors by calling self.db.get(col, &key).ok().flatten().is_some(), which makes read failures appear as cache misses while writes propagate errors; change this to propagate read errors consistently: call self.db.get(col, &key) and handle the Result so that any Err from the DB is returned (propagated) instead of mapped to None, then only when Ok(Some(_)) update self.lru.insert(small, ()) and return Ok(false); use the same error type/propagation style used by the surrounding function so reads and writes behave symmetrically.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@src/cid_collections/hash_set.rs`:
- Around line 157-159: The read path currently swallows errors by calling
self.db.get(col, &key).ok().flatten().is_some(), which makes read failures
appear as cache misses while writes propagate errors; change this to propagate
read errors consistently: call self.db.get(col, &key) and handle the Result so
that any Err from the DB is returned (propagated) instead of mapped to None,
then only when Ok(Some(_)) update self.lru.insert(small, ()) and return
Ok(false); use the same error type/propagation style used by the surrounding
function so reads and writes behave symmetrically.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 357cbd77-5a4e-495d-af73-61d7f8470bb5
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (3)
CHANGELOG.mdsrc/cid_collections/hash_set.rssrc/ipld/util.rs
🚧 Files skipped from review as they are similar to previous changes (1)
- CHANGELOG.md
There was a problem hiding this comment.
♻️ Duplicate comments (2)
src/cid_collections/hash_set.rs (1)
161-168:⚠️ Potential issue | 🟠 MajorStop swallowing parity-db read failures.
Line 161 still converts
Db::geterrors into “not present”, so a corrupt or full backing store can silently reprocess CIDs and degrade export correctness. Sinceinsertis already fallible, propagate the read error and add context on both DB operations. As per coding guidelines**/*.rs: Use `anyhow::Result` for most operations and add context with `.context()` when errors occur.Suggested fix
- if self.db.get(col, &key).ok().flatten().is_some() { + if self + .db + .get(col, &key) + .context("failed to read CID presence from parity-db")? + .is_some() + { self.lru.insert(small, ()); Ok(false) } else { - self.db.commit_changes_bytes([( - col, - parity_db::Operation::Set(key, EMPTY_VALUE.clone()), - )])?; + self.db + .commit_changes_bytes([( + col, + parity_db::Operation::Set(key, EMPTY_VALUE.clone()), + )]) + .context("failed to persist CID presence into parity-db")?; self.lru.insert(small, ()); Ok(true) }#!/bin/bash # Verify that `FileBackedCidHashSet::insert` still suppresses parity-db read errors # and lacks context on the write path. sed -n '148,172p' src/cid_collections/hash_set.rs rg -n 'db\.get|commit_changes_bytes|context\(' src/cid_collections/hash_set.rs🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/cid_collections/hash_set.rs` around lines 161 - 168, In FileBackedCidHashSet::insert, stop converting parity_db::Db::get errors into "not present": change the function to return anyhow::Result and propagate the error from self.db.get(...) instead of .ok().flatten(), using .context("reading CID presence from parity-db") to add context; likewise wrap the self.db.commit_changes_bytes(...) call with .context("writing empty value to parity-db for CID") so both read and write failures surface with helpful messages while keeping the existing LRU insert and Ok(false)/Ok(true) behavior.src/tool/subcommands/archive_cmd.rs (1)
587-603:⚠️ Potential issue | 🟠 MajorAvoid routing export dedupe state through the OS temp dir.
These branches hard-code
FileBackedCidHashSet::new_in_temp_dir(). On hosts wherestd::env::temp_dir()is tmpfs, the “file-backed” seen-set moves back into RAM and undercuts this PR’s main memory win; it also makes export success depend on temp-dir policy instead of Forest’s configured storage root.#!/bin/bash # Inspect the call site and the default temp-root implementation. sed -n '587,603p' src/tool/subcommands/archive_cmd.rs sed -n '143,145p' src/cid_collections/hash_set.rs # Run this on a representative host to confirm what temp dir Forest will use. python - <<'PY' import tempfile, pathlib print("temp_dir =", pathlib.Path(tempfile.gettempdir()).resolve()) PY # If `/tmp` or the resolved temp dir is tmpfs/memory-backed, the seen-set is not # really offloaded from RAM. mount | grep -E ' on /tmp | on /var/tmp ' || true🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/tool/subcommands/archive_cmd.rs` around lines 587 - 603, The code currently uses FileBackedCidHashSet::new_in_temp_dir() for the dedupe "seen" set in archive_cmd.rs (the seen variable and stream_chain creation), which forces reliance on OS temp directory; change this to create the file-backed set inside Forest's configured storage root instead. Replace calls to FileBackedCidHashSet::new_in_temp_dir() with a constructor that accepts a Path (e.g., FileBackedCidHashSet::new_at(...) or new_in_dir(...)) and pass a path derived from the configured storage root (use the archive command's config/Store/root path or similar configuration value) with a unique per-run filename (like "archive-seen-<pid|uuid>") and ensure any cleanup/rotation semantics are preserved; update both branches (the diff branch and the else branch) so the same storage-root-backed path is used.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@src/cid_collections/hash_set.rs`:
- Around line 161-168: In FileBackedCidHashSet::insert, stop converting
parity_db::Db::get errors into "not present": change the function to return
anyhow::Result and propagate the error from self.db.get(...) instead of
.ok().flatten(), using .context("reading CID presence from parity-db") to add
context; likewise wrap the self.db.commit_changes_bytes(...) call with
.context("writing empty value to parity-db for CID") so both read and write
failures surface with helpful messages while keeping the existing LRU insert and
Ok(false)/Ok(true) behavior.
In `@src/tool/subcommands/archive_cmd.rs`:
- Around line 587-603: The code currently uses
FileBackedCidHashSet::new_in_temp_dir() for the dedupe "seen" set in
archive_cmd.rs (the seen variable and stream_chain creation), which forces
reliance on OS temp directory; change this to create the file-backed set inside
Forest's configured storage root instead. Replace calls to
FileBackedCidHashSet::new_in_temp_dir() with a constructor that accepts a Path
(e.g., FileBackedCidHashSet::new_at(...) or new_in_dir(...)) and pass a path
derived from the configured storage root (use the archive command's
config/Store/root path or similar configuration value) with a unique per-run
filename (like "archive-seen-<pid|uuid>") and ensure any cleanup/rotation
semantics are preserved; update both branches (the diff branch and the else
branch) so the same storage-root-backed path is used.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 6ff6f0eb-9891-4aa4-8739-e3890ba00ae7
📒 Files selected for processing (3)
src/cid_collections/hash_set.rssrc/dev/subcommands/export_state_tree_cmd.rssrc/tool/subcommands/archive_cmd.rs
🚧 Files skipped from review as they are similar to previous changes (1)
- src/dev/subcommands/export_state_tree_cmd.rs
Summary of changes
To reduce memory requirement for chain export and garbage collection
Performance impact: time cost of chain export is ~20% larger on my box
Changes introduced in this pull request:
Reference issue to close (if applicable)
Closes
Other information and links
Change checklist
Outside contributions
Summary by CodeRabbit
Refactor
Documentation
Chores
Tests