fix(gc): schedule automatic GC only when the node is in sync#6165
fix(gc): schedule automatic GC only when the node is in sync#6165hanabi1224 merged 4 commits intomainfrom
Conversation
WalkthroughThreads a new Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Daemon
participant ChainFollower
participant SyncStatus as "SyncStatus\n(Arc<RwLock<SyncStatusReport>>)"
participant SnapGC as SnapshotGarbageCollector
participant Services as "start_services callback"
Daemon->>ChainFollower: start()
ChainFollower-->>Daemon: chain_follower.sync_status (SyncStatus)
Daemon->>SnapGC: set_sync_status(sync_status)
Note right of SnapGC #e8f5e9: stores SyncStatus handle
Daemon->>Services: start_services(ctx, sync_status)
sequenceDiagram
autonumber
participant SnapGC as SnapshotGarbageCollector
participant SyncStatus as "SyncStatus\n(Arc<RwLock<SyncStatusReport>>)"
participant CARDB as "CAR DB"
participant Export as "Snapshot Export"
loop Scheduler tick
SnapGC->>SyncStatus: read current_head_epoch, network_head_epoch, sync flags
SnapGC->>CARDB: read car_db_head_epoch
alt in-sync && no forks && epoch progressed
SnapGC->>Export: run export/GC
Export-->>SnapGC: completed
SnapGC->>SnapGC: compute & log duration
else not scheduled
SnapGC->>SnapGC: log "not scheduled (network_head_epoch=...)"
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
| { | ||
| let head_epoch = head.epoch(); | ||
| if head_epoch - car_db_head_epoch >= snap_gc_interval_epochs | ||
| const IN_SYNC_EPOCH_THRESHOLD: i64 = 2; |
There was a problem hiding this comment.
I notice sync_status.status does not refect the reality after a long catchup (potential bug?), so checking head_epoch + IN_SYNC_EPOCH_THRESHOLD >= network_head_epoch manually here
There was a problem hiding this comment.
This is a potential bug SyncStatusReport, I will file an issue for it.
There was a problem hiding this comment.
This is because there we are using time diff only:
let time_diff = now_ts.saturating_sub(heaviest.min_timestamp());
if time_diff < seconds_per_epoch as u64 * SYNCED_EPOCH_THRESHOLD {
NodeSyncStatus::Synced
} else {
NodeSyncStatus::Syncing
}But you can use
if sync_status.is_synced() && sync_status.active_forks.is_empty()
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/db/gc/snapshot.rs (1)
141-151: Critical: running flag stays true on export failure, blocking future GCIf export_snapshot() returns Err, running is never reset, so the scheduler will never trigger again.
Proposed fix:
} else { self.running.store(true, Ordering::Relaxed); - if let Err(e) = self.export_snapshot().await { - tracing::warn!("{e}"); - } + match self.export_snapshot().await { + Ok(_) => {} + Err(e) => { + tracing::warn!("{e}"); + // Ensure scheduler can re-trigger after a failed run + self.running.store(false, Ordering::Relaxed); + } + } }
🧹 Nitpick comments (1)
src/db/gc/snapshot.rs (1)
178-196: Scheduling gates look right; make threshold configurableLogic correctly prevents GC unless in-sync and interval elapsed. Consider making the in-sync threshold tunable via env/config for different nets.
Apply within this block:
- const IN_SYNC_EPOCH_THRESHOLD: i64 = 2; - let sync_status = &*sync_status.read(); + let in_sync_epoch_threshold: i64 = std::env::var("FOREST_SNAPSHOT_GC_IN_SYNC_EPOCH_THRESHOLD") + .ok() + .and_then(|v| v.parse().ok()) + .unwrap_or(2); + let sync_status = &*sync_status.read(); let network_head_epoch = sync_status.network_head_epoch; let head_epoch = sync_status.current_head_epoch; if head_epoch > 0 - && head_epoch <= network_head_epoch - && head_epoch + IN_SYNC_EPOCH_THRESHOLD >= network_head_epoch + && head_epoch <= network_head_epoch + && head_epoch + in_sync_epoch_threshold >= network_head_epoch && head_epoch - car_db_head_epoch >= snap_gc_interval_epochs && self.trigger_tx.try_send(()).is_ok() {
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (2)
src/daemon/mod.rs(5 hunks)src/db/gc/snapshot.rs(6 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
src/db/gc/snapshot.rs (1)
src/daemon/mod.rs (1)
start(532-574)
src/daemon/mod.rs (2)
src/db/gc/snapshot.rs (2)
set_db(129-131)set_sync_status(137-139)src/chain_sync/chain_follower.rs (1)
chain_follower(134-331)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
- GitHub Check: tests
- GitHub Check: tests-release
- GitHub Check: Build forest binaries on Linux AMD64
- GitHub Check: cargo-publish-dry-run
- GitHub Check: Build Ubuntu
- GitHub Check: Build MacOS
- GitHub Check: All lint checks
🔇 Additional comments (4)
src/db/gc/snapshot.rs (2)
77-77: Sync status plumbing: LGTMField, init, and setter are correct and align with daemon propagation.
Also applies to: 115-116, 137-139
249-253: Export duration logging: LGTMClear, human-readable timing; consistent with other logs.
src/daemon/mod.rs (2)
565-569: Wiring sync_status into GC: LGTMCorrectly sets DB, sync_status, and CAR DB head epoch in the init callback.
581-581: Use FnOnce for the init callback
The callback runs once; FnOnce is more flexible (allows move captures) with no downside here.-pub(super) async fn start_services( +pub(super) async fn start_services( start_time: chrono::DateTime<chrono::Utc>, opts: &CliOpts, mut config: Config, shutdown_send: mpsc::Sender<()>, - on_app_context_and_db_initialized: impl Fn(&AppContext, Arc<RwLock<SyncStatusReport>>), + on_app_context_and_db_initialized: impl FnOnce(&AppContext, Arc<RwLock<SyncStatusReport>>), ) -> anyhow::Result<()> {All call sites already pass closures compatible with FnOnce.
| running: AtomicBool, | ||
| blessed_lite_snapshot: RwLock<Option<PathBuf>>, | ||
| db: RwLock<Option<Arc<DB>>>, | ||
| sync_status: RwLock<Option<Arc<RwLock<crate::chain_sync::SyncStatusReport>>>>, |
There was a problem hiding this comment.
Do you think this will be better here?
RwLock<Option<Arc<RwLock<crate::chain_sync::SyncStatusReport>>>> -> OnceLock<Arc<RwLock<crate::chain_sync::SyncStatusReport>>>
This way we can avoid doing: &*sync_status.read()
There was a problem hiding this comment.
It's set more than once after every GC
There was a problem hiding this comment.
Yeah, I missed this GC restarts the node.
There was a problem hiding this comment.
Why do we need two RwLocks here?
There was a problem hiding this comment.
Could we simplify this type?
There was a problem hiding this comment.
@elmattic simplified type by adding pub type SyncStatus = Arc<RwLock<SyncStatusReport>>;
794c82a to
94f7193
Compare
94f7193 to
da3f9f6
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
src/db/gc/snapshot.rs (1)
186-189: Consider adding active forks check.Based on past review discussions, it was suggested to also check
sync_status.active_forks.is_empty()in addition to verifying the sync status. Running GC during active fork resolution could potentially interfere with block validation.Apply this diff to add the active forks check:
if head_epoch > 0 && head_epoch <= network_head_epoch && sync_status.status == NodeSyncStatus::Synced + && sync_status.active_forks.is_empty() && head_epoch - car_db_head_epoch >= snap_gc_interval_epochs && self.trigger_tx.try_send(()).is_ok()Based on past review comments.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/daemon/mod.rs(5 hunks)src/db/gc/snapshot.rs(7 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- src/daemon/mod.rs
🧰 Additional context used
🧬 Code graph analysis (1)
src/db/gc/snapshot.rs (1)
src/daemon/mod.rs (1)
start(532-574)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
- GitHub Check: Build forest binaries on Linux AMD64
- GitHub Check: cargo-publish-dry-run
- GitHub Check: All lint checks
- GitHub Check: tests-release
- GitHub Check: Build Ubuntu
- GitHub Check: tests
- GitHub Check: Build MacOS
🔇 Additional comments (1)
src/db/gc/snapshot.rs (1)
249-253: LGTM! Good observability improvement.The addition of export duration logging is helpful for monitoring GC performance and diagnosing potential issues.
| let sync_status = &*sync_status.read(); | ||
| let network_head_epoch = sync_status.network_head_epoch; | ||
| let head_epoch = sync_status.current_head_epoch; | ||
| if head_epoch > 0 |
There was a problem hiding this comment.
nit: this test condition is quite a beast, maybe it's time to break it down into functions and add some comments.
There was a problem hiding this comment.
I've added comment to each condition
a593c5c to
710f042
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (1)
src/db/gc/snapshot.rs (1)
148-149: Resetting running flag on export error — nice catchThis addresses the earlier concern about being unable to re-trigger GC after a failure.
🧹 Nitpick comments (1)
src/db/gc/snapshot.rs (1)
181-196: Make the scheduling predicate clearer and reusableLogic is correct. For readability and testability, consider extracting the predicate and making the non-negative gap explicit.
Apply this diff to factor the check and clarify the gap:
@@ - loop { - if !self.running.load(Ordering::Relaxed) - && let Some(car_db_head_epoch) = *self.car_db_head_epoch.read() - && let Some(sync_status) = &*self.sync_status.read() - { - let sync_status = &*sync_status.read(); - let network_head_epoch = sync_status.network_head_epoch; - let head_epoch = sync_status.current_head_epoch; - if head_epoch > 0 // sync_status has been initialized - && head_epoch <= network_head_epoch // head epoch is within a sane range - && sync_status.is_synced() // chain is in sync - && sync_status.active_forks.is_empty() // no active fork - && head_epoch - car_db_head_epoch >= snap_gc_interval_epochs // the gap between chain head and car_db head is above threshold - && self.trigger_tx.try_send(()).is_ok() - { - tracing::info!(%car_db_head_epoch, %head_epoch, %network_head_epoch, %snap_gc_interval_epochs, "Snap GC scheduled"); - } else { - tracing::debug!(%car_db_head_epoch, %head_epoch, %network_head_epoch, %snap_gc_interval_epochs, "Snap GC not scheduled"); - } - } + loop { + if !self.running.load(Ordering::Relaxed) + && let Some(car_db_head_epoch) = *self.car_db_head_epoch.read() + && let Some(sync_status) = &*self.sync_status.read() + { + let report = &*sync_status.read(); + let network_head_epoch = report.network_head_epoch; + let head_epoch = report.current_head_epoch; + let gap_ok = head_epoch >= car_db_head_epoch + && (head_epoch - car_db_head_epoch) >= snap_gc_interval_epochs; + if should_schedule_gc(head_epoch, network_head_epoch, report, gap_ok) + && self.trigger_tx.try_send(()).is_ok() + { + tracing::info!(%car_db_head_epoch, %head_epoch, %network_head_epoch, %snap_gc_interval_epochs, "Snap GC scheduled"); + } else { + tracing::debug!(%car_db_head_epoch, %head_epoch, %network_head_epoch, %snap_gc_interval_epochs, "Snap GC not scheduled"); + } + } tokio::time::sleep(snap_gc_check_interval).await; }Add this helper inside impl:
fn should_schedule_gc( head_epoch: ChainEpoch, network_head_epoch: ChainEpoch, report: &crate::chain_sync::SyncStatusReport, gap_ok: bool, ) -> bool { head_epoch > 0 && head_epoch <= network_head_epoch && report.is_synced() && report.active_forks.is_empty() && gap_ok }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
src/chain_sync/chain_follower.rs(2 hunks)src/chain_sync/mod.rs(1 hunks)src/chain_sync/sync_status.rs(2 hunks)src/daemon/mod.rs(4 hunks)src/db/gc/snapshot.rs(6 hunks)src/health/mod.rs(3 hunks)src/rpc/mod.rs(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- src/rpc/mod.rs
- src/daemon/mod.rs
- src/health/mod.rs
🧰 Additional context used
🧬 Code graph analysis (2)
src/chain_sync/chain_follower.rs (1)
src/chain_sync/tipset_syncer.rs (1)
validate_tipset(99-139)
src/db/gc/snapshot.rs (2)
src/chain_sync/chain_follower.rs (2)
new(83-110)new(545-556)src/daemon/mod.rs (1)
start(531-573)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (21)
- GitHub Check: Devnet checks
- GitHub Check: V1 snapshot export checks
- GitHub Check: Calibnet no discovery checks
- GitHub Check: db-migration-checks
- GitHub Check: State migrations
- GitHub Check: Wallet tests
- GitHub Check: Bootstrap checks - Forest
- GitHub Check: V2 snapshot export checks
- GitHub Check: Bootstrap checks - Lotus
- GitHub Check: Calibnet eth mapping check
- GitHub Check: Calibnet api test-stateful check
- GitHub Check: Diff snapshot export checks
- GitHub Check: Calibnet kademlia checks
- GitHub Check: Calibnet stateless RPC check
- GitHub Check: Calibnet stateless mode check
- GitHub Check: Forest CLI checks
- GitHub Check: Calibnet check
- GitHub Check: Build forest binaries on Linux AMD64
- GitHub Check: All lint checks
- GitHub Check: tests-release
- GitHub Check: tests
🔇 Additional comments (5)
src/chain_sync/sync_status.rs (1)
10-10: Good aliasing of sync status with parking_lot RwLockThe
SyncStatusalias and theparking_lot::RwLockimport simplify signatures and keep locking lightweight. LGTM.Based on learnings
Also applies to: 105-106
src/db/gc/snapshot.rs (2)
77-77: Plumbing SyncStatus into GC is clean and consistentStoring and setting the
SyncStatushandle looks good and aligns with the new alias across the codebase.Also applies to: 115-116, 137-140
233-254: Nice timing instrumentation for snapshot exportMeasuring and logging export duration improves observability. LGTM.
src/chain_sync/mod.rs (1)
19-21: Re-exporting SyncStatus is the right API polishThis avoids leaking concrete lock types across modules and simplifies imports.
src/chain_sync/chain_follower.rs (1)
47-48: Adopting SyncStatus alias throughout ChainFollower looks correctField, constructor, function signature, and update site are consistently switched to the alias; lock usage is appropriate and scoped.
Also applies to: 94-95, 137-138, 242-248
Summary of changes
This fixes a rare condition when GC kicks in during node catchup and breaks the block validation
Changes introduced in this pull request:
cargo update -p halfto update the yankedhalf@2.7.0Reference issue to close (if applicable)
Closes
Other information and links
Change checklist
Summary by CodeRabbit
New Features
Improvements
Bug Fixes