fix: do not mark time traveling blocks as bad#6241
Conversation
WalkthroughThe changes implement a fix for sync issues caused by time-traveling blocks by propagating a shared bad block cache through the chain synchronization pipeline. The cache is now passed through Changes
Sequence DiagramsequenceDiagram
participant CF as chain_follower
participant SSM as SyncStateMachine
participant ST as SyncTask
participant TV as validate_tipset
participant BBC as BadBlockCache
CF->>CF: Initialize with bad_block_cache
CF->>SSM: Pass bad_block_cache.clone()
SSM->>ST: Spawn SyncTask with bad_block_cache
ST->>TV: execute() with bad_block_cache param
TV->>TV: Validate tipset blocks
alt Validation Success
TV-->>ST: Ok(())
else Validation Failure (non-time-travel)
TV->>BBC: push(failed_block_cid)
BBC-->>TV: Logged warning
TV-->>ST: Err(TipsetSyncerError)
else Time-Travel Error
TV-->>ST: Err(TimeTravellingBlock)
end
ST-->>SSM: SyncEvent
SSM-->>CF: State transition
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20–25 minutes
Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (2)
src/chain_sync/bad_block_cache.rs (1)
32-35: Use structured fields for tracing and avoid string formatting.Prefer structured, queryable logs and avoid allocs in the hot path.
Apply:
- tracing::warn!("Marked bad block: {c}"); + tracing::warn!(cid = %c, "Marked bad block");CHANGELOG.md (1)
36-39: Clarify the fix and align wording.Be explicit that future-timestamp blocks aren’t marked bad and are retried later.
[respond_minimally]
-- [#6241](https://github.com/ChainSafe/forest/pull/6241) Fixed a sync issue that is caused by time traveling block(s). +- [#6241](https://github.com/ChainSafe/forest/pull/6241) Do not mark time‑travelling (future‑timestamp) blocks as bad; return a temporal error and retry later (Lotus parity).
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
CHANGELOG.md(1 hunks)src/chain_sync/bad_block_cache.rs(1 hunks)src/chain_sync/chain_follower.rs(4 hunks)src/chain_sync/tipset_syncer.rs(3 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: hanabi1224
Repo: ChainSafe/forest PR: 5944
File: src/chain/store/index.rs:0-0
Timestamp: 2025-08-18T03:09:47.932Z
Learning: In Forest's tipset_by_height caching implementation, hanabi1224 prefers performance-conscious solutions that leverage finality guarantees rather than expensive chain walking for fork detection. The approach of constraining cache lookups to finalized epochs (using CHECKPOINT_INTERVAL >= CHAIN_FINALITY) provides fork safety without the performance cost of ancestry verification.
Learnt from: hanabi1224
Repo: ChainSafe/forest PR: 5930
File: build.rs:64-77
Timestamp: 2025-08-13T09:43:20.301Z
Learning: hanabi1224 prefers hard compile-time errors in build scripts rather than runtime safeguards or collision detection, believing it's better to fail fast and fix root causes of issues like malformed snapshot names.
Learnt from: hanabi1224
Repo: ChainSafe/forest PR: 6057
File: src/cli/subcommands/f3_cmd.rs:0-0
Timestamp: 2025-09-09T10:37:17.947Z
Learning: hanabi1224 prefers having default timeouts (like 10m for --no-progress-timeout) to prevent commands from hanging indefinitely, even when the timeout flag isn't explicitly provided by users. This fail-fast approach is preferred over requiring explicit flag usage.
📚 Learning: 2025-08-11T14:00:47.060Z
Learnt from: hanabi1224
Repo: ChainSafe/forest PR: 5886
File: src/daemon/db_util.rs:236-259
Timestamp: 2025-08-11T14:00:47.060Z
Learning: In Forest's snapshot import (`src/daemon/db_util.rs`), when F3 CID is present in snapshot metadata but the actual F3 data is missing, this should cause a hard error as it indicates snapshot corruption. Only errors during the F3 import process itself (after data is successfully retrieved) should be non-fatal and logged.
Applied to files:
src/chain_sync/tipset_syncer.rs
📚 Learning: 2025-08-08T12:11:55.266Z
Learnt from: hanabi1224
Repo: ChainSafe/forest PR: 5867
File: src/ipld/util.rs:461-487
Timestamp: 2025-08-08T12:11:55.266Z
Learning: Forest (src/ipld/util.rs, Rust): In UnorderedChainStream::poll_next, dropping `extract_sender` (when no more tipsets and the extract queue is empty) is the intended shutdown signal for workers. Any subsequent attempt to enqueue work after this drop is a logic error and should be treated as an error; do not change `send()` to ignore a missing sender.
Applied to files:
src/chain_sync/tipset_syncer.rs
🧬 Code graph analysis (1)
src/chain_sync/chain_follower.rs (1)
src/chain_sync/tipset_syncer.rs (1)
validate_tipset(100-147)
🪛 GitHub Actions: Link Checker
src/chain_sync/bad_block_cache.rs
[warning] 1-1: Lychee: Error creating request: InvalidPathToUri("/")
src/chain_sync/tipset_syncer.rs
[warning] 1-1: Lychee: Error creating request: InvalidPathToUri("/")
src/chain_sync/chain_follower.rs
[warning] 1-1: Lychee: Error creating request: InvalidPathToUri("/")
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
- GitHub Check: Build MacOS
- GitHub Check: cargo-publish-dry-run
- GitHub Check: Build Ubuntu
- GitHub Check: All lint checks
- GitHub Check: tests-release
- GitHub Check: tests
- GitHub Check: Build forest binaries on Linux AMD64
- GitHub Check: Check
🔇 Additional comments (2)
src/chain_sync/tipset_syncer.rs (1)
105-106: Don’t mark transient validation errors as permanently bad (exclude TipsetParentNotFound).Current logic excludes only TimeTravellingBlock; TipsetParentNotFound is also transient (see Lines 196–205), but will be pushed to bad_block_cache, causing permanent rejection. Align with Lotus’ isPermanent(err) idea.
Apply within validate_tipset:
- warn!("Validating block [CID = {cid}] in EPOCH = {epoch} failed: {why}"); - match &why { - TipsetSyncerError::TimeTravellingBlock(_, _) => { - // Do not mark a block as bad for temporary errors. - // See <https://github.com/filecoin-project/lotus/blob/v1.34.1/chain/sync.go#L602> in Lotus - } - _ => { - if let Some(bad_block_cache) = bad_block_cache { - bad_block_cache.push(cid); - } - } - }; + warn!(epoch, %cid, error = %why, "block validation failed"); + let is_temp = matches!( + &why, + TipsetSyncerError::TimeTravellingBlock(_, _) + | TipsetSyncerError::TipsetParentNotFound(_) + ); + if !is_temp { + if let Some(ref bad_block_cache) = bad_block_cache { + bad_block_cache.push(cid); + } + }Optionally extract the predicate:
fn is_temporary_sync_error(e: &TipsetSyncerError) -> bool { matches!(e, TipsetSyncerError::TimeTravellingBlock(_, _) | TipsetSyncerError::TipsetParentNotFound(_) ) }Then use
if !is_temporary_sync_error(&why) { … }.Also applies to: 129-140
⛔ Skipped due to learnings
Learnt from: hanabi1224 Repo: ChainSafe/forest PR: 5886 File: src/daemon/db_util.rs:236-259 Timestamp: 2025-08-11T14:00:47.060Z Learning: In Forest's snapshot import (`src/daemon/db_util.rs`), when F3 CID is present in snapshot metadata but the actual F3 data is missing, this should cause a hard error as it indicates snapshot corruption. Only errors during the F3 import process itself (after data is successfully retrieved) should be non-fatal and logged.src/chain_sync/chain_follower.rs (1)
233-234: Ripgrep found the validate_tipset call site as expected, but cargo check failed due to environment limitations.The
rgsearch located the call tovalidate_tipsetat line 851, which falls within the changed range (852–859). However, the sandbox environment lacks a C compiler, preventingcargo checkfrom completing. The structural changes appear correct, but manual verification in your local environment is required to confirm compilation:cargo check -qAlso applies to: 260-261, 829-835, 852-859
Summary of changes
To match the logic in Lotus: filecoin-project/lotus#535
Changes introduced in this pull request:
Reference issue to close (if applicable)
Closes
Other information and links
Change checklist
Summary by CodeRabbit