Skip to content

feat: gate mining on local tip catching up to peers' best head#356

Open
constwz wants to merge 10 commits into
developfrom
feat/mining-anti-fork-gate
Open

feat: gate mining on local tip catching up to peers' best head#356
constwz wants to merge 10 commits into
developfrom
feat/mining-anti-fork-gate

Conversation

@constwz
Copy link
Copy Markdown
Contributor

@constwz constwz commented May 20, 2026

Description

Add a peer-head-vs-local-tip lag gate in the BSC miner so a validator refuses to extend a stale tip while it is still behind the network, preventing the validator from persisting a divergent fork to disk when other sync paths (fork_recover, staged-sync pipeline) are temporarily unavailable.

Rationale

cef78d6 deliberately removed the is_syncing() mining gate that main carries, on the grounds that network.is_syncing() never clears in the all-validators-restart cold start and would deadlock mining. The replacement gate (is_network_ready_to_mine) only checks num_connected_peers() > 0, so the miner now produces blocks the instant any peer is connected — without ever checking whether the local tip has caught up to that peer.

In the same commit, the Syncing handling on new_payload was redirected from "synthesize an FCU, let engine-tree's optimistic-sync trigger the staged-sync pipeline" to "spawn fork_recover over bsc/2". This means small-gap recovery (< MAX_FORK_DEPTH) no longer flips network.is_syncing() to true, because the staged-sync pipeline never runs for those gaps — fork_recover runs in its own task.

Combined, the two changes leave a window where:

  • the node has connected peers (peer-ready gate passes),
  • the node is genuinely behind by a small gap (under PIPELINE_TRIGGER_DELTA = 2048),
  • fork_recover is the only active recovery and it is silently failing (bsc/2 stale tx, peer timeout, packet loss, etc.),
  • and the miner sees no gating signal at all and starts producing blocks on top of the stale local tip.

Each mined block is immediately FCU'd to canonical, written to MDBX, and accumulated into TrieDB pathdb difflayers. Periodic pathdb flush pushes the disk layer onto that divergent chain. After ~20 minutes a 3-validator qanet validator can carry a >1700-block divergent fork on disk that cannot self-heal — decide_startup_alignment (reth #175) re-aligns MDBX to TrieDB consistently on the wrong chain, and fork_recover cannot bridge across pathdb-gap'd ancestors.

This PR adds an independent gate that does not depend on is_syncing() or on any specific recovery path: ask peers directly what they think the head is, and refuse to mine if the local tip lags beyond a threshold.

Example

Without the gate (today on develop), a validator that fell behind during a bsc/2 startup race logs:

INFO  bsc::miner: Try new work tip_block=20004789 ...
DEBUG bsc::block_import: Sending new block to import service: number=20004790 ...
INFO  bsc::block_import: New payload returned Syncing - spawning fork recovery block_number=20004790
WARN  bsc::block_import: Fork recovery failed ... err=failed to send GetBlocksByRange command
                                       (block 20004790 NOT imported — local tip stays at 20004789)
INFO  bsc::miner: Sealing local block on parent 20004789 ...
                                       (mined on stale tip → divergent fork written to disk)

With the gate, the same race instead logs:

DEBUG bsc::miner: Skip mining: local tip lags peers' best head beyond threshold local_tip=20004789 peer_best=20006498 lag=1709 threshold=5

Local tip stays put. No fork is persisted. Once fork_recover / pipeline catches up (or operator intervenes), mining resumes automatically the next time the lag closes under threshold.

Changes

src/node/miner/bsc_miner.rs — one new async gate is_caught_up_to_peers(local_tip) invoked from try_new_work right after is_network_ready_to_mine:

  1. network.get_all_peers().await → take max(p.best_number) across connected peers.
  2. If at least one peer has a head: compute lag = peer_best.saturating_sub(local_tip); skip mining when lag > MINING_LAG_THRESHOLD (default 5, loose enough to absorb the one-block-in-flight window between an inbound NewBlock and our canonical-head update).
  3. If no peer has published a head yet (cold start, all best_number = None): grant a PEER_HEAD_WAIT_TIMEOUT_SECS (5 s) grace window via a OnceLock<Instant>, then fall through and permit mining. This preserves cef78d6's all-validators-restart deadlock-break intent without using the is_syncing() signal that no longer fires for small gaps. The fallthrough emits a WARN so it is visible in operator logs.
  4. On get_all_peers() error or missing network handle, skip mining (fail-closed).

Potential Impacts

  • Self-recovery is automatic: once the local tip is within 5 blocks of the highest peer head, mining resumes on the next canonical-state notification — no operator action required for routine sync delays.
  • Diagnostics improved: when the gate fires, the log line carries local_tip, peer_best, lag, threshold. Previously a behind-network miner produced silent divergent forks; this turns silent failure into a visible, structured WARN/DEBUG signal.
  • No new dependencies and no protocol-wire changes: uses reth_network::Peers::get_all_peers, which is already used elsewhere in the import service.
  • Compatibility with cef78d6's deadlock-break: the grace window covers the exact "no peer ever reported a head" cold start that motivated removing is_syncing(); after grace, mining proceeds.
  • Compatibility with fix: bsc protocol stale registry tx #355: the two PRs are complementary, not overlapping. fix: bsc protocol stale registry tx #355 lets bsc/2 self-heal when its sender channel goes stale; this PR ensures that during the window where bsc/2 (or any other recovery path) is unhealthy, the miner does not extend the chain locally. Both should land together to fully close the qanet failure mode that motivated them.
  • Threshold (5) deliberately loose: BSC's 0.75 s blocks mean a 5-block lag is ~3.75 s of head drift. Tighter would risk false negatives during normal NewBlock propagation; looser would risk missing the multi-second divergence window that produces deep forks.

@constwz constwz requested a review from joey0612 as a code owner May 20, 2026 07:29
@hashdit-bot
Copy link
Copy Markdown

hashdit-bot Bot commented May 20, 2026

Pull Request Review

This PR adds a new mining safety gate in src/node/miner/bsc_miner.rs that prevents mining when the local tip is lagging behind peers’ reported best head by more than a fixed threshold. It introduces is_caught_up_to_peers(local_tip) with fail-closed behavior on missing network handle or peer-query errors, a lag threshold (MINING_LAG_THRESHOLD = 5), and a bounded cold-start fallback window (PEER_HEAD_WAIT_TIMEOUT_SECS = 5) using OnceLock<Instant>. The gate is invoked in try_new_work immediately after the existing network-readiness check.

Sensitive Content

No sensitive content detected.

Security Issues

No serious security issues detected.


Generated by Hashdit Bot. This tool can absolutely NOT replace manual audits.

@hashdit-bot
Copy link
Copy Markdown

hashdit-bot Bot commented May 20, 2026

Pull Request Review

This PR adds a new mining safety gate in bsc_miner.rs that suppresses mining when the local tip lags behind peers’ reported best head by more than a threshold, with a bounded cold-start grace period when no peer head is yet available. It also introduces fail-closed behavior for missing network handle / peer query errors and wires this gate into try_new_work before block production. Additionally, it increases BSC protocol handshake timeout in stream.rs from 5s to 30s and expands documentation/comments explaining startup race conditions and rationale.

Sensitive Content

No sensitive content detected.

Security Issues

No serious security issues detected.


Generated by Hashdit Bot. This tool can absolutely NOT replace manual audits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants