feat: gate mining on local tip catching up to peers' best head#356
feat: gate mining on local tip catching up to peers' best head#356constwz wants to merge 10 commits into
Conversation
chore: add Osaka and Mendel hardfork timestamps for testnet (#280)
Pull Request ReviewThis PR adds a new mining safety gate in Sensitive ContentNo sensitive content detected. Security IssuesNo serious security issues detected. Generated by Hashdit Bot. This tool can absolutely NOT replace manual audits. |
Pull Request ReviewThis PR adds a new mining safety gate in Sensitive ContentNo sensitive content detected. Security IssuesNo serious security issues detected. Generated by Hashdit Bot. This tool can absolutely NOT replace manual audits. |
Description
Add a peer-head-vs-local-tip lag gate in the BSC miner so a validator refuses to extend a stale tip while it is still behind the network, preventing the validator from persisting a divergent fork to disk when other sync paths (
fork_recover, staged-sync pipeline) are temporarily unavailable.Rationale
cef78d6 deliberately removed the
is_syncing()mining gate thatmaincarries, on the grounds thatnetwork.is_syncing()never clears in the all-validators-restart cold start and would deadlock mining. The replacement gate (is_network_ready_to_mine) only checksnum_connected_peers() > 0, so the miner now produces blocks the instant any peer is connected — without ever checking whether the local tip has caught up to that peer.In the same commit, the
Syncinghandling onnew_payloadwas redirected from "synthesize an FCU, let engine-tree's optimistic-sync trigger the staged-sync pipeline" to "spawnfork_recoverover bsc/2". This means small-gap recovery (<MAX_FORK_DEPTH) no longer flipsnetwork.is_syncing()to true, because the staged-sync pipeline never runs for those gaps —fork_recoverruns in its own task.Combined, the two changes leave a window where:
PIPELINE_TRIGGER_DELTA = 2048),fork_recoveris the only active recovery and it is silently failing (bsc/2 stale tx, peer timeout, packet loss, etc.),Each mined block is immediately FCU'd to canonical, written to MDBX, and accumulated into TrieDB pathdb difflayers. Periodic pathdb flush pushes the disk layer onto that divergent chain. After ~20 minutes a 3-validator qanet validator can carry a >1700-block divergent fork on disk that cannot self-heal —
decide_startup_alignment(reth #175) re-aligns MDBX to TrieDB consistently on the wrong chain, andfork_recovercannot bridge across pathdb-gap'd ancestors.This PR adds an independent gate that does not depend on
is_syncing()or on any specific recovery path: ask peers directly what they think the head is, and refuse to mine if the local tip lags beyond a threshold.Example
Without the gate (today on
develop), a validator that fell behind during a bsc/2 startup race logs:With the gate, the same race instead logs:
Local tip stays put. No fork is persisted. Once
fork_recover/ pipeline catches up (or operator intervenes), mining resumes automatically the next time the lag closes under threshold.Changes
src/node/miner/bsc_miner.rs— one new async gateis_caught_up_to_peers(local_tip)invoked fromtry_new_workright afteris_network_ready_to_mine:network.get_all_peers().await→ takemax(p.best_number)across connected peers.lag = peer_best.saturating_sub(local_tip); skip mining whenlag > MINING_LAG_THRESHOLD(default 5, loose enough to absorb the one-block-in-flight window between an inboundNewBlockand our canonical-head update).best_number = None): grant aPEER_HEAD_WAIT_TIMEOUT_SECS(5 s) grace window via aOnceLock<Instant>, then fall through and permit mining. This preserves cef78d6's all-validators-restart deadlock-break intent without using theis_syncing()signal that no longer fires for small gaps. The fallthrough emits a WARN so it is visible in operator logs.get_all_peers()error or missing network handle, skip mining (fail-closed).Potential Impacts
local_tip,peer_best,lag,threshold. Previously a behind-network miner produced silent divergent forks; this turns silent failure into a visible, structured WARN/DEBUG signal.reth_network::Peers::get_all_peers, which is already used elsewhere in the import service.is_syncing(); after grace, mining proceeds.