Skip to content

sync/fix: Clear gap sync on known imported blocks#8445

Merged
lexnv merged 16 commits intomasterfrom
lexnv/investigate-warpsync
Jun 4, 2025
Merged

sync/fix: Clear gap sync on known imported blocks#8445
lexnv merged 16 commits intomasterfrom
lexnv/investigate-warpsync

Conversation

@lexnv
Copy link
Contributor

@lexnv lexnv commented May 6, 2025

This PR ensures that warp sync gaps are properly cleared when known blocks are imported. Previously, gaps were only removed in response to ImportedUnknown events.

This limitation caused issues for asset-hub and bridge-hub collators, which remained stuck in the "Block history" state without progressing.

The root cause lies in the client.info() reporting a gap during node startup or restart (ie block verification fails). In some cases, a peer may respond with the missing blocks after we’ve already imported them locally, leaving the gap open.

Grafana link: https://grafana.teleport.parity.io/goto/jCcsBLxNg?orgId=1

Traces from production:

2025-05-06 12:55:34.251 DEBUG                 main sync: [Parachain] Starting gap sync #4935955 - #4935955    

2025-05-06 12:55:34.558 TRACE tokio-runtime-worker sync: [Parachain] New gap block request for 12D3KooWAVQMhkXmc5ueSYasdsRWQbKus2YGZ6HDZUB4ViJMCxXy, (best:5103253, common:5103253) BlockRequest { id: 0, fields: HEADER | BODY | JUSTIFICATION, from: Number(4935955), direction: Descending, max: Some(1) }    

2025-05-06 12:55:34.558 TRACE tokio-runtime-worker sync: [Parachain] Processed `SyncingAction::StartRequest` to 12D3KooWAVQMhkXmc5ueSYasdsRWQbKus2YGZ6HDZUB4ViJMCxXy with strategy key StrategyKey("ChainSync").    

2025-05-06 12:55:34.608 TRACE tokio-runtime-worker sync: [Parachain] BlockResponse 0 from 12D3KooWAVQMhkXmc5ueSYasdsRWQbKus2YGZ6HDZUB4ViJMCxXy with 1 blocks  (4935955)    

2025-05-06 12:55:34.608 DEBUG tokio-runtime-worker sync: [Parachain] Drained 1 gap blocks from 4935954    
	
2025-05-06 12:55:35.511 TRACE tokio-runtime-worker sync::import-queue: [Parachain] Starting import of 1 blocks  (4935955)    

2025-05-06 12:55:35.517 TRACE tokio-runtime-worker sync::import-queue: [Parachain] Block already in chain 4935955: 0x63db2b40cccac020fbc922e5e98bb3955f4cdaa823a2be85ecf22776745ccacc    

2025-05-06 12:55:35.517 TRACE tokio-runtime-worker sync::import-queue: [Parachain] Block imported successfully Some(4935955) (0x63db…cacc)    

2025-05-06 12:55:35.517 TRACE tokio-runtime-worker sync: [Parachain] Cleared blocks from 4935955 to 4935956    

Testing Done

Added two tests to verify that warp sync gaps are correctly cleared under both block import scenarios. The first test closely follows the operations performed by the node, while the second one emulates the imports.

Next Steps

Added extra debug logs to monitor if the issue persists (pointing towards a corupt database -- ie client.info() always has the gap present).

Closes: #8416

cc @paritytech/networking

lexnv added 6 commits May 5, 2025 14:08
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
@lexnv lexnv self-assigned this May 6, 2025
@lexnv lexnv added T0-node This PR/Issue is related to the topic “node”. I2-bug The node fails to follow expected behavior. labels May 6, 2025
@lexnv lexnv added this to Networking May 6, 2025
lexnv added 2 commits May 6, 2025 17:39
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
@lexnv lexnv requested a review from a team May 6, 2025 14:39
@paritytech-workflow-stopper
Copy link

All GitHub workflows were cancelled due to failure one of the required jobs.
Failed workflow url: https://github.com/paritytech/polkadot-sdk/actions/runs/14862527642
Failed job name: test-linux-stable-no-try-runtime

@lexnv
Copy link
Contributor Author

lexnv commented May 6, 2025

Unrelated test failing

 ──── TRY 1 STDOUT:       polkadot-sdk-docs guides::your_first_node::tests::guide_first_runtime_works

running 1 test
test guides::your_first_node::tests::guide_first_runtime_works has been running for over 60 seconds
test guides::your_first_node::tests::guide_first_runtime_works ... FAILED

Copy link
Contributor

@skunert skunert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good. Just one more sanity check:

  • When we warp sync to block x, the target of the gap sync will be block x - 1.
  • There is not reasonable way that we import a known block x - 1 without completing the gap sync.

I am just wondering why it was originally done like this.

@lexnv
Copy link
Contributor Author

lexnv commented May 27, 2025

Coming back to some older PRs after handling litep2p, sorry for the delay 🙏

There is not reasonable way that we import a known block x - 1 without completing the gap sync.
I am just wondering why it was originally done like this.

@skunert Yep, that makes sense to me. Maybe this was always a missed case on the sync implementation?

Maybe @dmitry-markin do you know if there were some edge-cases in the past with known blocks that pointed us towards not closing the gap on importing known blocks? 🤔

@lexnv lexnv added this pull request to the merge queue May 27, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks May 27, 2025
Comment on lines +179 to +180
// Note: Ideally we can deduce this information with #[derive(derive_more::Debug)].
// However, we'd need a bump to the latest version 2 of the crate.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not do the bump?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will come with a follow-up, IIRC they have changed some feature-flags and now we have to explicitly select them :D

@lexnv lexnv added this pull request to the merge queue Jun 4, 2025
Merged via the queue into master with commit 85b71da Jun 4, 2025
182 of 183 checks passed
@lexnv lexnv deleted the lexnv/investigate-warpsync branch June 4, 2025 11:52
@github-project-automation github-project-automation bot moved this to Blocked ⛔️ in Networking Jun 4, 2025
ordian added a commit that referenced this pull request Jun 4, 2025
* master:
  omni-node: fix `benchmark pallet` to work with `--runtime` (#8594)
  Handle and suppress "New unknown `FromSwarm` libp2p event" warning (#8731)
  Implement detailed logging for XCM failures (#8724)
  [pallet-revive] contract's nonce starts at 1 (#8734)
  sync/fix: Clear gap sync on known imported blocks (#8445)
  [PoP] Add personhood tracking pallets (#8164)
  client/net: Use litep2p as the default network backend (#8461)
  Unflake `returns_status_for_pruned_blocks` (#8709)
  [AHM] Report the weights of epmb pallet to expose kusama and polkadot weights (#8704)
  Remove all XCM dependencies from `pallet-revive` (#8584)
  Docker master image tag fix (#8711)
  Record ed as part of the storage deposit (#8718)
  [pallet-revive] update dry-run logic (#8662)
  feat: add collator peer ID to ParachainInherentData (#8708)
  Nest errors in pallet-xcm (#7730)
  pallet-assets ERC20 precompile (#8554)
  Broker: Introduce min price + adjust renewals to lower market. (#8630)
  [AHM] Staking async fixes for XCM and election planning (#8422)
  Staking (EPMB): Add defensive error handling to voter snapshot creation and solution verification (#8687)
pgherveou pushed a commit that referenced this pull request Jun 11, 2025
This PR ensures that warp sync gaps are properly cleared when known
blocks are imported. Previously, gaps were only removed in response to
`ImportedUnknown` events.

This limitation caused issues for asset-hub and bridge-hub collators,
which remained stuck in the "Block history" state without progressing.


The root cause lies in the client.info() reporting a gap during node
startup or restart (ie block verification fails). In some cases, a peer
may respond with the missing blocks after we’ve already imported them
locally, leaving the gap open.

Grafana link: https://grafana.teleport.parity.io/goto/jCcsBLxNg?orgId=1

Traces from production:

```
2025-05-06 12:55:34.251 DEBUG                 main sync: [Parachain] Starting gap sync #4935955 - #4935955    

2025-05-06 12:55:34.558 TRACE tokio-runtime-worker sync: [Parachain] New gap block request for 12D3KooWAVQMhkXmc5ueSYasdsRWQbKus2YGZ6HDZUB4ViJMCxXy, (best:5103253, common:5103253) BlockRequest { id: 0, fields: HEADER | BODY | JUSTIFICATION, from: Number(4935955), direction: Descending, max: Some(1) }    

2025-05-06 12:55:34.558 TRACE tokio-runtime-worker sync: [Parachain] Processed `SyncingAction::StartRequest` to 12D3KooWAVQMhkXmc5ueSYasdsRWQbKus2YGZ6HDZUB4ViJMCxXy with strategy key StrategyKey("ChainSync").    

2025-05-06 12:55:34.608 TRACE tokio-runtime-worker sync: [Parachain] BlockResponse 0 from 12D3KooWAVQMhkXmc5ueSYasdsRWQbKus2YGZ6HDZUB4ViJMCxXy with 1 blocks  (4935955)    

2025-05-06 12:55:34.608 DEBUG tokio-runtime-worker sync: [Parachain] Drained 1 gap blocks from 4935954    
	
2025-05-06 12:55:35.511 TRACE tokio-runtime-worker sync::import-queue: [Parachain] Starting import of 1 blocks  (4935955)    

2025-05-06 12:55:35.517 TRACE tokio-runtime-worker sync::import-queue: [Parachain] Block already in chain 4935955: 0x63db2b40cccac020fbc922e5e98bb3955f4cdaa823a2be85ecf22776745ccacc    

2025-05-06 12:55:35.517 TRACE tokio-runtime-worker sync::import-queue: [Parachain] Block imported successfully Some(4935955) (0x63db…cacc)    

2025-05-06 12:55:35.517 TRACE tokio-runtime-worker sync: [Parachain] Cleared blocks from 4935955 to 4935956    
```

### Testing Done

Added two tests to verify that warp sync gaps are correctly cleared
under both block import scenarios. The first test closely follows the
operations performed by the node, while the second one emulates the
imports.

### Next Steps

Added extra debug logs to monitor if the issue persists (pointing
towards a corupt database -- ie client.info() always has the gap
present).

Closes: #8416

cc @paritytech/networking

---------

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Co-authored-by: Nikolai Kozlov <1431590+nkpar@users.noreply.github.com>
girazoki pushed a commit to moondance-labs/polkadot-sdk that referenced this pull request Aug 11, 2025
This PR ensures that warp sync gaps are properly cleared when known
blocks are imported. Previously, gaps were only removed in response to
`ImportedUnknown` events.

This limitation caused issues for asset-hub and bridge-hub collators,
which remained stuck in the "Block history" state without progressing.


The root cause lies in the client.info() reporting a gap during node
startup or restart (ie block verification fails). In some cases, a peer
may respond with the missing blocks after we’ve already imported them
locally, leaving the gap open.

Grafana link: https://grafana.teleport.parity.io/goto/jCcsBLxNg?orgId=1

Traces from production:

```
2025-05-06 12:55:34.251 DEBUG                 main sync: [Parachain] Starting gap sync #4935955 - #4935955    

2025-05-06 12:55:34.558 TRACE tokio-runtime-worker sync: [Parachain] New gap block request for 12D3KooWAVQMhkXmc5ueSYasdsRWQbKus2YGZ6HDZUB4ViJMCxXy, (best:5103253, common:5103253) BlockRequest { id: 0, fields: HEADER | BODY | JUSTIFICATION, from: Number(4935955), direction: Descending, max: Some(1) }    

2025-05-06 12:55:34.558 TRACE tokio-runtime-worker sync: [Parachain] Processed `SyncingAction::StartRequest` to 12D3KooWAVQMhkXmc5ueSYasdsRWQbKus2YGZ6HDZUB4ViJMCxXy with strategy key StrategyKey("ChainSync").    

2025-05-06 12:55:34.608 TRACE tokio-runtime-worker sync: [Parachain] BlockResponse 0 from 12D3KooWAVQMhkXmc5ueSYasdsRWQbKus2YGZ6HDZUB4ViJMCxXy with 1 blocks  (4935955)    

2025-05-06 12:55:34.608 DEBUG tokio-runtime-worker sync: [Parachain] Drained 1 gap blocks from 4935954    
	
2025-05-06 12:55:35.511 TRACE tokio-runtime-worker sync::import-queue: [Parachain] Starting import of 1 blocks  (4935955)    

2025-05-06 12:55:35.517 TRACE tokio-runtime-worker sync::import-queue: [Parachain] Block already in chain 4935955: 0x63db2b40cccac020fbc922e5e98bb3955f4cdaa823a2be85ecf22776745ccacc    

2025-05-06 12:55:35.517 TRACE tokio-runtime-worker sync::import-queue: [Parachain] Block imported successfully Some(4935955) (0x63db…cacc)    

2025-05-06 12:55:35.517 TRACE tokio-runtime-worker sync: [Parachain] Cleared blocks from 4935955 to 4935956    
```

### Testing Done

Added two tests to verify that warp sync gaps are correctly cleared
under both block import scenarios. The first test closely follows the
operations performed by the node, while the second one emulates the
imports.

### Next Steps

Added extra debug logs to monitor if the issue persists (pointing
towards a corupt database -- ie client.info() always has the gap
present).

Closes: paritytech#8416

cc @paritytech/networking

---------

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Co-authored-by: Nikolai Kozlov <1431590+nkpar@users.noreply.github.com>
girazoki added a commit to moondance-labs/polkadot-sdk that referenced this pull request Aug 12, 2025
RomarQ pushed a commit to moonbeam-foundation/polkadot-sdk that referenced this pull request Oct 8, 2025
This PR ensures that warp sync gaps are properly cleared when known
blocks are imported. Previously, gaps were only removed in response to
`ImportedUnknown` events.

This limitation caused issues for asset-hub and bridge-hub collators,
which remained stuck in the "Block history" state without progressing.


The root cause lies in the client.info() reporting a gap during node
startup or restart (ie block verification fails). In some cases, a peer
may respond with the missing blocks after we’ve already imported them
locally, leaving the gap open.

Grafana link: https://grafana.teleport.parity.io/goto/jCcsBLxNg?orgId=1

Traces from production:

```
2025-05-06 12:55:34.251 DEBUG                 main sync: [Parachain] Starting gap sync #4935955 - #4935955    

2025-05-06 12:55:34.558 TRACE tokio-runtime-worker sync: [Parachain] New gap block request for 12D3KooWAVQMhkXmc5ueSYasdsRWQbKus2YGZ6HDZUB4ViJMCxXy, (best:5103253, common:5103253) BlockRequest { id: 0, fields: HEADER | BODY | JUSTIFICATION, from: Number(4935955), direction: Descending, max: Some(1) }    

2025-05-06 12:55:34.558 TRACE tokio-runtime-worker sync: [Parachain] Processed `SyncingAction::StartRequest` to 12D3KooWAVQMhkXmc5ueSYasdsRWQbKus2YGZ6HDZUB4ViJMCxXy with strategy key StrategyKey("ChainSync").    

2025-05-06 12:55:34.608 TRACE tokio-runtime-worker sync: [Parachain] BlockResponse 0 from 12D3KooWAVQMhkXmc5ueSYasdsRWQbKus2YGZ6HDZUB4ViJMCxXy with 1 blocks  (4935955)    

2025-05-06 12:55:34.608 DEBUG tokio-runtime-worker sync: [Parachain] Drained 1 gap blocks from 4935954    
	
2025-05-06 12:55:35.511 TRACE tokio-runtime-worker sync::import-queue: [Parachain] Starting import of 1 blocks  (4935955)    

2025-05-06 12:55:35.517 TRACE tokio-runtime-worker sync::import-queue: [Parachain] Block already in chain 4935955: 0x63db2b40cccac020fbc922e5e98bb3955f4cdaa823a2be85ecf22776745ccacc    

2025-05-06 12:55:35.517 TRACE tokio-runtime-worker sync::import-queue: [Parachain] Block imported successfully Some(4935955) (0x63db…cacc)    

2025-05-06 12:55:35.517 TRACE tokio-runtime-worker sync: [Parachain] Cleared blocks from 4935955 to 4935956    
```

### Testing Done

Added two tests to verify that warp sync gaps are correctly cleared
under both block import scenarios. The first test closely follows the
operations performed by the node, while the second one emulates the
imports.

### Next Steps

Added extra debug logs to monitor if the issue persists (pointing
towards a corupt database -- ie client.info() always has the gap
present).

Closes: paritytech#8416

cc @paritytech/networking

---------

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Co-authored-by: Nikolai Kozlov <1431590+nkpar@users.noreply.github.com>
alvicsam pushed a commit that referenced this pull request Oct 17, 2025
This PR ensures that warp sync gaps are properly cleared when known
blocks are imported. Previously, gaps were only removed in response to
`ImportedUnknown` events.

This limitation caused issues for asset-hub and bridge-hub collators,
which remained stuck in the "Block history" state without progressing.


The root cause lies in the client.info() reporting a gap during node
startup or restart (ie block verification fails). In some cases, a peer
may respond with the missing blocks after we’ve already imported them
locally, leaving the gap open.

Grafana link: https://grafana.teleport.parity.io/goto/jCcsBLxNg?orgId=1

Traces from production:

```
2025-05-06 12:55:34.251 DEBUG                 main sync: [Parachain] Starting gap sync #4935955 - #4935955    

2025-05-06 12:55:34.558 TRACE tokio-runtime-worker sync: [Parachain] New gap block request for 12D3KooWAVQMhkXmc5ueSYasdsRWQbKus2YGZ6HDZUB4ViJMCxXy, (best:5103253, common:5103253) BlockRequest { id: 0, fields: HEADER | BODY | JUSTIFICATION, from: Number(4935955), direction: Descending, max: Some(1) }    

2025-05-06 12:55:34.558 TRACE tokio-runtime-worker sync: [Parachain] Processed `SyncingAction::StartRequest` to 12D3KooWAVQMhkXmc5ueSYasdsRWQbKus2YGZ6HDZUB4ViJMCxXy with strategy key StrategyKey("ChainSync").    

2025-05-06 12:55:34.608 TRACE tokio-runtime-worker sync: [Parachain] BlockResponse 0 from 12D3KooWAVQMhkXmc5ueSYasdsRWQbKus2YGZ6HDZUB4ViJMCxXy with 1 blocks  (4935955)    

2025-05-06 12:55:34.608 DEBUG tokio-runtime-worker sync: [Parachain] Drained 1 gap blocks from 4935954    
	
2025-05-06 12:55:35.511 TRACE tokio-runtime-worker sync::import-queue: [Parachain] Starting import of 1 blocks  (4935955)    

2025-05-06 12:55:35.517 TRACE tokio-runtime-worker sync::import-queue: [Parachain] Block already in chain 4935955: 0x63db2b40cccac020fbc922e5e98bb3955f4cdaa823a2be85ecf22776745ccacc    

2025-05-06 12:55:35.517 TRACE tokio-runtime-worker sync::import-queue: [Parachain] Block imported successfully Some(4935955) (0x63db…cacc)    

2025-05-06 12:55:35.517 TRACE tokio-runtime-worker sync: [Parachain] Cleared blocks from 4935955 to 4935956    
```

### Testing Done

Added two tests to verify that warp sync gaps are correctly cleared
under both block import scenarios. The first test closely follows the
operations performed by the node, while the second one emulates the
imports.

### Next Steps

Added extra debug logs to monitor if the issue persists (pointing
towards a corupt database -- ie client.info() always has the gap
present).

Closes: #8416

cc @paritytech/networking

---------

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Co-authored-by: Nikolai Kozlov <1431590+nkpar@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

I2-bug The node fails to follow expected behavior. T0-node This PR/Issue is related to the topic “node”.

Projects

Status: Blocked ⛔️

Development

Successfully merging this pull request may close these issues.

Investigate why collators constantly are in "Block history" state

5 participants