Skip to content

fix: slot-based collator shuts down immediately after init#11628

Merged
sigurpol merged 4 commits intomasterfrom
sigurpol-fix-regression-a1a2bbfdb4
Apr 3, 2026
Merged

fix: slot-based collator shuts down immediately after init#11628
sigurpol merged 4 commits intomasterfrom
sigurpol-fix-regression-a1a2bbfdb4

Conversation

@sigurpol
Copy link
Copy Markdown
Contributor

@sigurpol sigurpol commented Apr 3, 2026

Fix a regression introduced by #11381, where we wrapped the slot-based collator launch in an async task that first calls wait_for_aura, then spawns the actual long-running collator tasks via slot_based::run(). The wrapper was spawned with spawn_essential_handle().

Essential tasks shut down the node when they complete. The init wrapper completes immediately after spawning, the TaskManager sees an essential task exit, and the node shuts down.

This only affects parachain collators started with --authoring=slot-based.

Fix: use spawn_handle() for the short-lived init wrapper. The child tasks inside slot_based::run() remain correctly marked as essential.

An easy way to reproduce (same setup used by staking-miner nightly test - which in fact started to fail after #11381 got merged e.g. here ): spawn a Zombienet network with a 2-validator relay chain and a single slot-based parachain collator. The collator process starts but shuts down immediately.
For example in your SDK repo:

cd substrate/frame/staking-async/runtimes/papi-tests
just setup
just run fake-dev 

which launches zombienet spawning

  • alice (relay validator, port 9944) — polkadot
  • bob (relay validator, port 9945) — polkadot
  • charlie (parachain collator, port 9946) — polkadot-parachain --collator --authoring=slot-based

Port 9946 never comes up.

I have also verified that the fix coming from #11381 still works, running manually ./target/release/polkadot-parachain --chain asset-hub-polkadot --sync warp --authoring=slot-based --tmp -- --sync warp.

Regression from a1a2bbf ("Fix slot-based collator panic during warp
sync"). That commit wrapped the slot-based collator launch in an async
task that first calls `wait_for_aura`, then spawns the actual long-running
collator tasks via `slot_based::run()`. The wrapper was spawned with
`spawn_essential_handle()`.

Essential tasks shut down the node when they complete — by design, they
are expected to run forever. Unlike the lookahead collator (whose
`aura::run_with_export().await` loops indefinitely), `slot_based::run()`
is synchronous: it spawns two child essential tasks and returns. So the
init wrapper completes immediately after spawning, the TaskManager sees
an essential task exit, and the node shuts down.

This only affects parachain collators started with `--authoring=slot-based`
(e.g. the collator on ws port 9946 in a Zombienet setup). Relay chain
nodes (ports 9944/9945) use BABE/GRANDPA and are unaffected.

Fix: use `spawn_handle()` for the short-lived init wrapper. The child
tasks inside `slot_based::run()` remain correctly marked as essential.
@sigurpol
Copy link
Copy Markdown
Contributor Author

sigurpol commented Apr 3, 2026

/cmd prdoc --audience runtime_dev --bump patch

@sigurpol sigurpol added the T9-cumulus This PR/Issue is related to cumulus. label Apr 3, 2026
@sigurpol
Copy link
Copy Markdown
Contributor Author

sigurpol commented Apr 3, 2026

cc @clangenb - PTAL if the change makes sense for you too 🙏

@sigurpol sigurpol requested a review from serban300 April 3, 2026 08:30
@clangenb
Copy link
Copy Markdown
Contributor

clangenb commented Apr 3, 2026

Yoo, sorry, expected the regular non-warp sync case to be tested in CI here, and I did not wait for the para warp sync to finish when I tested. XD

However, it seems there are relevant scenarios not tested in CI - I guess we should add a follow-up issue to that?

EDIT: Fix looks good obviously

@sigurpol
Copy link
Copy Markdown
Contributor Author

sigurpol commented Apr 3, 2026

Yoo, sorry, expected the regular non-warp sync case to be tested in CI here, and I did not wait for the para warp sync to finish. XD

However, it seems there are relevant scenarios not tested in CI - I guess we should add a follow-up issue to that?

EDIT: Fix looks good obviously

Thanks for the feedback - yes, I believe we could improve coverage on CI definitely, we were discussing for staking to make tests / setup under staking-async/runtimes/papi-tests part of CI eventually but we haven't prioritized that yet --- so independently by that, I think we should probably have this basic use case where we spawn something similar to what I described as part of CI - maybe @pepoviola as zombienet's wizard - together with node experts - you have suggestions / ideas, I am definitely not authoritative here. I have noticed the issue just because the staking-miner nightly job spawns the setup I described in the PR vs latest SDK and starts to fail miserably 😅

Copy link
Copy Markdown
Contributor

@skunert skunert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Missed that indeed. We have zombienet tests for authoring, but they use test-parachain binary. It is used because it has extra CLI flags that are needed for some scenarios. But might be better if we switched to Omni node for the ones that don't require anything special.

@sigurpol sigurpol added this pull request to the merge queue Apr 3, 2026
Merged via the queue into master with commit 6324a66 Apr 3, 2026
249 of 256 checks passed
@sigurpol sigurpol deleted the sigurpol-fix-regression-a1a2bbfdb4 branch April 3, 2026 14:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

T9-cumulus This PR/Issue is related to cumulus.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants