fix: slot-based collator shuts down immediately after init#11628
fix: slot-based collator shuts down immediately after init#11628
Conversation
Regression from a1a2bbf ("Fix slot-based collator panic during warp sync"). That commit wrapped the slot-based collator launch in an async task that first calls `wait_for_aura`, then spawns the actual long-running collator tasks via `slot_based::run()`. The wrapper was spawned with `spawn_essential_handle()`. Essential tasks shut down the node when they complete — by design, they are expected to run forever. Unlike the lookahead collator (whose `aura::run_with_export().await` loops indefinitely), `slot_based::run()` is synchronous: it spawns two child essential tasks and returns. So the init wrapper completes immediately after spawning, the TaskManager sees an essential task exit, and the node shuts down. This only affects parachain collators started with `--authoring=slot-based` (e.g. the collator on ws port 9946 in a Zombienet setup). Relay chain nodes (ports 9944/9945) use BABE/GRANDPA and are unaffected. Fix: use `spawn_handle()` for the short-lived init wrapper. The child tasks inside `slot_based::run()` remain correctly marked as essential.
|
/cmd prdoc --audience runtime_dev --bump patch |
…time_dev --bump patch'
|
cc @clangenb - PTAL if the change makes sense for you too 🙏 |
|
Yoo, sorry, expected the regular non-warp sync case to be tested in CI here, and I did not wait for the para warp sync to finish when I tested. XD However, it seems there are relevant scenarios not tested in CI - I guess we should add a follow-up issue to that? EDIT: Fix looks good obviously |
Thanks for the feedback - yes, I believe we could improve coverage on CI definitely, we were discussing for staking to make tests / setup under |
skunert
left a comment
There was a problem hiding this comment.
Thanks! Missed that indeed. We have zombienet tests for authoring, but they use test-parachain binary. It is used because it has extra CLI flags that are needed for some scenarios. But might be better if we switched to Omni node for the ones that don't require anything special.
Fix a regression introduced by #11381, where we wrapped the slot-based collator launch in an async task that first calls
wait_for_aura, then spawns the actual long-running collator tasks viaslot_based::run(). The wrapper was spawned withspawn_essential_handle().Essential tasks shut down the node when they complete. The init wrapper completes immediately after spawning, the TaskManager sees an essential task exit, and the node shuts down.
This only affects parachain collators started with
--authoring=slot-based.Fix: use
spawn_handle()for the short-lived init wrapper. The child tasks insideslot_based::run()remain correctly marked as essential.An easy way to reproduce (same setup used by staking-miner nightly test - which in fact started to fail after #11381 got merged e.g. here ): spawn a Zombienet network with a 2-validator relay chain and a single slot-based parachain collator. The collator process starts but shuts down immediately.
For example in your SDK repo:
which launches zombienet spawning
Port 9946 never comes up.
I have also verified that the fix coming from #11381 still works, running manually
./target/release/polkadot-parachain --chain asset-hub-polkadot --sync warp --authoring=slot-based --tmp -- --sync warp.