Skip to content

Staking-Async + EPMB: Migrate operations to poll#9925

Merged
kianenigma merged 69 commits into
masterfrom
kiz-epmb-poll
Dec 1, 2025
Merged

Staking-Async + EPMB: Migrate operations to poll#9925
kianenigma merged 69 commits into
masterfrom
kiz-epmb-poll

Conversation

@kianenigma
Copy link
Copy Markdown
Contributor

@kianenigma kianenigma commented Oct 3, 2025

This PR moves all operations related to staking elections from a mandatory on_initialize with no consideration to weight, to an optional on_poll with accurate, pre-execution weight checking.

Why

  • on_initialize is a mandatory hook. If a single parachain block happens to contain too many of them, this block can never be authored and imported. In solo/relay chains, this is more forgiving, as you would have one slow block, instead of an indefinite stall.
  • For example, message-queue XCMs, scheduler and MBMs might overlap with the staking on_initialize in AH (unlikely, but totally possible), and put the chain at risk.
  • Contrary, poll hooks:
    • Might not happen at all by frame-executive (e.g. during MBMs)
    • Have access to a clear WeigthMeter, allowing the subject to make a decision about whether to proceed or not.

Functional Changes

As seen by the minimal diff in existing tests, this change, in the absence of weight scarcity, is almost a noop. The only difference is that the start signal from the signed pallet to the verifier pallet is now sent at the end of the signed phase, not the beginning the signed validation.

Non-Functional Changes

  • Now, the only pallets that call on_poll are multi_block (the parent, not verifier and signed), and staking_async. This makes the code easier to audit.
  • Removes a lot of on_initialize terminology from weight functions
  • Cleans up some stale variations in the mock setup, allowing us to skip the signed pallet's on-initialize. This no longer makes sense as the parent pallet is only one that calls on_poll.

Mote Test Changes

During this PR, I found multiple instances where we are forwarding the wrong number of blocks forward in the EPMB tests. For example, to verify a solution, 3 blocks are needed, but we are calling roll_next 4 times, and the test is still passing. To harden such cases and make sure all future tests are as explicit as possible, I have:

  • Generic fn roll_next is made fully private and in tests should be replaced by:
    • fn roll_next_and_phase(expect: Phase<T>)
    • fn roll_next_and_phase_verifier(expect: Phase<T>, status: Status)
    • fn roll_next_and_verifier(status: Status)
  • This ensures all tests to explicitly state what the expected Phase/Status should be after moving a block forward.

All tests are updated to respect this paradigm, which has made the diff slightly larger than I wished it to be.

Implementation/Review Notes

Overall Design

The overall idea is to move all operations to a model similar to dispatchables, where before executing f(input) -> Result, we have access to a w(input) -> Weight that gives us the pre-execution weight. If the pre-execution weight is good, we proceed with executing. The execution may override the pre-execution weight to a smaller value if it wishes so.

/// ### Type
///
/// The commonly used `(Weight, Box<dyn Fn(&mut WeightMeter)>)` should be interpreted as such:
///
/// * The `Weight` is the pre-computed worst case weight of the operation that we are going to
///   do.
/// * The `Box<dyn Fn(&mut WeightMeter)>` is the function that represents that the work that
///   will at most consume the said amount of weight. While executing, it will alter the given
///   weight meter to consume the actual weight used. Indeed, the weight that is registered in
///   the `WeightMeter` must never be more than the `Weight` returned as the first item of the
///   tuple.
///
/// In essence, the caller must:
///
/// 1. given an existing `meter`, receive `(worst_weight, exec)`
/// 2. ensure `meter` can consume up to `worst_weight`.
/// 3. if so, call `exec(meter)`, knowing `meter` will accumulate at most `worst_weight` extra.
fn per_block_exec(current_phase: Phase<T>) -> (Weight, Box<dyn Fn(&mut WeightMeter)>) {
    ... 
}

Export Weight

Through this PR, I realized that we previously were never registering the weight of the export process. This is because the export is managed by staking pallet, and previously it had no way to know how much the weight of each export step is.

Now, we alter the ElectionProvider::status interface such that not only we signal if we are ready or not, but also we signal we are ready, and this is the weight of the next elect.

fn status() -> Result<Option<Weight>, ()> {
	match <CurrentPhase<T>>::get() {
		// we're not doing anything.
		Phase::Off => Err(()),

		// we're doing sth but not ready.
		Phase::Signed(_) |
		Phase::SignedValidation(_) |
		Phase::Unsigned(_) |
		Phase::Snapshot(_) |
		Phase::Emergency => Ok(None),

		// we're ready, and this is the weight of the next step
		Phase::Done => Ok(Some(T::WeightInfo::export_non_terminal())),
		Phase::Export(p) =>
			if p.is_zero() {
				Ok(Some(T::WeightInfo::export_terminal()))
			} else {
				Ok(Some(T::WeightInfo::export_non_terminal()))
			},
	}
}

Integration

The only breaking change of this PR is:

impl multi_block::Config for Runtime {
    // .. 
    type Signed = multi_block_signed::Pallet<Self>
}

While not mandatory, the fellowship runtimes should use the new check_all_weights function to test the weights.

Path To Weight Refund

The usage of WeightMeter is intentional here to pave the way to reclaiming the weight in a subsequent PR. It will look like this:

--- a/substrate/frame/election-provider-multi-block/src/lib.rs
+++ b/substrate/frame/election-provider-multi-block/src/lib.rs
@@ -1310,6 +1310,7 @@ impl<T: Config> Pallet<T> {
 	/// 2. ensure `meter` can consume up to `worst_weight`.
 	/// 3. if so, call `exec(meter)`, knowing `meter` will accumulate at most `worst_weight` extra.
 	fn per_block_exec(current_phase: Phase<T>) -> (Weight, Box<dyn Fn(&mut WeightMeter)>) {
+		use cumulus_primitives_storage_weight_reclaim::StorageWeightReclaimer;
 		type ExecuteFn = Box<dyn Fn(&mut WeightMeter)>;
 		let noop: (Weight, ExecuteFn) = (T::WeightInfo::per_block_nothing(), Box::new(|_| {}));
 
@@ -1318,8 +1319,9 @@ impl<T: Config> Pallet<T> {
 				// first snapshot
 				let weight = T::WeightInfo::per_block_snapshot_msp();
 				let exec: ExecuteFn = Box::new(move |meter: &mut WeightMeter| {
+					let mut reclaimer = StorageWeightReclaimer::new(meter);
 					Self::create_targets_snapshot();
-					meter.consume(weight)
+					let _reclaimed = reclaimer.reclaim_with_meter(meter);
 				});
 				(weight, exec)
 			},
@@ -1328,8 +1330,9 @@ impl<T: Config> Pallet<T> {
 				// rest of the snapshot, incl last one.
 				let weight = T::WeightInfo::per_block_snapshot_rest();
 				let exec: ExecuteFn = Box::new(move |meter: &mut WeightMeter| {
+					let mut reclaimer = StorageWeightReclaimer::new(meter);
 					Self::create_voters_snapshot_paged(x);
-					meter.consume(weight)
+					let _reclaimed = reclaimer.reclaim_with_meter(meter);
 				});
 				(weight, exec)
 			},

In short, instead of consuming the worst case weight, we consume the accurate amount given to us by the weight reclaimer.

TODO

@kianenigma kianenigma requested a review from a team as a code owner October 3, 2025 12:26
@kianenigma kianenigma added T2-pallets This PR/Issue is related to a particular pallet. A4-backport-unstable2507 Pull request must be backported to the unstable2507 release branch labels Oct 3, 2025
@sigurpol sigurpol self-requested a review October 3, 2025 13:21
@kianenigma
Copy link
Copy Markdown
Contributor Author

/cmd bench --help

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Oct 6, 2025

Command help:
usage: /cmd bench [-h] [--quiet] [--clean] [--image IMAGE]
                  [--runtime [{dev,westend,rococo,asset-hub-westend,asset-hub-rococo,bridge-hub-rococo,bridge-hub-westend,collectives-westend,coretime-rococo,coretime-westend,glutton-westend,people-rococo,people-westend} ...]]
                  [--pallet [PALLET ...]] [--fail-fast]

options:
  -h, --help            show this help message and exit
  --quiet               Won't print start/end/failed messages in PR
  --clean               Clean up the previous bot's & author's comments in PR
  --image IMAGE         Override docker image '--image
                        docker.io/paritytech/ci-unified:latest'
  --runtime [{dev,westend,rococo,asset-hub-westend,asset-hub-rococo,bridge-hub-rococo,bridge-hub-westend,collectives-westend,coretime-rococo,coretime-westend,glutton-westend,people-rococo,people-westend} ...]
                        Runtime(s) space separated
  --pallet [PALLET ...]
                        Pallet(s) space separated
  --fail-fast           Fail fast on first failed benchmark

**Examples**:
 Runs all benchmarks 
 /cmd bench

 Runs benchmarks for pallet_balances and pallet_multisig for all runtimes which have these pallets. **--quiet** makes it to output nothing to PR but reactions
 /cmd bench --pallet pallet_balances pallet_xcm_benchmarks::generic --quiet
 
 Runs bench for all pallets for westend runtime and fails fast on first failed benchmark
 /cmd bench --runtime westend --fail-fast
 
 Does not output anything and cleans up the previous bot's & author command triggering comments in PR 
 /cmd bench --runtime westend rococo --pallet pallet_balances pallet_multisig --quiet --clean

@kianenigma
Copy link
Copy Markdown
Contributor Author

/cmd bench --pallet pallet_election_provider_multi_block pallet_election_provider_multi_block_signed pallet_election_provider_multi_block_verifier pallet_election_provider_multi_block_unsigned --runtime asset-hub-westend

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Oct 6, 2025

Command "bench --pallet pallet_election_provider_multi_block pallet_election_provider_multi_block_signed pallet_election_provider_multi_block_verifier pallet_election_provider_multi_block_unsigned --runtime asset-hub-westend" has started 🚀 See logs here

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Oct 6, 2025

Command "bench --pallet pallet_election_provider_multi_block pallet_election_provider_multi_block_signed pallet_election_provider_multi_block_verifier pallet_election_provider_multi_block_unsigned --runtime asset-hub-westend" has failed ❌! See logs here

Comment thread substrate/frame/election-provider-multi-block/src/lib.rs Outdated
Comment thread substrate/frame/election-provider-multi-block/src/lib.rs Outdated
Comment thread substrate/frame/election-provider-multi-block/src/lib.rs Outdated
Comment thread substrate/frame/election-provider-multi-block/src/lib.rs Outdated
Comment thread substrate/frame/election-provider-multi-block/src/lib.rs Outdated
Comment thread substrate/frame/election-provider-multi-block/src/signed/tests.rs Outdated
@paritytech-workflow-stopper
Copy link
Copy Markdown

All GitHub workflows were cancelled due to failure one of the required jobs.
Failed workflow url: https://github.com/paritytech/polkadot-sdk/actions/runs/19818107671
Failed job name: test-linux-stable-no-try-runtime

@kianenigma kianenigma enabled auto-merge December 1, 2025 10:32
@kianenigma kianenigma added this pull request to the merge queue Dec 1, 2025
Merged via the queue into master with commit 05a3fb1 Dec 1, 2025
251 of 254 checks passed
@kianenigma kianenigma deleted the kiz-epmb-poll branch December 1, 2025 20:54
@paritytech-release-backport-bot
Copy link
Copy Markdown

Created backport PR for unstable2507:

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin backport-9925-to-unstable2507
git worktree add --checkout .worktree/backport-9925-to-unstable2507 backport-9925-to-unstable2507
cd .worktree/backport-9925-to-unstable2507
git reset --hard HEAD^
git cherry-pick -x 05a3fb107e488378075e956186df34d04c1bf656
git push --force-with-lease

@paritytech-release-backport-bot
Copy link
Copy Markdown

Created backport PR for stable2509:

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin backport-9925-to-stable2509
git worktree add --checkout .worktree/backport-9925-to-stable2509 backport-9925-to-stable2509
cd .worktree/backport-9925-to-stable2509
git reset --hard HEAD^
git cherry-pick -x 05a3fb107e488378075e956186df34d04c1bf656
git push --force-with-lease

@paritytech-release-backport-bot
Copy link
Copy Markdown

Created backport PR for stable2512:

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin backport-9925-to-stable2512
git worktree add --checkout .worktree/backport-9925-to-stable2512 backport-9925-to-stable2512
cd .worktree/backport-9925-to-stable2512
git reset --hard HEAD^
git cherry-pick -x 05a3fb107e488378075e956186df34d04c1bf656
git push --force-with-lease

bee344 pushed a commit that referenced this pull request Dec 2, 2025
This PR moves all operations related to staking elections from a
mandatory `on_initialize` with no consideration to weight, to an
optional `on_poll` with accurate, pre-execution weight checking.

## Why

* `on_initialize` is a mandatory hook. If a single parachain block
happens to contain too many of them, this block can never be authored
and imported. In solo/relay chains, this is more forgiving, as you would
have one slow block, instead of an indefinite stall.
* For example, message-queue XCMs, scheduler and MBMs might overlap with
the staking `on_initialize` in AH (unlikely, but totally possible), and
put the chain at risk.
* Contrary, `poll` hooks: 
  * Might not happen at all by `frame-executive` (e.g. during MBMs)
* Have access to a clear `WeigthMeter`, allowing the subject to make a
decision about whether to proceed or not.

## Functional Changes

As seen by the minimal diff in existing tests, this change, in the
absence of weight scarcity, is almost a noop. The only difference is
that the start signal from the signed pallet to the verifier pallet is
now sent at the end of the signed phase, not the beginning the signed
validation.

## Non-Functional Changes

* Now, the only pallets that call `on_poll` are `multi_block` (the
parent, not verifier and signed), and `staking_async`. This makes the
code easier to audit.
* Removes a lot of `on_initialize` terminology from weight functions
* Cleans up some stale variations in the mock setup, allowing us to skip
the signed pallet's on-initialize. This no longer makes sense as the
parent pallet is only one that calls `on_poll`.

#### Mote Test Changes

During this PR, I found multiple instances where we are forwarding the
wrong number of blocks forward in the EPMB tests. For example, to verify
a solution, 3 blocks are needed, but we are calling `roll_next` 4 times,
and the test is still passing. To harden such cases and make sure all
future tests are as explicit as possible, I have:

* Generic `fn roll_next` is made fully private and in tests should be
replaced by:
  * `fn roll_next_and_phase(expect: Phase<T>)`
  * `fn roll_next_and_phase_verifier(expect: Phase<T>, status: Status)`
  * `fn roll_next_and_verifier(status: Status)`
* This ensures all tests to explicitly state what the expected
`Phase`/`Status` should be after moving a block forward.

All tests are updated to respect this paradigm, which has made the diff
slightly larger than I wished it to be.

### Implementation/Review Notes

#### Overall Design 

The overall idea is to move all operations to a model similar to
dispatchables, where before executing `f(input) -> Result`, we have
access to a `w(input) -> Weight` that gives us the pre-execution weight.
If the pre-execution weight is good, we proceed with executing. The
execution may override the pre-execution weight to a smaller value if it
wishes so.

```rust
/// ### Type
///
/// The commonly used `(Weight, Box<dyn Fn(&mut WeightMeter)>)` should be interpreted as such:
///
/// * The `Weight` is the pre-computed worst case weight of the operation that we are going to
///   do.
/// * The `Box<dyn Fn(&mut WeightMeter)>` is the function that represents that the work that
///   will at most consume the said amount of weight. While executing, it will alter the given
///   weight meter to consume the actual weight used. Indeed, the weight that is registered in
///   the `WeightMeter` must never be more than the `Weight` returned as the first item of the
///   tuple.
///
/// In essence, the caller must:
///
/// 1. given an existing `meter`, receive `(worst_weight, exec)`
/// 2. ensure `meter` can consume up to `worst_weight`.
/// 3. if so, call `exec(meter)`, knowing `meter` will accumulate at most `worst_weight` extra.
fn per_block_exec(current_phase: Phase<T>) -> (Weight, Box<dyn Fn(&mut WeightMeter)>) {
    ... 
}
``` 

#### Export Weight

Through this PR, I realized that we previously were never registering
the weight of the export process. This is because the export is managed
by staking pallet, and previously it had no way to know how much the
weight of each export step is.

Now, we alter the `ElectionProvider::status` interface such that not
only we signal if we are ready or not, but also we signal _we are ready,
and this is the weight of the next `elect`_.

```rust
fn status() -> Result<Option<Weight>, ()> {
	match <CurrentPhase<T>>::get() {
		// we're not doing anything.
		Phase::Off => Err(()),

		// we're doing sth but not ready.
		Phase::Signed(_) |
		Phase::SignedValidation(_) |
		Phase::Unsigned(_) |
		Phase::Snapshot(_) |
		Phase::Emergency => Ok(None),

		// we're ready, and this is the weight of the next step
		Phase::Done => Ok(Some(T::WeightInfo::export_non_terminal())),
		Phase::Export(p) =>
			if p.is_zero() {
				Ok(Some(T::WeightInfo::export_terminal()))
			} else {
				Ok(Some(T::WeightInfo::export_non_terminal()))
			},
	}
}
```

## Integration 

The only breaking change of this PR is: 

```
impl multi_block::Config for Runtime {
    // .. 
    type Signed = multi_block_signed::Pallet<Self>
}
```

While not mandatory, the fellowship runtimes should use the new
`check_all_weights` function to test the weights.

## Path To Weight Refund

The usage of `WeightMeter` is intentional here to pave the way to
reclaiming the weight in a subsequent PR. It will look like this:

```diff
--- a/substrate/frame/election-provider-multi-block/src/lib.rs
+++ b/substrate/frame/election-provider-multi-block/src/lib.rs
@@ -1310,6 +1310,7 @@ impl<T: Config> Pallet<T> {
 	/// 2. ensure `meter` can consume up to `worst_weight`.
 	/// 3. if so, call `exec(meter)`, knowing `meter` will accumulate at most `worst_weight` extra.
 	fn per_block_exec(current_phase: Phase<T>) -> (Weight, Box<dyn Fn(&mut WeightMeter)>) {
+		use cumulus_primitives_storage_weight_reclaim::StorageWeightReclaimer;
 		type ExecuteFn = Box<dyn Fn(&mut WeightMeter)>;
 		let noop: (Weight, ExecuteFn) = (T::WeightInfo::per_block_nothing(), Box::new(|_| {}));
 
@@ -1318,8 +1319,9 @@ impl<T: Config> Pallet<T> {
 				// first snapshot
 				let weight = T::WeightInfo::per_block_snapshot_msp();
 				let exec: ExecuteFn = Box::new(move |meter: &mut WeightMeter| {
+					let mut reclaimer = StorageWeightReclaimer::new(meter);
 					Self::create_targets_snapshot();
-					meter.consume(weight)
+					let _reclaimed = reclaimer.reclaim_with_meter(meter);
 				});
 				(weight, exec)
 			},
@@ -1328,8 +1330,9 @@ impl<T: Config> Pallet<T> {
 				// rest of the snapshot, incl last one.
 				let weight = T::WeightInfo::per_block_snapshot_rest();
 				let exec: ExecuteFn = Box::new(move |meter: &mut WeightMeter| {
+					let mut reclaimer = StorageWeightReclaimer::new(meter);
 					Self::create_voters_snapshot_paged(x);
-					meter.consume(weight)
+					let _reclaimed = reclaimer.reclaim_with_meter(meter);
 				});
 				(weight, exec)
 			},
```

In short, instead of consuming the _worst case weight_, we consume the
accurate amount given to us by the weight reclaimer.

## TODO

- [x] Unit tests
- [x] Unified integration tests for weights 
- [x] weight for RoundRotation is missing (export -> off phase)
- [x] Run all papi-integration tests at the end once.
- [x] Weight update / closes
#7714
- [ ] Test to ensure pallet ordering is not important anymore.
- [x] closes #8910
- [x] Upgrade block `on-init` -> `on-poll`
- [x] queue to audit
- [ ] audit done

---------

Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com>
@rachsrl rachsrl moved this from Scheduled to In progress in Security Audit (PRs) - SRLabs Dec 15, 2025
@mmostafas mmostafas moved this from In progress to Waiting for fix in Security Audit (PRs) - SRLabs Jan 8, 2026
@rachsrl rachsrl moved this from Waiting for fix to Audited in Security Audit (PRs) - SRLabs Jan 15, 2026
sigurpol added a commit that referenced this pull request Feb 24, 2026
…igned phase (#11156)

on_initialize claimed on_initialize_into_signed weight (> ~3M
proof_size) every block during the Signed phase, but the heavy work
(loading voter snapshots) only happens once when transitioning from
Snapshot into Signed.
Use discriminant comparison to distinguish phase entry from same-phase
ticks, falling back to on_initialize_nothing for the latter.

A proper fix would be backporting #9925 into stable2512 - but this would
be a bigger change not compatible with the tight timeline of 2.1.0
release on Polkadot/Kusama.

---------

Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com>
arturgontijo pushed a commit to moonbeam-foundation/polkadot-sdk that referenced this pull request Apr 1, 2026
…igned phase (paritytech#11156)

on_initialize claimed on_initialize_into_signed weight (> ~3M
proof_size) every block during the Signed phase, but the heavy work
(loading voter snapshots) only happens once when transitioning from
Snapshot into Signed.
Use discriminant comparison to distinguish phase entry from same-phase
ticks, falling back to on_initialize_nothing for the latter.

A proper fix would be backporting paritytech#9925 into stable2512 - but this would
be a bigger change not compatible with the tight timeline of 2.1.0
release on Polkadot/Kusama.

---------

Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A4-backport-stable2509 Pull request must be backported to the stable2509 release branch A4-backport-stable2512 Pull request must be backported to the stable2512 release branch A4-backport-unstable2507 Pull request must be backported to the unstable2507 release branch T2-pallets This PR/Issue is related to a particular pallet.

Projects

Status: Audited
Status: Done

Development

Successfully merging this pull request may close these issues.

7 participants