aura/slot_based: Fix effective slot deadline using relay parent offset by lexnv · Pull Request #11453 · paritytech/polkadot-sdk

lexnv · 2026-03-20T13:17:19Z

This PR addresses two issues with the collators building blocks:

Issue 1: Skipping the first block of the slot:
- Wall clock (slot) 803, parachain slot 803
- last reported slot is 803, next slot 804, remaining 491ms => deadline = 491 - 1000 = 0 (skipps the block)
Issue 2: Wrongfully building the first block of the next slot:
- Wall clock 805, parachain slot 803 (relay chain not yet imported, same parent same paraslot outdated)
- last reported slot is 805, next slot 806, remaining 6s, deadline = 6s - 1s = 5s => builds block

When issue 1 happens, the collator A is building less blocks than expected degrading the block times.

When issue 2 happens, the collator A is competing (outside of his slot) with another collator B. Since both collators are building their blocks at roughly the same time, it is a matter of chance which one gets backed first by the relay chain.

However, in 95% of incidents, collator B's block will get backed by the relay chain.
Collator A is at the end of his slot (one block past it, in fact), while collator B has a fresh connection with the backing group. One theory is that connections might degrade over time and that could explain why collator B block gets backed 95% of incidents.

This exposes the Issue 3 (unaddressed in this PR):

T0: Collator A advertises its block 0xA (build wrongfully)
T0: Collator B advertises at the same height the block 0xB
T1: Collator B imports the block 0xA as best then starts building 9 other blocks on this fork
T2: Relay chain backs 0xB invalidating and degrading the block confidence

Root Cause

Because we use an RP offset=1, we only start building on top of relay_parent=0xa052…1016 relay_parent_num=30397129 when #30397130 (0xa052…1016 → 0x201f…fc48) gets imported.

Node 3 was building properly 10 blocks, skipping 2. Then on the next block production opportunity, the relay block import races with building:

at 10:20:31.002 we start building the 11 block outside our slot because we still build on top of 30397128
at 10:20:31.017 node imports relay 30397130 which would have seen best as 30397129 and detect the paraslot change

Logs

Node 3

// Skips first block opportunity
10:20:24.501
aura::cumulus: [Parachain] New block production opportunity. slot_duration=SlotDuration(6000) 
  aura_slot=Slot(295623803) relay_parent=0xf970…0b74 relay_parent_num=30397128
    slot=Slot(295623803)
  duration_until_next_slot=491.263683ms
  Adjusted proposal duration. duration=None
  
10:20:25.002
aura::cumulus: [Parachain] New block production opportunity. slot_duration=SlotDuration(6000)
  aura_slot=Slot(295623804) relay_parent=0xf970…0b74 relay_parent_num=30397128
    slot=Slot(295623803)
  duration_until_next_slot=5.992199112s 
  Adjusted proposal duration. duration=Some(493ms)

// Builds 9 more blocks

// Skips 2 blocks
10:20:30.000
aura::cumulus: [Parachain] New block production opportunity. slot_duration=SlotDuration(6000)
  aura_slot=Slot(295623804)  relay_parent=0xf970…0b74 relay_parent_num=30397128
  duration_until_next_slot=993.059675ms
  Adjusted proposal duration. duration=None
  
10:20:30.500
aura::cumulus: [Parachain] New block production opportunity. slot_duration=SlotDuration(6000)
  aura_slot=Slot(295623804) relay_parent=0xf970…0b74 relay_parent_num=30397128
  duration_until_next_slot=493.092078ms
  Adjusted proposal duration. duration=None

// The issue happens here, aura_slot is the wall clock which gets incremented
// while the relay parent stays fixed and results into paraslot 803
// when in fact relay has changed and this node did not see it yet.
//
// Sufficient time to build: allows building wrongfully

10:20:31.002
aura::cumulus: [Parachain] New block production opportunity. slot_duration=SlotDuration(6000)
  aura_slot=Slot(295623805) relay_parent=0xf970…0b74 relay_parent_num=30397128
       slot=Slot(295623803)
  duration_until_next_slot=5.991286702s
  Adjusted proposal duration. duration=Some(492ms)

Node 1 Imports

10:20:24.419 [Relaychain] 🏆 Imported #30397129 (0xf970…0b74 → 0xa052…1016)
10:20:30.960 [Relaychain] 🏆 Imported #30397130 (0xa052…1016 → 0x201f…fc48)
10:20:36.522 [Relaychain] 🏆 Imported #30397131 (0x201f…fc48 → 0xd6b3…5527)

Node 3 Imports

10:20:24.378 [Relaychain] 🏆 Imported #30397129 (0xf970…0b74 → 0xa052…1016)
10:20:31.017 [Relaychain] 🏆 Imported #30397130 (0xa052…1016 → 0x201f…fc48)
10:20:36.524 [Relaychain] 🏆 Imported #30397131 (0x201f…fc48 → 0xd6b3…5527)

This has been detected using:

Lower Block Confidence on Polkadot 96% vs Kusama parachains 99% #11377 (comment)
https://github.com/lexnv/block-confidence-monitor and manual inspection of incidents

Testing Done

have added unit tests
deployment in progress to double check block confidence

cc @sandreim @skunert

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>

bkchr · 2026-03-20T20:28:44Z

T5: Relay parent backs 0xC3

But this would be correct? You probably mean the other one?

bkchr

This is the last time I look at such AI slop. IT IS YOUR JOB TO LOOK OVER THIS CODE BEFORE OPENING A PR. Next time I see such a pr, I will just close it.

Reading the explanation is extremely hard to follow what is going on. The changes are just "wild". If I understand it correctly and it is about the relay parent offset, wouldn't it be much simpler to directly use remove the relay parent offset from duration_now? Then we don't need to adjust the slot later on and it should simplify this pr drastically.

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>

lexnv · 2026-03-23T12:08:08Z

Yep Im going to rethink the PR and update the description, believe we can trim it down to a few lines🙏

Over the weekend:

polkadot YAP went from 97.38 to 97.81%: improved by 0.43%
kusama YAP went from 98.8 to 99.66%: improved by 0.86%

…-blocks

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>

to none Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>

skunert

Looks much better now, left one comment we should quickly discuss, but then ready to approve.

This case 1 that is mentioned here took me some mental gymnastics to get right again, even though its not so complicated.
Came up with this illustration to clarify:

skunert · 2026-03-24T17:13:20Z

+			hash: Option<H256>,
+			/// True if the collator built a block for the current relay parent, false otherwise.
+			///
+			/// This state is needed, otherwise the opportunity 1 might mark the block as


I am wondering if this is true. Opportunity 1 can only happen if we skip the last block because of slot handover. So the collator authoring the blocks for the next RP is someone else entirely anyway. If we run into this situation it should not matter whether this has triggered 🤔 .

Opportunity 1 can only happen if we skip the last block because of slot handover. So the collator authoring the blocks for the next RP is someone else entirely anyway.

In this case, the same collator that sees opportunity 1 is the collator that will build in the next wall slot (going through all the rest of the block building opportunities). Then if we ignore the has_built flag, we are effectively short-circuiting the building of 10 blocks in wall slot 804 and para slot 803 🤔

Have added a new diagram here for the race condition: #11453 (comment)

lexnv · 2026-03-25T12:31:46Z

This diagram shows the race conditions:

(issue 1 from pr description): Wall clock is 803 and parachain slot is 803: The collator A attempts to build this block, but building is skipped because the next wall clock 804 arrives in ~490ms and we have the 1s drift
Then the collator A builds 10 more blocks in Wall clock 803 and parachain slot 803 (expected behavior)
Then the collaotr A skips building 2 blocks because of the 1s drift (expected)
(issue 2 from pr description): Wall clock is now 805 and parachain slot is still 803 (a new block hasnt been imported yet): Collator A wrongfully builds this block (14th attempts in total)

Reproducing the edge-case with the PR fix:
node1-building-v2.txt

bkchr · 2026-03-25T14:56:53Z

+		//
+		//                 [ wall slot | para slot | next wall slot]
+		// opportunity 1:  [ 803       |       803 |         490ms ]
+		// 	  - The wall slot is behind the para slot deduced by the relay block


They are both the same here?

bkchr · 2026-03-25T14:57:00Z

+		// opportunity 1:  [ 803       |       803 |         490ms ]
+		// 	  - The wall slot is behind the para slot deduced by the relay block
+		// 	  - The next slot 804 arrives in 490ms leaving no room for the 1s authoring duration
+		//    - collator must skip the building the first block for this relay block


Yep exactly, when we first deduce para slot 803, it is overlaped in wall slot 803

This leaves us ~490ms until next wall slot 804 arrives, and because it is within the 1s of the next slot it gets skipepd

I don't understand what you are saying. Your example in the code doesn't make sense to me nor your explanation. Why do we only have 490ms?

Maybe this diagram makes a bit easier to follow: #11453 (comment)

We've got two slots, one which is the wall clock (aura slot) that wakes us up every 500ms, and the one that is inferred from the relay parent (para slot). It happens that we get the following situation:

aura slot is 803, currently 500ms away from the next aura slot 804

para slot is 803: the slot in which the collator is allowed to build

We use the aura slot to "cut" the last 1s of the slot, that is we are not building any blocks in the last second of the wall clock (aura slot). For 12 cores, we are building 10 blocks at 500ms, then skipping a full 1s.

Because the aura slot 803 is 500ms away from 804, we'll skip building the block (500 - 1000 cappes at 0).
The pattern is: Skip, Build 10 blocks, Skip Skip, Build 1 block (wrongfully)

bkchr · 2026-03-25T14:58:36Z

+		//     the wall slot ticks.
+		//   - We don't want to build on this relay parent and instead skip until the next relay
+		//     block arrives.
+		struct ParentTracker {


This should be moved outside of this function (the type declaration)

bkchr · 2026-03-25T15:00:23Z

+			has_terminated: bool,
+		}
+
+		let mut parent_tracker =


Isn't it not just enough to check that the block number of the relay chain is strictly increasing? So, we do not build on the same block twice?

Let me know if I got the idea right:

We use a block number that we set when finished building the 10 blocks for this slot, because we reach the 1s limit at the end of the slot (this signals we are at the end of the slot and the block is terminated)

Therefore, when the next wall slot arrives, the relay block on which we are building must be strictly greater than the block we just set, otherwise we are building on the same stale parent

Does this guard against the first case? The scenario where the first block opportunity is skipped for the same wall slot and para slot? 🤔

Also it might have a tiny race with reorgs? Maybe I got the idea wrong:

// We use a simple number to detect when we terminated with the block production let mut last_terminated_relay_number: Option<u32> = None; ... let relay_parent = rp_data.relay_parent().hash(); let relay_parent_header = rp_data.relay_parent().clone(); // Could re-org with a different parent, but mostly ok since that doesnt happen that often? if last_terminated_relay_number >= Some(relay_parent_header.number()) { continue; } ... let Some(adjusted_authoring_duration) = adjusted_authoring_duration else { // But this case doesnt guard against the first opportunity last_terminated_relay_number = Some(relay_parent_header.number()); }

Isn't it not just enough to check that the block number of the relay chain is strictly increasing? So, we do not build on the same block twice?

This sounds right to me.

We can also remove the 1s time offset which makes things a bit easier to understand, However then we'd still have to wait for the RC block import, before we can build.

If using only the relay number we can brick the elastic scaling (maybe missing something from the initial suggestion):

core 0 (0ms): we build one block

core 1 (0ms + 500ms): the same aura slot, but the relay is not increasing

Have coupled the aura slot (wall clock) with a relay parent number. They are set only when we successfully build a block, such that we don't race with the T0 (from the comment) which is a soft error.

Managed to simplify from a Hash and 2 bools to a 2 ints. However, I still believe we could use hashes to tighten the code against reorgs (which we'll now skip until the RP advances)?

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>

skunert

Logic is basically the same as with the two bools, but based on block numbers. Lets get this merged.

skunert · 2026-03-27T10:27:30Z

+		//
+		//  - T0: Aura 803 begins. The relay parent is still old (0xA). The node skips.
+		//
+		//  - T1 (soft failure): Aura 803 last block production opportunity. Relay parent 0xB (para


nit: I think its fine to just mention here the race condition in a shorter form and link to this PR or something. 30 lines of comment for this edge case seems a bit much.

eskimor

I have two issues:

One is already mentioned in code. (We should not be using the parachain slot.)

But even with that fix, the code is still not quite right:

Startup

The current block producer will behave itself with this added check. If we are in the next slot, it will wait for the next relay parent, realizing that it is already the next guy's turn and be done. The other guy does not have any safeguards, but also does not need any - as it wasn't its turn on the old relay parent anyways. Now, this works except on startup: There was no previously built block, so nobody behaves itself. This is minor (a fork when booting up a chain), but already hints that the solution is not fully sound.

2 possibilities to satisfy the condition

You wake up with the condition met, before the relay parent of the current slot arrives: Result, you start building, but only after a short-while the next guys will still conflict with you as the next relay parent has arrived. It is a bit unpredictable, with the forks then, which will stay and so on, but in principle this situation could be sticky. (The conditions keeps holding.)
You wake up with the condition met, after the relay parent of the current slot arrives: This seems the desired state, but is still inherently racy with relay chain block propagation. The good thing: Once we are in this state it is also kind of sticky as the logic will enforce us staying in that state ... as long as no block producer skips its slot/reboots or similar things.

Proposals for fixing

Use relay parent of previous slot

Fix this properly. The underlying issue is that we are trying to use a relay parent that is currently being built, we try to accommodate for that fact with the offset, but that is just a hack. The proper solution is to do what we are also properly enforcing with v3: Always build on the relay parent of the last finished relay chain slot. The network assumption is that building/propagation & import can fit in a slot, so by this means we can actually be "sure" that the relay parent is there when we start our work.

This proposal could come with downsides as of now, as we would then use an older relay parent:

Again increased messaging latency
Session boundaries are more rough: With effective synchronous backing, we can predict the session change, with proper async backing (this proposal), this is not possible and we will always drop blocks on session boundaries.

Could still be good, as point (2) only works with empty blocks (but that is predominant situation). Actually utilized elastic scaling has no way to work within synchronous backing boundaries. And for (1) we are already accepting this trade-off and you are already working on a solution (speculative messaging).

Use relay parent of current slot

Always build on the relay parent of the current slot & enforce this: Instead of relying on previously built blocks, just check whether the currently in scope relay parent is the one of the current slot - if not, wait until it arrives.

Downsides:

Block production will have jitter depending on relay chain block propagation.
We will always skip block production opportunities at the beginning of the relay chain slot. (Waiting for the next relay parent to arrive)

But it should not be any worse then what this PR does, as this is essentially the desired effect of it anyways (2nd possibility above).

eskimor · 2026-03-27T12:04:21Z

+			// If the wall slot changes, and the relay parent number is smaller than the last built
+			// relay parent, it means we are in a new slot but a new relay parent has not been
+			// updated on yet (races with import).
+			if aura_slot > last_slot_number && relay_parent_number <= last_rp_built_number {


This looks wrong. aura_slot is the parachain slot, so with parachain slots > relay chain slots, this will not have the desired effect.

skunert · 2026-03-27T14:36:51Z

Use relay parent of previous slot

I like the way we do this in V3. But if my reasoning is correct, it needs to be enforceable. Otherwise someone else could just build on the most recent RP and ignore this rule. Its only node-side after all and parachain clock can not be trusted without enforcement by the RP. As long as they provide a complete RP descendantry it will go through.

eskimor · 2026-03-27T15:42:43Z

I like the way we do this in V3. But if my reasoning is correct, it needs to be enforceable. Otherwise someone else could just build on the most recent RP and ignore this rule. Its only node-side after all and parachain clock can not be trusted without enforcement by the RP. As long as they provide a complete RP descendantry it will go through.

True, I wasn't too concerned about this, as this whole PR is avoiding forks by voluntary back-off by the current block producer. I am afraid that in v2 we can't enforce slots properly in any case, but I agree using last slot makes it arguably worse. I am also leaning towards the more hacky second approach as it better matches the status quo, proper fix will come with v3 anyways.

This reverts commit 2ee71e4.

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>

lexnv · 2026-04-08T09:54:48Z

Closing this in favor of:

Enforce current relay parent to be available #11621

lexnv added 5 commits March 20, 2026 12:03

aura/slot_based: Allow building blocks in the effective parachain slots

1dd1358

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>

aura/slot_based: Use saturating add for overflows

35ffc0d

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>

aura/slot_based: Log effective slot for debuggability

3c730de

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>

aura/slot_based: Detect stale effective slots on different authors

a963b39

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>

aura/slot_based: Add tests

e9fdc0b

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>

lexnv self-assigned this Mar 20, 2026

lexnv added the T0-node This PR/Issue is related to the topic “node”. label Mar 20, 2026

github-project-automation Bot added this to [preview] release tracker Mar 20, 2026

github-project-automation Bot moved this to Todo in [preview] release tracker Mar 20, 2026

aura/slot_based: Adjust the limits based on kusama data

68067d4

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>

pgherveou removed this from [preview] release tracker Mar 20, 2026

bkchr requested changes Mar 20, 2026

View reviewed changes

aura/slot_based: Revert the minimum duration

30a70de

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>

lexnv added 4 commits March 23, 2026 12:32

Merge remote-tracking branch 'origin/master' into lexnv/fix-authoring…

f3a752f

…-blocks

Revert to origin master

fead047

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>

aura: Ensure block builder marks the slot as terminated after adjusting

dcb9520

to none Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>

aura: Add more documentation into the race condition

542d7e7

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>

lexnv requested review from bkchr and skunert March 24, 2026 14:48

skunert reviewed Mar 24, 2026

View reviewed changes

bkchr reviewed Mar 25, 2026

View reviewed changes

Use block numbers

2ee71e4

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>

skunert approved these changes Mar 27, 2026

View reviewed changes

eskimor requested changes Mar 27, 2026

View reviewed changes

sandreim approved these changes Mar 27, 2026

View reviewed changes

lexnv added 2 commits March 31, 2026 10:39

Revert "Use block numbers"

df8c3c7

This reverts commit 2ee71e4.

Update comment

0b47879

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>

This was referenced Apr 1, 2026

[DNM/noreview] Deploy latest patches on YAP Polkadot w 3 cores fixed #11598

Open

Enforce current relay parent to be available #11621

Merged

lexnv requested review from bkchr and eskimor April 7, 2026 10:05

lexnv closed this Apr 8, 2026

Conversation

lexnv commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Root Cause

Logs

Node 3

Node 1 Imports

Node 3 Imports

Testing Done

Uh oh!

bkchr commented Mar 20, 2026

Uh oh!

bkchr left a comment

Choose a reason for hiding this comment

Uh oh!

lexnv commented Mar 23, 2026

Uh oh!

skunert left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lexnv commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lexnv Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lexnv Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

skunert left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eskimor left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Startup

2 possibilities to satisfy the condition

Proposals for fixing

Use relay parent of previous slot

Use relay parent of current slot

Uh oh!

Choose a reason for hiding this comment

Uh oh!

skunert commented Mar 27, 2026

Uh oh!

eskimor commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lexnv commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

lexnv commented Mar 20, 2026 •

edited

Loading

lexnv commented Mar 25, 2026 •

edited

Loading

lexnv Mar 25, 2026 •

edited

Loading

lexnv Mar 26, 2026 •

edited

Loading

eskimor left a comment •

edited

Loading

eskimor commented Mar 27, 2026 •

edited

Loading