Fix core sharing and make use of scheduling_lookahead #4724

tdimitrov · 2024-06-07T07:29:47Z

Implements most of #1797

Core sharing (two parachains or more marachains scheduled on the same core with the same PartsOf57600 value) was not working correctly. The expected behaviour is to have Backed and Included event in each block for the paras sharing the core and the paras should take turns. E.g. for two cores we expect: Backed(a); Included(a)+Backed(b); Included(b)+Backed(a); etc. Instead of this each block contains just one event and there are a lot of gaps (blocks w/o events) during the session.

Core sharing should also work when collators are building collations ahead of time

TODOs:

Add a zombienet test verifying that the behaviour mentioned above works.
prdoc

Co-authored by: alindima <[email protected]>

eskimor

Ok. I don't get how this code fixes anything. The only place that is now fully taking advantage of the claim queue is prospective parachains, but that should not help much if we only ever provide collations for claim queue depth 0.

What do I miss?

polkadot/node/network/collator-protocol/src/collator_side/mod.rs

polkadot/node/network/collator-protocol/src/validator_side/mod.rs

eskimor · 2024-06-07T11:31:47Z

Ok. It likely works because max_ancestry_len is long enough, that your collation will still be valid on the next occurrence in the claim queue. This is why the change in prospective parachains has an effect.

alindima · 2024-06-10T12:55:10Z

that your collation will still be valid on the next occurrence in the claim queue. This is why the change in prospective parachains has an effect.

Correct. Still, we don't seem to need an allowed_ancestry_len larger than 2. I tested with this value and 4 parachains sharing a core with 1.8 second candidates and it works in zombienet.

@eskimor at this moment, we don't have any collator that builds collations for more than one relay chain block in advance (lookahead and slot-based don't do this). So adding any change here that does take into account the full claim queue will not have any effect.

Collators search for an availability core where the para id is scheduled at this block or at the very next block and it builds max_candidate_depth + 1 candidates.

That's why the fixes in this PR (besides the one in prospective-parachains) don't need the claim queue (and can do with the next_up_on_available value).
The fixes in collator-protocol (both validator and collator sides) work because we were previously not looking into next_up_on_available at all. We were using the paraid of the already occupied core.

The only place where the claimqueue makes a difference is indeed in prospective-parachains. There, we need to not drop candidates for a para which are not scheduled for the next core but may be scheduled for the one after that.

alindima · 2024-06-11T11:07:23Z

discussed offline with @eskimor about this. The PR was so far fixing core sharing but not implementing the required changes for collators to be able to build collations ahead of time.
So while it is working with the current collators, we want to have collators that are able to build collations before they're scheduled (particularly for useful for on demand). I'll make these changes

eskimor

We need to ensure fairness between paras. This can be a followup though.

polkadot/node/core/backing/src/lib.rs

eskimor · 2024-06-13T12:26:23Z

polkadot/node/core/backing/src/lib.rs

 		};

-		if assigned_para_id != candidate_para_id {
+		if !assigned_paras.contains(&candidate_para_id) {


Not introduced here, but would be good to clear up. This function is called core_index_from_statement ... I would not expect it to do any validation. Do we have a better place to do this check?

I though about this and we could move this bit of validation to the caller, but I think that hurts readability. core_index_from_statement is already doing several validation steps

polkadot/node/core/prospective-parachains/src/lib.rs

polkadot/node/core/prospective-parachains/src/tests.rs

polkadot/node/network/collator-protocol/src/collator_side/mod.rs

polkadot/node/network/collator-protocol/src/validator_side/mod.rs

eskimor · 2024-06-13T14:56:15Z

polkadot/node/network/collator-protocol/src/validator_side/collation.rs

@@ -293,7 +293,7 @@ impl Collations {
 			} else {
 				1
 			};
-		self.seconded_count < seconded_limit
+		self.seconded_count >= seconded_limit


Note: This limit is based on max_candidate_depth, but is not per para (despite the docs suggesting so). Maybe we want a global (for all paras) limit still, but it certainly is not enough. A global limit only protects the validator but not parachains from each other.

For the latter we need to ensure fairness in collation fetching:

paras should get fetches proportional to how often they occur in the claim queue. E.g for a claim queue [A,B,A], A can have twice as many fetches as B.

If for two paras the weighted fetch count (actual fetches/claim queue occurrences) is equal, then prefer the para that is higher up in the claim queue as it is more urgent.

Yes, I decided to leave the fairness bit as a follow-up.

I have some thoughts on this, I'll add them to #1797

polkadot/node/subsystem-util/src/vstaging.rs

polkadot/zombienet_tests/smoke/assign-core-parts.js

paritytech-cicd-pr · 2024-06-17T12:22:31Z

The CI pipeline was cancelled due to failure one of the required jobs.
Job name: cargo-clippy
Logs: https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/6477941

polkadot/node/core/backing/src/lib.rs

polkadot/node/network/collator-protocol/src/validator_side/mod.rs

eskimor · 2024-06-17T18:54:45Z

polkadot/node/subsystem-util/src/vstaging.rs

@@ -31,7 +31,7 @@ const LOG_TARGET: &'static str = "parachain::subsystem-util-vstaging";

 /// A snapshot of the runtime claim queue at an arbitrary relay chain block.
 #[derive(Default)]
-pub struct ClaimQueueSnapshot(BTreeMap<CoreIndex, VecDeque<ParaId>>);
+pub struct ClaimQueueSnapshot(pub BTreeMap<CoreIndex, VecDeque<ParaId>>);


Why do we need this pub? I mean as long as we are not keeping any invariants via the wrapper, I don't see a problem, but is it needed?

there aren't any invariants kept. and you suggested using the vecdeque directly: #4724 (comment)
instead of adding more methods that look similar to each other I decided there's no reason why we can't expose its inner type.

Nit: Alternatively you could use: derive_more::Deref for read only access

polkadot/runtime/parachains/src/runtime_api_impl/vstaging.rs

polkadot/runtime/parachains/src/scheduler.rs

polkadot/zombienet_tests/elastic_scaling/assign-core.js

polkadot/zombienet_tests/functional/assign-core.js

it was too high for CI machines

…claimqueue

alexggh · 2024-06-19T05:38:17Z

polkadot/zombienet_tests/functional/0015-coretime-shared-core.zndsl

+# assign core 0 to be shared by all paras.
+validator-0: js-script ./assign-core.js with "0,2000,14400,2001,14400,2002,14400,2003,14400" return is 0 within 600 seconds
+
+collator-2000: reports block height is at least 6 within 200 seconds


With this high timeouts and low block heights, aren't we at a high-risk of the test working accidentally.

each parachain should get one block every 4 relay chain blocks. so one in 24 seconds (which would amount to about 144 seconds). An extra minute is needed for the session change (we're registering the parachains manually, because zombienet would otherwise also assign cores). Just to be sure, I ran this test with the code on master and it fails

Alright makes, sense then the timeouts aren't that high. Thank you!

eskimor · 2024-06-19T09:50:53Z

polkadot/node/core/backing/src/lib.rs

@@ -1029,7 +1030,7 @@ fn core_index_from_statement(
 			?group_rotation_info,
 			?statement,
 			?validator_to_group,
-			n_cores = ?cores.len() ,
+			n_cores,


Not for this PR, but once we got rid of the legacy stuff, I think we should make e.g. this iterator accessible. Should enable us to be able to get rid of this parameter as well (and the fetching of the cores to begin with).

eskimor · 2024-06-19T09:53:32Z

Thank you! Great work!

tdimitrov and others added 7 commits June 6, 2024 13:23

Zombienet test for two parachains sharing a core

b64eb22

fixes

4deb507

further simplify the fix

c94c398

prospective-parachains: hack fetch_upcoming_paras

732a74f

Co-authored by: alindima <[email protected]>

Use claim queue in the validator side of collator protocol

49193f3

Use claim queue in the collator side of collator protocol

00eed19

Use claim queue in fetch_upcoming_paras from prospective-parachains

4ddb874

tdimitrov mentioned this pull request Jun 7, 2024

[DNM] Validator side of the collator protocol uses CQ #4675

Closed

eskimor reviewed Jun 7, 2024

View reviewed changes

polkadot/node/network/collator-protocol/src/collator_side/mod.rs Outdated Show resolved Hide resolved

polkadot/node/network/collator-protocol/src/validator_side/mod.rs Outdated Show resolved Hide resolved

alexggh self-requested a review June 7, 2024 10:28

alindima added 3 commits June 11, 2024 13:58

add prospective parachains tests

8190687

fix collator side and collator side tests

a2f0a25

Merge remote-tracking branch 'origin/master' into tsv-ct-core-sharing

b979563

alindima added 2 commits June 11, 2024 14:39

clippy

c996a90

it works!!

c874cf1

alindima self-assigned this Jun 12, 2024

alindima changed the title ~~Use claim queue in collator-protocol and prospective-parachains~~ Fix core sharing and make use of scheduling_lookahead Jun 12, 2024

alindima added 2 commits June 12, 2024 14:40

Merge remote-tracking branch 'origin/master' into tsv-ct-core-sharing

ec70d6b

properly fix backing

b88f4e8

alindima added the T8-polkadot This PR/Issue is related to/affects the Polkadot network. label Jun 12, 2024

fix unit tests

8afe82f

alindima marked this pull request as ready for review June 12, 2024 13:50

alindima added 2 commits June 13, 2024 13:20

update comments

7050c2e

add proper zombienet test

e23187b

alindima requested a review from a team as a code owner June 13, 2024 15:04

eskimor reviewed Jun 13, 2024

View reviewed changes

review comments

a3560a3

alindima added 2 commits June 17, 2024 15:13

try fixing zombienet

8f1d8e0

fix compilation

3469f2d

alindima added 5 commits June 17, 2024 15:24

clippy again

889cb32

add prdoc

a4387a2

try fixing prdoc

8532f7d

Merge remote-tracking branch 'origin/master' into tsv-ct-core-sharing

7c1fc91

try fixing zombienet

e85b19f

alindima requested a review from eskimor June 17, 2024 14:22

more zombienet

4d69a3a

eskimor reviewed Jun 17, 2024

View reviewed changes

alindima added 3 commits June 18, 2024 09:17

semver

829af4d

use relative symlinks

a6998be

CI stuff

21be690

kianenigma reviewed Jun 18, 2024

View reviewed changes

polkadot/zombienet_tests/functional/assign-core.js Outdated Show resolved Hide resolved

alindima added 7 commits June 18, 2024 09:43

small review comment

2cb6bf2

add copy command to CI file

8f860c1

fix gitlab yaml

e343989

lower glutton compute

69d16cd

it was too high for CI machines

Merge remote-tracking branch 'origin/master' into tsv-ct-core-sharing

4c3635b

relax zombienet constraint

9eea0c6

don't compute the validator group for an unscheduled core when using …

9c118cb

…claimqueue

alexggh reviewed Jun 19, 2024

View reviewed changes

alexggh approved these changes Jun 19, 2024

View reviewed changes

eskimor approved these changes Jun 19, 2024

View reviewed changes

alindima added this pull request to the merge queue Jun 19, 2024

Merged via the queue into master with commit 739c37b Jun 19, 2024
165 of 166 checks passed

alindima deleted the tsv-ct-core-sharing branch June 19, 2024 10:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix core sharing and make use of scheduling_lookahead #4724

Fix core sharing and make use of scheduling_lookahead #4724

tdimitrov commented Jun 7, 2024 •

edited by alindima

Loading

eskimor left a comment

eskimor commented Jun 7, 2024

alindima commented Jun 10, 2024

alindima commented Jun 11, 2024

eskimor left a comment

eskimor Jun 13, 2024

alindima Jun 17, 2024

eskimor Jun 13, 2024

alindima Jun 17, 2024

paritytech-cicd-pr commented Jun 17, 2024

eskimor Jun 17, 2024

alindima Jun 18, 2024

alexggh Jun 19, 2024

alexggh Jun 19, 2024

alindima Jun 19, 2024 •

edited

Loading

alexggh Jun 19, 2024

eskimor Jun 19, 2024

eskimor commented Jun 19, 2024

Fix core sharing and make use of scheduling_lookahead #4724

Fix core sharing and make use of scheduling_lookahead #4724

Conversation

tdimitrov commented Jun 7, 2024 • edited by alindima Loading

eskimor left a comment

Choose a reason for hiding this comment

eskimor commented Jun 7, 2024

alindima commented Jun 10, 2024

alindima commented Jun 11, 2024

eskimor left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paritytech-cicd-pr commented Jun 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alindima Jun 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eskimor commented Jun 19, 2024

tdimitrov commented Jun 7, 2024 •

edited by alindima

Loading

alindima Jun 19, 2024 •

edited

Loading