Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix core sharing and make use of scheduling_lookahead #4724

Merged
merged 37 commits into from
Jun 19, 2024

Conversation

tdimitrov
Copy link
Contributor

@tdimitrov tdimitrov commented Jun 7, 2024

Implements most of #1797

Core sharing (two parachains or more marachains scheduled on the same core with the same PartsOf57600 value) was not working correctly. The expected behaviour is to have Backed and Included event in each block for the paras sharing the core and the paras should take turns. E.g. for two cores we expect: Backed(a); Included(a)+Backed(b); Included(b)+Backed(a); etc. Instead of this each block contains just one event and there are a lot of gaps (blocks w/o events) during the session.

Core sharing should also work when collators are building collations ahead of time

TODOs:

  • Add a zombienet test verifying that the behaviour mentioned above works.
  • prdoc

Copy link
Member

@eskimor eskimor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I don't get how this code fixes anything. The only place that is now fully taking advantage of the claim queue is prospective parachains, but that should not help much if we only ever provide collations for claim queue depth 0.

What do I miss?

@alexggh alexggh self-requested a review June 7, 2024 10:28
@eskimor
Copy link
Member

eskimor commented Jun 7, 2024

Ok. It likely works because max_ancestry_len is long enough, that your collation will still be valid on the next occurrence in the claim queue. This is why the change in prospective parachains has an effect.

@alindima
Copy link
Contributor

that your collation will still be valid on the next occurrence in the claim queue. This is why the change in prospective parachains has an effect.

Correct. Still, we don't seem to need an allowed_ancestry_len larger than 2. I tested with this value and 4 parachains sharing a core with 1.8 second candidates and it works in zombienet.

@eskimor at this moment, we don't have any collator that builds collations for more than one relay chain block in advance (lookahead and slot-based don't do this). So adding any change here that does take into account the full claim queue will not have any effect.

Collators search for an availability core where the para id is scheduled at this block or at the very next block and it builds max_candidate_depth + 1 candidates.

That's why the fixes in this PR (besides the one in prospective-parachains) don't need the claim queue (and can do with the next_up_on_available value).
The fixes in collator-protocol (both validator and collator sides) work because we were previously not looking into next_up_on_available at all. We were using the paraid of the already occupied core.

The only place where the claimqueue makes a difference is indeed in prospective-parachains. There, we need to not drop candidates for a para which are not scheduled for the next core but may be scheduled for the one after that.

@alindima
Copy link
Contributor

discussed offline with @eskimor about this. The PR was so far fixing core sharing but not implementing the required changes for collators to be able to build collations ahead of time.
So while it is working with the current collators, we want to have collators that are able to build collations before they're scheduled (particularly for useful for on demand). I'll make these changes

@alindima alindima self-assigned this Jun 12, 2024
@alindima alindima changed the title Use claim queue in collator-protocol and prospective-parachains Fix core sharing and make use of scheduling_lookahead Jun 12, 2024
@alindima alindima added the T8-polkadot This PR/Issue is related to/affects the Polkadot network. label Jun 12, 2024
@alindima alindima marked this pull request as ready for review June 12, 2024 13:50
@alindima alindima requested a review from a team as a code owner June 13, 2024 15:04
Copy link
Member

@eskimor eskimor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to ensure fairness between paras. This can be a followup though.

polkadot/node/core/backing/src/lib.rs Outdated Show resolved Hide resolved
};

if assigned_para_id != candidate_para_id {
if !assigned_paras.contains(&candidate_para_id) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not introduced here, but would be good to clear up. This function is called core_index_from_statement ... I would not expect it to do any validation. Do we have a better place to do this check?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I though about this and we could move this bit of validation to the caller, but I think that hurts readability. core_index_from_statement is already doing several validation steps

polkadot/node/core/prospective-parachains/src/lib.rs Outdated Show resolved Hide resolved
polkadot/node/core/prospective-parachains/src/tests.rs Outdated Show resolved Hide resolved
@@ -293,7 +293,7 @@ impl Collations {
} else {
1
};
self.seconded_count < seconded_limit
self.seconded_count >= seconded_limit
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: This limit is based on max_candidate_depth, but is not per para (despite the docs suggesting so). Maybe we want a global (for all paras) limit still, but it certainly is not enough. A global limit only protects the validator but not parachains from each other.

For the latter we need to ensure fairness in collation fetching:

  1. paras should get fetches proportional to how often they occur in the claim queue. E.g for a claim queue [A,B,A], A can have twice as many fetches as B.
  2. If for two paras the weighted fetch count (actual fetches/claim queue occurrences) is equal, then prefer the para that is higher up in the claim queue as it is more urgent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I decided to leave the fairness bit as a follow-up.

I have some thoughts on this, I'll add them to #1797

polkadot/node/subsystem-util/src/vstaging.rs Outdated Show resolved Hide resolved
polkadot/zombienet_tests/smoke/assign-core-parts.js Outdated Show resolved Hide resolved
@paritytech-cicd-pr
Copy link

The CI pipeline was cancelled due to failure one of the required jobs.
Job name: cargo-clippy
Logs: https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/6477941

@alindima alindima requested a review from eskimor June 17, 2024 14:22
polkadot/node/core/backing/src/lib.rs Show resolved Hide resolved
@@ -31,7 +31,7 @@ const LOG_TARGET: &'static str = "parachain::subsystem-util-vstaging";

/// A snapshot of the runtime claim queue at an arbitrary relay chain block.
#[derive(Default)]
pub struct ClaimQueueSnapshot(BTreeMap<CoreIndex, VecDeque<ParaId>>);
pub struct ClaimQueueSnapshot(pub BTreeMap<CoreIndex, VecDeque<ParaId>>);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this pub? I mean as long as we are not keeping any invariants via the wrapper, I don't see a problem, but is it needed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there aren't any invariants kept. and you suggested using the vecdeque directly: #4724 (comment)
instead of adding more methods that look similar to each other I decided there's no reason why we can't expose its inner type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Alternatively you could use: derive_more::Deref for read only access

polkadot/runtime/parachains/src/scheduler.rs Show resolved Hide resolved
# assign core 0 to be shared by all paras.
validator-0: js-script ./assign-core.js with "0,2000,14400,2001,14400,2002,14400,2003,14400" return is 0 within 600 seconds

collator-2000: reports block height is at least 6 within 200 seconds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this high timeouts and low block heights, aren't we at a high-risk of the test working accidentally.

Copy link
Contributor

@alindima alindima Jun 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

each parachain should get one block every 4 relay chain blocks. so one in 24 seconds (which would amount to about 144 seconds). An extra minute is needed for the session change (we're registering the parachains manually, because zombienet would otherwise also assign cores). Just to be sure, I ran this test with the code on master and it fails

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright makes, sense then the timeouts aren't that high. Thank you!

@@ -1029,7 +1030,7 @@ fn core_index_from_statement(
?group_rotation_info,
?statement,
?validator_to_group,
n_cores = ?cores.len() ,
n_cores,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR, but once we got rid of the legacy stuff, I think we should make e.g. this iterator accessible. Should enable us to be able to get rid of this parameter as well (and the fetching of the cores to begin with).

@eskimor
Copy link
Member

eskimor commented Jun 19, 2024

Thank you! Great work!

@alindima alindima added this pull request to the merge queue Jun 19, 2024
Merged via the queue into master with commit 739c37b Jun 19, 2024
165 of 166 checks passed
@alindima alindima deleted the tsv-ct-core-sharing branch June 19, 2024 10:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T8-polkadot This PR/Issue is related to/affects the Polkadot network.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants