Conversation
be812cb to
340fdfb
Compare
There was a problem hiding this comment.
I think we can do better than this and have some separation of concerns and more robust check. We should actually query the collator protocol and ask it if we can use a specific relay parent.
Please look at distribute_collation as there are multiple checks there on the relay parent which we can do before we decide to create a collation on it.
Also, we should keep in mind that advertisement can actually happen later, and by that time the relay parent might not be valid anymore, if a new block was created in a new session. This means that what we do here will not always prevent the situation from happening.
| where | ||
| Client: RelayChainInterface, | ||
| { | ||
| let Ok(relay_best_hash) = relay_client.best_block_hash().await else { |
There was a problem hiding this comment.
you already relay_best_hash this in the caller scope, you can just pass it to the fn
|
@sandreim can you expand a bit on your last message / explain more what you had in mind. I'll call checking if relay parent is in old session for
From where should we query the collator protocol? Later you say:
Thus you make it sound like performing
Hmm did you mean I should perform By "query the collator protocol", do you mean adding a method to the trait collator.collator_service().some_function()I feel properly confused now 😅 |
|
I thought we wanted to fix this on the relay chain side? #9766 I have a check here which already checks whether there will be an epoch change in the relay parent ancestry. If yes, I am including the next epochs authorities for verification. Currently these blocks are getting dropped anyway, so the implementation is already forward looking, because I assumed at some point we will not drop anymore 😬. |
We want to avoid even building the parablock in the first place, to not waste resources (degrading block confidence), so it must happen on parachain side then, right? |
Confidence is dropping because we drop candidates at session boundaries. If we wouldn't do that, confidence would not reduce and parachains could keep producing blocks as they do now right? #9766 But yeah I assume it takes too long or is not scheduled? |
#9766 is a different issue. If you read the ticket it is about candidates that have already been backed on chain and are pending availability. To solve that one we need to fix availability. The issue in #9977 is that these candidates are not even advertised by collator protocol because the relay parent is out of scope already. To properly fix it we need to allow candidates with relay parents from the previous session. The fix should require changes in collator protocol, backing and prospective-parachains. We will need to do this for supporting low latency parachains. IIRC we discussed with @eskimor about decoupling the relay parent we use for execution context from the one we use for scheduling information. The fix in this PR should be very easy but will not be perfect. The candidate could be fetched because a new session was not observed yet, but then dropped from prospective parachains as soon as the RC advances in new session. What we can do for now is not build a collation on an older RP if we've already seen the RC best block in new session.
I am not familiar with this code, does it solve what I said above ? Also I don't think this should be solved in the cumulus code, because of separation of concerns. That's why I am proposing to query the collator protocol subsystem to do a sanity check on the relay parent before proceeding with block production. When we will allow RPs from prev session, you won't need to change anything in cumulus. |
Yes, before, I propose you send a message to |
|
The check I mentioned above checks if any of the RP descendants we use to enforce the offset contain a session change digest. If they do, we include additional relay chain authorities in the inherent storage proof. I did this because I was assuming that we will allow relay parents from old sessions at some point. So it does currently not fix the issue you want to fix. But the check can be used for this, because the condition is the same. The concerns of finding the correct relay parent are currently not separated anyway. We have the parent_search which tries to find a suitable parent block which lives in the same session as the tip of the chain. While thinking about this issue here I realized that we pass the So I think what we should do is:
|
1716d38 to
5a3ee8a
Compare
5a3ee8a to
a05663e
Compare
|
@skunert @sandreim I've tried the approach proposed by Sebastian (at least according to my potentially flawed understand of his instructions) and I've run Zombienet tests with these changes, and it did NOT fix the issue, see changes below:
Then I've re-applied my implementation from 2 weeks ago, and ran the same Zombienet tests, and it DOES solve the issue. I've probably misunderstood Sebastians input... but yeah I'm blocked until further input. |
This reverts commit a05663e.
…ecting epoch changes in header.
|
I've verified that this PR fixes the issue by running ZN test. I've also added some unit tests of |
sandreim
left a comment
There was a problem hiding this comment.
Can we also enhance an existing ZN test ensuring that we don't build on prev session relay parent ?
| } | ||
|
|
||
| if sc_consensus_babe::contains_epoch_change::<RelayBlock>(&relay_header) { | ||
| tracing::debug!(target: LOG_TARGET, ?relay_best_block, "Relay header contains epoch change."); |
There was a problem hiding this comment.
That is true, but doesn't fully help debug if you don't have much context. I'd make the log more explicit like this:
| tracing::debug!(target: LOG_TARGET, ?relay_best_block, "Relay header contains epoch change."); | |
| tracing::debug!(target: LOG_TARGET, ?relay_best_block, "Relay parent is in previous session"); |
There was a problem hiding this comment.
Lets also log the relay block number here. When reading the logs we don't have to do the mapping to see at which height we are. Started to do this more and more. Even though technically not necessary it makes debugging easier.
| .relay_parent_header | ||
| .clone(); | ||
| if sc_consensus_babe::contains_epoch_change::<RelayBlock>(&next_header) { | ||
| tracing::debug!(target: LOG_TARGET, ?relay_best_block, ancestor = %next_header.hash(), "Next header contains epoch change."); |
| }; | ||
|
|
||
| let Ok(rp_data) = offset_relay_parent_find_descendants( | ||
| let Ok(Some(rp_data)) = offset_relay_parent_find_descendants( |
There was a problem hiding this comment.
Is there any place we actually log the reason we are not building a block ?
There was a problem hiding this comment.
Inside offset_relay_parent_find_descendants there is logging.
I agree that for each abort reason there should be at least some message telling why we skipped. Either inside the helpers or here in the main loop.
There was a problem hiding this comment.
Instead of Option<RelayParentData> I can use a specific enum with more contextual information, perhaps (please help with naming):
+ enum RelayParentDescendantsSearchOutcome {
+ /// A potentially suitable relay parent to build upon.
+ Data(RelayParentData)
+ /// We should not build on this relay parent, it is in an old session
+ RelayParentIsInPreviousSession
+ }
async fn offset_relay_parent_find_descendants<RelayClient>(
// args omitted
- ) -> Result<Option<RelayParentData>, ()> {
+ ) -> Result<RelayParentDescendantsSearchOutcome, ()> {
...
if relay_parent_offset == 0 {
- return Ok(Some(RelayParentData::new(relay_header)));
+ return Ok(RelayParentDescendantsSearchOutcome::Data(RelayParentData::new(relay_header)));
}
if sc_consensus_babe::contains_epoch_change::<RelayBlock>(&relay_header) {
...
- return Ok(None);
+ return Ok(RelayParentDescendantsSearchOutcome::RelayParentIsInPreviousSession);
}
...
while required_ancestors.len() < relay_parent_offset as usize {
...
if sc_consensus_babe::contains_epoch_change::<RelayBlock>(&next_header) {
...
- return Ok(None);
+ return Ok(RelayParentDescendantsSearchOutcome::RelayParentIsInPreviousSession);
}
...
}
...
- Ok(Some(RelayParentData::new_with_descendants(relay_parent, required_ancestors.into())))
+ Ok(RelayParentDescendantsSearchOutcome::Data(RelayParentData::new_with_descendants(relay_parent, required_ancestors.into())))
}| }; | ||
|
|
||
| let Ok(rp_data) = offset_relay_parent_find_descendants( | ||
| let Ok(Some(rp_data)) = offset_relay_parent_find_descendants( |
There was a problem hiding this comment.
Inside offset_relay_parent_find_descendants there is logging.
I agree that for each abort reason there should be at least some message telling why we skipped. Either inside the helpers or here in the main loop.
| } | ||
|
|
||
| if sc_consensus_babe::contains_epoch_change::<RelayBlock>(&relay_header) { | ||
| tracing::debug!(target: LOG_TARGET, ?relay_best_block, "Relay header contains epoch change."); |
There was a problem hiding this comment.
Lets also log the relay block number here. When reading the logs we don't have to do the mapping to see at which height we are. Started to do this more and more. Even though technically not necessary it makes debugging easier.
| } | ||
|
|
||
| #[tokio::test] | ||
| async fn offset_returns_none_when_best_header_contains_epoch_change() { |
There was a problem hiding this comment.
Nice tests!
One nitpick is that you could use rstest, example here:
I like that the parameterization makes the file a bit shorter. But optional, do as you like here.
There was a problem hiding this comment.
Of course, nice, I did not know we had used rstest before (I'm a bit careful assuming I can just add any crate as a dependency).
Co-authored-by: Sebastian Kunert <skunert49@gmail.com>
|
All GitHub workflows were cancelled due to failure one of the required jobs. |
On it! I need to understand how I best add new ZN tests using the "new" ZN crate (over ZN binary and toml/zndsl files), I've asked here |
06a4864 to
6acef9b
Compare
…_in_old_session_issue_9977
* [CI/CD] Check semver job improvements (paritytech#10323) This PR adds couple of improvements to the Check semver job for the stable branches: 1. The `validate: false` option can be set now not only on the `mojor` bumps but on the `minor` and `patch` as well, this one is useful when for the backport cases when a desired bump does not match with the one, that `parity-publish` semver check has predicted (like [here](https://github.com/paritytech/polkadot-sdk/actions/runs/19135068993/job/54685184577?pr=10221)) 2. Possibility to skip check, when it is really not needed but still fails (like on the post crates release [prs](https://github.com/paritytech/polkadot-sdk/actions/runs/18311557391/job/52141285274?pr=9951)) closes: paritytech/release-engineering#274 * Cumulus: fix pre-connect to backers for lonely collators (paritytech#10305) When running a single collator (most commonly on testnets), the block builder task is always able to claim a slot, so we're never triggering the pre-connect mechanism which happens for slots owned by other authors. Additionally I fixed some tests. --------- Signed-off-by: Andrei Sandu <andrei-mihail@parity.io> Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com> * frame-system: Only enable special benchmarking code when running in `no_std` (paritytech#10321) This fixes `cargo test -p cumulus-pallet-parachain-system --features runtime-benchmarks` * fix: support `paginationStartKey` parameter for `archive_v1_storage` (paritytech#10329) Fixes paritytech#10185 This PR is to add support for `paginationStartKey` parameter in `archive_v1_storage` JSON RPC API for query type: `descendantsValues` and `descendantsHashes` per [the latest specs](https://paritytech.github.io/json-rpc-interface-spec/api/archive_v1_storage.html). --------- Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Bastian Köcher <git@kchr.de> * Rename `SlotSchedule` to `TargetBlockRate` (paritytech#10316) This renames the `SlotSchedule` runtime api to `TargetBlockRate`. It also changes the signature to only returning the target block rate. As discussed at the retreat, we don't need the block time returned as part of this runtime api. * chore: update zombienet environment vars (paritytech#10293) # Description paritytech#9724 --------- Co-authored-by: Javier Viola <363911+pepoviola@users.noreply.github.com> * Skip building on blocks on relay parents in old session (paritytech#9990) Fixes: paritytech#9977 On our Kusama Canary chain YAP-3392 has the log entry: ``` Collation wasn't advertised because it was built on a relay chain block that is now part of an old session ``` [show up 400+ times (2025-10-03 -- 2025-10-10)](https://grafana.teleport.parity.io/goto/spoPcDeHR?orgId=1). # Changes Changed `offset_relay_parent_find_descendants` to return `None` if the `relay_best_hash` or any of its ancestors contains an epoch change. --------- Co-authored-by: Sebastian Kunert <skunert49@gmail.com> * ci: ci-unified with resolc 0.5.0 (paritytech#10325) cc paritytech/devops#4508 cc @athei * Introduce `ReplayProofSizeProvider`, `RecordingProofProvider` & transactional extensions (paritytech#9930) The `ProofSizeExt` extension is used to serve the proof size to the runtime. It uses the proof recorder to request the current proof size. The `RecordingProofProvider` extension can record the calls to the proof size function. Later the `ReplayProofSizeProvider` can be used to replay these recorded proof sizes. So, the proof recorder is not required anymore. Extensions are now also hooked into the transactional system. This means they are called when a new transaction is created and informed when a transaction is committed or reverted. --------- Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com> * fix P256Verify precompile address (paritytech#10336) fix paritytech/contract-issues#220 --------- Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com> * parachain-consensus: Do not pin blocks on the relay chain during syncing (paritytech#10333) We had reports in the past about `polkadot-parachain` consuming a lot of memory during syncing. I spend some time investigating this again. This graph shows memory consumption during sync process: <img width="1256" height="302" alt="image" src="https://github.com/user-attachments/assets/eec1b510-1aa8-446e-8088-5ff0daab6252" /> We see a rise up to 50gb and then release of a lot of memory and node stabilizes at around 20gb. While I still find that relatively high, I found that the large reduction in memory towards the end was caused by finality notifications. I tracked down the culprit to be `parachain-consensus`. It is doing long-blocking finalization operations and keeps finality notifications around while doing so. In this PR I introduce a new task that fetches the included block and then immediately releases the finality notifications of the relay chain. Memory is now more bounded at around ~12gb: <img width="1248" height="308" alt="image" src="https://github.com/user-attachments/assets/5a8be3bb-02a2-400f-9d0d-87ec298ce09f" /> closes paritytech#1662 --------- Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com> * Allow DT CI to be manually triggered (paritytech#10337) # Description This is a small PR that allows for the differential testing job to be manually triggered instead of _only_ being triggered by PRs. --------- Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com> * [Release|CI/CD] Use larger runners only for the polkadot-parachain and polkadot-omni-node builds (paritytech#10343) This PR changes the RC build flow so that the large github runners will be used only for the `polakdot-parachain` and `polkadot-omni-node` builds, as other binaries builds run fine on the standard runners and what helps as well to save some costs and resources. closes: paritytech/release-engineering#279 * Make tasks local only. (paritytech#10162) Related to paritytech#9693 In the transaction pool, transaction are identified by the tag they provide. For tasks the provided tag is simply the hash for the encoded task. Nothing in the doc says that implementers should be careful that tasks are not too many for a single operation. What I mean is if a task is `migrate_keys(limit)`, with valid first from 1 to 10_000. Then all tasks `migrate_keys(1)`, `migrate_keys(2)` ... `migrate_keys(10_000)` are valid and effectively do the same operation: they all migrate part of the keys. In this case a malicious person can submit all those tasks at once and spam the transaction pool with 10_000 transactions. I see multiple solution: * (1) we are careful when we implement tasks, we make the doc clear, but the API is error prone. (in my example above we would implement just `migrate_keys` and inside the call we would do a basic rate of migration of 1000 keys in a bulk). * (2) we have a new value returned that is the provided tag for the task. Or we use the task index as provided tag. * (3) we only accept local tasks: <-- implemented in this PR. maybe (2) is a better API if we want external submission of tasks. --------- Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com> * Version bumps and prdocs reordering from stable2509-2 (paritytech#10339) This PR backports regular version bumps and prdoc reordering from the release branch back to master * Fix the `CodeNotFound` issue in PolkaVM tests (paritytech#10298) # Description This PR bumps the commit hash of the revive-differential-tests framework to a version that contains a fix for the `CodeNotFound` issue we've been seeing with PolkaVM. The framework now uploads the code of all the contracts prior to running the tests. When CI runs for this PR we should observe that there's either no more `CodeNotFound` errors in PolkaVM tests or that it's greatly reduced. --------- Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com> * Don't require PR for uploading comment for DT CI (paritytech#10347) # Description Small PR that changes the DT CI to not require a PR for uploading the report to the CI job. --------- Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com> * add flow to create an old tag --------- Signed-off-by: Andrei Sandu <andrei-mihail@parity.io> Co-authored-by: Andrei Sandu <54316454+sandreim@users.noreply.github.com> Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Bastian Köcher <git@kchr.de> Co-authored-by: Thang X. Vu <zthangxv@gmail.com> Co-authored-by: DenzelPenzel <15388928+DenzelPenzel@users.noreply.github.com> Co-authored-by: Javier Viola <363911+pepoviola@users.noreply.github.com> Co-authored-by: Alexander Cyon <Sajjon@users.noreply.github.com> Co-authored-by: Sebastian Kunert <skunert49@gmail.com> Co-authored-by: Alexander Samusev <41779041+alvicsam@users.noreply.github.com> Co-authored-by: PG Herveou <pgherveou@gmail.com> Co-authored-by: Omar <OmarAbdulla7@hotmail.com> Co-authored-by: Guillaume Thiolliere <gui.thiolliere@gmail.com> Co-authored-by: BDevParity <bruno.devic@parity.io>
Fixes: #9977 On our Kusama Canary chain YAP-3392 has the log entry: ``` Collation wasn't advertised because it was built on a relay chain block that is now part of an old session ``` [show up 400+ times (2025-10-03 -- 2025-10-10)](https://grafana.teleport.parity.io/goto/spoPcDeHR?orgId=1). # Changes Changed `offset_relay_parent_find_descendants` to return `None` if the `relay_best_hash` or any of its ancestors contains an epoch change. --------- Co-authored-by: Sebastian Kunert <skunert49@gmail.com>
) Fixes: paritytech#9977 On our Kusama Canary chain YAP-3392 has the log entry: ``` Collation wasn't advertised because it was built on a relay chain block that is now part of an old session ``` [show up 400+ times (2025-10-03 -- 2025-10-10)](https://grafana.teleport.parity.io/goto/spoPcDeHR?orgId=1). Changed `offset_relay_parent_find_descendants` to return `None` if the `relay_best_hash` or any of its ancestors contains an epoch change. --------- Co-authored-by: Sebastian Kunert <skunert49@gmail.com>
…ay parents in old session Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>

Fixes: #9977
On our Kusama Canary chain YAP-3392 has the log entry:
show up 400+ times (2025-10-03 -- 2025-10-10).
Changes
Changed
offset_relay_parent_find_descendantsto returnNoneif therelay_best_hashor any of its ancestors contains an epoch change.