Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry timeout #2438

Closed
wants to merge 3 commits into from
Closed

Retry timeout #2438

wants to merge 3 commits into from

Conversation

eagr
Copy link
Contributor

@eagr eagr commented Nov 22, 2023

closes #660

Is the retry delay being used (200ms) good enough for the retry for job timeout?

Btw, how serious are you with the SchrodingersCandidate? I got zero problem with the name. :)) @mrcnski @s0me0ne-unkn0wn

@paritytech-cicd-pr
Copy link

The CI pipeline was cancelled due to failure one of the required jobs.
Job name: test-linux-stable 2/3
Logs: https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/4429721

@mrcnski
Copy link
Contributor

mrcnski commented Nov 22, 2023

Btw, how serious are you with the SchrodingersCandidate? I got zero problem with the name. :)) @mrcnski @s0me0ne-unkn0wn

I love that you asked this. 😂 I vote "yes", only question is, should we include the umlaut?

Comment on lines +360 to +361
// Either the worker or the job timed out. Kill the worker in either case. Treated as
// definitely-invalid, because if we timed out, there's no time left for a retry.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be reworded as it's potentially confusing. It is only definitely-invalid in backing where we don't retry the possibly-invalid candidates. In backing there's no time for a retry (and it's stricter).

@mrcnski
Copy link
Contributor

mrcnski commented Nov 22, 2023

Hmm, actually I'm not sure if we should retry, at least with the current timeout which is DEFAULT_APPROVAL_EXECUTION_TIMEOUT = 12. That's already a long time given for execution.

One idea is to allow a retry but cut that timeout in half. The idea is that if execution is taking 6 seconds (which is already a long time) then something went wrong (i.e. maybe the worker/job communication broke down).

And speaking of communication, the host/worker comms (note, not worker/job comm) have a really long timeout (like factor of 4 of the default timeout). We shouldn't retry if that fails because there'd just be no time left. But that long timeout needs to remain long, because it's wall clock time, whereas in the job we timeout based on CPU clock time. (That probably sounds really confusing, so see paritytech/polkadot#4217.)

But getting back to cutting the approval timeout in half, I wonder if opens up some attacks, like candidates getting through backing but timing out on some validators with slower machines, leading to disputes. 6 seconds would already be 3x the backing timeout, but I can see it happening. cc @eskimor

@eagr
Copy link
Contributor Author

eagr commented Nov 22, 2023

You have a change of mind now or did I misunderstood your comment?

12s is a really long time. How often does a timeout happen on a valid candidate in practice? If it's extremely rare, I suppose we shouldn't. In that case, I'm happy to close this as it no longer makes sense to switch the category.

@mrcnski
Copy link
Contributor

mrcnski commented Nov 22, 2023

You understood correctly, I just forgot that we had bumped the approval timeout to 12s.

How often does a timeout happen on a valid candidate in practice?

Not sure, I suppose my concerns are more theoretical in nature. Like if there is a validator that is 3x slower than the rest, it can potentially be exploited to create disputes. I think in practice it's more likely that an approval timeout happens due to some local conditions, like some deadlock, comm bug, etc... though still not very likely.

Maybe some compromise is possible, like 8s + a retry? Not sure, @eskimor would have a better idea.

@eskimor
Copy link
Member

eskimor commented Nov 22, 2023

Definitely no to shortening the timeout. We have measured fluctuations on the same machine up to a factor of 3. An additional factor of 2 does not seem to be too conservative, given that validators don't run identical machines.

Whether a retry on timeout makes sense is an interesting question:

  1. We would be too slow to vote and thus approval voting would escalate because of our no-show. This is an issue, as without the retry we would have raised a dispute - which is a full escalation, but in that case at least someone will pay the bill.
  2. Retrying increases the chances of us voting valid on a candidate. If a super majority finds this borderline candidate to be invalid, we just increased our chances of getting slashed significantly. Although if everybody does this, then the chances of a super majority finding the candidate invalid also gets smaller. On top of this with Slash approval voters on approving invalid blocks - dynamically #635 in place, collectively it does not matter much if 1 or 10 validators voted valid on an invalid candidate.

2 is not much of a reason, but 1 means it is free to trigger an escalation (and break sharding). So retry on timeout is not an option ... in fact if we had the time, I don't see why we would ever retry instead of just increasing the timeout to begin with. What would be the value of retrying, instead of increasing the timeout? If you restart - any work already done (even if little) would go to waste. While with a longer timeout you keep it and have a better success chance, given the same time constraints.

@eskimor
Copy link
Member

eskimor commented Nov 22, 2023

12s is a really long time. How often does a timeout happen on a valid candidate in practice?

It is.

If it's extremely rare

We at least hope so. That's the whole point of that long timeout. We actually should increase the noshow timeout because of above reasoning, as execution time alone is not the full story.

For handling dishonest backers, we need time disputes.

@mrcnski
Copy link
Contributor

mrcnski commented Nov 22, 2023

We have measured fluctuations on the same machine up to a factor of 3.

Wow, a 3x fluctuation on the same machine? This must be wall clock time? The timeouts in question are CPU time. But still, I agree that we should have a large margin for differently-performing validators.

in fact if we had the time, I don't see why we would ever retry instead of just increasing the timeout to begin with. What would be the value of retrying, instead of increasing the timeout?

If execution is taking that long, then it could be due to some deadlock etc. Waiting a second and then retrying could avoid some local nondeterminism. Then again, I can't think of any specific internal scenarios that would cause this to occur, except for high load on the machine, in which case retrying could make things worse.

Maybe, there is some way for the node to query whether the execution job is at least making progress as opposed to spinning in a loop. But I might be getting into the weeds here.

@eagr
Copy link
Contributor Author

eagr commented Nov 22, 2023

Thanks for the explanation and pointers! @eskimor Those are so informative. Curious, what difference does it make if we no-show for longer time, if any? Say we double the timeout.

You could just close this if there's no reason to proceed. :) @mrcnski

@burdges
Copy link

burdges commented Nov 22, 2023

I've forgotten our last discussion, but I think..

We designed time disputes aka overtime billing so that candidate validation timeouts could be increased to 30s or 1 minute, which reduces our risks from execution times differing. We've maybe deprioritized actually doing this until polkavm reaches some maturity. If polkavm becomes twice as fast then maybe we'll have deterministic metering via the jan's floating point overflow trick. It's possible jan's floating point overflow trick work for metering wasmtime too. We'll restart the overtime billing plan though if the floating point trick becomes dubious.

@mrcnski
Copy link
Contributor

mrcnski commented Nov 22, 2023

If polkavm becomes twice as fast

Looks like it just happened.

@mrcnski
Copy link
Contributor

mrcnski commented Nov 22, 2023

Closing. Thanks for the hustle @eagr and apologies for the spurious suggestion! Will also close the related issue as there's nothing left to do.

@mrcnski
Copy link
Contributor

mrcnski commented Nov 23, 2023

Maybe, there is some way for the node to query whether the execution job is at least making progress as opposed to spinning in a loop.

I suppose it can be done in PVM by checking if gas is going up. Maybe overly paranoid. I wouldn't do it unless timeouts become a source of issues.

bkchr pushed a commit that referenced this pull request Apr 10, 2024
* Asynchronous backing PR (#2300)

* Update substrate & polkadot

* min changes to make async backing compile

* (async backing) parachain-system: track limitations for unincluded blocks (#2438)

* unincluded segment draft

* read para head from storage proof

* read_para_head -> read_included_para_head

* Provide pub interface

* add errors

* fix unincluded segment update

* BlockTracker -> Ancestor

* add a dmp limit

* Read para head depending on the storage switch

* doc comments

* storage items docs

* add a sanity check on block initialize

* Check watermark

* append to the segment on block finalize

* Move segment update into set_validation_data

* Resolve para head todo

* option watermark

* fix comment

* Drop dmq check

* fix weight

* doc-comments on inherent invariant

* Remove TODO

* add todo

* primitives tests

* pallet tests

* doc comments

* refactor unincluded segment length into a ConsensusHook (#2501)

* refactor unincluded segment length into a ConsensusHook

* add docs

* refactor bandwidth_out calculation

Co-authored-by: Chris Sosnin <[email protected]>

* test for limits from impl

* fmt

* make tests compile

* update comment

* uncomment test

* fix collator test by adding parent to state proof

* patch HRMP watermark rules for unincluded segment

* get consensus-common tests to pass, using unincluded segment

* fix unincluded segment tests

* get all tests passing

* fmt

* rustdoc CI

* aura-ext: limit the number of authored blocks per slot (#2551)

* aura_ext consensus hook

* reverse dependency

* include weight into hook

* fix tests

* remove stray println

Co-authored-by: Chris Sosnin <[email protected]>

* fix test warning

* fix doc link

---------

Co-authored-by: Chris Sosnin <[email protected]>
Co-authored-by: Chris Sosnin <[email protected]>

* parachain-system: ignore go ahead signal once upgrade is processed (#2594)

* handle goahead signal for unincluded segment

* doc comment

* add test

* parachain-system: drop processed messages from inherent data (#2590)

* implement `drop_processed_messages`

* drop messages based on relay parent number

* adjust tests

* drop changes to mqc

* fix comment

* drop test

* drop more dead code

* clippy

* aura-ext: check slot in consensus hook and remove all `CheckInherents` logic (#2658)

* aura-ext: check slot in consensus hook

* convert relay chain slot

* Make relay chain slot duration generic

* use fixed velocity hook for pallets with aura

* purge timestamp inherent

* fix warning

* adjust runtime tests

* fix slots in tests

* Make `xcm-emulator` test pass for new consensus hook (#2722)

* add pallets on_initialize

* tests pass

* add AuraExt on_init

* ".git/.scripts/commands/fmt/fmt.sh"

---------

Co-authored-by: command-bot <>

---------

Co-authored-by: Ignacio Palacios <[email protected]>

* update polkadot git refs

* CollationGenerationConfig closure is now optional (#2772)

* CollationGenerationConfig closure is now optional

* fix test

* propagate network-protocol-staging feature (#2899)

* Feature Flagging Consensus Hook Type Parameter (#2911)

* First pass

* fmt

* Added as default feature in tomls

* Changed to direct dependency feature

* Dealing with clippy error

* Update pallets/parachain-system/src/lib.rs

Co-authored-by: asynchronous rob <[email protected]>

---------

Co-authored-by: asynchronous rob <[email protected]>

* fmt

* bump deps and remove warning

* parachain-system: update RelevantMessagingState according to the unincluded segment (#2948)

* mostly address 2471 with a bug introduced

* adjust relevant messaging state after computing total

* fmt

* max -> min

* fix test implementation of xcmp source

* add test

* fix test message sending logic

* fix + test

* add more to unincluded segment test

* fmt

---------

Co-authored-by: Chris Sosnin <[email protected]>

* Integrate new Aura / Parachain Consensus Logic in Parachain-Template / Polkadot-Parachain (#2864)

* add a comment

* refactor client/service utilities

* deprecate start_collator

* update parachain-template

* update test-service in the same way

* update polkadot-parachain crate

* fmt

* wire up new SubmitCollation message

* some runtime utilities for implementing unincluded segment runtime APIs

* allow parachains to configure their level of sybil-resistance when starting the network

* make aura-ext compile

* update to specify sybil resistance levels

* fmt

* specify relay chain slot duration in milliseconds

* update Aura to explicitly produce Send futures

also, make relay_chain_slot_duration a Duration

* add authoring duration to basic collator and document params

* integrate new basic collator into parachain-template

* remove assert_send used for testing

* basic-aura: only author when parent included

* update polkadot-parachain-bin

* fmt

* some fixes

* fixes

* add a RelayNumberMonotonicallyIncreases

* add a utility function for initializing subsystems

* some logging for timestamp adjustment

* fmt

* some fixes for lookahead collator

* add a log

* update `find_potential_parents` to account for sessions

* bound the loop

* restore & deprecate old start_collator and start_full_node functions.

* remove unnecessary await calls

* fix warning

* clippy

* more clippy

* remove unneeded logic

* ci

* update comment

Co-authored-by: Marcin S. <[email protected]>

* (async backing) restore `CheckInherents` for backwards-compatibility (#2977)

* bring back timestamp

* Restore CheckInherents

* revert to empty CheckInherents

* make CheckInherents optional

* attempt

* properly end system blocks

* add some more comments

* ignore failing system parachain tests

* update refs after main feature branch merge

* comment out the offending tests because CI runs ignored tests

* fix warnings

* fmt

* revert to polkadot master

* cargo update -p polkadot-primitives -p sp-io

---------

Co-authored-by: asynchronous rob <[email protected]>
Co-authored-by: Ignacio Palacios <[email protected]>
Co-authored-by: Bradley Olson <[email protected]>
Co-authored-by: Marcin S. <[email protected]>
Co-authored-by: eskimor <[email protected]>
Co-authored-by: Andronik <[email protected]>
(cherry picked from commit 6ef1117)

* Companion: restructure macro related exports (#3015)

* restructure macro related exports

* restructure macro related exports

* wip

* wip

* update cargo lock

* refactor RuntimeDebug on unincluded segment

* fmt

* Companion: restructure `benchmarking` macro related exports (#3039)

* wip

* wip

* restructure benchmarking macro related exports

* add cargo lock

---------

Co-authored-by: parity-processbot <>
(cherry picked from commit 8349c8d)

* Add missing workspace members (#3056)

* Add dependencies

Signed-off-by: Oliver Tale-Yazdi <[email protected]>

* Add missing workspace members

Signed-off-by: Oliver Tale-Yazdi <[email protected]>

* Fix more

Signed-off-by: Oliver Tale-Yazdi <[email protected]>

---------

Signed-off-by: Oliver Tale-Yazdi <[email protected]>
(cherry picked from commit 44499cf)

* Add CI for monorepo (#1145)

* Add CI for monorepo

* fix frame tests

* Format features

Signed-off-by: Oliver Tale-Yazdi <[email protected]>

* add note for skipping tests and disable test-linux-stable-all

* Fix tests and compile issues (#1152)

* Fix feature dependant import

Signed-off-by: Oliver Tale-Yazdi <[email protected]>

* Bump test timeout

Signed-off-by: Oliver Tale-Yazdi <[email protected]>

* Remove feature gate

Signed-off-by: Oliver Tale-Yazdi <[email protected]>

* Add resolver 2

Signed-off-by: Oliver Tale-Yazdi <[email protected]>

* Remove old lockfile

Signed-off-by: Oliver Tale-Yazdi <[email protected]>

* Format features

Signed-off-by: Oliver Tale-Yazdi <[email protected]>

---------

Signed-off-by: Oliver Tale-Yazdi <[email protected]>

* Fix check-dependency-rules

Signed-off-by: Oliver Tale-Yazdi <[email protected]>

* rm test-runtime

Signed-off-by: Oliver Tale-Yazdi <[email protected]>

* Actually fix script

Signed-off-by: Oliver Tale-Yazdi <[email protected]>

* enable cargo-check-each-crate-macos

* Run check-each-crate on 6 machines (#1163)

---------

Signed-off-by: Oliver Tale-Yazdi <[email protected]>
Co-authored-by: Oliver Tale-Yazdi <[email protected]>
(cherry picked from commit e494934)

* Fix features (#1194)

* Manually fix conflicting ?

Signed-off-by: Oliver Tale-Yazdi <[email protected]>

* Remove duplicates

Signed-off-by: Oliver Tale-Yazdi <[email protected]>

* Autofix feature propagation

zepter lint propagate-feature --feature try-runtime --left-side-feature-missing=ignore --workspace --fix --feature-enables-dep="try-runtime:frame-try-runtime"
zepter lint propagate-feature --feature runtime-benchmarks --left-side-feature-missing=ignore --workspace --fix --feature-enables-dep="runtime-benchmarks:frame-benchmarking"
zepter lint propagate-feature --feature std --left-side-feature-missing=ignore --workspace --fix
zepter f f

Signed-off-by: Oliver Tale-Yazdi <[email protected]>

* Bump zepter

Signed-off-by: Oliver Tale-Yazdi <[email protected]>

* Add some duplicates

Signed-off-by: Oliver Tale-Yazdi <[email protected]>

* Revert "Add some duplicates"

This reverts commit c6ce627.

* Remove default enabled features

Signed-off-by: Oliver Tale-Yazdi <[email protected]>

* Bump Zepter

Signed-off-by: Oliver Tale-Yazdi <[email protected]>

* Bump in correct location 🤦

Signed-off-by: Oliver Tale-Yazdi <[email protected]>

* DNM: Add some mistakes

Signed-off-by: Oliver Tale-Yazdi <[email protected]>

* DNM: Add some mistakes

Signed-off-by: Oliver Tale-Yazdi <[email protected]>

* Revert "DNM: Add some mistakes"

This reverts commit d469b3f.

* Revert "DNM: Add some mistakes"

This reverts commit d892a73.

---------

Signed-off-by: Oliver Tale-Yazdi <[email protected]>
(cherry picked from commit 0400ed9)

* Fix build profiles (#1229)

* Fix build profiles

Closes #1155

Signed-off-by: Oliver Tale-Yazdi <[email protected]>

* Manually set version to 1.0.0

Signed-off-by: Oliver Tale-Yazdi <[email protected]>

* Use workspace repo

Signed-off-by: Oliver Tale-Yazdi <[email protected]>

* 'Authors and Edition from workspace

Signed-off-by: Oliver Tale-Yazdi <[email protected]>

---------

Signed-off-by: Oliver Tale-Yazdi <[email protected]>
(cherry picked from commit dcda0e5)

* Set test crates to nopublish (#1240)

* Set test crates to nopublish

* Don't publish more crates

* Set even more crates to nopublish

---------

Co-authored-by: Oliver Tale-Yazdi <[email protected]>
(cherry picked from commit 1c7ef1f)

* Add missing licenses and tune the scanning workflow (#1288)

* Add missing Cumulus licenses

* Typo

* Add missing Substrate licenses

* Single job checking the sub-repos in steps

* Remove dates

* Remove dates

* Add missing (C)

* Update FRAME UI tests

Signed-off-by: Oliver Tale-Yazdi <[email protected]>

* Update more UI tests

Signed-off-by: Oliver Tale-Yazdi <[email protected]>

---------

Signed-off-by: Oliver Tale-Yazdi <[email protected]>
Co-authored-by: Oliver Tale-Yazdi <[email protected]>
(cherry picked from commit bfb241d)

* Restructure `dispatch` macro related exports (#1162)

* restructure dispatch macro related exports

* moved Dispatchable to lib.rs

* fix .gitignore final newline

* ".git/.scripts/commands/fmt/fmt.sh"

* fix rustdocs

* wip

---------

Co-authored-by: Liam Aharon <[email protected]>
Co-authored-by: command-bot <>
Co-authored-by: ordian <[email protected]>
(cherry picked from commit bdbe982)

* Fixes

* Fix clippy

---------

Co-authored-by: Chris Sosnin <[email protected]>
Co-authored-by: Juan <[email protected]>
Co-authored-by: Oliver Tale-Yazdi <[email protected]>
Co-authored-by: Alexander Samusev <[email protected]>
Co-authored-by: Lulu <[email protected]>
Co-authored-by: Przemek Rzad <[email protected]>
@eagr eagr deleted the retry-timeout branch September 19, 2024 10:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PVF: Add a new class of "possibly invalid" errors
5 participants