-
Notifications
You must be signed in to change notification settings - Fork 799
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry timeout #2438
Retry timeout #2438
Conversation
The CI pipeline was cancelled due to failure one of the required jobs. |
I love that you asked this. 😂 I vote "yes", only question is, should we include the umlaut? |
// Either the worker or the job timed out. Kill the worker in either case. Treated as | ||
// definitely-invalid, because if we timed out, there's no time left for a retry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be reworded as it's potentially confusing. It is only definitely-invalid in backing where we don't retry the possibly-invalid candidates. In backing there's no time for a retry (and it's stricter).
Hmm, actually I'm not sure if we should retry, at least with the current timeout which is One idea is to allow a retry but cut that timeout in half. The idea is that if execution is taking 6 seconds (which is already a long time) then something went wrong (i.e. maybe the worker/job communication broke down). And speaking of communication, the host/worker comms (note, not worker/job comm) have a really long timeout (like factor of 4 of the default timeout). We shouldn't retry if that fails because there'd just be no time left. But that long timeout needs to remain long, because it's wall clock time, whereas in the job we timeout based on CPU clock time. (That probably sounds really confusing, so see paritytech/polkadot#4217.) But getting back to cutting the approval timeout in half, I wonder if opens up some attacks, like candidates getting through backing but timing out on some validators with slower machines, leading to disputes. 6 seconds would already be 3x the backing timeout, but I can see it happening. cc @eskimor |
You have a change of mind now or did I misunderstood your comment? 12s is a really long time. How often does a timeout happen on a valid candidate in practice? If it's extremely rare, I suppose we shouldn't. In that case, I'm happy to close this as it no longer makes sense to switch the category. |
You understood correctly, I just forgot that we had bumped the approval timeout to 12s.
Not sure, I suppose my concerns are more theoretical in nature. Like if there is a validator that is 3x slower than the rest, it can potentially be exploited to create disputes. I think in practice it's more likely that an approval timeout happens due to some local conditions, like some deadlock, comm bug, etc... though still not very likely. Maybe some compromise is possible, like 8s + a retry? Not sure, @eskimor would have a better idea. |
Definitely no to shortening the timeout. We have measured fluctuations on the same machine up to a factor of 3. An additional factor of 2 does not seem to be too conservative, given that validators don't run identical machines. Whether a retry on timeout makes sense is an interesting question:
2 is not much of a reason, but 1 means it is free to trigger an escalation (and break sharding). So retry on timeout is not an option ... in fact if we had the time, I don't see why we would ever retry instead of just increasing the timeout to begin with. What would be the value of retrying, instead of increasing the timeout? If you restart - any work already done (even if little) would go to waste. While with a longer timeout you keep it and have a better success chance, given the same time constraints. |
It is.
We at least hope so. That's the whole point of that long timeout. We actually should increase the noshow timeout because of above reasoning, as execution time alone is not the full story. For handling dishonest backers, we need time disputes. |
Wow, a 3x fluctuation on the same machine? This must be wall clock time? The timeouts in question are CPU time. But still, I agree that we should have a large margin for differently-performing validators.
If execution is taking that long, then it could be due to some deadlock etc. Waiting a second and then retrying could avoid some local nondeterminism. Then again, I can't think of any specific internal scenarios that would cause this to occur, except for high load on the machine, in which case retrying could make things worse. Maybe, there is some way for the node to query whether the execution job is at least making progress as opposed to spinning in a loop. But I might be getting into the weeds here. |
I've forgotten our last discussion, but I think.. We designed time disputes aka overtime billing so that candidate validation timeouts could be increased to 30s or 1 minute, which reduces our risks from execution times differing. We've maybe deprioritized actually doing this until polkavm reaches some maturity. If polkavm becomes twice as fast then maybe we'll have deterministic metering via the jan's floating point overflow trick. It's possible jan's floating point overflow trick work for metering wasmtime too. We'll restart the overtime billing plan though if the floating point trick becomes dubious. |
Looks like it just happened. |
Closing. Thanks for the hustle @eagr and apologies for the spurious suggestion! Will also close the related issue as there's nothing left to do. |
I suppose it can be done in PVM by checking if gas is going up. Maybe overly paranoid. I wouldn't do it unless timeouts become a source of issues. |
* Asynchronous backing PR (#2300) * Update substrate & polkadot * min changes to make async backing compile * (async backing) parachain-system: track limitations for unincluded blocks (#2438) * unincluded segment draft * read para head from storage proof * read_para_head -> read_included_para_head * Provide pub interface * add errors * fix unincluded segment update * BlockTracker -> Ancestor * add a dmp limit * Read para head depending on the storage switch * doc comments * storage items docs * add a sanity check on block initialize * Check watermark * append to the segment on block finalize * Move segment update into set_validation_data * Resolve para head todo * option watermark * fix comment * Drop dmq check * fix weight * doc-comments on inherent invariant * Remove TODO * add todo * primitives tests * pallet tests * doc comments * refactor unincluded segment length into a ConsensusHook (#2501) * refactor unincluded segment length into a ConsensusHook * add docs * refactor bandwidth_out calculation Co-authored-by: Chris Sosnin <[email protected]> * test for limits from impl * fmt * make tests compile * update comment * uncomment test * fix collator test by adding parent to state proof * patch HRMP watermark rules for unincluded segment * get consensus-common tests to pass, using unincluded segment * fix unincluded segment tests * get all tests passing * fmt * rustdoc CI * aura-ext: limit the number of authored blocks per slot (#2551) * aura_ext consensus hook * reverse dependency * include weight into hook * fix tests * remove stray println Co-authored-by: Chris Sosnin <[email protected]> * fix test warning * fix doc link --------- Co-authored-by: Chris Sosnin <[email protected]> Co-authored-by: Chris Sosnin <[email protected]> * parachain-system: ignore go ahead signal once upgrade is processed (#2594) * handle goahead signal for unincluded segment * doc comment * add test * parachain-system: drop processed messages from inherent data (#2590) * implement `drop_processed_messages` * drop messages based on relay parent number * adjust tests * drop changes to mqc * fix comment * drop test * drop more dead code * clippy * aura-ext: check slot in consensus hook and remove all `CheckInherents` logic (#2658) * aura-ext: check slot in consensus hook * convert relay chain slot * Make relay chain slot duration generic * use fixed velocity hook for pallets with aura * purge timestamp inherent * fix warning * adjust runtime tests * fix slots in tests * Make `xcm-emulator` test pass for new consensus hook (#2722) * add pallets on_initialize * tests pass * add AuraExt on_init * ".git/.scripts/commands/fmt/fmt.sh" --------- Co-authored-by: command-bot <> --------- Co-authored-by: Ignacio Palacios <[email protected]> * update polkadot git refs * CollationGenerationConfig closure is now optional (#2772) * CollationGenerationConfig closure is now optional * fix test * propagate network-protocol-staging feature (#2899) * Feature Flagging Consensus Hook Type Parameter (#2911) * First pass * fmt * Added as default feature in tomls * Changed to direct dependency feature * Dealing with clippy error * Update pallets/parachain-system/src/lib.rs Co-authored-by: asynchronous rob <[email protected]> --------- Co-authored-by: asynchronous rob <[email protected]> * fmt * bump deps and remove warning * parachain-system: update RelevantMessagingState according to the unincluded segment (#2948) * mostly address 2471 with a bug introduced * adjust relevant messaging state after computing total * fmt * max -> min * fix test implementation of xcmp source * add test * fix test message sending logic * fix + test * add more to unincluded segment test * fmt --------- Co-authored-by: Chris Sosnin <[email protected]> * Integrate new Aura / Parachain Consensus Logic in Parachain-Template / Polkadot-Parachain (#2864) * add a comment * refactor client/service utilities * deprecate start_collator * update parachain-template * update test-service in the same way * update polkadot-parachain crate * fmt * wire up new SubmitCollation message * some runtime utilities for implementing unincluded segment runtime APIs * allow parachains to configure their level of sybil-resistance when starting the network * make aura-ext compile * update to specify sybil resistance levels * fmt * specify relay chain slot duration in milliseconds * update Aura to explicitly produce Send futures also, make relay_chain_slot_duration a Duration * add authoring duration to basic collator and document params * integrate new basic collator into parachain-template * remove assert_send used for testing * basic-aura: only author when parent included * update polkadot-parachain-bin * fmt * some fixes * fixes * add a RelayNumberMonotonicallyIncreases * add a utility function for initializing subsystems * some logging for timestamp adjustment * fmt * some fixes for lookahead collator * add a log * update `find_potential_parents` to account for sessions * bound the loop * restore & deprecate old start_collator and start_full_node functions. * remove unnecessary await calls * fix warning * clippy * more clippy * remove unneeded logic * ci * update comment Co-authored-by: Marcin S. <[email protected]> * (async backing) restore `CheckInherents` for backwards-compatibility (#2977) * bring back timestamp * Restore CheckInherents * revert to empty CheckInherents * make CheckInherents optional * attempt * properly end system blocks * add some more comments * ignore failing system parachain tests * update refs after main feature branch merge * comment out the offending tests because CI runs ignored tests * fix warnings * fmt * revert to polkadot master * cargo update -p polkadot-primitives -p sp-io --------- Co-authored-by: asynchronous rob <[email protected]> Co-authored-by: Ignacio Palacios <[email protected]> Co-authored-by: Bradley Olson <[email protected]> Co-authored-by: Marcin S. <[email protected]> Co-authored-by: eskimor <[email protected]> Co-authored-by: Andronik <[email protected]> (cherry picked from commit 6ef1117) * Companion: restructure macro related exports (#3015) * restructure macro related exports * restructure macro related exports * wip * wip * update cargo lock * refactor RuntimeDebug on unincluded segment * fmt * Companion: restructure `benchmarking` macro related exports (#3039) * wip * wip * restructure benchmarking macro related exports * add cargo lock --------- Co-authored-by: parity-processbot <> (cherry picked from commit 8349c8d) * Add missing workspace members (#3056) * Add dependencies Signed-off-by: Oliver Tale-Yazdi <[email protected]> * Add missing workspace members Signed-off-by: Oliver Tale-Yazdi <[email protected]> * Fix more Signed-off-by: Oliver Tale-Yazdi <[email protected]> --------- Signed-off-by: Oliver Tale-Yazdi <[email protected]> (cherry picked from commit 44499cf) * Add CI for monorepo (#1145) * Add CI for monorepo * fix frame tests * Format features Signed-off-by: Oliver Tale-Yazdi <[email protected]> * add note for skipping tests and disable test-linux-stable-all * Fix tests and compile issues (#1152) * Fix feature dependant import Signed-off-by: Oliver Tale-Yazdi <[email protected]> * Bump test timeout Signed-off-by: Oliver Tale-Yazdi <[email protected]> * Remove feature gate Signed-off-by: Oliver Tale-Yazdi <[email protected]> * Add resolver 2 Signed-off-by: Oliver Tale-Yazdi <[email protected]> * Remove old lockfile Signed-off-by: Oliver Tale-Yazdi <[email protected]> * Format features Signed-off-by: Oliver Tale-Yazdi <[email protected]> --------- Signed-off-by: Oliver Tale-Yazdi <[email protected]> * Fix check-dependency-rules Signed-off-by: Oliver Tale-Yazdi <[email protected]> * rm test-runtime Signed-off-by: Oliver Tale-Yazdi <[email protected]> * Actually fix script Signed-off-by: Oliver Tale-Yazdi <[email protected]> * enable cargo-check-each-crate-macos * Run check-each-crate on 6 machines (#1163) --------- Signed-off-by: Oliver Tale-Yazdi <[email protected]> Co-authored-by: Oliver Tale-Yazdi <[email protected]> (cherry picked from commit e494934) * Fix features (#1194) * Manually fix conflicting ? Signed-off-by: Oliver Tale-Yazdi <[email protected]> * Remove duplicates Signed-off-by: Oliver Tale-Yazdi <[email protected]> * Autofix feature propagation zepter lint propagate-feature --feature try-runtime --left-side-feature-missing=ignore --workspace --fix --feature-enables-dep="try-runtime:frame-try-runtime" zepter lint propagate-feature --feature runtime-benchmarks --left-side-feature-missing=ignore --workspace --fix --feature-enables-dep="runtime-benchmarks:frame-benchmarking" zepter lint propagate-feature --feature std --left-side-feature-missing=ignore --workspace --fix zepter f f Signed-off-by: Oliver Tale-Yazdi <[email protected]> * Bump zepter Signed-off-by: Oliver Tale-Yazdi <[email protected]> * Add some duplicates Signed-off-by: Oliver Tale-Yazdi <[email protected]> * Revert "Add some duplicates" This reverts commit c6ce627. * Remove default enabled features Signed-off-by: Oliver Tale-Yazdi <[email protected]> * Bump Zepter Signed-off-by: Oliver Tale-Yazdi <[email protected]> * Bump in correct location 🤦 Signed-off-by: Oliver Tale-Yazdi <[email protected]> * DNM: Add some mistakes Signed-off-by: Oliver Tale-Yazdi <[email protected]> * DNM: Add some mistakes Signed-off-by: Oliver Tale-Yazdi <[email protected]> * Revert "DNM: Add some mistakes" This reverts commit d469b3f. * Revert "DNM: Add some mistakes" This reverts commit d892a73. --------- Signed-off-by: Oliver Tale-Yazdi <[email protected]> (cherry picked from commit 0400ed9) * Fix build profiles (#1229) * Fix build profiles Closes #1155 Signed-off-by: Oliver Tale-Yazdi <[email protected]> * Manually set version to 1.0.0 Signed-off-by: Oliver Tale-Yazdi <[email protected]> * Use workspace repo Signed-off-by: Oliver Tale-Yazdi <[email protected]> * 'Authors and Edition from workspace Signed-off-by: Oliver Tale-Yazdi <[email protected]> --------- Signed-off-by: Oliver Tale-Yazdi <[email protected]> (cherry picked from commit dcda0e5) * Set test crates to nopublish (#1240) * Set test crates to nopublish * Don't publish more crates * Set even more crates to nopublish --------- Co-authored-by: Oliver Tale-Yazdi <[email protected]> (cherry picked from commit 1c7ef1f) * Add missing licenses and tune the scanning workflow (#1288) * Add missing Cumulus licenses * Typo * Add missing Substrate licenses * Single job checking the sub-repos in steps * Remove dates * Remove dates * Add missing (C) * Update FRAME UI tests Signed-off-by: Oliver Tale-Yazdi <[email protected]> * Update more UI tests Signed-off-by: Oliver Tale-Yazdi <[email protected]> --------- Signed-off-by: Oliver Tale-Yazdi <[email protected]> Co-authored-by: Oliver Tale-Yazdi <[email protected]> (cherry picked from commit bfb241d) * Restructure `dispatch` macro related exports (#1162) * restructure dispatch macro related exports * moved Dispatchable to lib.rs * fix .gitignore final newline * ".git/.scripts/commands/fmt/fmt.sh" * fix rustdocs * wip --------- Co-authored-by: Liam Aharon <[email protected]> Co-authored-by: command-bot <> Co-authored-by: ordian <[email protected]> (cherry picked from commit bdbe982) * Fixes * Fix clippy --------- Co-authored-by: Chris Sosnin <[email protected]> Co-authored-by: Juan <[email protected]> Co-authored-by: Oliver Tale-Yazdi <[email protected]> Co-authored-by: Alexander Samusev <[email protected]> Co-authored-by: Lulu <[email protected]> Co-authored-by: Przemek Rzad <[email protected]>
closes #660
Is the retry delay being used (200ms) good enough for the retry for job timeout?
Btw, how serious are you with the
SchrodingersCandidate
? I got zero problem with the name. :)) @mrcnski @s0me0ne-unkn0wn