This repository was archived by the owner on Nov 15, 2023. It is now read-only.
pvf: fix the deadlock when any of the channels get at capacity#5297
Merged
paritytech-processbot[bot] merged 1 commit intomasterfrom Apr 9, 2022
Merged
pvf: fix the deadlock when any of the channels get at capacity#5297paritytech-processbot[bot] merged 1 commit intomasterfrom
paritytech-processbot[bot] merged 1 commit intomasterfrom
Conversation
The PVF host is designed to avoid spawning tasks to minimize knowledge of outer code. Using `async_std::task::spawn` (or Tokio's counterpart) deemed unacceptable, `SpawnNamed` undesirable. Instead there is only one task returned that is spawned by the candidate-validation subsystem. The tasks from the sub-components are polled by that root task. However, the way the tasks are bundled was incorrect. There was a giant select that was polling those tasks. Particularly, that implies that as soon as one of the arms of that select goes into await those sub-tasks stop getting polled. This is a recipe for a deadlock which indeed happened here. Specifically, the deadlock happened during sending messages to the execute queue by calling [`send_execute`](https://github.com/paritytech/polkadot/blob/818da57386900e9b3eb934eaf868161ed3eb6d70/node/core/pvf/src/host.rs#L601). When the channel to the queue reaches the capacity, the control flow is suspended until the queue handles those messages. Since this code is essentially reached from [one of the select arms](https://github.com/paritytech/polkadot/blob/818da57386900e9b3eb934eaf868161ed3eb6d70/node/core/pvf/src/host.rs#L371), the queue won't be given the control and thus no further progress can be made. This problem is solved by bundling the tasks one level higher instead, by `selecting` over those long-running tasks. We also stop treating returning from those long-running tasks as error conditions, since that can happen during legit shutdown.
bkchr
approved these changes
Apr 9, 2022
rphmeier
approved these changes
Apr 9, 2022
Contributor
|
@vstakhov if you find it correct, please merge |
vstakhov
approved these changes
Apr 9, 2022
Contributor
|
bot merge |
drahnr
reviewed
Apr 10, 2022
Comment on lines
+225
to
+229
| _ = run_host.fuse() => {}, | ||
| _ = run_prepare_queue.fuse() => {}, | ||
| _ = run_prepare_pool.fuse() => {}, | ||
| _ = run_execute_queue.fuse() => {}, | ||
| _ = run_sweeper.fuse() => {}, |
Contributor
There was a problem hiding this comment.
nit: We should maintain logs here which one terminated.
Contributor
Author
There was a problem hiding this comment.
Which severity you reckon?
Contributor
There was a problem hiding this comment.
error! I'd say, this should really never happen
Contributor
Author
There was a problem hiding this comment.
The reason why I removed never!s is that those paths can theoretically be hit during normal shutdown.
ordian
added a commit
that referenced
this pull request
Apr 13, 2022
* master: (62 commits) Bump tracing from 0.1.32 to 0.1.33 (#5299) Add staging runtime api (#5048) CI: rename ambiguous jobs (#5313) Prepare for rust 1.60 (#5282) Bump proc-macro2 from 1.0.36 to 1.0.37 (#5265) Fixes the dead lock when any of the channels get at capacity. (#5297) Bump syn from 1.0.90 to 1.0.91 (#5273) create .github/workflows/pr-custom-review.yml (#5280) [ci] fix pallet benchmark script (#5247) (#5292) bump spec_version to 9190 (#5291) bump version to 0.9.19 (#5290) Session is actually ancient not unknown. (#5286) [ci] point benchmark script to sub-commands (#5247) (#5289) Remove Travis CI (#5287) Improve error handling in approval voting block import (#5283) fix .github/pr-custom-review.yml (#5279) Co #11164: Sub-commands for `benchmark` (#5247) Bump clap from 3.1.6 to 3.1.8 (#5240) Custom Slots Migration for Fund Index Change (#5227) create pr-custom-review.yml (#5253) ...
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The PVF host is designed to avoid spawning tasks to minimize knowledge
of outer code. Using
async_std::task::spawn(or Tokio's counterpart)deemed unacceptable,
SpawnNamedundesirable. Instead there is only onetask returned that is spawned by the candidate-validation subsystem.
The tasks from the sub-components are polled by that root task.
However, the way the tasks are bundled was incorrect. There was a giant
select that was polling those tasks. Particularly, that implies that as soon as
one of the arms of that select goes into await those sub-tasks stop
getting polled. This is a recipe for a deadlock which indeed happened
here.
Specifically, the deadlock happened during sending messages to the
execute queue by calling
send_execute.When the channel to the queue reaches the capacity, the control flow is
suspended until the queue handles those messages. Since this code is
essentially reached from one of the select
arms,
the queue won't be given the control and thus no further progress can be
made.
This problem is solved by bundling the tasks one level higher instead,
by
selectingover those long-running tasks.We also stop treating returning from those long-running tasks as error
conditions, since that can happen during legit shutdown.