This repository has been archived by the owner on Jan 8, 2024. It is now read-only.
Fix bug in concurrent step groups with remote runner #3115
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What problem does this fix?
Currently, if you try to send two open step groups from a remote runner, the client blocks forever.
To reproduce:
waypoint project apply -data-source="git" -git-url="https://github.com/hashicorp/waypoint-examples" -git-path="docker/static" -git-ref=main nginx-project
)waypoint build -project=nginx-project -local=false
It hangs on
Running build v1
forever. It doesn't even respect ctrl-c! (a separate bug)The problem surfaced since #3081, when we started emitting concurrent step groups for most ops.
Root cause
The problem comes in when the second step group comes in as an event from the server to the client. The client sees it has an existing step group, and calls sg.Wait(), which never returns.
This solution
In practice, if we remove that call to
sg.Wait()
, everything seems to just work. It tracks the concurrent step groups correctly. With this change in place:Note two concurrent open step groups:
![Screen Shot 2022-03-18 at 5 30 44 PM](https://user-images.githubusercontent.com/8404559/159086918-3094df8b-1a6f-43d2-9bd9-b41378b32ba1.png)
Note the UI has closed the two step groups from above, and opened then next one:
![Screen Shot 2022-03-18 at 5 31 53 PM](https://user-images.githubusercontent.com/8404559/159086935-1ce7612a-abd1-4c92-bb75-36197ef982fd.png)
Why does this change work?
Only steps have actual output displayed, not step groups, so we don't need to directly track the step groups to get correct output.
The stepgroup's
Wait()
function is intended to tell us when a step group has had all of its steps complete, and it's safe to stop rendering. In practice for us, we don't finish rendering until the job closes, so we don't need to invoke.Wait()
.In theory, this issue would also be solved by #1480. We should revisit this logic once we've finished auditing and retooling how we're tracking steps and step groups.