Skip to content
This repository has been archived by the owner on Jan 8, 2024. It is now read-only.

Fix bug in concurrent step groups with remote runner #3115

Merged
merged 1 commit into from
Mar 21, 2022

Conversation

izaaklauer
Copy link
Contributor

@izaaklauer izaaklauer commented Mar 18, 2022

What problem does this fix?

Currently, if you try to send two open step groups from a remote runner, the client blocks forever.

To reproduce:

  • Run a waypoint server and runner locally
  • Configure a project for remote builds (e.x. waypoint project apply -data-source="git" -git-url="https://github.com/hashicorp/waypoint-examples" -git-path="docker/static" -git-ref=main nginx-project)
  • Run waypoint build -project=nginx-project -local=false
  • Observe the following output:
$ waypoint build -local=false

» Building web...
There are local changes that do not match the remote repository. By default,
Waypoint will perform this operation using a remote runner that will use the
remote repository’s git ref and not these local changes. For these changes
to be used for future operations, either commit and push, or run the operation
locally with the -local flag.
  Performing this operation on a remote runner with id "01FYFF8QE03Y02PM8R2GHY9TQV"

» Cloning data from Git
  URL: https://github.com/hashicorp/waypoint-examples
  Ref: main
⠹ Running build v1

It hangs on Running build v1 forever. It doesn't even respect ctrl-c! (a separate bug)

The problem surfaced since #3081, when we started emitting concurrent step groups for most ops.

Root cause

The problem comes in when the second step group comes in as an event from the server to the client. The client sees it has an existing step group, and calls sg.Wait(), which never returns.

This solution

In practice, if we remove that call to sg.Wait(), everything seems to just work. It tracks the concurrent step groups correctly. With this change in place:

Note two concurrent open step groups:
Screen Shot 2022-03-18 at 5 30 44 PM

Note the UI has closed the two step groups from above, and opened then next one:
Screen Shot 2022-03-18 at 5 31 53 PM

Why does this change work?

Only steps have actual output displayed, not step groups, so we don't need to directly track the step groups to get correct output.

The stepgroup's Wait() function is intended to tell us when a step group has had all of its steps complete, and it's safe to stop rendering. In practice for us, we don't finish rendering until the job closes, so we don't need to invoke .Wait().

In theory, this issue would also be solved by #1480. We should revisit this logic once we've finished auditing and retooling how we're tracking steps and step groups.

@github-actions github-actions bot added the core label Mar 18, 2022
@izaaklauer izaaklauer added the pr/no-changelog No automatic changelog entry required for this pull request label Mar 18, 2022
@izaaklauer izaaklauer requested a review from a team March 18, 2022 22:43
Copy link
Member

@briancain briancain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried this locally, and fixes the issue for me. On main I don't see any output at all actually past the first stepgroup, so if there are other bugs with removing Wait at least you can see the output with this removed 👍🏻 @evanphx mentioned the client is already "waiting" for the job to finish anyway so this check is probably ok to remove for now 🤔

edit: also I think this one is worth a changelog!

@izaaklauer
Copy link
Contributor Author

I had skipped the changelog because the bug hasn't been released yet!

@izaaklauer izaaklauer merged commit 5476991 into main Mar 21, 2022
@izaaklauer izaaklauer deleted the fix-concurrent-step-groups branch March 21, 2022 15:53
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
core pr/no-changelog No automatic changelog entry required for this pull request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants