Add "WatchTask" operation to watch ODR launch tasks and expose more debug info #3306

mitchellh · 2022-05-02T19:13:33Z

Depends on: hashicorp/waypoint-plugin-sdk#71

This adds a new operation type WatchTask that runs on the static runner (the same place as StartTask and StopTask). The WatchTask operation is automatically created as part of the ODR wrapping process, resulting in a job dependency tree that looks like this for any job:

//           ┌────────────────┐
//           │   Start Task   │─────────────┐
//           └────────────────┘             │
//                    │                     │
//          ┌─────────┴─────────┐           │
//          ▼                   ▼           │
// ┌────────────────┐   ┌─ ── ── ── ── ──   │
// │   Watch Task   │   │      Job       │  │
// └────────────────┘   └ ── ── ── ── ── ┘  │
//          │                    │          │
//          └─────────┬──────────┘          │
//                    ▼                     │
//           ┌────────────────┐             │
//           │   Stop Task    │◀────────────┘
//           └────────────────┘

Why?

The primary motivator is pipelines but there are more general improvements as well.

For pipelines, this is going to let us execute non-Waypoint-runner tasks and (1) know when it exists and (2) stream the output back to the Waypoint server and track it with the job system. Currently, all tasks are run via a Waypoint runner so we lean on it doing the right things via runner APIs to let us know. This gives us a second source of information.

More generally, this exposes previously difficult-to-reach debug information. If the Waypoint runner failed to launch for any reason, it was very unclear what went wrong. With this, the watch task should still get the platform logs, so we should be able to see errors before or after the runner runs! This should dramatically limit the number of times Waypoint users/operators/maintainers need to reach into the platform directly to see "why isn't my task running".

Semantics

The watch task and source job (build, deploy, etc.) run concurrently.
The stop task depends on both watch and job so it will not stop until both are done.
The watch task is allowed to fail and it doesn't impact anything (therefore, we can merge this without implementing watchtask for all task launchers)

A Building Block

This PR doesn't expose the watch task information easily in any way. This PR is more about establishing the building block.

From here, we could discuss how we expose this information more readily in UIs, CLIs, etc.

TODO (doesn't block this PR)

This PR only implements task watching for Docker. We need to implement task watching for other platforms (esp K8S). But as noted above, if they are unimplemented, it doesn't change the current behavior -- watch tasks are allowed to fail currently.

briancain

This looks pretty good to me! I am requesting changes because the Trigger HTTP handler expects to receive a job triple with the source job being the middle job and this breaks that assumption. It should be an easy fix to filter those jobs properly on the trigger http handler so it keeps working on main.

We should remember to add WatchTask to pb.Task and the state machine driver at some point too, perhaps in a future PR as we finish out WatchTask for the remaining builtin plugins.

briancain · 2022-05-02T20:05:42Z

pkg/server/singleprocess/service_job.go

+//                    ▼                     │
+//           ┌────────────────┐             │
+//           │   Stop Task    │◀────────────┘
+//           └────────────────┘


Love having these directly in the code docs! ✨ 📈

internal/runner/operation_task.go

Co-authored-by: Brian Cain <[email protected]>

mitchellh · 2022-05-02T20:41:05Z

Fixed triggers with @briancain. They no longer require hardcoded offsets and are resilient to all future changes we'd make to how we wrap jobs. 😄

briancain

Thank you! 👍🏻

evanphx · 2022-05-02T21:13:45Z

pkg/server/singleprocess/service_job.go

@@ -70,25 +70,27 @@ func (s *Service) queueJobMulti(
 	ctx context.Context,
 	req []*pb.QueueJobRequest,
 ) ([]*pb.QueueJobResponse, error) {
-	jobs := make([]*pb.Job, 0, len(req))
+	jobQueue := make([]*pb.Job, 0, len(req)*4)


I presume *4 is just a guess to so there isn't a slice resize operation?

evanphx · 2022-05-02T21:18:31Z

pkg/server/singleprocess/service_job.go

+	resp := make([]*pb.QueueJobResponse, len(jobIds))
+	for i, id := range jobIds {
+		resp[i] = &pb.QueueJobResponse{JobId: id}


Great simplification of the logic to hide the ODR jobs.

evanphx

One note that this will create a lot more pressure on the job log system because every "normal" job in an ODR context (ie the common context) will generate double the output, as WatchTask will now contribute to the job log storage as well.

That's not a big issue, but we should revise what our limits are on the logs is all.

mitchellh · 2022-05-02T21:53:29Z

Yeah, the log requirement for jobs will roughly double (since they're mostly duplicated). A lot we can do to mitigate this if this ends up being a problem.

izaaklauer

👍

mitchellh added 10 commits May 1, 2022 19:09

proto

a63a75d

go.mod: update to task-watch branch of sdk

b25e63a

builtin: implement WatchTask for all task launchers (all error for now)

7b0ee61

internal/runner: initial watchtask impl

a428c17

server/singleprocess: queue a watch task for all ODRs

e9ba649

internal/cli: implement watch task info for job list

ce92f60

builtin/docker: implement WatchTask

0708912

internal/runner: pipe through WatchTask results

ed5e67a

internal/cli: know about watch task

987d116

builtin/docker: send more errors to the UI during watch task

253c081

mitchellh requested review from evanphx, briancain and a team May 2, 2022 19:13

github-actions bot added core plugin plugin/aws plugin/docker plugin/k8s plugin/nomad labels May 2, 2022

changelog

5ca78c9

briancain suggested changes May 2, 2022

View reviewed changes

mitchellh and others added 2 commits May 2, 2022 13:40

server: trigger only looks at main job queued, not all jobs wrapped

6438411

Update internal/runner/operation_task.go

4862a77

Co-authored-by: Brian Cain <[email protected]>

briancain approved these changes May 2, 2022

View reviewed changes

evanphx approved these changes May 2, 2022

View reviewed changes

mitchellh merged commit 191584c into main May 2, 2022

mitchellh deleted the f-watchtask branch May 2, 2022 21:53

izaaklauer reviewed May 9, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add "WatchTask" operation to watch ODR launch tasks and expose more debug info #3306

Add "WatchTask" operation to watch ODR launch tasks and expose more debug info #3306

mitchellh commented May 2, 2022 •

edited

Loading

briancain left a comment

briancain May 2, 2022

mitchellh commented May 2, 2022

briancain left a comment

evanphx May 2, 2022

mitchellh May 2, 2022

evanphx May 2, 2022

evanphx left a comment

mitchellh commented May 2, 2022

izaaklauer left a comment

Add "WatchTask" operation to watch ODR launch tasks and expose more debug info #3306

Add "WatchTask" operation to watch ODR launch tasks and expose more debug info #3306

Conversation

mitchellh commented May 2, 2022 • edited Loading

Why?

Semantics

A Building Block

TODO (doesn't block this PR)

briancain left a comment

Choose a reason for hiding this comment

briancain May 2, 2022

Choose a reason for hiding this comment

mitchellh commented May 2, 2022

briancain left a comment

Choose a reason for hiding this comment

evanphx May 2, 2022

Choose a reason for hiding this comment

mitchellh May 2, 2022

Choose a reason for hiding this comment

evanphx May 2, 2022

Choose a reason for hiding this comment

evanphx left a comment

Choose a reason for hiding this comment

mitchellh commented May 2, 2022

izaaklauer left a comment

Choose a reason for hiding this comment

mitchellh commented May 2, 2022 •

edited

Loading