-
Notifications
You must be signed in to change notification settings - Fork 329
Add "WatchTask" operation to watch ODR launch tasks and expose more debug info #3306
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks pretty good to me! I am requesting changes because the Trigger HTTP handler expects to receive a job triple with the source job being the middle job and this breaks that assumption. It should be an easy fix to filter those jobs properly on the trigger http handler so it keeps working on main
.
We should remember to add WatchTask
to pb.Task
and the state machine driver at some point too, perhaps in a future PR as we finish out WatchTask
for the remaining builtin plugins.
// ▼ │ | ||
// ┌────────────────┐ │ | ||
// │ Stop Task │◀────────────┘ | ||
// └────────────────┘ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love having these directly in the code docs! ✨ 📈
Fixed triggers with @briancain. They no longer require hardcoded offsets and are resilient to all future changes we'd make to how we wrap jobs. 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! 👍🏻
@@ -70,25 +70,27 @@ func (s *Service) queueJobMulti( | |||
ctx context.Context, | |||
req []*pb.QueueJobRequest, | |||
) ([]*pb.QueueJobResponse, error) { | |||
jobs := make([]*pb.Job, 0, len(req)) | |||
jobQueue := make([]*pb.Job, 0, len(req)*4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I presume *4 is just a guess to so there isn't a slice resize operation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep.
resp := make([]*pb.QueueJobResponse, len(jobIds)) | ||
for i, id := range jobIds { | ||
resp[i] = &pb.QueueJobResponse{JobId: id} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great simplification of the logic to hide the ODR jobs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One note that this will create a lot more pressure on the job log system because every "normal" job in an ODR context (ie the common context) will generate double the output, as WatchTask will now contribute to the job log storage as well.
That's not a big issue, but we should revise what our limits are on the logs is all.
Yeah, the log requirement for jobs will roughly double (since they're mostly duplicated). A lot we can do to mitigate this if this ends up being a problem. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Depends on: hashicorp/waypoint-plugin-sdk#71
This adds a new operation type
WatchTask
that runs on the static runner (the same place asStartTask
andStopTask
). The WatchTask operation is automatically created as part of the ODR wrapping process, resulting in a job dependency tree that looks like this for any job:Why?
The primary motivator is pipelines but there are more general improvements as well.
For pipelines, this is going to let us execute non-Waypoint-runner tasks and (1) know when it exists and (2) stream the output back to the Waypoint server and track it with the job system. Currently, all tasks are run via a Waypoint runner so we lean on it doing the right things via runner APIs to let us know. This gives us a second source of information.
More generally, this exposes previously difficult-to-reach debug information. If the Waypoint runner failed to launch for any reason, it was very unclear what went wrong. With this, the watch task should still get the platform logs, so we should be able to see errors before or after the runner runs! This should dramatically limit the number of times Waypoint users/operators/maintainers need to reach into the platform directly to see "why isn't my task running".
Semantics
A Building Block
This PR doesn't expose the watch task information easily in any way. This PR is more about establishing the building block.
From here, we could discuss how we expose this information more readily in UIs, CLIs, etc.
TODO (doesn't block this PR)
This PR only implements task watching for Docker. We need to implement task watching for other platforms (esp K8S). But as noted above, if they are unimplemented, it doesn't change the current behavior -- watch tasks are allowed to fail currently.