chore: k8s pod & node informer actor refactor [DET-9597] #7182

carolinaecalderon · 2023-06-21T14:56:53Z

Description

As part of the actor refactor project, remove all references to the actor system from the pod & node informers for Kubernetes. To test these changes, informer_intg_test.go is added, in addition to generated mock files.

Test Plan

CircleCI workflows: OSS, EE

Set-up: Follow instructions here to run Determined backed by Kubernetes, but outside of Kubernetes. Start your Kubernetes cluster & start the Determined devcluster.

Case 1a: Suppose an experiment starts/completes successfully, so the pod informers report “Pending”/”Running” until the resources are deleted. The devcluster logs should match the content below:

# To start:
DEBU[...] pod informer is starting                      Informer=default
time="..." level=info msg="created pod <exp-pod-name>" ...
DEBU[...] informer got new pod event for pod: <exp-pod-name> Pending Informer=default
DEBU[...] informer got new pod event for pod: <exp-pod-name> Running Informer=default

# Upon canceling from Determined WebUI:
DEBU[2023-07-10T14:33:42-04:00] informer got new pod event for pod: <exp-pod-name> Failed Informer=default

# Upon successful completion:
# If the informer receives any updates AFTER the experiment is completed, it will receive a warning.
DEBU[...] informer got new pod event for pod: <exp-pod-name> Running  Informer=default
INFO[...] transitioning pod state from RUNNING to TERMINATED  ...
INFO[... pod exited successfully                       ...
INFO[...] requesting to delete kubernetes resources     ...
DEBU[...] resources state changed: ... ResourcesStopped:resources exited successfully  ...
time="..." level=info msg="deleted pod <exp-pod-name>" ...
INFO[...] de-registering pod handler                    ... pod=<exp-pod-name>
DEBU[...] informer got new pod event for pod: <exp-pod-name> Running  Informer=default
WARN[...] received pod status update for un-registered pod  ... pod-name=<exp-pod-name>

Case 2a: Suppose there is a long-running experiment where its pod is killed manually (on the Kubernetes side), so the pod informer fails. The devcluster logs should match the content below, this tells us that Determined “heard” about the pod failure & the informer cannot accept more events.

INFO[...] requesting to delete kubernetes resources ...
DEBU[...] resources state changed: ... ResourcesStopped:resources failed with non-zero exit code: ... 
INFO[...] resources are released for <pod-name> ...
DEBU[...] informer got new pod event for pod: <exp-pod-name> Failed Informer=default
INFO[...] experiment shut down successfully ...

Case 1b: Suppose an experiment starts/completes successfully, so the node informers report “Pending”/”Running” until the resources are deleted. TODO
Upon start:

DEBU[2023-07-13T13:44:22-04:00] informer added node: <node-name>                 component=nodeInformer
DEBU[2023-07-13T13:44:22-04:00] node informer is starting             ...
DEBU[2023-07-13T15:15:45-04:00] informer got new node event(MODIFIED) for node: minikube   component=nodeInformer

Checklist

Changes have been manually QA'd
User-facing API changes need the "User-facing API Change" label.
Release notes should be added as a separate file under docs/release-notes/.
See Release Note for details.
Licenses should be included for new code which was copied and/or modified from any external code.

Ticket

DET-9597

master/internal/rm/kubernetesrm/events.go

master/internal/rm/kubernetesrm/pods.go

master/internal/rm/kubernetesrm/informer.go

stoksc

this looks mostly good but there is a flaw that runs throughout, the informer should be long lived and send a stream of events to the pods actor whereas in this code, it is always called synchronously and only returns one event. i think if you change to start the informer in a background routine and fire the event via a callback, all the other required changes should sort of follow from that.

master/internal/rm/kubernetesrm/informer.go

master/internal/rm/kubernetesrm/nodes.go

master/internal/rm/kubernetesrm/informer.go

master/internal/rm/kubernetesrm/pods.go

maxrussell

Still looking, but submitting for now

master/internal/rm/kubernetesrm/informer.go

master/internal/rm/kubernetesrm/informer_intg_test.go

maxrussell

Looking really good! A request, some suggestions, and a couple nits

master/internal/rm/kubernetesrm/informer_intg_test.go

master/internal/rm/kubernetesrm/nodes.go

stoksc

looks great, just a few comments than should be gtg

master/internal/rm/kubernetesrm/pods.go

master/internal/rm/kubernetesrm/nodes.go

master/internal/rm/kubernetesrm/informer.go

master/internal/rm/kubernetesrm/nodes.go

master/internal/rm/kubernetesrm/informer.go

master/internal/rm/kubernetesrm/pods.go

master/internal/rm/kubernetesrm/nodes.go

master/internal/rm/kubernetesrm/informer.go

stoksc

lgtm, though probably should get 2 approvals considering the size

stoksc

looks good just a few final comments.

master/internal/rm/kubernetesrm/informer.go

stoksc · 2023-07-17T20:57:33Z

master/internal/rm/kubernetesrm/informer.go

-		ctx.Log().WithError(err).Warnf("error retrieving internal resource version")
-		actors.NotifyAfter(ctx, defaultInformerBackoff, startInformer{})
-		return
+		panic(fmt.Sprint("pod informer has failed", err))


i would return this error and let the caller (pods.go) decide to panic, since panicking unnecessarily in libraries can make them hard to consume correctly (you have to handle an error, you can accidentally not handle a panic). the panic when the retry watcher is slightly different, but mostly because it requires a lot more infra to propagate; if we had more time i would probably say it shouldn't panic, too (and pods should get notified, async, of its error and decide to panic or restart it or something else).

@maxrussell / @erikwilson curious your thoughts here.

Totally agreed.

I pretty much only use panic when the system shouldn't try to recover—when I intend to crash the program—so pretty much when I'm sure it's some unrecoverable error. In this case, I don't think this function has enough context to know whether we're in that state, so returning an error so calling code can decide seems like the best option.

all good points -- my justification for panicking vs returning an error is that if the startInformer() functions were going to be placed in Initialize, then my bias is that any errors that occur there are higher stakes. Additionally, I don't think I've seen any error handling within any 'initialize' functions from my memory -- although please correct me if I'm wrong & this correlation is not causation.
Perhaps the better solution to preserve the error handling (vs defaulting to panicking) would be to move the start functions out of Initialize -- something that @stoksc has alluded to in later comments

master/internal/rm/kubernetesrm/informer.go

master/internal/rm/kubernetesrm/nodes.go

master/internal/rm/kubernetesrm/pods.go

Outdated

stoksc

lgtm! great work.

cla-bot bot added the cla-signed label Jun 21, 2023

carolinaecalderon requested review from erikwilson, maxrussell and stoksc June 21, 2023 14:57

carolinaecalderon changed the title ~~draft work on informer/pods, TODO nodeInformer/pods~~ chore: k8s informer actor refactor [DET-9597] Jun 21, 2023

carolinaecalderon force-pushed the 9597-informer-refactor branch from 86a82f2 to f3c7905 Compare June 21, 2023 18:41

carolinaecalderon commented Jun 22, 2023

View reviewed changes

master/internal/rm/kubernetesrm/events.go Show resolved Hide resolved

carolinaecalderon marked this pull request as ready for review June 22, 2023 15:10

carolinaecalderon requested a review from a team as a code owner June 22, 2023 15:10

carolinaecalderon commented Jun 22, 2023

View reviewed changes

master/internal/rm/kubernetesrm/pods.go Outdated Show resolved Hide resolved

carolinaecalderon commented Jun 22, 2023

View reviewed changes

master/internal/rm/kubernetesrm/pods.go Outdated Show resolved Hide resolved

carolinaecalderon commented Jun 22, 2023

View reviewed changes

master/internal/rm/kubernetesrm/informer.go Show resolved Hide resolved

stoksc reviewed Jun 23, 2023

View reviewed changes

master/internal/rm/kubernetesrm/informer.go Outdated Show resolved Hide resolved

carolinaecalderon force-pushed the 9597-informer-refactor branch from 7d6cdac to 145f769 Compare June 26, 2023 21:07

maxrussell reviewed Jun 26, 2023

View reviewed changes

master/internal/rm/kubernetesrm/informer.go Show resolved Hide resolved

master/internal/rm/kubernetesrm/informer.go Outdated Show resolved Hide resolved

master/internal/rm/kubernetesrm/informer.go Outdated Show resolved Hide resolved

carolinaecalderon force-pushed the 9597-informer-refactor branch from 145f769 to 7d425bc Compare June 27, 2023 14:49

carolinaecalderon commented Jun 28, 2023

View reviewed changes

master/internal/rm/kubernetesrm/nodes.go Outdated Show resolved Hide resolved

carolinaecalderon commented Jun 28, 2023

View reviewed changes

master/internal/rm/kubernetesrm/nodes.go Outdated Show resolved Hide resolved

carolinaecalderon requested review from stoksc and maxrussell June 28, 2023 19:42

erikwilson reviewed Jun 28, 2023

View reviewed changes

master/internal/rm/kubernetesrm/informer.go Outdated Show resolved Hide resolved

erikwilson reviewed Jun 28, 2023

View reviewed changes

master/internal/rm/kubernetesrm/pods.go Outdated Show resolved Hide resolved

maxrussell reviewed Jun 29, 2023

View reviewed changes

master/internal/rm/kubernetesrm/informer.go Outdated Show resolved Hide resolved

master/internal/rm/kubernetesrm/informer_intg_test.go Outdated Show resolved Hide resolved

carolinaecalderon force-pushed the 9597-informer-refactor branch from 94cb80d to d00c2b5 Compare June 29, 2023 14:58

carolinaecalderon mentioned this pull request Jun 29, 2023

chore: events & preemption listener actor refactor [DET-9617] #7256

Merged

4 tasks

carolinaecalderon requested review from erikwilson and maxrussell July 3, 2023 21:21

carolinaecalderon force-pushed the 9597-informer-refactor branch 3 times, most recently from bb94d28 to fe75a53 Compare July 5, 2023 20:02

carolinaecalderon force-pushed the 9597-informer-refactor branch from 5423fd6 to 43f2407 Compare July 5, 2023 20:41

maxrussell previously requested changes Jul 7, 2023

View reviewed changes

carolinaecalderon requested review from maxrussell and stoksc and removed request for stoksc July 10, 2023 18:32

carolinaecalderon commented Jul 11, 2023

View reviewed changes

master/internal/rm/kubernetesrm/nodes.go Outdated Show resolved Hide resolved

stoksc reviewed Jul 12, 2023

View reviewed changes

master/internal/rm/kubernetesrm/nodes.go Outdated Show resolved Hide resolved

stoksc reviewed Jul 12, 2023

View reviewed changes

master/internal/rm/kubernetesrm/informer.go Outdated Show resolved Hide resolved

carolinaecalderon force-pushed the 9597-informer-refactor branch 2 times, most recently from 0c3f38b to 2bde7f9 Compare July 13, 2023 14:46

carolinaecalderon requested a review from stoksc July 13, 2023 14:46

stoksc approved these changes Jul 13, 2023

View reviewed changes

carolinaecalderon changed the title ~~chore: k8s informer actor refactor [DET-9597]~~ chore: k8s pod & node informer actor refactor [DET-9597] Jul 13, 2023

carolinaecalderon force-pushed the 9597-informer-refactor branch 2 times, most recently from 9998b74 to 3c5eace Compare July 17, 2023 15:17

stoksc reviewed Jul 17, 2023

View reviewed changes

stoksc approved these changes Jul 19, 2023

View reviewed changes

chore: k8s pod & node informer refactor

1967057

carolinaecalderon force-pushed the 9597-informer-refactor branch from 49d8034 to 1967057 Compare July 19, 2023 20:02

carolinaecalderon merged commit e2112c9 into determined-ai:main Jul 19, 2023

carolinaecalderon deleted the 9597-informer-refactor branch July 19, 2023 20:33

carolinaecalderon mentioned this pull request Jul 25, 2023

chore: consolidate k8s informers code & fix Makefile mocks #7455

Merged

4 tasks

NicholasBlaskey pushed a commit that referenced this pull request Jul 25, 2023

chore: k8s pod & node informer refactor (#7182)

e3e2627

dannysauer added this to the 0.23.4 milestone Feb 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: k8s pod & node informer actor refactor [DET-9597] #7182

chore: k8s pod & node informer actor refactor [DET-9597] #7182

carolinaecalderon commented Jun 21, 2023 •

edited

Loading

stoksc left a comment

maxrussell left a comment

maxrussell left a comment

stoksc left a comment

stoksc left a comment

stoksc left a comment

stoksc Jul 17, 2023

maxrussell Jul 17, 2023

carolinaecalderon Jul 18, 2023

stoksc left a comment

chore: k8s pod & node informer actor refactor [DET-9597] #7182

chore: k8s pod & node informer actor refactor [DET-9597] #7182

Conversation

carolinaecalderon commented Jun 21, 2023 • edited Loading

Description

Test Plan

Checklist

Ticket

stoksc left a comment

Choose a reason for hiding this comment

maxrussell left a comment

Choose a reason for hiding this comment

maxrussell left a comment

Choose a reason for hiding this comment

stoksc left a comment

Choose a reason for hiding this comment

stoksc left a comment

Choose a reason for hiding this comment

stoksc left a comment

Choose a reason for hiding this comment

stoksc Jul 17, 2023

Choose a reason for hiding this comment

maxrussell Jul 17, 2023

Choose a reason for hiding this comment

carolinaecalderon Jul 18, 2023

Choose a reason for hiding this comment

stoksc left a comment

Choose a reason for hiding this comment

carolinaecalderon commented Jun 21, 2023 •

edited

Loading