Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: k8s pod & node informer actor refactor [DET-9597] #7182

Merged

Conversation

carolinaecalderon
Copy link
Contributor

@carolinaecalderon carolinaecalderon commented Jun 21, 2023

Description

As part of the actor refactor project, remove all references to the actor system from the pod & node informers for Kubernetes. To test these changes, informer_intg_test.go is added, in addition to generated mock files.

Test Plan

CircleCI workflows: OSS, EE

Set-up: Follow instructions here to run Determined backed by Kubernetes, but outside of Kubernetes. Start your Kubernetes cluster & start the Determined devcluster.

Case 1a: Suppose an experiment starts/completes successfully, so the pod informers report “Pending”/”Running” until the resources are deleted. The devcluster logs should match the content below:

# To start:
DEBU[...] pod informer is starting                      Informer=default
time="..." level=info msg="created pod <exp-pod-name>" ...
DEBU[...] informer got new pod event for pod: <exp-pod-name> Pending Informer=default
DEBU[...] informer got new pod event for pod: <exp-pod-name> Running Informer=default

# Upon canceling from Determined WebUI:
DEBU[2023-07-10T14:33:42-04:00] informer got new pod event for pod: <exp-pod-name> Failed Informer=default

# Upon successful completion:
# If the informer receives any updates AFTER the experiment is completed, it will receive a warning.
DEBU[...] informer got new pod event for pod: <exp-pod-name> Running  Informer=default
INFO[...] transitioning pod state from RUNNING to TERMINATED  ...
INFO[... pod exited successfully                       ...
INFO[...] requesting to delete kubernetes resources     ...
DEBU[...] resources state changed: ... ResourcesStopped:resources exited successfully  ...
time="..." level=info msg="deleted pod <exp-pod-name>" ...
INFO[...] de-registering pod handler                    ... pod=<exp-pod-name>
DEBU[...] informer got new pod event for pod: <exp-pod-name> Running  Informer=default
WARN[...] received pod status update for un-registered pod  ... pod-name=<exp-pod-name>

Case 2a: Suppose there is a long-running experiment where its pod is killed manually (on the Kubernetes side), so the pod informer fails. The devcluster logs should match the content below, this tells us that Determined “heard” about the pod failure & the informer cannot accept more events.

INFO[...] requesting to delete kubernetes resources ...
DEBU[...] resources state changed: ... ResourcesStopped:resources failed with non-zero exit code: ... 
INFO[...] resources are released for <pod-name> ...
DEBU[...] informer got new pod event for pod: <exp-pod-name> Failed Informer=default
INFO[...] experiment shut down successfully ...

Case 1b: Suppose an experiment starts/completes successfully, so the node informers report “Pending”/”Running” until the resources are deleted. TODO
Upon start:

DEBU[2023-07-13T13:44:22-04:00] informer added node: <node-name>                 component=nodeInformer
DEBU[2023-07-13T13:44:22-04:00] node informer is starting             ...
DEBU[2023-07-13T15:15:45-04:00] informer got new node event(MODIFIED) for node: minikube   component=nodeInformer

Checklist

  • Changes have been manually QA'd
  • User-facing API changes need the "User-facing API Change" label.
  • Release notes should be added as a separate file under docs/release-notes/.
    See Release Note for details.
  • Licenses should be included for new code which was copied and/or modified from any external code.

Ticket

DET-9597

@cla-bot cla-bot bot added the cla-signed label Jun 21, 2023
@carolinaecalderon carolinaecalderon changed the title draft work on informer/pods, TODO nodeInformer/pods chore: k8s informer actor refactor [DET-9597] Jun 21, 2023
@carolinaecalderon carolinaecalderon marked this pull request as ready for review June 22, 2023 15:10
@carolinaecalderon carolinaecalderon requested a review from a team as a code owner June 22, 2023 15:10
Copy link
Contributor

@stoksc stoksc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks mostly good but there is a flaw that runs throughout, the informer should be long lived and send a stream of events to the pods actor whereas in this code, it is always called synchronously and only returns one event. i think if you change to start the informer in a background routine and fire the event via a callback, all the other required changes should sort of follow from that.

master/internal/rm/kubernetesrm/informer.go Outdated Show resolved Hide resolved
Copy link
Contributor

@maxrussell maxrussell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still looking, but submitting for now

master/internal/rm/kubernetesrm/informer.go Outdated Show resolved Hide resolved
master/internal/rm/kubernetesrm/informer_intg_test.go Outdated Show resolved Hide resolved
@carolinaecalderon carolinaecalderon force-pushed the 9597-informer-refactor branch from 5423fd6 to 43f2407 Compare July 5, 2023 20:41
Copy link
Contributor

@maxrussell maxrussell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking really good! A request, some suggestions, and a couple nits

@carolinaecalderon carolinaecalderon requested review from maxrussell and stoksc and removed request for stoksc July 10, 2023 18:32
Copy link
Contributor

@stoksc stoksc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great, just a few comments than should be gtg

master/internal/rm/kubernetesrm/pods.go Outdated Show resolved Hide resolved
master/internal/rm/kubernetesrm/nodes.go Outdated Show resolved Hide resolved
master/internal/rm/kubernetesrm/informer.go Outdated Show resolved Hide resolved
master/internal/rm/kubernetesrm/informer.go Outdated Show resolved Hide resolved
master/internal/rm/kubernetesrm/informer.go Outdated Show resolved Hide resolved
master/internal/rm/kubernetesrm/nodes.go Outdated Show resolved Hide resolved
master/internal/rm/kubernetesrm/nodes.go Outdated Show resolved Hide resolved
master/internal/rm/kubernetesrm/informer.go Outdated Show resolved Hide resolved
master/internal/rm/kubernetesrm/pods.go Outdated Show resolved Hide resolved
master/internal/rm/kubernetesrm/pods.go Outdated Show resolved Hide resolved
@carolinaecalderon carolinaecalderon force-pushed the 9597-informer-refactor branch 2 times, most recently from 0c3f38b to 2bde7f9 Compare July 13, 2023 14:46
@carolinaecalderon carolinaecalderon requested a review from stoksc July 13, 2023 14:46
Copy link
Contributor

@stoksc stoksc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, though probably should get 2 approvals considering the size

@carolinaecalderon carolinaecalderon changed the title chore: k8s informer actor refactor [DET-9597] chore: k8s pod & node informer actor refactor [DET-9597] Jul 13, 2023
@carolinaecalderon carolinaecalderon force-pushed the 9597-informer-refactor branch 2 times, most recently from 9998b74 to 3c5eace Compare July 17, 2023 15:17
Copy link
Contributor

@stoksc stoksc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good just a few final comments.

master/internal/rm/kubernetesrm/informer.go Outdated Show resolved Hide resolved
ctx.Log().WithError(err).Warnf("error retrieving internal resource version")
actors.NotifyAfter(ctx, defaultInformerBackoff, startInformer{})
return
panic(fmt.Sprint("pod informer has failed", err))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would return this error and let the caller (pods.go) decide to panic, since panicking unnecessarily in libraries can make them hard to consume correctly (you have to handle an error, you can accidentally not handle a panic). the panic when the retry watcher is slightly different, but mostly because it requires a lot more infra to propagate; if we had more time i would probably say it shouldn't panic, too (and pods should get notified, async, of its error and decide to panic or restart it or something else).

@maxrussell / @erikwilson curious your thoughts here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally agreed.

I pretty much only use panic when the system shouldn't try to recover—when I intend to crash the program—so pretty much when I'm sure it's some unrecoverable error. In this case, I don't think this function has enough context to know whether we're in that state, so returning an error so calling code can decide seems like the best option.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all good points -- my justification for panicking vs returning an error is that if the startInformer() functions were going to be placed in Initialize, then my bias is that any errors that occur there are higher stakes. Additionally, I don't think I've seen any error handling within any 'initialize' functions from my memory -- although please correct me if I'm wrong & this correlation is not causation.
Perhaps the better solution to preserve the error handling (vs defaulting to panicking) would be to move the start functions out of Initialize -- something that @stoksc has alluded to in later comments

master/internal/rm/kubernetesrm/informer.go Outdated Show resolved Hide resolved
master/internal/rm/kubernetesrm/nodes.go Outdated Show resolved Hide resolved
master/internal/rm/kubernetesrm/pods.go Outdated Show resolved Hide resolved
master/internal/rm/kubernetesrm/pods.go Outdated Show resolved Hide resolved
Copy link
Contributor

@stoksc stoksc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! great work.

@carolinaecalderon carolinaecalderon merged commit e2112c9 into determined-ai:main Jul 19, 2023
@carolinaecalderon carolinaecalderon deleted the 9597-informer-refactor branch July 19, 2023 20:33
@dannysauer dannysauer added this to the 0.23.4 milestone Feb 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants