-
Notifications
You must be signed in to change notification settings - Fork 361
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: k8s pod & node informer actor refactor [DET-9597] #7182
chore: k8s pod & node informer actor refactor [DET-9597] #7182
Conversation
86a82f2
to
f3c7905
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks mostly good but there is a flaw that runs throughout, the informer should be long lived and send a stream of events to the pods actor whereas in this code, it is always called synchronously and only returns one event. i think if you change to start the informer in a background routine and fire the event via a callback, all the other required changes should sort of follow from that.
7d6cdac
to
145f769
Compare
145f769
to
7d425bc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still looking, but submitting for now
94cb80d
to
d00c2b5
Compare
bb94d28
to
fe75a53
Compare
5423fd6
to
43f2407
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking really good! A request, some suggestions, and a couple nits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks great, just a few comments than should be gtg
0c3f38b
to
2bde7f9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, though probably should get 2 approvals considering the size
9998b74
to
3c5eace
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good just a few final comments.
ctx.Log().WithError(err).Warnf("error retrieving internal resource version") | ||
actors.NotifyAfter(ctx, defaultInformerBackoff, startInformer{}) | ||
return | ||
panic(fmt.Sprint("pod informer has failed", err)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would return this error and let the caller (pods.go) decide to panic, since panicking unnecessarily in libraries can make them hard to consume correctly (you have to handle an error, you can accidentally not handle a panic). the panic when the retry watcher is slightly different, but mostly because it requires a lot more infra to propagate; if we had more time i would probably say it shouldn't panic, too (and pods should get notified, async, of its error and decide to panic or restart it or something else).
@maxrussell / @erikwilson curious your thoughts here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally agreed.
I pretty much only use panic
when the system shouldn't try to recover—when I intend to crash the program—so pretty much when I'm sure it's some unrecoverable error. In this case, I don't think this function has enough context to know whether we're in that state, so returning an error so calling code can decide seems like the best option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all good points -- my justification for panicking vs returning an error is that if the startInformer() functions were going to be placed in Initialize
, then my bias is that any errors that occur there are higher stakes. Additionally, I don't think I've seen any error handling within any 'initialize' functions from my memory -- although please correct me if I'm wrong & this correlation is not causation.
Perhaps the better solution to preserve the error handling (vs defaulting to panicking) would be to move the start functions out of Initialize
-- something that @stoksc has alluded to in later comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm! great work.
49d8034
to
1967057
Compare
Description
As part of the actor refactor project, remove all references to the actor system from the pod & node informers for Kubernetes. To test these changes,
informer_intg_test.go
is added, in addition to generated mock files.Test Plan
CircleCI workflows: OSS, EE
Set-up: Follow instructions here to run Determined backed by Kubernetes, but outside of Kubernetes. Start your Kubernetes cluster & start the Determined devcluster.
Case 1a: Suppose an experiment starts/completes successfully, so the pod informers report “Pending”/”Running” until the resources are deleted. The devcluster logs should match the content below:
Case 2a: Suppose there is a long-running experiment where its pod is killed manually (on the Kubernetes side), so the pod informer fails. The devcluster logs should match the content below, this tells us that Determined “heard” about the pod failure & the informer cannot accept more events.
Case 1b: Suppose an experiment starts/completes successfully, so the node informers report “Pending”/”Running” until the resources are deleted. TODO
Upon start:
Checklist
docs/release-notes/
.See Release Note for details.
Ticket
DET-9597