events: make kube broadcaster shutdown gracefully and tune correlator so we don't loose events#777
Conversation
|
/hold this will cause panics if we try to write events after shutdown. You probably don't want that. |
e1c672a to
32d27cb
Compare
|
@deads2k this will flush what is in sink and waits until all those events are recorded... it is true that "new" events that arrived after we called shutdown won't make it... but I dunno how can we stop them coming. we can time the shutdown as the last thing we call AFTER all controllers are down? |
The issue isn't that we don't get them. The issue is that the event write will panic because the incoming channel is closed. |
32d27cb to
03633a2
Compare
|
/retest |
c25731d to
ada3238
Compare
|
@deads2k proof: openshift/cluster-kube-apiserver-operator#837 I made some tweaks:
|
| Name: "total_events_count", | ||
| Help: "Total count of events processed by this event recorder per involved object", | ||
| StabilityLevel: metrics.ALPHA, | ||
| }, []string{"namespace", "name"}) |
There was a problem hiding this comment.
@deads2k i think this will have reasonable cardinality and can show us how many events we receive per operator... i wonder if we want to break this into warnings vs. normal events :-)
i imagine an alert that can fire if we see "abnormal" amount of "warning" events in a period of time...
0584af7 to
c4cafbc
Compare
|
/retest found typo in resourcesynccontroller unit test: #780 |
7f935c9 to
ceda9aa
Compare
|
@deads2k updated, made it configurable. |
|
updated the proof as well: openshift/cluster-kube-apiserver-operator#837 |
|
|
||
| // fallbackRecorder is used when the kube recorder is shutting down | ||
| // in that case we create the events directly. | ||
| fallbackRecorder Recorder |
There was a problem hiding this comment.
oh what a wicked web we weave
There was a problem hiding this comment.
i can just log the events that come after shutdown... i think the chance we leak events after shutdown is triggered is really small (the window is basically ~1-2s, but this makes sure we don't miss any event at all...
| fallbackRecorder Recorder | ||
| } | ||
|
|
||
| var DefaultOperatorEventRecorderOptions = record.CorrelatorOptions{ |
There was a problem hiding this comment.
still forgot the keyFunc that needs to include the message to avoid de-duping different messages like we have today.
|
|
||
| var DefaultOperatorEventRecorderOptions = record.CorrelatorOptions{ | ||
| BurstSize: 60, // default: 25 (change allows a single source to send 50 events about object per minute) | ||
| QPS: 1. / 60., // default: 1/300 (change allows refill rate to 1 new event every 2 minutes) |
There was a problem hiding this comment.
I think is just 1.0 for one per second. We want them all!
ceda9aa to
4df8c12
Compare
4df8c12 to
d23c2fd
Compare
| kubeInformers := informers.NewSharedInformerFactoryWithOptions(o.kubeClient, 10*time.Minute, informers.WithNamespace(o.Namespace)) | ||
|
|
||
| eventRecorder := events.NewKubeRecorder(o.kubeClient.CoreV1().Events(o.Namespace), "cert-syncer", | ||
| eventRecorder := events.NewKubeRecorder(o.kubeClient.CoreV1().Events(o.Namespace), record.CorrelatorOptions{}, "cert-syncer", |
There was a problem hiding this comment.
we want the operator level one for this
d23c2fd to
7d0764b
Compare
|
before change, openshift-kube-apiserver-operator had 186 events, now it has 383 events. There's a lot, but I think we can manage it. |
|
@mfojtik some with the same message are showing up as separate events now. This surprises me. |
| event.Reason, | ||
| event.Message, | ||
| }, | ||
| ""), event.Message |
| // Event emits the normal type event. | ||
| func (r *upstreamRecorder) Event(reason, message string) { | ||
| defer r.incrementEventsCounter(corev1.EventTypeNormal) | ||
| if r.isShuttingDown() { |
| // Warning emits the warning type event. | ||
| func (r *upstreamRecorder) Warning(reason, message string) { | ||
| defer r.incrementEventsCounter(corev1.EventTypeWarning) | ||
| if r.isShuttingDown() { |
There was a problem hiding this comment.
shutdown can be called after isShuttingDown left the lock. We would lose events.
| kubeInformers := informers.NewSharedInformerFactoryWithOptions(o.kubeClient, 10*time.Minute, informers.WithNamespace(o.Namespace)) | ||
|
|
||
| eventRecorder := events.NewKubeRecorder(o.kubeClient.CoreV1().Events(o.Namespace), "cert-syncer", | ||
| eventRecorder := events.NewKubeRecorder(o.kubeClient.CoreV1().Events(o.Namespace), events.RecommendedClusterSingletonCorrelatorOptions(), "cert-syncer", |
There was a problem hiding this comment.
award for the longest identifier! 🏆
There was a problem hiding this comment.
it has some Java smell, am not surprised :D
1288255 to
c04379f
Compare
|
|
||
| // NewKubeRecorder returns new event recorder. | ||
| func NewKubeRecorder(client corev1client.EventInterface, sourceComponentName string, involvedObjectRef *corev1.ObjectReference) Recorder { | ||
| func NewKubeRecorder(client corev1client.EventInterface, options record.CorrelatorOptions, sourceComponentName string, involvedObjectRef *corev1.ObjectReference) Recorder { |
There was a problem hiding this comment.
do we want to break this interface?
There was a problem hiding this comment.
I can make NewKubeRecorderWithOptions ?
| // Event emits the normal type event. | ||
| func (r *upstreamRecorder) Event(reason, message string) { | ||
| r.shutdownMutex.Lock() | ||
| defer r.shutdownMutex.Unlock() |
There was a problem hiding this comment.
this is a long lock. Use a RW lock and lock here read-only, and write-lock in the shutdown call func.
c04379f to
209722f
Compare
| // This is needed if the binary is sending a lot of events. | ||
| // Using events.DefaultOperatorEventRecorderOptions here makes a good default for normal operator binary. | ||
| func (b *ControllerBuilder) WithEventRecorderOptions(options record.CorrelatorOptions) *ControllerBuilder { | ||
| b.eventRecorderOptions = options |
There was a problem hiding this comment.
I expected to set it to the recommended options by default. Is that the case?
There was a problem hiding this comment.
I guess https://github.com/openshift/library-go/pull/777/files#diff-ff937793ae2933db156923ac8eaebbd8R271 is doing this, right?
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mfojtik, sttts The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/hold cancel |
events: make kube broadcaster shutdown gracefully and tune correlator so we don't loose events
This change will wire the broadcaster
Shutdown()function into library-go event recorder.This is later facilitated in controllercmd builder, where this is called when the binary leader election change.
In addition, this change will provide more fine-tuned correlator options that operators which send a lot of events and don't want to loose events should use. These are default now for controllercmd builder based operators.