events: make kube broadcaster shutdown gracefully and tune correlator so we don't loose events by mfojtik · Pull Request #777 · openshift/library-go

mfojtik · 2020-04-21T15:00:36Z

This change will wire the broadcaster Shutdown() function into library-go event recorder.
This is later facilitated in controllercmd builder, where this is called when the binary leader election change.

In addition, this change will provide more fine-tuned correlator options that operators which send a lot of events and don't want to loose events should use. These are default now for controllercmd builder based operators.

deads2k · 2020-04-21T15:05:21Z

/hold

this will cause panics if we try to write events after shutdown. You probably don't want that.

mfojtik · 2020-04-21T15:12:39Z

@deads2k this will flush what is in sink and waits until all those events are recorded... it is true that "new" events that arrived after we called shutdown won't make it... but I dunno how can we stop them coming. we can time the shutdown as the last thing we call AFTER all controllers are down?

deads2k · 2020-04-21T15:13:18Z

@deads2k this will flush what is in sink and waits until all those events are recorded... it is true that "new" events that arrived after we called shutdown won't make it... but I dunno how can we stop them coming. we can time the shutdown as the last thing we call AFTER all controllers are down?

The issue isn't that we don't get them. The issue is that the event write will panic because the incoming channel is closed.

mfojtik · 2020-04-22T07:49:31Z

/retest

mfojtik · 2020-04-22T08:30:52Z

@deads2k proof: openshift/cluster-kube-apiserver-operator#837

I made some tweaks:

During shutdown we switch to normal client, so we keep creating events even if the broadcaster is flushing
I added prometheus counter for events, so in future we can check how many events we send and how many events we see
I tweaked the upstream broadcaster to allow twice as much events from a single source than normal (this should improve our events throughput, although I would want to see the current rate of dropped events first maybe).

mfojtik · 2020-04-22T08:35:55Z

pkg/operator/events/recorder_upstream.go

+	Name:           "total_events_count",
+	Help:           "Total count of events processed by this event recorder per involved object",
+	StabilityLevel: metrics.ALPHA,
+}, []string{"namespace", "name"})


@deads2k i think this will have reasonable cardinality and can show us how many events we receive per operator... i wonder if we want to break this into warnings vs. normal events :-)

i imagine an alert that can fire if we see "abnormal" amount of "warning" events in a period of time...

cc @smarterclayton

mfojtik · 2020-04-22T11:50:50Z

/retest

found typo in resourcesynccontroller unit test: #780

pkg/operator/events/recorder_upstream.go

mfojtik · 2020-04-22T13:14:31Z

@deads2k updated, made it configurable.

mfojtik · 2020-04-22T13:14:43Z

updated the proof as well: openshift/cluster-kube-apiserver-operator#837

pkg/controller/controllercmd/builder.go

deads2k · 2020-04-22T20:57:01Z

pkg/operator/events/recorder_upstream.go

+
+	// fallbackRecorder is used when the kube recorder is shutting down
+	// in that case we create the events directly.
+	fallbackRecorder Recorder


oh what a wicked web we weave

i can just log the events that come after shutdown... i think the chance we leak events after shutdown is triggered is really small (the window is basically ~1-2s, but this makes sure we don't miss any event at all...

deads2k · 2020-04-22T20:57:49Z

pkg/operator/events/recorder_upstream.go

+	fallbackRecorder Recorder
+}
+
+var DefaultOperatorEventRecorderOptions = record.CorrelatorOptions{


still forgot the keyFunc that needs to include the message to avoid de-duping different messages like we have today.

keyFunc added

deads2k · 2020-04-22T20:58:15Z

pkg/operator/events/recorder_upstream.go

+
+var DefaultOperatorEventRecorderOptions = record.CorrelatorOptions{
+	BurstSize: 60,       // default: 25 (change allows a single source to send 50 events about object per minute)
+	QPS:       1. / 60., // default: 1/300 (change allows refill rate to 1 new event every 2 minutes)


I think is just 1.0 for one per second. We want them all!

deads2k · 2020-04-23T12:24:22Z

pkg/operator/staticpod/certsyncpod/certsync_cmd.go

 	kubeInformers := informers.NewSharedInformerFactoryWithOptions(o.kubeClient, 10*time.Minute, informers.WithNamespace(o.Namespace))

-	eventRecorder := events.NewKubeRecorder(o.kubeClient.CoreV1().Events(o.Namespace), "cert-syncer",
+	eventRecorder := events.NewKubeRecorder(o.kubeClient.CoreV1().Events(o.Namespace), record.CorrelatorOptions{}, "cert-syncer",


we want the operator level one for this

deads2k · 2020-04-23T13:06:33Z

before change, openshift-kube-apiserver-operator had 186 events, now it has 383 events. There's a lot, but I think we can manage it.

deads2k · 2020-04-23T13:10:11Z

@mfojtik some with the same message are showing up as separate events now. This surprises me.

04:39:27 (1) "openshift-kube-apiserver-operator" OperatorStatusChanged Status for clusteroperator/kube-apiserver changed: Degraded message changed from "NodeControllerDegraded: The master nodes not ready: node ""ip-10-0-132-198.us-east-2.compute.internal"" not ready since 2020-04-23 08:34:47 +0000 UTC because KubeletNotReady (runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?)
04:39:27 (1) "openshift-kube-apiserver-operator" OperatorStatusChanged Status for clusteroperator/kube-apiserver changed: Degraded message changed from "NodeControllerDegraded: The master nodes not ready: node ""ip-10-0-132-198.us-east-2.compute.internal"" not ready since 2020-04-23 08:34:47 +0000 UTC because KubeletNotReady (runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?)
04:39:27 (1) "openshift-kube-apiserver-operator" OperatorStatusChanged Status for clusteroperator/kube-apiserver changed: Degraded message changed from "NodeControllerDegraded: The master nodes not ready: node ""ip-10-0-132-198.us-east-2.compute.internal"" not ready since 2020-04-23 08:34:47 +0000 UTC because KubeletNotReady (runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?)
04:39:28 (1) "openshift-kube-apiserver-operator" OperatorStatusChanged Status for clusteroperator/kube-apiserver changed: Degraded message changed from "NodeControllerDegraded: The master nodes not ready: node ""ip-10-0-132-198.us-east-2.compute.internal"" not ready since 2020-04-23 08:34:47 +0000 UTC because KubeletNotReady (runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?)
04:39:28 (1) "openshift-kube-apiserver-operator" OperatorStatusChanged Status for clusteroperator/kube-apiserver changed: Degraded message changed from "NodeControllerDegraded: The master nodes not ready: node ""ip-10-0-132-198.us-east-2.compute.internal"" not ready since 2020-04-23 08:34:47 +0000 UTC because KubeletNotReady (runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?)
04:39:28 (1) "openshift-kube-apiserver-operator" OperatorStatusChanged Status for clusteroperator/kube-apiserver changed: Degraded message changed from "NodeControllerDegraded: The master nodes not ready: node ""ip-10-0-132-198.us-east-2.compute.internal"" not ready since 2020-04-23 08:34:47 +0000 UTC because KubeletNotReady (runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?)
04:39:28 (1) "openshift-kube-apiserver-operator" OperatorStatusChanged Status for clusteroperator/kube-apiserver changed: Degraded message changed from "NodeControllerDegraded: The master nodes not ready: node ""ip-10-0-132-198.us-east-2.compute.internal"" not ready since 2020-04-23 08:34:47 +0000 UTC because KubeletNotReady (runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?)
04:39:29 (1) "openshift-kube-apiserver-operator" OperatorStatusChanged Status for clusteroperator/kube-apiserver changed: Degraded message changed from "NodeControllerDegraded: The master nodes not ready: node ""ip-10-0-132-198.us-east-2.compute.internal"" not ready since 2020-04-23 08:34:47 +0000 UTC because KubeletNotReady (runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?)
04:39:29 (1) "openshift-kube-apiserver-operator" OperatorStatusChanged Status for clusteroperator/kube-apiserver changed: Degraded message changed from "NodeControllerDegraded: The master nodes not ready: node ""ip-10-0-132-198.us-east-2.compute.internal"" not ready since 2020-04-23 08:34:47 +0000 UTC because KubeletNotReady (runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?)
04:39:41 (1) "openshift-kube-apiserver-operator" OperatorStatusChanged Status for clusteroperator/kube-apiserver changed: Degraded message changed from "NodeControllerDegraded: The master nodes not ready: node ""ip-10-0-132-198.us-east-2.compute.internal"" not ready since 2020-04-23 08:34:47 +0000 UTC because KubeletNotReady (runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?)
04:39:58 (1) "openshift-kube-apiserver-operator" OperatorStatusChanged Status for clusteroperator/kube-apiserver changed: Degraded message changed from "NodeControllerDegraded: The master nodes not ready: node ""ip-10-0-132-198.us-east-2.compute.internal"" not ready since 2020-04-23 08:34:47 +0000 UTC because KubeletNotReady (runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?)

from proof PR https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-kube-apiserver-operator/837/pull-ci-openshift-cluster-kube-apiserver-operator-master-e2e-aws-serial/321

mfojtik · 2020-04-23T15:51:58Z

@deads2k those messages are truncated, see #784

i think increase of events is reasonable... if should consider which events are useful and which are not and how many we fire from controllers perhaps?

sttts · 2020-04-24T09:42:04Z

pkg/operator/events/recorder_upstream.go

+				event.Reason,
+				event.Message,
+			},
+				""), event.Message


odd idention

sttts · 2020-04-24T09:43:50Z

pkg/operator/events/recorder_upstream.go

 // Event emits the normal type event.
 func (r *upstreamRecorder) Event(reason, message string) {
+	defer r.incrementEventsCounter(corev1.EventTypeNormal)
+	if r.isShuttingDown() {


sttts · 2020-04-24T09:44:12Z

pkg/operator/events/recorder_upstream.go

 // Warning emits the warning type event.
 func (r *upstreamRecorder) Warning(reason, message string) {
+	defer r.incrementEventsCounter(corev1.EventTypeWarning)
+	if r.isShuttingDown() {


shutdown can be called after isShuttingDown left the lock. We would lose events.

sttts · 2020-04-24T09:44:44Z

pkg/operator/staticpod/certsyncpod/certsync_cmd.go

 	kubeInformers := informers.NewSharedInformerFactoryWithOptions(o.kubeClient, 10*time.Minute, informers.WithNamespace(o.Namespace))

-	eventRecorder := events.NewKubeRecorder(o.kubeClient.CoreV1().Events(o.Namespace), "cert-syncer",
+	eventRecorder := events.NewKubeRecorder(o.kubeClient.CoreV1().Events(o.Namespace), events.RecommendedClusterSingletonCorrelatorOptions(), "cert-syncer",


award for the longest identifier! 🏆

goes to @deads2k for suggesting it :)

it has some Java smell, am not surprised :D

sttts · 2020-04-27T10:24:22Z

pkg/operator/events/recorder_upstream.go


 // NewKubeRecorder returns new event recorder.
-func NewKubeRecorder(client corev1client.EventInterface, sourceComponentName string, involvedObjectRef *corev1.ObjectReference) Recorder {
+func NewKubeRecorder(client corev1client.EventInterface, options record.CorrelatorOptions, sourceComponentName string, involvedObjectRef *corev1.ObjectReference) Recorder {


do we want to break this interface?

I can make NewKubeRecorderWithOptions ?

sttts · 2020-04-27T10:25:47Z

pkg/operator/events/recorder_upstream.go

 // Event emits the normal type event.
 func (r *upstreamRecorder) Event(reason, message string) {
+	r.shutdownMutex.Lock()
+	defer r.shutdownMutex.Unlock()


this is a long lock. Use a RW lock and lock here read-only, and write-lock in the shutdown call func.

sttts · 2020-04-27T10:45:50Z

pkg/controller/controllercmd/builder.go

+// This is needed if the binary is sending a lot of events.
+// Using events.DefaultOperatorEventRecorderOptions here makes a good default for normal operator binary.
+func (b *ControllerBuilder) WithEventRecorderOptions(options record.CorrelatorOptions) *ControllerBuilder {
+	b.eventRecorderOptions = options


I expected to set it to the recommended options by default. Is that the case?

I guess https://github.com/openshift/library-go/pull/777/files#diff-ff937793ae2933db156923ac8eaebbd8R271 is doing this, right?

sttts · 2020-04-27T10:47:26Z

/approve
/lgtm

openshift-ci-robot · 2020-04-27T10:47:51Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mfojtik, sttts

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [sttts]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

deads2k · 2020-04-27T12:26:48Z

/hold cancel

events: make kube broadcaster shutdown gracefully and tune correlator so we don't loose events

openshift-ci-robot requested review from deads2k and sttts April 21, 2020 15:01

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 21, 2020

mfojtik force-pushed the events-shutdown branch from e1c672a to 32d27cb Compare April 21, 2020 15:06

mfojtik force-pushed the events-shutdown branch from 32d27cb to 03633a2 Compare April 21, 2020 15:28

mfojtik force-pushed the events-shutdown branch 2 times, most recently from c25731d to ada3238 Compare April 22, 2020 08:20

mfojtik commented Apr 22, 2020

View reviewed changes

mfojtik force-pushed the events-shutdown branch 5 times, most recently from 0584af7 to c4cafbc Compare April 22, 2020 11:39

deads2k reviewed Apr 22, 2020

View reviewed changes

pkg/operator/events/recorder_upstream.go Outdated Show resolved Hide resolved

deads2k reviewed Apr 22, 2020

View reviewed changes

pkg/operator/events/recorder_upstream.go Outdated Show resolved Hide resolved

deads2k reviewed Apr 22, 2020

View reviewed changes

pkg/operator/events/recorder_upstream.go Outdated Show resolved Hide resolved

mfojtik force-pushed the events-shutdown branch 2 times, most recently from 7f935c9 to ceda9aa Compare April 22, 2020 13:11

deads2k reviewed Apr 22, 2020

View reviewed changes

pkg/controller/controllercmd/builder.go Show resolved Hide resolved

deads2k reviewed Apr 22, 2020

View reviewed changes

mfojtik force-pushed the events-shutdown branch from ceda9aa to 4df8c12 Compare April 23, 2020 08:16

mfojtik changed the title ~~events: wire shutdown~~ events: make kube broadcaster shutdown gracefully and tune correlator so we don't loose events Apr 23, 2020

mfojtik force-pushed the events-shutdown branch from 4df8c12 to d23c2fd Compare April 23, 2020 12:20

deads2k reviewed Apr 23, 2020

View reviewed changes

mfojtik force-pushed the events-shutdown branch from d23c2fd to 7d0764b Compare April 23, 2020 12:25

sttts reviewed Apr 24, 2020

View reviewed changes

pkg/operator/events/recorder_upstream.go Outdated

event.Reason,

event.Message,

},

""), event.Message

Copy link

Contributor

sttts Apr 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

odd idention

sttts reviewed Apr 24, 2020

View reviewed changes

mfojtik force-pushed the events-shutdown branch 2 times, most recently from 1288255 to c04379f Compare April 24, 2020 12:03

sttts reviewed Apr 27, 2020

View reviewed changes

events: wire graceful broadcaster shutdown

209722f

mfojtik force-pushed the events-shutdown branch from c04379f to 209722f Compare April 27, 2020 10:39

sttts reviewed Apr 27, 2020

View reviewed changes

openshift-ci-robot assigned sttts Apr 27, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 27, 2020

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 27, 2020

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 27, 2020

openshift-merge-robot merged commit 6d48516 into openshift:master Apr 27, 2020

bertinatto pushed a commit to bertinatto/library-go that referenced this pull request Jul 2, 2020

Merge pull request openshift#777 from mfojtik/events-shutdown

25439c6

events: make kube broadcaster shutdown gracefully and tune correlator so we don't loose events

Conversation

mfojtik commented Apr 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

deads2k commented Apr 21, 2020

Uh oh!

mfojtik commented Apr 21, 2020

Uh oh!

deads2k commented Apr 21, 2020

Uh oh!

mfojtik commented Apr 22, 2020

Uh oh!

mfojtik commented Apr 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mfojtik commented Apr 22, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mfojtik commented Apr 22, 2020

Uh oh!

mfojtik commented Apr 22, 2020

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

deads2k commented Apr 23, 2020

Uh oh!

deads2k commented Apr 23, 2020

Uh oh!

mfojtik commented Apr 23, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sttts commented Apr 27, 2020

Uh oh!

openshift-ci-robot commented Apr 27, 2020

mfojtik commented Apr 21, 2020 •

edited

Loading

mfojtik commented Apr 22, 2020 •

edited

Loading