Revise leader election logic for endpoints controller #12021

mateiidavid · 2024-01-31T17:20:50Z

Revise leader election logic for endpoints controller

Our leader election logic can result in updates being missed under certain
conditions. Leases expire after their duration is up, even if their current
holder has been terminated. During this dead time, any changes in the
system will be observed by other controllers, but will not be written to
the API Server.

For example, during a rollout, a controller that has come up will not be
able to acquire the lease for a maximum time of 30 seconds (lease
duration). Within this time frame, any changes to the system (e.g. modified
workloads, services, deleted endpointslices) will be observed but not acted
on by the newly created controller. Once the controller gets into a bad
state, it can only recover after 10 minutes (via service resyncs) or if any
resources are modified.

To address this, we change our leader election mechanism. Instead of
pushing leader election to the edge (i.e. when performing writes) we only
allow events to be observed when a controller is leading (i.e. by
registering callbacks). When a controller stops leading, all of its
callbacks will be de-registered.

NOTE:

controllers will have a grace period during which they can renew their
lease. Their callbacks will be de-registered only if this fails. We will
not register and de-register callbacks that often for a single
controller.
we do not lose out on any state. Other informers will continue to run
(e.g. destination readers). When callbacks are registered, we pass all of
the cached objects through them. In other words, we do not issue API
requests on registration, we process the state of the cluster as observed
from the cache.
we make another change that's slightly orthogonal. Before we shutdown,
we ensure to drain the queue. This should not be a race since we will
first block until the queue is drained, then signal to the leader elector
loop that we are done. This gives us some confidence that all events have
been processed as soon as they were observed.

Signed-off-by: Matei David [email protected]

Our leader election logic can result in updates being missed under certain conditions. Leases expire after their duration is up, even if their current holder has been terminated. During this dead time, any changes in the system will be observed by other controllers, but will not be written to the API Server. For example, during a rollout, a controller that has come up will not be able to acquire the lease for a maximum time of 30 seconds (lease duration). Within this time frame, any changes to the system (e.g. modified workloads, services, deleted endpointslices) will be observed but not acted on by the newly created controller. Once the controller gets into a bad state, it can only recover after 10 minutes (via service resyncs) or if any resources are modified. To address this, we change our leader election mechanism. Instead of pushing leader election to the edge (i.e. when performing writes) we only allow events to be observed when a controller is leading (i.e. by registering callbacks). When a controller stops leading, all of its callbacks will be de-registered. NOTE: * controllers will have a grace period during which they can renew their lease. Their callbacks will be de-registered only if this fails. We will not register and de-register callbacks that often for a single controller. * we do not lose out on any state. Other informers will continue to run (e.g. destination readers). When callbacks are registered, we pass all of the cached objects through them. In other words, we do not issue API requests on registration, we process the state of the cluster as observed from the cache. * we make another change that's slightly orthogonal. Before we shutdown, we ensure to drain the queue. This should not be a race since we will first block until the queue is drained, then signal to the leader elector loop that we are done. This gives us some confidence that all events have been processed as soon as they were observed. Signed-off-by: Matei David <[email protected]>

zaharidichev · 2024-02-01T12:41:41Z

controller/api/destination/external-workload/endpoints_controller.go

+
+// addHandlers will register a set of callbacks with the different informers
+// needed to synchronise endpoint state.
+func (ec *EndpointsController) addHandlers() error {


There is going to be a problem here. Imagine the following sequence of events:

Controller acquire lease.

Creates endpoints for a service and external workload

Drops lease removes callback

Meanwhile endpoints membership is modified by another controller

Our controllers aquires the lease

We get an ADD for the service, external workload,etc

We try to do a reconciliation, but the state of the endpoints tracker is stale because we never saw the DELETE

We try to requeue but our state never recovers and we drop the update.

We end up in an inconsistent state.

The problem here is that you are keeping state in the endpointsslice tracker that might change under you feet. If you look at the upstream implementation, you will notice that we start with a new endpoitnsslices tracker each time we aquire the lease.

We need to do that here as well. Keeping this state around will get us into trouble.

Great catch. The endpoint tracker does add a bit of complexity but the reasoning above is super compelling and likely to happen.

To fix, we wipe the tracker before callback registration happens. I've also added a test that exercises the scenario you described. I can get the test to fail without the aforementioned change.

Signed-off-by: Matei David <[email protected]>

zaharidichev

LGTM, with a minor comment to avoid races

zaharidichev · 2024-02-01T16:04:23Z

controller/api/destination/external-workload/endpoints_controller.go

+
+// removeHandlers will de-register callbacks
+func (ec *EndpointsController) removeHandlers() error {
+	var err error


Obtain the lock here. Since addHandlers && removeHandlers are called from the leader callbacks, we should assume there is no strict synchronization here.

ehh completely forgot to do it 🤦🏻 thx.

zaharidichev · 2024-02-01T16:05:30Z

controller/api/destination/external-workload/endpoints_controller.go

+		// Drain the queue before signalling the lease to terminate
+		ec.queue.ShutDownWithDrain()


zaharidichev · 2024-02-01T16:08:16Z

controller/api/destination/external-workload/endpoints_controller_test.go

+// fail, since any changes made to the resources will not be observed while the
+// lease is not held; these changes will result in stale cache entries (since
+// the state diverged).
+func TestLeaderElectionSyncsState(t *testing.T) {


Very happy you have that here!

Signed-off-by: Matei David <[email protected]>

alpeb

Makes total sense!

This edge release contains performance and stability improvements to the Destination controller, and continues stabilizing support for ExternalWorkloads. * Reduced the load on the Destination controller by only processing Server updates on workloads affected by the Server ([#12017]) * Changed how the Destination controller reacts to target clusters (in multicluster pod-to-pod mode) whose Server CRD is outdated: skip them and log an error instead of panicking ([#12008]) * Improved the leader election of the ExternalWorkloads Endpoints controller to avoid missing events ([#12021]) * Improved naming of EndpointSlices generated by ExternWorkloads ([#12016])

This edge release contains performance and stability improvements to the Destination controller, and continues stabilizing support for ExternalWorkloads. * Reduced the load on the Destination controller by only processing Server updates on workloads affected by the Server ([#12017]) * Changed how the Destination controller reacts to target clusters (in multicluster pod-to-pod mode) whose Server CRD is outdated: skip them and log an error instead of panicking ([#12008]) * Improved the leader election of the ExternalWorkloads Endpoints controller to avoid missing events ([#12021]) * Improved naming of EndpointSlices generated by ExternWorkloads ([#12016]) * Restriced the number of IPs an ExternalWorkload can have ([#12026])

mateiidavid requested a review from a team as a code owner January 31, 2024 17:20

mateiidavid marked this pull request as draft January 31, 2024 17:20

mateiidavid force-pushed the matei/leader-elect-fix branch 2 times, most recently from 4bd7048 to 6de39c4 Compare January 31, 2024 17:23

mateiidavid force-pushed the matei/leader-elect-fix branch from 6de39c4 to 879b1cf Compare February 1, 2024 11:38

mateiidavid changed the title ~~WIP: Revise leader election for endpoints controller~~ Revise leader election logic for endpoints controller Feb 1, 2024

mateiidavid marked this pull request as ready for review February 1, 2024 11:40

zaharidichev reviewed Feb 1, 2024

View reviewed changes

mateiidavid added 2 commits February 1, 2024 13:34

Wipe-out tracker state on registration and add a test

71374cc

Signed-off-by: Matei David <[email protected]>

Create an unready endpoint in test to satisfy linter

f2a9fe4

Signed-off-by: Matei David <[email protected]>

zaharidichev approved these changes Feb 1, 2024

View reviewed changes

forgot to lock

db37b75

Signed-off-by: Matei David <[email protected]>

alpeb approved these changes Feb 1, 2024

View reviewed changes

alpeb merged commit d4f99b3 into main Feb 1, 2024
33 checks passed

alpeb deleted the matei/leader-elect-fix branch February 1, 2024 22:46

alpeb mentioned this pull request Feb 1, 2024

Change notes for edge-24.2.1 #12029

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revise leader election logic for endpoints controller #12021

Revise leader election logic for endpoints controller #12021

mateiidavid commented Jan 31, 2024 •

edited

Loading

zaharidichev Feb 1, 2024

mateiidavid Feb 1, 2024

zaharidichev left a comment

zaharidichev Feb 1, 2024

mateiidavid Feb 1, 2024

zaharidichev Feb 1, 2024

zaharidichev Feb 1, 2024

alpeb left a comment

		// Drain the queue before signalling the lease to terminate
		ec.queue.ShutDownWithDrain()

Revise leader election logic for endpoints controller #12021

Revise leader election logic for endpoints controller #12021

Conversation

mateiidavid commented Jan 31, 2024 • edited Loading

zaharidichev Feb 1, 2024

Choose a reason for hiding this comment

mateiidavid Feb 1, 2024

Choose a reason for hiding this comment

zaharidichev left a comment

Choose a reason for hiding this comment

zaharidichev Feb 1, 2024

Choose a reason for hiding this comment

mateiidavid Feb 1, 2024

Choose a reason for hiding this comment

zaharidichev Feb 1, 2024

Choose a reason for hiding this comment

zaharidichev Feb 1, 2024

Choose a reason for hiding this comment

alpeb left a comment

Choose a reason for hiding this comment

mateiidavid commented Jan 31, 2024 •

edited

Loading