Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revise leader election logic for endpoints controller #12021

Merged
merged 4 commits into from
Feb 1, 2024

Conversation

mateiidavid
Copy link
Member

@mateiidavid mateiidavid commented Jan 31, 2024

Revise leader election logic for endpoints controller

Our leader election logic can result in updates being missed under certain
conditions. Leases expire after their duration is up, even if their current
holder has been terminated. During this dead time, any changes in the
system will be observed by other controllers, but will not be written to
the API Server.

For example, during a rollout, a controller that has come up will not be
able to acquire the lease for a maximum time of 30 seconds (lease
duration). Within this time frame, any changes to the system (e.g. modified
workloads, services, deleted endpointslices) will be observed but not acted
on by the newly created controller. Once the controller gets into a bad
state, it can only recover after 10 minutes (via service resyncs) or if any
resources are modified.

To address this, we change our leader election mechanism. Instead of
pushing leader election to the edge (i.e. when performing writes) we only
allow events to be observed when a controller is leading (i.e. by
registering callbacks). When a controller stops leading, all of its
callbacks will be de-registered.

NOTE:

  • controllers will have a grace period during which they can renew their
    lease. Their callbacks will be de-registered only if this fails. We will
    not register and de-register callbacks that often for a single
    controller.
  • we do not lose out on any state. Other informers will continue to run
    (e.g. destination readers). When callbacks are registered, we pass all of
    the cached objects through them. In other words, we do not issue API
    requests on registration, we process the state of the cluster as observed
    from the cache.
  • we make another change that's slightly orthogonal. Before we shutdown,
    we ensure to drain the queue. This should not be a race since we will
    first block until the queue is drained, then signal to the leader elector
    loop that we are done. This gives us some confidence that all events have
    been processed as soon as they were observed.

Signed-off-by: Matei David [email protected]

@mateiidavid mateiidavid requested a review from a team as a code owner January 31, 2024 17:20
@mateiidavid mateiidavid marked this pull request as draft January 31, 2024 17:20
@mateiidavid mateiidavid force-pushed the matei/leader-elect-fix branch 2 times, most recently from 4bd7048 to 6de39c4 Compare January 31, 2024 17:23
Our leader election logic can result in updates being missed under
certain conditions. Leases expire after their duration is up, even if
their current holder has been terminated. During this dead time, any
changes in the system will be observed by other controllers, but will
not be written to the API Server.

For example, during a rollout, a controller that has come up will not be
able to acquire the lease for a maximum time of 30 seconds (lease
duration). Within this time frame, any changes to the system (e.g.
modified workloads, services, deleted endpointslices) will be observed
but not acted on by the newly created controller. Once the controller
gets into a bad state, it can only recover after 10 minutes (via service
resyncs) or if any resources are modified.

To address this, we change our leader election mechanism. Instead of
pushing leader election to the edge (i.e. when performing writes) we
only allow events to be observed when a controller is leading (i.e. by
registering callbacks). When a controller stops leading, all of its
callbacks will be de-registered.

NOTE:

* controllers will have a grace period during which they can renew their
  lease. Their callbacks will be de-registered only if this fails. We
  will not register and de-register callbacks that often for a single
  controller.
* we do not lose out on any state. Other informers will continue to run
  (e.g. destination readers). When callbacks are registered, we pass all
  of the cached objects through them. In other words, we do not issue
  API requests on registration, we process the state of the cluster as
  observed from the cache.
* we make another change that's slightly orthogonal. Before we shutdown,
  we ensure to drain the queue. This should not be a race since we will
  first block until the queue is drained, then signal to the leader
  elector loop that we are done. This gives us some confidence that all
  events have been processed as soon as they were observed.

Signed-off-by: Matei David <[email protected]>
@mateiidavid mateiidavid changed the title WIP: Revise leader election for endpoints controller Revise leader election logic for endpoints controller Feb 1, 2024
@mateiidavid mateiidavid marked this pull request as ready for review February 1, 2024 11:40

// addHandlers will register a set of callbacks with the different informers
// needed to synchronise endpoint state.
func (ec *EndpointsController) addHandlers() error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is going to be a problem here. Imagine the following sequence of events:

  1. Controller acquire lease.
  2. Creates endpoints for a service and external workload
  3. Drops lease removes callback
  4. Meanwhile endpoints membership is modified by another controller
  5. Our controllers aquires the lease
  6. We get an ADD for the service, external workload,etc
  7. We try to do a reconciliation, but the state of the endpoints tracker is stale because we never saw the DELETE
  8. We try to requeue but our state never recovers and we drop the update.
  9. We end up in an inconsistent state.

The problem here is that you are keeping state in the endpointsslice tracker that might change under you feet. If you look at the upstream implementation, you will notice that we start with a new endpoitnsslices tracker each time we aquire the lease.

We need to do that here as well. Keeping this state around will get us into trouble.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch. The endpoint tracker does add a bit of complexity but the reasoning above is super compelling and likely to happen.

To fix, we wipe the tracker before callback registration happens. I've also added a test that exercises the scenario you described. I can get the test to fail without the aforementioned change.

Copy link
Member

@zaharidichev zaharidichev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, with a minor comment to avoid races


// removeHandlers will de-register callbacks
func (ec *EndpointsController) removeHandlers() error {
var err error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obtain the lock here. Since addHandlers && removeHandlers are called from the leader callbacks, we should assume there is no strict synchronization here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ehh completely forgot to do it 🤦🏻 thx.

Comment on lines +270 to +271
// Drain the queue before signalling the lease to terminate
ec.queue.ShutDownWithDrain()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good!

// fail, since any changes made to the resources will not be observed while the
// lease is not held; these changes will result in stale cache entries (since
// the state diverged).
func TestLeaderElectionSyncsState(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very happy you have that here!

Signed-off-by: Matei David <[email protected]>
Copy link
Member

@alpeb alpeb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes total sense!

@alpeb alpeb merged commit d4f99b3 into main Feb 1, 2024
33 checks passed
@alpeb alpeb deleted the matei/leader-elect-fix branch February 1, 2024 22:46
alpeb added a commit that referenced this pull request Feb 1, 2024
This edge release contains performance and stability improvements to the
Destination controller, and continues stabilizing support for ExternalWorkloads.

* Reduced the load on the Destination controller by only processing Server
  updates on workloads affected by the Server ([#12017])
* Changed how the Destination controller reacts to target clusters (in
  multicluster pod-to-pod mode) whose Server CRD is outdated: skip them and log
  an error instead of panicking ([#12008])
* Improved the leader election of the ExternalWorkloads Endpoints controller to
  avoid missing events ([#12021])
* Improved naming of EndpointSlices generated by ExternWorkloads ([#12016])
alpeb added a commit that referenced this pull request Feb 1, 2024
This edge release contains performance and stability improvements to the
Destination controller, and continues stabilizing support for ExternalWorkloads.

* Reduced the load on the Destination controller by only processing Server
  updates on workloads affected by the Server ([#12017])
* Changed how the Destination controller reacts to target clusters (in
  multicluster pod-to-pod mode) whose Server CRD is outdated: skip them and log
  an error instead of panicking ([#12008])
* Improved the leader election of the ExternalWorkloads Endpoints controller to
  avoid missing events ([#12021])
* Improved naming of EndpointSlices generated by ExternWorkloads ([#12016])
alpeb added a commit that referenced this pull request Feb 2, 2024
This edge release contains performance and stability improvements to the
Destination controller, and continues stabilizing support for ExternalWorkloads.

* Reduced the load on the Destination controller by only processing Server
  updates on workloads affected by the Server ([#12017])
* Changed how the Destination controller reacts to target clusters (in
  multicluster pod-to-pod mode) whose Server CRD is outdated: skip them and log
  an error instead of panicking ([#12008])
* Improved the leader election of the ExternalWorkloads Endpoints controller to
  avoid missing events ([#12021])
* Improved naming of EndpointSlices generated by ExternWorkloads ([#12016])
* Restriced the number of IPs an ExternalWorkload can have ([#12026])
alpeb added a commit that referenced this pull request Feb 2, 2024
This edge release contains performance and stability improvements to the
Destination controller, and continues stabilizing support for ExternalWorkloads.

* Reduced the load on the Destination controller by only processing Server
  updates on workloads affected by the Server ([#12017])
* Changed how the Destination controller reacts to target clusters (in
  multicluster pod-to-pod mode) whose Server CRD is outdated: skip them and log
  an error instead of panicking ([#12008])
* Improved the leader election of the ExternalWorkloads Endpoints controller to
  avoid missing events ([#12021])
* Improved naming of EndpointSlices generated by ExternWorkloads ([#12016])
* Restriced the number of IPs an ExternalWorkload can have ([#12026])
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants