-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revise leader election logic for endpoints controller #12021
Conversation
4bd7048
to
6de39c4
Compare
Our leader election logic can result in updates being missed under certain conditions. Leases expire after their duration is up, even if their current holder has been terminated. During this dead time, any changes in the system will be observed by other controllers, but will not be written to the API Server. For example, during a rollout, a controller that has come up will not be able to acquire the lease for a maximum time of 30 seconds (lease duration). Within this time frame, any changes to the system (e.g. modified workloads, services, deleted endpointslices) will be observed but not acted on by the newly created controller. Once the controller gets into a bad state, it can only recover after 10 minutes (via service resyncs) or if any resources are modified. To address this, we change our leader election mechanism. Instead of pushing leader election to the edge (i.e. when performing writes) we only allow events to be observed when a controller is leading (i.e. by registering callbacks). When a controller stops leading, all of its callbacks will be de-registered. NOTE: * controllers will have a grace period during which they can renew their lease. Their callbacks will be de-registered only if this fails. We will not register and de-register callbacks that often for a single controller. * we do not lose out on any state. Other informers will continue to run (e.g. destination readers). When callbacks are registered, we pass all of the cached objects through them. In other words, we do not issue API requests on registration, we process the state of the cluster as observed from the cache. * we make another change that's slightly orthogonal. Before we shutdown, we ensure to drain the queue. This should not be a race since we will first block until the queue is drained, then signal to the leader elector loop that we are done. This gives us some confidence that all events have been processed as soon as they were observed. Signed-off-by: Matei David <[email protected]>
6de39c4
to
879b1cf
Compare
|
||
// addHandlers will register a set of callbacks with the different informers | ||
// needed to synchronise endpoint state. | ||
func (ec *EndpointsController) addHandlers() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is going to be a problem here. Imagine the following sequence of events:
- Controller acquire lease.
- Creates endpoints for a service and external workload
- Drops lease removes callback
- Meanwhile endpoints membership is modified by another controller
- Our controllers aquires the lease
- We get an ADD for the service, external workload,etc
- We try to do a reconciliation, but the state of the endpoints tracker is stale because we never saw the DELETE
- We try to requeue but our state never recovers and we drop the update.
- We end up in an inconsistent state.
The problem here is that you are keeping state in the endpointsslice tracker that might change under you feet. If you look at the upstream implementation, you will notice that we start with a new endpoitnsslices tracker each time we aquire the lease.
We need to do that here as well. Keeping this state around will get us into trouble.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch. The endpoint tracker does add a bit of complexity but the reasoning above is super compelling and likely to happen.
To fix, we wipe the tracker before callback registration happens. I've also added a test that exercises the scenario you described. I can get the test to fail without the aforementioned change.
Signed-off-by: Matei David <[email protected]>
Signed-off-by: Matei David <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, with a minor comment to avoid races
|
||
// removeHandlers will de-register callbacks | ||
func (ec *EndpointsController) removeHandlers() error { | ||
var err error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Obtain the lock here. Since addHandlers
&& removeHandlers
are called from the leader callbacks, we should assume there is no strict synchronization here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ehh completely forgot to do it 🤦🏻 thx.
// Drain the queue before signalling the lease to terminate | ||
ec.queue.ShutDownWithDrain() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good!
// fail, since any changes made to the resources will not be observed while the | ||
// lease is not held; these changes will result in stale cache entries (since | ||
// the state diverged). | ||
func TestLeaderElectionSyncsState(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very happy you have that here!
Signed-off-by: Matei David <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes total sense!
This edge release contains performance and stability improvements to the Destination controller, and continues stabilizing support for ExternalWorkloads. * Reduced the load on the Destination controller by only processing Server updates on workloads affected by the Server ([#12017]) * Changed how the Destination controller reacts to target clusters (in multicluster pod-to-pod mode) whose Server CRD is outdated: skip them and log an error instead of panicking ([#12008]) * Improved the leader election of the ExternalWorkloads Endpoints controller to avoid missing events ([#12021]) * Improved naming of EndpointSlices generated by ExternWorkloads ([#12016])
This edge release contains performance and stability improvements to the Destination controller, and continues stabilizing support for ExternalWorkloads. * Reduced the load on the Destination controller by only processing Server updates on workloads affected by the Server ([#12017]) * Changed how the Destination controller reacts to target clusters (in multicluster pod-to-pod mode) whose Server CRD is outdated: skip them and log an error instead of panicking ([#12008]) * Improved the leader election of the ExternalWorkloads Endpoints controller to avoid missing events ([#12021]) * Improved naming of EndpointSlices generated by ExternWorkloads ([#12016])
This edge release contains performance and stability improvements to the Destination controller, and continues stabilizing support for ExternalWorkloads. * Reduced the load on the Destination controller by only processing Server updates on workloads affected by the Server ([#12017]) * Changed how the Destination controller reacts to target clusters (in multicluster pod-to-pod mode) whose Server CRD is outdated: skip them and log an error instead of panicking ([#12008]) * Improved the leader election of the ExternalWorkloads Endpoints controller to avoid missing events ([#12021]) * Improved naming of EndpointSlices generated by ExternWorkloads ([#12016]) * Restriced the number of IPs an ExternalWorkload can have ([#12026])
This edge release contains performance and stability improvements to the Destination controller, and continues stabilizing support for ExternalWorkloads. * Reduced the load on the Destination controller by only processing Server updates on workloads affected by the Server ([#12017]) * Changed how the Destination controller reacts to target clusters (in multicluster pod-to-pod mode) whose Server CRD is outdated: skip them and log an error instead of panicking ([#12008]) * Improved the leader election of the ExternalWorkloads Endpoints controller to avoid missing events ([#12021]) * Improved naming of EndpointSlices generated by ExternWorkloads ([#12016]) * Restriced the number of IPs an ExternalWorkload can have ([#12026])
Revise leader election logic for endpoints controller
Our leader election logic can result in updates being missed under certain
conditions. Leases expire after their duration is up, even if their current
holder has been terminated. During this dead time, any changes in the
system will be observed by other controllers, but will not be written to
the API Server.
For example, during a rollout, a controller that has come up will not be
able to acquire the lease for a maximum time of 30 seconds (lease
duration). Within this time frame, any changes to the system (e.g. modified
workloads, services, deleted endpointslices) will be observed but not acted
on by the newly created controller. Once the controller gets into a bad
state, it can only recover after 10 minutes (via service resyncs) or if any
resources are modified.
To address this, we change our leader election mechanism. Instead of
pushing leader election to the edge (i.e. when performing writes) we only
allow events to be observed when a controller is leading (i.e. by
registering callbacks). When a controller stops leading, all of its
callbacks will be de-registered.
NOTE:
lease. Their callbacks will be de-registered only if this fails. We will
not register and de-register callbacks that often for a single
controller.
(e.g. destination readers). When callbacks are registered, we pass all of
the cached objects through them. In other words, we do not issue API
requests on registration, we process the state of the cluster as observed
from the cache.
we ensure to drain the queue. This should not be a race since we will
first block until the queue is drained, then signal to the leader elector
loop that we are done. This gives us some confidence that all events have
been processed as soon as they were observed.
Signed-off-by: Matei David [email protected]