Return error when leader lease is lost #1130

kate-osborn · 2023-10-11T18:10:49Z

Proposed changes

Problem: If NGF Pod loses connection to the API server and cannot renew the leader election lease, it stops leading and cannot become the leader again. After it stops leading, it will not report any statuses until it is restarted.

Solution: Update the leader elector Start method to return an error if the leader lease is lost. This will cause the controller-runtime manager to exit, and the NGF container will restart. The new container will then attempt to become the leader again. This aligns with how the controller-runtime library handles losing a leader lease.

Testing: Verified that if NGF cannot renew the leader lease, it retries until it reaches the renew deadline of 10s and then exits:

I1011 16:53:39.872956       8 leaderelection.go:250] attempting to acquire leader lease nginx-gateway/nginx-gateway-leader-election...
I1011 16:53:39.891998       8 leaderelection.go:260] successfully acquired lease nginx-gateway/nginx-gateway-leader-election
{"level":"info","ts":"2023-10-11T16:53:39Z","logger":"leaderElector","msg":"Started leading"}
E1011 16:54:36.257646       8 leaderelection.go:332] error retrieving resource lock nginx-gateway/nginx-gateway-leader-election: leases.coordination.k8s.io "nginx-gateway-leader-election" is forbidden: User "system:serviceaccount:nginx-gateway:nginx-gateway" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "nginx-gateway"
E1011 16:54:38.261217       8 leaderelection.go:332] error retrieving resource lock nginx-gateway/nginx-gateway-leader-election: leases.coordination.k8s.io "nginx-gateway-leader-election" is forbidden: User "system:serviceaccount:nginx-gateway:nginx-gateway" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "nginx-gateway"
E1011 16:54:40.262373       8 leaderelection.go:332] error retrieving resource lock nginx-gateway/nginx-gateway-leader-election: leases.coordination.k8s.io "nginx-gateway-leader-election" is forbidden: User "system:serviceaccount:nginx-gateway:nginx-gateway" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "nginx-gateway"
E1011 16:54:42.261366       8 leaderelection.go:332] error retrieving resource lock nginx-gateway/nginx-gateway-leader-election: leases.coordination.k8s.io "nginx-gateway-leader-election" is forbidden: User "system:serviceaccount:nginx-gateway:nginx-gateway" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "nginx-gateway"
E1011 16:54:44.261783       8 leaderelection.go:332] error retrieving resource lock nginx-gateway/nginx-gateway-leader-election: leases.coordination.k8s.io "nginx-gateway-leader-election" is forbidden: User "system:serviceaccount:nginx-gateway:nginx-gateway" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "nginx-gateway"
I1011 16:54:46.254506       8 leaderelection.go:285] failed to renew lease nginx-gateway/nginx-gateway-leader-election: timed out waiting for the condition
{"level":"info","ts":"2023-10-11T16:54:46Z","logger":"leaderElector","msg":"Stopped leading"}
{"level":"info","ts":"2023-10-11T16:54:46Z","msg":"Stopping and waiting for non leader election runnables"}
{"level":"info","ts":"2023-10-11T16:54:46Z","msg":"Stopping and waiting for leader election runnables"}
Error: failed to start control loop: leader election lost
      --leader-election-disable            Disable leader election. Leader election is used to avoid multiple replicas of the NGINX Gateway Fabric reporting the status of the Gateway API resources. If disabled, all replicas of NGINX Gateway Fabric will update the statuses of the Gateway API resources.
      --leader-election-lock-name string   The name of the leader election lock. A Lease object with this name will be created in the same Namespace as the controller. (default "nginx-gateway-leader-election-lock")
failed to start control loop: leader election lost

Once the underlying error is resolved, which in this case is missing RBAC permissions, the new NGF Pod becomes the leader:

I1011 16:54:47.416192      67 leaderelection.go:250] attempting to acquire leader lease nginx-gateway/nginx-gateway-leader-election...
E1011 16:54:47.421393      67 leaderelection.go:332] error retrieving resource lock nginx-gateway/nginx-gateway-leader-election: leases.coordination.k8s.io "nginx-gateway-leader-election" is forbidden: User "system:serviceaccount:nginx-gateway:nginx-gateway" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "nginx-gateway"
{"level":"info","ts":"2023-10-11T16:54:47Z","logger":"statusUpdater","msg":"Skipping updating Gateway API status because not leader"}
{"level":"info","ts":"2023-10-11T16:54:47Z","logger":"statusUpdater","msg":"Skipping updating Nginx Gateway status because not leader"}
E1011 16:54:50.564170      67 leaderelection.go:332] error retrieving resource lock nginx-gateway/nginx-gateway-leader-election: leases.coordination.k8s.io "nginx-gateway-leader-election" is forbidden: User "system:serviceaccount:nginx-gateway:nginx-gateway" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "nginx-gateway"
I1011 16:54:54.367481      67 leaderelection.go:260] successfully acquired lease nginx-gateway/nginx-gateway-leader-election
{"level":"info","ts":"2023-10-11T16:54:54Z","logger":"leaderElector","msg":"Started leading"}

Please focus on (optional): Right now, we use the core client's default values for leader election:

// The duration that the acting master will retry refreshing leadership before giving up.
renewDeadline = 10 * time.Second
//  The duration that non-leader candidates will wait to force acquire leadership
leaseDuration = 15 * time.Second
// The duration the LeaderElector clients should wait between tries of actions.
retryPeriod   = 2 * time.Second

If NGF loses connectivity to the API server, it will try to renew the leader lease 5 times over a 10-second period.

We can increase the lease duration and renew deadline so NGF will try to renew the leader lease for a longer period of time and be able to recover from temporary connection issues. However, increasing these values means that it will take longer to recover after the leader is lost. A candidate Pod waits until the existing lease exceeds the lease duration, and then a new election is forced.

Closes #1100

Checklist

Before creating a PR, run through this checklist and mark each as complete.

I have read the CONTRIBUTING doc
I have added tests that prove my fix is effective or that my feature works
I have checked that all unit tests pass after adding my changes
I have updated necessary documentation
I have rebased my branch onto main
I will ensure my PR is targeting the main branch and pulling from my branch from my own fork

pleshakov · 2023-10-11T18:29:47Z

This will cause the controller-runtime manager to exit, and the Pod will restart. The new Pod will then attempt to become the leader again.

Will NGINX container continue to run? or will it also be brought down?

If NGF loses connectivity to the API server, it will try to renew the leader lease 5 times over a 10-second period.

is it true for non-leader as well?

Let's imagine we have 5 NGF pods running and they temporarily loose connectivity to the API server (let's say it is undergoing an upgrade). Could that lead to a restart of all 5 pods? or only the leader?

kate-osborn · 2023-10-11T18:34:54Z

Will NGINX container continue to run? or will it also be brought down?

It continues to run.

is it true for non-leader as well?

No. The non-leaders continue to retry until they succeed or the context is canceled.

Let's imagine we have 5 NGF pods running and they temporarily loose connectivity to the API server (let's say it is undergoing an upgrade). Could that lead to a restart of all 5 pods? or only the leader?

I tested this and only the leader restarts.

kate-osborn requested a review from a team as a code owner October 11, 2023 18:10

github-actions bot added the bug Something isn't working label Oct 11, 2023

pleshakov approved these changes Oct 11, 2023

View reviewed changes

sjberman approved these changes Oct 11, 2023

View reviewed changes

bjee19 approved these changes Oct 11, 2023

View reviewed changes

kate-osborn force-pushed the fix/leader-election branch 2 times, most recently from b4b4f73 to 988c4ee Compare October 12, 2023 14:11

Return error when leader lease is lost

4053c22

kate-osborn force-pushed the fix/leader-election branch from 988c4ee to 4053c22 Compare October 12, 2023 15:29

kate-osborn merged commit 4f40fca into nginxinc:main Oct 12, 2023
22 checks passed

vepatel mentioned this pull request Oct 23, 2023

Once pod stops leading it cannot become the leader again nginxinc/kubernetes-ingress#4506

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Return error when leader lease is lost #1130

Return error when leader lease is lost #1130

kate-osborn commented Oct 11, 2023 •

edited

Loading

pleshakov commented Oct 11, 2023

kate-osborn commented Oct 11, 2023

Return error when leader lease is lost #1130

Return error when leader lease is lost #1130

Conversation

kate-osborn commented Oct 11, 2023 • edited Loading

Proposed changes

Checklist

pleshakov commented Oct 11, 2023

kate-osborn commented Oct 11, 2023

kate-osborn commented Oct 11, 2023 •

edited

Loading