Return error when leader lease is lost #1130
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Proposed changes
Problem: If NGF Pod loses connection to the API server and cannot renew the leader election lease, it stops leading and cannot become the leader again. After it stops leading, it will not report any statuses until it is restarted.
Solution: Update the leader elector Start method to return an error if the leader lease is lost. This will cause the controller-runtime manager to exit, and the NGF container will restart. The new container will then attempt to become the leader again. This aligns with how the controller-runtime library handles losing a leader lease.
Testing: Verified that if NGF cannot renew the leader lease, it retries until it reaches the renew deadline of 10s and then exits:
Once the underlying error is resolved, which in this case is missing RBAC permissions, the new NGF Pod becomes the leader:
Please focus on (optional): Right now, we use the core client's default values for leader election:
If NGF loses connectivity to the API server, it will try to renew the leader lease 5 times over a 10-second period.
We can increase the lease duration and renew deadline so NGF will try to renew the leader lease for a longer period of time and be able to recover from temporary connection issues. However, increasing these values means that it will take longer to recover after the leader is lost. A candidate Pod waits until the existing lease exceeds the lease duration, and then a new election is forced.
Closes #1100
Checklist
Before creating a PR, run through this checklist and mark each as complete.