Scheduler should terminate on loosing leader lock #81306

ravisantoshgudimetla · 2019-08-12T19:17:58Z

What type of PR is this?

Uncomment only one /kind <> line, hit enter to put that in a new line, and remove leading whitespaces from that line:

/kind bug

What this PR does / why we need it:
As of now, we're not terminating scheduler on loosing leader lock causing the scheduler to continue watching pod, node objects and allowing the pod to be scheduled. While this may not cause problem in single scheduler topology, this will certainly cause problems if we have HA enabled for control plane components. This PR ensures that scheduler terminates instead of silently ignoring the loss of leader lock and proceeding with scheduling. Without this patch, we'd notice that in HA deployment, schedulers would race to schedule pods(with or without leader lock) causing multiple bind failures as the individual scheduler will have a stale cache.

Thanks to @gnufied for identifying the issue.

/cc @bsalamat @Huang-Wei @ahg-g @damemi @ingvagabund
Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Scheduler should terminate when it looses leader lock.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

ravisantoshgudimetla · 2019-08-12T19:18:16Z

/sig scheduling
/priority important

ravisantoshgudimetla · 2019-08-12T19:20:40Z

/priority important-soon

Huang-Wei · 2019-08-12T19:44:03Z

cmd/kube-scheduler/app/server.go

@@ -262,7 +261,7 @@ func Run(cc schedulerserverconfig.CompletedConfig, stopCh <-chan struct{}, regis
 		cc.LeaderElection.Callbacks = leaderelection.LeaderCallbacks{
 			OnStartedLeading: run,
 			OnStoppedLeading: func() {
-				utilruntime.HandleError(fmt.Errorf("lost master"))
+				klog.Fatalf("leaderelection lost")


Looks good. Now we're consistent with our neighbours:

cmd/kube-controller-manager/app/controllermanager.go 281: OnStoppedLeading: func() { 282- klog.Fatalf("leaderelection lost") cmd/kube-scheduler/app/server.go 264: OnStoppedLeading: func() { 265- utilruntime.HandleError(fmt.Errorf("lost master")) cmd/cloud-controller-manager/app/controllermanager.go 209: OnStoppedLeading: func() { 210- klog.Fatalf("leaderelection lost")

bsalamat

/lgtm
/approve

Thanks, @ravisantoshgudimetla for fixing this important bug. We must backport this to older versions (1.13+) of K8s.

k8s-ci-robot · 2019-08-12T22:44:22Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bsalamat, ravisantoshgudimetla

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cmd/kube-scheduler/OWNERS~~ [bsalamat,ravisantoshgudimetla]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Huang-Wei · 2019-08-13T00:31:49Z

/retest

gaorong · 2019-08-13T02:02:43Z

/hold
(if I have permission)

gaorong · 2019-08-13T02:54:22Z

Sorry to bother you, could you please provide some real cases and how to reproduce this issue?
I have tested this and scheduler can successfully exit.
step as below:

start two schedulers, say scheduler1 and scheduler2.
the scheduler1 comes as the leader.
break the connection between scheduler1 and apiserver using iptables: iptables -I INPUT -p tcp --dport 8083 -j DROP.
scheduler1 can not schedule pod because of broken connection and scheduler2 become as the leader successfully.
remove iptables from step 3 and connection recover. then scheduler1 finds himself has lost lease and EXITs successfully.
node agent supervisor(e.g. systemd) restart scheduler1.

logs as below: (ignore the unsupported 1.10 version, this snippet hasn't changed since then)

I0813 09:57:56.190610    3114 server.go:562] Version: v1.10.11
I0813 09:57:56.191312    3114 server.go:581] starting healthz server on 0.0.0.0:10251
I0813 09:57:57.091874    3114 controller_utils.go:1027] Waiting for caches to sync for scheduler controller
I0813 09:57:57.191994    3114 controller_utils.go:1034] Caches are synced for scheduler controller
I0813 09:57:57.192019    3114 leaderelection.go:175] attempting to acquire leader lease  kube-system/kube-scheduler...
I0813 09:58:15.926175    3114 leaderelection.go:184] successfully acquired lease kube-system/kube-scheduler
I0813 09:59:10.889224    3114 leaderelection.go:203] failed to renew lease kube-system/kube-scheduler: timed out waiting for the condition
E0813 09:59:10.889265    3114 server.go:619] lost master
error: lost lease

It seems all work fine.
IIUC, If scheduler loses lease, it will return from leaderElector.Run(ctx) and then invoke return fmt.Errorf("lost lease") to exit.

kubernetes/cmd/kube-scheduler/app/server.go

Lines 261 to 275 in 890b50f

    
           if cc.LeaderElection != nil { 
        
           	cc.LeaderElection.Callbacks = leaderelection.LeaderCallbacks{ 
        
           		OnStartedLeading: run, 
        
           		OnStoppedLeading: func() { 
        
           			utilruntime.HandleError(fmt.Errorf("lost master")) 
        
           		}, 
        
           	} 
        
           	leaderElector, err := leaderelection.NewLeaderElector(*cc.LeaderElection) 
        
           	if err != nil { 
        
           		return fmt.Errorf("couldn't create leader elector: %v", err) 
        
           	} 
        
           	leaderElector.Run(ctx) 
        
           	return fmt.Errorf("lost lease")

if we want to be consistent with our neighbors, we can merge this but it targets as cleanup (not bugfix, so no need to backport).

ravisantoshgudimetla · 2019-08-13T03:10:42Z

I0813 09:59:10.889224 3114 leaderelection.go:203] failed to renew lease kube-system/kube-scheduler: timed out waiting for the condition

It depends on how long you waited before the leader looses lock, in my case the scheduler was able to communicate with apiserver before it timed out waiting for condition, the scenario I am talking about is a network condition where we would loose connectivity with apiserver for 30 seconds and then scheduler is able to communicate again, it's not a permanent network failure which would cause the above scenario you mentioned.

gaorong · 2019-08-13T03:29:30Z

It depends on how long you waited before the leader looses lock

I think no matters how long the scheduler wait, once the connection recover, the scheduler will detect himself has lost lock and exit.

ravisantoshgudimetla · 2019-08-13T21:57:34Z

@gaorong - Thanks for testing the above scenario. What type of resourceLocks are you using? configmaps or endpoints? Can you simulate the same test with kube 1.12+ and see if you're able to reproduce this?

It's interesting that the scheduler is terminating for you, I looked more in 1.10 code base and I think leader election code is different.

Following are the logs from v1.14.0+0faddd8

I0810 18:54:37.287283       1 leaderelection.go:263] failed to renew lease openshift-kube-scheduler/kube-scheduler: failed to tryAcquireOrRenew context deadline exceeded
E0810 18:54:37.287313       1 server.go:248] lost master
E0810 18:54:38.278507       1 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1beta1.PodDisruptionBudget: poddisruptionbudgets.policy is forbidden: User "system:kube-scheduler" cannot list resource "poddisruptionbudgets" in API group "policy" at the cluster scope
E0810 18:54:38.278654       1 reflector.go:126] k8s.io/kubernetes/cmd/kube-scheduler/app/server.go:234: Failed to list *v1.Pod: pods is forbidden: User "system:kube-scheduler" cannot list resource "pods" in API group "" at the cluster scope
E0810 18:54:38.278787       1 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1.Service: services is forbidden: User "system:kube-scheduler" cannot list resource "services" in API group "" at the cluster scope
I0810 18:55:22.386588       1 scheduler.go:417] Failed to bind pod: test-insights/insights-operator-9868bfd69-kklx8
E0810 18:55:22.386635       1 factory.go:660] Error scheduling test-insights/insights-operator-9868bfd69-kklx8: Operation cannot be fulfilled on pods/binding "insights-operator-9868bfd69-kklx8": pod insights-operator-9868bfd69-kklx8 is already assigned to node "test-hlb8q-master-0"; retrying

You can clearly see that after lost master, we're not exiting from scheduler and continuing with scheduling.

What's happening?

The code path is slightly different in leader election code, scheduler in your case is failing at lock acquistion state while renewing and the channel is closing right after that here - https://github.com/kubernetes/kubernetes/blob/release-1.10/staging/src/k8s.io/client-go/tools/leaderelection/leaderelection.go#L162

We moved to contexts from channels in #57932 1.12+ and I think this line

kubernetes/staging/src/k8s.io/client-go/tools/leaderelection/leaderelection.go

Line 265 in f0c11ee

}, le.config.RetryPeriod, ctx.Done())

, won't always cancel the parent context causing explicit exit in the component that calls the leader election code.

gaorong · 2019-08-14T02:54:03Z

What type of resourceLocks are you using? configmaps or endpoints?

the default: endpoint

Can you simulate the same test with kube 1.12+ and see if you're able to reproduce this?

Yes, scheduler can still terminate as before.
I run v1.15.0 scheduler as a foreground process with minimal configuration.

logs as below:

[root@k8s-test-master01 ]#  /usr/bin/kube-scheduler  --v=0 --master=http://127.0.0.1:8080
I0814 10:17:08.141925   11894 serving.go:319] Generated self-signed cert in-memory
W0814 10:17:08.313407   11894 authentication.go:249] No authentication-kubeconfig provided in order to lookup client-ca-file in configmap/extension-apiserver-authentication in kube-system, so client certificate authentication won't work.
W0814 10:17:08.313424   11894 authentication.go:252] No authentication-kubeconfig provided in order to lookup requestheader-client-ca-file in configmap/extension-apiserver-authentication in kube-system, so request-header client certificate authentication won't work.
W0814 10:17:08.313439   11894 authorization.go:146] No authorization-kubeconfig provided, so SubjectAccessReview of authorization tokens won't work.
I0814 10:17:08.314082   11894 server.go:142] Version: v1.15.0
I0814 10:17:08.314117   11894 defaults.go:87] TaintNodesByCondition is enabled, PodToleratesNodeTaints predicate is mandatory
W0814 10:17:08.315223   11894 authorization.go:47] Authorization is disabled
W0814 10:17:08.315234   11894 authentication.go:55] Authentication is disabled
I0814 10:17:08.315244   11894 deprecated_insecure_serving.go:51] Serving healthz insecurely on [::]:10251
I0814 10:17:08.315639   11894 secure_serving.go:116] Serving securely on [::]:10259
I0814 10:17:09.218792   11894 leaderelection.go:235] attempting to acquire leader lease  kube-system/kube-scheduler...
I0814 10:17:24.964249   11894 leaderelection.go:245] successfully acquired lease kube-system/kube-scheduler
E0814 10:19:09.392164   11894 leaderelection.go:324] error retrieving resource lock kube-system/kube-scheduler: Get http://127.0.0.1:8080/api/v1/namespaces/kube-system/endpoints/kube-scheduler?timeout=10s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E0814 10:19:09.392245   11894 event.go:296] Could not construct reference to: '&v1.Endpoints{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"", GenerateName:"", Namespace:"", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, Subsets:[]v1.EndpointSubset(nil)}' due to: 'selfLink was empty, can't make reference'. Will not report event: 'Normal' 'LeaderElection' 'k8s-test-master01_bc8bdc92-a9d8-423d-9177-e48bc60b3f80 stopped leading'
I0814 10:19:09.392515   11894 leaderelection.go:281] failed to renew lease kube-system/kube-scheduler: timed out waiting for the condition
E0814 10:19:09.392538   11894 server.go:254] lost master
lost lease
[root@k8s-test-master01 ]#

it's weird to have different behavior. I think leader-election should be compatible with previous behavior (in v1.10).
Are there some special configurations for you?

ravisantoshgudimetla · 2019-08-14T03:39:06Z

I run v1.15.0 scheduler as a foreground process with minimal configuration.

Do you have connectivity to apiserver from scheduler, looking at the logs, it seems apiserver is not running. The scenario is specific to situation where the scheduler acquired lock and releasing it back.

Are there some special configurations for you?

I am using configmaps instead of endpoints/

gaorong · 2019-08-14T06:59:34Z

Do you have connectivity to apiserver from scheduler, looking at the logs, it seems apiserver is not running

apiserver is listening on port 8080, I drop all tcp packages sending to apiserver by iptables: iptables -I INPUT -p tcp --dport 8080 -j DROP. maybe this is equivalent to apiserver is down. as shown in the scheduler flag in above example, the scheduler is connecting to apiserver by localhost, so when breaking one of apiservers' connectivity, other schedulers can still successfully connect to apiserver within the same host and acquire the lock.

The scenario is specific to situation where the scheduler acquired lock and releasing it back.

How can we simulate this scenario?

gaorong · 2019-08-14T07:20:08Z

The scenario is specific to situation where the scheduler acquired lock and releasing it back.

Do you set ReleaseOnCancel to true by changing scheduler code? the scheduler would not release lock proactively.

kubernetes/staging/src/k8s.io/client-go/tools/leaderelection/leaderelection.go

Lines 148 to 153 in a520302

    
           // ReleaseOnCancel should be set true if the lock should be released 
        
           // when the run context is cancelled. If you set this to true, you must 
        
           // ensure all code guarded by this lease has successfully completed 
        
           // prior to cancelling the context, or you may have two processes 
        
           // simultaneously acting on the critical path. 
        
           ReleaseOnCancel bool

yqwang-ms · 2019-08-14T08:03:32Z

@ravisantoshgudimetla I am not familiar with kube-scheduler in HA mode, but I have a few small questions, could you please help:

Shouldn't the previous leader be fenced when the new leader is selected?

Seems the previous leader just exit as soon as possible, still cannot totally avoid your race condition that multiple scheduler are scheduling pods at the same time, right?
Such as, the binding request from the previous leader is being sending to ApiServer, exit the process may not be able to cancel the sending (at least when the request is already sent to network).

So, can I consider this PR is still best effort to avoid the race condition above, but it largerly reduces the chance, right?
BTW, seems kube-scheduler do not need to totally avoid it (by fencing etc)?

ravisantoshgudimetla · 2019-08-14T16:24:43Z

so when breaking one of apiservers' connectivity, other schedulers can still successfully connect to apiserver within the same host and acquire the lock.

How're the other schedulers connecting to apiserver? Are they running on the same host? If you block inbound traffic on port 8080, they'd also loose connectivity, isn't it?

How can we simulate this scenario?

I am currently using a 3 node control plane cluster, where each scheduler on every node connects to apiserver running locally. As of now, I am bouncing apiserver one node at a time to simulate this scenario

So, can I consider this PR is still best effort to avoid the race condition above, but it largerly reduces the chance, right?
BTW, seems kube-scheduler do not need to totally avoid it (by fencing etc)?

Good question, while I am not certain that scenario can happen in the above situation, there is a very good chance that apiserver received the request just before scheduler exited, the other scheduler which gained the lock won't send the request again because it'll have to wait for the informer caches to sync initially(actually even before acquiring the lock). The informers would talk to apiserver and get the latest state of pod before building the local cache and start scheduling. The main problem in the situation that I mentioned here is scheduler without lock is continuously sending requests to apiserver for binding or trying to reschedule pods that are already scheduled by the right scheduler (the one which has leader lock).

But in general, kubernetes is based on eventual consistency, where failure can happen sometimes because of various reasons but system should correct itself after some time.

gaorong · 2019-08-15T00:30:19Z

Are they running on the same host?

No, there aren't. my environment is same as you: each scheduler on every node connects to apiserver running locally.

k82cn · 2019-08-15T01:45:39Z

@gaorong , @ravisantoshgudimetla , is there any impact if we terminate scheduler in this case, similar to other component? IMO, terminate the scheduler seems a safer way to handle such kind of case :)

gaorong · 2019-08-15T03:03:55Z

@k82cn Agree, The main concern here is if this is a confirmed bug-fix or just a cleanup. could we backport a cleanup?

anyway, agree with merging this first
/hold cancel

yqwang-ms · 2019-08-15T03:38:37Z

@ravisantoshgudimetla Thanks for your detailed explaination. Totally agree we should be "eventual consistency", and add one more principle, the "eventual" duration should be as shorter as possible. :)

fejta-bot · 2019-08-15T05:46:05Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

fejta-bot · 2019-08-15T11:22:06Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

fejta-bot · 2019-08-15T14:31:06Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

bsalamat · 2019-08-15T16:48:52Z

This change is fine and we can merge it, but I agree with @gaorong that we don't need to back port it if it is a clean-up.

alculquicondor · 2019-08-15T17:07:08Z

I would consider this a bug and +1 on backporting

ravisantoshgudimetla added 2 commits August 9, 2019 15:23

Scheduler should terminate when it looses leader lock

e2b29cd

Autogenerated files

3a1354d

k8s-ci-robot requested review from ahg-g, bsalamat, damemi and Huang-Wei August 12, 2019 19:17

k8s-ci-robot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Aug 12, 2019

k8s-ci-robot requested a review from ingvagabund August 12, 2019 19:17

k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Aug 12, 2019

ravisantoshgudimetla changed the title ~~Terminate on leader election~~ Scheduler should terminate on loosing leader lock Aug 12, 2019

k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Aug 12, 2019

Huang-Wei reviewed Aug 12, 2019

View reviewed changes

k8s-ci-robot assigned bsalamat Aug 12, 2019

bsalamat approved these changes Aug 12, 2019

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 12, 2019

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 13, 2019

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 15, 2019

k8s-ci-robot merged commit de3aed8 into kubernetes:master Aug 15, 2019

k8s-ci-robot added this to the v1.16 milestone Aug 15, 2019

superseb mentioned this pull request Sep 12, 2019

Unhealthy controller manager and scheduler after leaving it running overnight rancher/rancher#14036

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scheduler should terminate on loosing leader lock #81306

Scheduler should terminate on loosing leader lock #81306

ravisantoshgudimetla commented Aug 12, 2019

ravisantoshgudimetla commented Aug 12, 2019

ravisantoshgudimetla commented Aug 12, 2019

Huang-Wei Aug 12, 2019

bsalamat left a comment

k8s-ci-robot commented Aug 12, 2019

Huang-Wei commented Aug 13, 2019

gaorong commented Aug 13, 2019

gaorong commented Aug 13, 2019 •

edited

Loading

ravisantoshgudimetla commented Aug 13, 2019 •

edited

Loading

gaorong commented Aug 13, 2019

ravisantoshgudimetla commented Aug 13, 2019 •

edited

Loading

gaorong commented Aug 14, 2019

ravisantoshgudimetla commented Aug 14, 2019

gaorong commented Aug 14, 2019

gaorong commented Aug 14, 2019

yqwang-ms commented Aug 14, 2019 •

edited

Loading

ravisantoshgudimetla commented Aug 14, 2019

gaorong commented Aug 15, 2019 •

edited

Loading

k82cn commented Aug 15, 2019

gaorong commented Aug 15, 2019 •

edited

Loading

yqwang-ms commented Aug 15, 2019

fejta-bot commented Aug 15, 2019

fejta-bot commented Aug 15, 2019

fejta-bot commented Aug 15, 2019

bsalamat commented Aug 15, 2019

alculquicondor commented Aug 15, 2019

Scheduler should terminate on loosing leader lock #81306

Scheduler should terminate on loosing leader lock #81306

Conversation

ravisantoshgudimetla commented Aug 12, 2019

ravisantoshgudimetla commented Aug 12, 2019

ravisantoshgudimetla commented Aug 12, 2019

Huang-Wei Aug 12, 2019

Choose a reason for hiding this comment

bsalamat left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Aug 12, 2019

Huang-Wei commented Aug 13, 2019

gaorong commented Aug 13, 2019

gaorong commented Aug 13, 2019 • edited Loading

ravisantoshgudimetla commented Aug 13, 2019 • edited Loading

gaorong commented Aug 13, 2019

ravisantoshgudimetla commented Aug 13, 2019 • edited Loading

gaorong commented Aug 14, 2019

ravisantoshgudimetla commented Aug 14, 2019

gaorong commented Aug 14, 2019

gaorong commented Aug 14, 2019

yqwang-ms commented Aug 14, 2019 • edited Loading

ravisantoshgudimetla commented Aug 14, 2019

gaorong commented Aug 15, 2019 • edited Loading

k82cn commented Aug 15, 2019

gaorong commented Aug 15, 2019 • edited Loading

yqwang-ms commented Aug 15, 2019

fejta-bot commented Aug 15, 2019

fejta-bot commented Aug 15, 2019

fejta-bot commented Aug 15, 2019

bsalamat commented Aug 15, 2019

alculquicondor commented Aug 15, 2019

gaorong commented Aug 13, 2019 •

edited

Loading

ravisantoshgudimetla commented Aug 13, 2019 •

edited

Loading

ravisantoshgudimetla commented Aug 13, 2019 •

edited

Loading

yqwang-ms commented Aug 14, 2019 •

edited

Loading

gaorong commented Aug 15, 2019 •

edited

Loading

gaorong commented Aug 15, 2019 •

edited

Loading