-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Design doc of 'Schedule DS Pod by default scheduler'. #1714
Conversation
That should be alpha in 1.10, as TaintNodeByCondition is still alpha right now. Any comments? |
/cc @kow3ns |
@@ -0,0 +1,71 @@ | |||
#Schedule DaemonSet Pods by default scheduler, not DaemonSet controller |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there should be a space after the #
|
||
* **Q**: Will this change/constrain update strategies, such as scheduling an updated pod to a node before the previous pod is gone? | ||
|
||
**A**: nop, this will NOT change update strategies. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nop typo
|
||
Before the discussion of solutions/options, there’s some requirements/questions on DaemonSet: | ||
|
||
* **Q**: DaemonSet controller can make pods even the network of node is unavailable, e.g. CNI network providers (Calico, Flannel), Will this impact bootstrapping, such as in the case that a DaemonSet is being used to provide the pod network? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
even "if"
|
||
* **Q**: DaemonSet controller can make pods even the network of node is unavailable, e.g. CNI network providers (Calico, Flannel), Will this impact bootstrapping, such as in the case that a DaemonSet is being used to provide the pod network? | ||
|
||
**A**: This will be handled in Support scheduling tolerating workloads on NotReady Nodes ([#45717](https://github.com/kubernetes/kubernetes/issues/45717)); after moving to check node’s taint, the DaemonSet pods will tolerant `NetworkUnavailable` taint. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should say 'tolerate' , not 'tolerant'
Currently, pods of DaemonSet are created and scheduled by DaemonSet controller: | ||
|
||
1. DS controller filter nodes by nodeSelector and scheduler’s predicates | ||
2. For each nodes, create a Pod for it by setting spec.hostName directly; it’ll skip default scheduler |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be 'node'
- dest_hostname | ||
``` | ||
|
||
##Reference |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need space after ##
##Reference | ||
|
||
* [DaemonsetController can't feel it when node has more resources, e.g. other Pod exits](https://github.com/kubernetes/kubernetes/issues/46935) | ||
* [DaemonsetController can't feel it when node recovered from outofdisk state](https://github.com/kubernetes/kubernetes/issues/46935) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wrong link, this is the same link as above , should be kubernetes/kubernetes#45628
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @k82cn and sorry for my late review.
* Hard to debug why DaemonSet’s Pod is not created, e.g. not enough resources; it’s better to have a pending Pods with predicates’ event | ||
* Hard to support preemption in different components, e.g. DS and default scheduler | ||
|
||
After [discussions](https://docs.google.com/document/d/1v7hsusMaeImQrOagktQb40ePbK6Jxp1hzgFB9OZa_ew/edit#), we come to a agreement that making DaemonSet to just produce pods like every other controller, and let them be scheduled by the regular scheduler, than to be its own scheduler. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/a agreement/an agreement/
I would rephrase the sentence "we come to an agreement..." to:
SIG scheduling approved changing DaemonSet controller to create DaemonSet Pods and set their node-affinity and let them be scheduled by default scheduler. After this change, DaemonSet controller will no longer schedule DaemonSet Pods directly.
|
||
* **Q**: DaemonSet controller can make pods even the network of node is unavailable, e.g. CNI network providers (Calico, Flannel), Will this impact bootstrapping, such as in the case that a DaemonSet is being used to provide the pod network? | ||
|
||
**A**: This will be handled in Support scheduling tolerating workloads on NotReady Nodes ([#45717](https://github.com/kubernetes/kubernetes/issues/45717)); after moving to check node’s taint, the DaemonSet pods will tolerant `NetworkUnavailable` taint. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/in Support/by supporting/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also tolerant
should be tolerate
|
||
* **Q**: DaemonSet controller can make pods even when the scheduler has not been started, which can help cluster bootstrap. | ||
|
||
**A**: As the scheduling logic is moved to default scheduler, the kube-scheduler is required after this proposal. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/the kube-scheduler is required after this proposal/the kube-scheduler must be started during cluster start-up/
|
||
This option is to leverage NodeAffinity feature to avoid introducing scheduler’s predicates in DS controller: | ||
|
||
1. DS controller filter nodes by nodeSelector, but did NOT check against scheduler’s predicates (e.g. PodFitHostResources) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/did not/does not/
This option is to leverage NodeAffinity feature to avoid introducing scheduler’s predicates in DS controller: | ||
|
||
1. DS controller filter nodes by nodeSelector, but did NOT check against scheduler’s predicates (e.g. PodFitHostResources) | ||
2. For each nodes, DS controller creates a Pod for it with following NodeAffinity |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/each nodes/each node/
s/with following/with the following/
1. DS controller filter nodes by nodeSelector, but did NOT check against scheduler’s predicates (e.g. PodFitHostResources) | ||
2. For each nodes, DS controller creates a Pod for it with following NodeAffinity | ||
3. When sync Pods, DS controller will map nodes and pods by this NodeAffinity to check whether Pods are started for nodes | ||
4. In scheduler, the Daemon pods will keep pending if predicates failed, e.g. PodFitHostResources; for critical daemons, DS controller will create Pods with critical pods annotation and leverage scheduler/kubelet’s logic to handle it; similar practice to [priority/preemption](https://github.com/kubernetes/features/issues/268 ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a part of enabling priority and preemption, we must ensure that all critical DaemonSet Pods get an appropriate critical priority. If they have critical priority, scheduler will ensure that they will be scheduled even when the cluster is under resource pressure. Scheduler preempts other Pods in such condition to schedule critical Pods.
6597e10
to
f936c42
Compare
1. DS controller filter nodes by nodeSelector, but does NOT check against scheduler’s predicates (e.g. PodFitHostResources) | ||
2. For each node, DS controller creates a Pod for it with the following NodeAffinity | ||
3. When sync Pods, DS controller will map nodes and pods by this NodeAffinity to check whether Pods are started for nodes | ||
4. In scheduler, the Daemon pods will keep pending if predicates failed, e.g. PodFitHostResources; for critical daemons, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/the Daemon pods will keep pending if predicates failed, e.g. PodFitHostResources; for critical daemons,/Daemon Pods will stay pending if scheduling predicates fail. To avoid this,/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Signed-off-by: Da K. Ma <[email protected]>
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bsalamat, k82cn The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these OWNERS Files:
Approvers can indicate their approval by writing |
Design doc of 'Schedule DS Pod by default scheduler'.
xref kubernetes/enhancements#548
/cc @bgrant0607 , @bsalamat , @kubernetes/sig-apps-feature-requests