Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AntiAffinity to kube-dns. #2693 #2705

Closed

Conversation

jamesbucher
Copy link

@jamesbucher jamesbucher commented Jun 11, 2017

Fixes: #2693

This is not quite ready as I have the following left to do:

  • Testing (Waiting on DNS registration stuffs so I can use Kops from my home machine).

  • Determine if "preferredDuringSchedulingIgnoredDuringExecution" the correct setting? Other settings would prevent the pod from scheduling on the same node and I think mark the pod as un-schedulable. Since DNS is pretty critical to a cluster it is probably better to have the scheduler attempt to schedule the pod to be resilient against DNS pod crashes.


This change is Reviewable

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 11, 2017
@k8s-ci-robot
Copy link
Contributor

Hi @jamesbucher. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with @k8s-bot ok to test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jun 11, 2017
@jamesbucher
Copy link
Author

I have verified that a clean deployment works for both Kubernetes 1.6 and 1.5 by deploying to AWS and checking the kube-scheduler.log for errors on the master node. I also made in a few typo fixes and rebased them to a single commit.

@chrislovecnm
Copy link
Contributor

@k8s-bot ok to test

How have you tested? Which k8s versions?

@k8s-ci-robot k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jun 12, 2017
@jamesbucher
Copy link
Author

I tested for Kubernetes 1.6.0 by doing the following with a custom build of kops (with the above changes):

  1. Create a new cluster using:
    ./kops create cluster --node-count 3 --zones=us-west-2a,us-west-2b,us-west-2c --state s3://jamesbucher-kops/ kubernetes.jamesbucher.info
  2. Edit the Cluster to the appropriate version of kubernetes (to downgrade to 1.5.3)
    ./kops edit cluster --state s3://jamesbucher-kops/ kubernetes.jamesbucher.info
  3. Run the actual cluster creation with:
    ./kops update cluster --state s3://jamesbucher-kops/ kubernetes.jamesbucher.info --yes
  4. Wait for the cluster to come online with:
    ./kops validate cluster --state s3://jamesbucher-kops/ kubernetes.jamesbucher.info
  5. When the cluster is online do the following:
    1. Check the DNS pod deployment to ensure that the annotation and antiAffinity are set (both for 1.6.2 and just the annotation for 1.5.3):
      kubectl --namespace kube-system get deployment kube-dns -o yaml
    2. Check the 2 DNS pods to ensure that they are not scheduled on the same node.
    3. Ssh Into a master node and check the kube-scheduler logs for errors
    4. Verify that the pods are not on the same node using: kubectl --namespace kube-system get pods -l k8s-app=kube-dns -o=jsonpath="{.items[*].spec.nodeName"}
    5. Kill the pods for DNS using kubectl --namespace kube-system delete pods -l k8s-app=kube-dns
    6. Repeat steps 4 & 5 several times to ensure pods don't get scheduled on the same nodes. I noticed two things here. First pods in their stable state (i.e. no terminating pods) were not scheduled on the same nodes. Second, during termination when a new pod was spawned they could get scheduled on the same node. I don't think the second item here is an issue.

I am not sure of a better way to test this. The scheduler behavior is a bit out of my hands. It appears to be working correctly. There is at least API validations for 1.6.2 as it checks the YAML validity.

I attempted to test this on Kubernetes 1.5.3 but the annotation was not there on the deployment. I checked the pre-1.6 yaml file in S3 and it was there... I don't really understand why its not there in the actual deployment. Is this file picked up from somewhere else? The file: https://s3-us-west-2.amazonaws.com/jamesbucher-kops/kubernetes.jamesbucher.info/addons/kube-dns.addons.k8s.io/pre-k8s-1.6.yaml from the deployment actually contains the annotation correctly but the actual deployment does not for 1.5.3.

To be clear I don't think this works for Kubernetes 1.5.*. It does however work for 1.6.0.

I will do some investigating tomorrow night to see why 1.5.* doesn't seem to be working. I am not sure why the 1.5 deployment didn't include the annotation and am open to ideas as to why that might be.

@kenden
Copy link
Contributor

kenden commented Jun 13, 2017

We should tag this as important. If the node where the kube-dns pods are goes down...

@jordanjennings
Copy link
Contributor

jordanjennings commented Jun 13, 2017

Just to throw in my 2 cents: how about using requiredDuringSchedulingIgnoredDuringExecution instead? Using "preferred" means that it's still possible to end up in a situation where a single node failure takes out DNS. Using "required" would cause the scheduling of the additional kube-dns pods to wait until suitable nodes were available.

For example, I have a cluster on AWS that uses spot instances for the nodes. Yesterday I lost all the nodes during a crazy spot price spike. The kube-dns pods could easily end up on the same node in this situation with only soft anti-affinity rules. Maybe spot instances isn't the best example, but I fear that momentary cluster instability for whatever reason could then set up the DNS pods together and then we're back to the single point of failure scenario we're trying to avoid with this change.

@jamesbucher
Copy link
Author

@jordanjennings : Yea I had the same thought for requiredDuringSchedulingIgnoredDuringExecution or even requiredDuringSchedulingRequiredDuringExection. My original thinking was that it would be better to have the scheduler schedule both pods (even if they are are on the same node). In case one crashes. However since I didn't see a scheduler option to have it do a sweep to move nodes once the pressure is resolved this opens things up for a node going down.

I think the Spot Price Hike example is a good one. It would be better to have the scheduler favor anti-affinity over being able to schedule the pod so that you never end in a state where DNS is only on one node. The caveat to this is that clusters under pressure may end up with DNS not being scheduled. I will change this to either requiredDuringSchedulingIgnoredDuringExecution or requiredDuringSchedulingRequiredDuringExection. To me requiredDuringSchedulingRequiredDuringExection makes the most sense but we may want to wait for a PodDisruptionBudget before allowing for eviction.

When: https://github.com/kubernetes/community/blob/master/contributors/design-proposals/rescheduling-for-critical-pods.md comes along we should probably add this to the DNS pods as having DNS go down is very bad.

@jordanjennings
Copy link
Contributor

@jamesbucher Yes I think requiredDuringSchedulingRequiredDuringExection would be ideal but to my understanding that option hasn't been implemented yet.

@jamesbucher
Copy link
Author

@jordanjennings Ok I will take a look at the checkins and such to see if I can determine if requiredDuringSchedulingRequiredDuringExecution is implemented yet. If it is I will use that for kubernetes 1.6 at the least and 1.5 if it supports it. Otherwise I will fallback to RequiredDuringSchedulingIgnoredDuringExectuion. First and foremost though I need to figure out why my 1.5 cluster didn't seem to have the annotation.

@jamesbucher
Copy link
Author

jamesbucher commented Jun 14, 2017

OK, Figured out what was wrong with the annotation in 1.5 (it was on the wrong deployment). Also changed the annotation to requiredDuringSchedulingRequiredDuringExection And was able to validate that this was working in Kubernetes 1.5.3 and 1.6.0 by creating a cluster with 3 nodes (which creates 2 DNS pods) and then scaling down to one node. After scaling down the 2nd DNS pod gets stuck in a Pending state because it can't be scheduled on the master node and the current node already has a DNS pod. Scaling back up results in the expected behavior of a new pod getting scheduled.

I have left this as a separate commit to make it easier to see the delta between this and the previous commit. I will rebase before final merge if desired.

Future work here should probably include having zone antiAffinity as well to ensure pods a spread across zones for added resiliency.

@chrislovecnm
Copy link
Contributor

can we use preferred instead of required? I am worried that we could run into load issues as well. If a lot of nodes are at capacity, we may not be able to schedule. Then if DNS is busy, we may not be able to scale.

@jordanjennings
Copy link
Contributor

jordanjennings commented Jun 14, 2017

@chrislovecnm these are already marked as critical pods, doesn't that mean in a high load situation the scheduler will evict other things to make sure these are running? Or does that just mean once running they won't get evicted?

@chrislovecnm
Copy link
Contributor

It is the reschedule that I am concerned about. Required forces k8s to not have two dns pods on the same node. Preferred will allow it, and try hard to not have the pods. For instance a cluster of one node will not have more than one dns pod, which is not good.

@jordanjennings
Copy link
Contributor

jordanjennings commented Jun 14, 2017

@chrislovecnm What are the "rules" for which cluster components can run on a master? If kube-dns can run on master then even in a single node scenario there would be resilience to single pod failure.

If it's not OK to run dns on master, is it safe to say resilience for dns matters more on a multi-node cluster than a single node cluster? Here's what I assume:

  • True single node clusters are rare and couldn't be expected to have great uptime anyway
  • Most single node clusters are temporary, due to a failure or maintenance scenario
  • DNS pods are less likely to fail than nodes (this could be totally wrong, but anecdotally I have seen more EC2 instances fail than I have seen stable pods fail)

Given those assumptions (even ignoring the last one), it makes more sense to me to require dns pods to be on separate nodes than it does to allow them to stack on a single node. Thoughts?

@jamesbucher
Copy link
Author

@jordanjennings - Under the current placement constraints the DNS pod will NOT run on the master. Currently I think the rescheduler that will force pods to be scheduled is still somewhat in the air. There is a Priority/preemption that fits this need. Reading the comments #268 subsumes Prioritized Scheduling of System Pods

@chrislovecnm - I think that Required is actually a better choice. Since there is no rescheduling enforcement if 2 DNS pods end up on the same node they may not get moved for a long time. This opens the cluster up to having DNS go down with a node.

In my mind it breaks down into the following categories:

  1. Cluster is not at capacity: DNS should always be scheduled on different nodes. - Required is ideal here.
  2. Cluster is a single node: DNS ideally should be 2 pods however doing so opens us up to potential bad states in 1 and 3. Given that the cluster is not High Availability (by nature of being 1 node) I think this is a moot point.
  3. Cluster is many nodes but is under extremely high load and DNS cannot be scheduled on 2 nodes. This boils down to 2 scenarios on cluster recovery (either lessened load or increased capacity (ex: autoscaler)) given different policies:
    1. In the case of Preferred at scheduling time this could result in 2 DNS pods getting scheduled on the same node IF there is space. On recovery these pods are not guaranteed to be moved opening the cluster to an extended period where if a node goes down the cluster could loose DNS. This gains the potential advantage that it will be more resilient to DNS process crashes (assuming both pods don't hit the same bug).
    2. In the case of Required at scheduling the scheduler will NOT schedule to DNS pods on the same node. Because of this secondary DNS pods will not be scheduled in the case of cluster space exhaustion. However, when the cluster recovers DNS pods will get scheduled on new nodes. This will prevent DNS from becoming vulnerable for a long time after resource exhaustion. Ideally DNS should be prioritized over other pods to resolve this issue. Priority/Preemption solves this issue.

In my opinion a cluster under 100% capacity is already likely to fail to schedule DNS nodes without some sort of pod priority. Without some metric of priority the scheduler may decide to schedule a non-DNS pod instead of the second (or even first) DNS pod in the event of capacity exhaustion.

I would also like to take the autoscaler into account for this feature. Since the autoscaler will scale/shrink a cluster I think it is preferable to have the DNS pod not schedule for two reasons:

  • To indicate to the scheduler that more capacity is needed. The autoscaler will scale up for unschedulable pods which is the behavior we want.
  • To ensure that on scaling up DNS pods end up across nodes not stuck on a single node.

Priority/Preemption + requiredDuringSchedulingRequiredDuringExection gives us exactly what we want longer term and I think should be the approach that is taken.

@chrislovecnm
Copy link
Contributor

Reached out to @justinsb so we can get his input

@justinsb
Copy link
Member

Asked in sig-scheduling on slack, cc @kubernetes/sig-scheduling-misc

@kenden
Copy link
Contributor

kenden commented Jun 16, 2017

What about running kube-dns as a daemonset on master nodes, wouldn't that be simpler?

@chrislovecnm
Copy link
Contributor

@kenden daemonsets are sorta not HA, I think the best pattern is to run this as a deployment.

@davidopp
Copy link
Member

FYI requiredDuringSchedulingRequiredDuringExection isn't implemented yet and there is no ETA. requiredDuringSchedulingIgnoredDuringExection is the closest.

@justinsb What is the question?

@justinsb
Copy link
Member

@davidopp thanks for chiming in. Is requiredDuringSchedulingIgnoredDuringExecution the right way to avoid multiple kube-dns pods getting scheduled to the same node during a rolling-update and then getting stuck there. (It's particularly bad on smaller clusters where you often end up with 2 kube-dns pods on the same node.) Or should we be looking into the rescheduler or simliar?

@kenden
Copy link
Contributor

kenden commented Jun 20, 2017

Thanks @chrislovecnm. I see how kube-dns as daemonset might not be HA enough with only one master. Only one master is not HA anyway. But otherwise?
If the pod dies, the daemonset creates a new one. We still have another master with kube-dns.
If a master dies, the instancegroup (autoscaling group in AWS) creates a new one. We still have another master with kube-dns.

If daemonsets do not work, it would we nice to ensure kube-dns pods are in different AZs, if a cluster is over several AZs.

@justinsb
Copy link
Member

justinsb commented Jul 1, 2017

I think we should merge for kops 1.7. I think required is the correct one because we don't have a rescheduler. The concern over preferred in a small cluster is certainly valid, but I would expect that if we didn't have many nodes we also wouldn't have as much DNS load. And I think required will surface the problem more clearly - the deployment will be at < desired.

I do think that (separately) we should also offer a daemonset option for kube-dns, particularly as I think support for same-node routing is there or almost there. But for a deployment, this feels like the right change until we have a rescheduler.

@justinsb
Copy link
Member

justinsb commented Jul 1, 2017

@chrislovecnm are you therefore OK to merge?

@chrislovecnm
Copy link
Contributor

I am extremely nervous about having this as required. Maybe we merge this and add an option to override? Also want happens to validate cluster? What happens during a rolling update?

@davidopp
Copy link
Member

davidopp commented Jul 5, 2017

Sorry for the delay in responding.

Most of the things people have said in this thread are correct, so I'm not sure there's much for me to add. requiredDuringSchedulingIgnoredDuringExecution pod anti-affinity among DNS pods will keep a DNS pod pending if the only node(s) where it can schedule are already running a DNS pod. So if you have two nodes, and DNS is in a two-pod ReplicaSet, and you scale down to one node, then one DNS pod will stay pending until you scale back up such that it can find a node that is not already running a DNS pod. preferredDuringSchedulingIgnoredDuringExecution has the problem mentioned earlier that it might allow two DNS replicas to schedule onto the same node, and we don't have any mechanism today that will spread the pods if/when you add more nodes. (The rescheduler will do this, eventually.)

I think the way to look at it, given the constraints of how the system works today, is

  • if you need multiple DNS pods primarily for availability, then requiredDuringSchedulingIgnoredDuringExecution makes sense because putting more than one DNS pod on the same node isn't useful
  • if you need multiple DNS pods primarily for performance, then
    preferredDuringScheduling IgnoredDuringExecution makes sense because it will allow the DNS pods to schedule even if they can't be spread across nodes

One other thing to keep in mind in general, is that all of the priority functions (including the one that implements preferredDuringScheduling) are mixed together to make the final decision of which node is "best." We currently give every factor the same weight, which means the results, while deterministic, can be hard to predict. For example, we have a priority function that prefers to put a container on a node that has the packages needed by that container already installed; this will tend to operate counter to a desire to spread pods of the same set. I'm definitely open to suggestions for tuning the weights of the default priority functions, but today they're all equal. (In practice I think if you set preferredDuringScheduling pod anti-affinity it's very likely the pods will be spread anyway, since there is also a built-in priority function that spreads pods from the same set, so in essence the spreading is being applied twice and the package thing only once.)

@justinsb
Copy link
Member

justinsb commented Jul 6, 2017

So this is great, but it's clear that there are tradeoffs. In kops we try to run as close as possible to the tested kubernetes configuration, which is here, and DNS in particular has caused us a lot of problems with even small deviations.

Can I ask that you PR this change to https://github.com/kubernetes/kubernetes/blob/master/cluster/addons/dns/kubedns-controller.yaml.base first, and then we'll pull it into kops? I'd suggest linking this issue as we've had some great discussion here and we just want to make sure that we can leverage all the e2e testing done!

@jamesbucher
Copy link
Author

I will file a PR on https://github.com/kubernetes/kubernetes/blob/master/cluster/addons/dns/kubedns-controller.yaml.base and link to this discussion to try and get this in there.

@chrislovecnm
Copy link
Contributor

Removing blocks-next as we recommended that this gets addressed upstream first.

@chrislovecnm
Copy link
Contributor

Marking as WIP, and please link the upstream issue / PR here

@k8s-github-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jamesbucher
We suggest the following additional approver: kris-nova

Assign the PR to them by writing /assign @kris-nova in a comment when ready.

Associated issue: 2693

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@blakebarnett
Copy link

blakebarnett commented Sep 8, 2017

If we did have a daemonset on masters option wouldn't that negate the kube-dns-autoscaler? What about larger clusters where you'll never have enough masters to handle the load?

@StevenACoffman
Copy link
Contributor

Hi, I did as requested and made a pull request against the upstream files to revive interest, but I probably won't be able to respond to change requests in a meaningful way. If @jamesbucher or anyone wants to run with it, please take it.

@blakebarnett
Copy link

Should we close this in favor of the upstream PR then?

@justinsb justinsb added this to the backlog milestone Sep 25, 2017
k8s-github-robot pushed a commit to kubernetes/kubernetes that referenced this pull request Oct 16, 2017
Automatic merge from submit-queue (batch tested with PRs 53106, 52193, 51250, 52449, 53861). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

kube-dns-anti-affinity: kube-dns never-co-located-in-the-same-node

**What this PR does / why we need it**:

This is upstreaming the kubernetes/kops#2705 pull request by @jamesbucher that was originally against [kops](github.com/kubernetes/kops).
Please see kubernetes/kops#2705 for more details, including a lengthy discussion.

Briefly, given the constraints of how the system works today:

+ if you need multiple DNS pods primarily for availability, then requiredDuringSchedulingIgnoredDuringExecution makes sense because putting more than one DNS pod on the same node isn't useful
+ if you need multiple DNS pods primarily for performance, then
preferredDuringScheduling IgnoredDuringExecution makes sense because it will allow the DNS pods to schedule even if they can't be spread across nodes

**Which issue this PR fixes**

fixes kubernetes/kops#2693

**Release note**:


```release-note
Improve resilience by annotating kube-dns addon with podAntiAffinity to prefer scheduling on different nodes.
```
@StevenACoffman
Copy link
Contributor

So I got this merged upstream (with some changes). What needs to happen here?

@chrislovecnm
Copy link
Contributor

@StevenACoffman if @jamesbucher does not respond, can you rebase on his commit and push a new PR? He will still get credit for the PR, but may not have time to get it in.

@justinsb justinsb added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. and removed WIP: Work In Progress labels Nov 4, 2017
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 15, 2018
@johanneswuerbach
Copy link
Contributor

johanneswuerbach commented Jan 15, 2018

Looks like the proposal got reverted upstream as anti-affinity was causing performance issues, but there is another proposal now after the perf issues have been resolved kubernetes/kubernetes#57683

@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 14, 2018
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.