Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make sure we use correct CoreDNS image tags in our upgrade tests #6143

Closed
3 of 4 tasks
sbueringer opened this issue Feb 16, 2022 · 23 comments · Fixed by #6339
Closed
3 of 4 tasks

Make sure we use correct CoreDNS image tags in our upgrade tests #6143

sbueringer opened this issue Feb 16, 2022 · 23 comments · Fixed by #6339
Assignees
Labels
area/testing Issues or PRs related to testing good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug.
Milestone

Comments

@sbueringer
Copy link
Member

sbueringer commented Feb 16, 2022

Currently we have a mixture of using CoreDNS tags with and without v in our Kubernetes upgrade test.
We are also not verifying that CoreDNS actually comes up, we only wait until the deployment has the new image tag.

Some context:

  • It's about the CoreDNS tag we're upgrading to which is configured via COREDNS_VERSION_UPGRADE_TO
  • There was a migration from k8s.gcr.io/coredns to k8s.gcr.io/coredns/coredns and the following images are available in GCR:
    • k8s.gcr.io/coredns <= 1.6.5 / 1.6.6 / 1.6.7 / 1.7.0
    • k8s.gcr.io/coredns/coredns v1.6.6 / v1.6.7 / v1.6.9 / v1.7.0 / v1.7.1 / v1.8.0 / v1.8.3 / v1.84 / v1.8.5 / v1.8.6
    • gcloud container images list-tags k8s.gcr.io/coredns & gcloud container images list-tags k8s.gcr.io/coredns/coredns
  • In KCP we are automatically switching to the new image repository with v1.8.0

So I would suggest that we use the v prefix for CoreDNS >= v1.8.0

Tasks:

  • change COREDNS_VERSION_UPGRADE_TO test/e2e/config/docker.yaml: 1.8.4 => v1.8.6 (we should use the default CoreDNS version of Kubernetes 1.23)
  • Check periodic and presubmit Kubernetes upgrade jobs in test-infra
  • Improve Kubernetes upgrade test to verify that the Deployment is rolled out correctly
  • I think we should do something similar for the kube-proxy Daemonset in WaitForKubeProxyUpgrade
    • Let's discuss this separately and try it in a separate PR.
    • I think this could break upgrade tests which are using CI images. IIRC when we are injecting a Kubernetes CI version in CAPA/CAPO we download a specific kube-proxy version from GCS which is not available in any registry. I.e. a new kube-proxy version won't come up on old nodes as the new image wasn't downloaded there.

/kind bug
[One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Feb 16, 2022
@sbueringer
Copy link
Member Author

/cc @fabriziopandini @detiber

I'm absolutely not sure about the last task, but I would flag the issue as good-first issue for the first and third if it sounds good to you.

@fabriziopandini
Copy link
Member

/area testing
/milestone v1.2
+1 on the overall plan; should consider to add the last point behind a flag, so we allow to opt-out if required
as an additional point, we should consider also hardening webhook validation

@k8s-ci-robot k8s-ci-robot added this to the v1.2 milestone Feb 16, 2022
@k8s-ci-robot k8s-ci-robot added the area/testing Issues or PRs related to testing label Feb 16, 2022
@sbueringer
Copy link
Member Author

we should consider also hardening webhook validation

Agree. My plan was to tackle that as part of: #2599 (comment)

@detiber
Copy link
Member

detiber commented Feb 16, 2022

/cc @fabriziopandini @detiber

I'm absolutely not sure about the last task, but I would flag the issue as good-first issue for the first and third if it sounds good to you.

The versions of CoreDNS and etcd images are provided through the e2e test config, so regardless of what versions would be used by default for CI versions, I would expect them to be overridden through config and the upgrade to still complete. The same way the default versions are not used for non-CI based upgrades today.

@detiber
Copy link
Member

detiber commented Feb 16, 2022

One other thing that I will add....

There is currently no verification of progress when checking for all control plane machines to be available, all machine deployment machines to be available, scaling operations for KCP/MD, and rolling out updates for KCP/MD today, so if there is a terminal failure (failure message/reason set on an owned Machine), then ;you need to wait for the wait-machine-upgrade, wait-control-plane, and/or wait-worker-nodes timeout to trigger. It would be nice if there was also a separate timeout for progress to be made that would help these types of failures to cause the test to fail quicker and in an easier to debug way.

@sbueringer
Copy link
Member Author

sbueringer commented Feb 16, 2022

One other thing that I will add....

There is currently no verification of progress when checking for all control plane machines to be available, all machine deployment machines to be available, scaling operations for KCP/MD, and rolling out updates for KCP/MD today, so if there is a terminal failure (failure message/reason set on an owned Machine), then ;you need to wait for the wait-machine-upgrade, wait-control-plane, and/or wait-worker-nodes timeout to trigger. It would be nice if there was also a separate timeout for progress to be made that would help these types of failures to cause the test to fail quicker and in an easier to debug way.

I wonder if we need another timeout or if we can just immediately fail once we discover that a Machine has a a terminal failure. Do you want to open a separate issue for that one? (so we can discuss further there)

The versions of CoreDNS and etcd images are provided through the e2e test config, so regardless of what versions would be used by default for CI versions, I would expect them to be overridden through config and the upgrade to still complete. The same way the default versions are not used for non-CI based upgrades today.

I'm not sure if I understand that. In the cluster upgrade test we:

  • are creating KCP without CoreDNS/etcd tags set. So kubeadm will use its default versions
  • are upgrading to the versions we pass in. Either the one from the e2e config or e.g. an overwrite via an env var

Today the CoreDNS version in our docker.yaml is not used at all in CI, because we always set the env var.

What part would you like to change?

@detiber
Copy link
Member

detiber commented Mar 2, 2022

The versions of CoreDNS and etcd images are provided through the e2e test config, so regardless of what versions would be used by default for CI versions, I would expect them to be overridden through config and the upgrade to still complete. The same way the default versions are not used for non-CI based upgrades today.

I'm not sure if I understand that. In the cluster upgrade test we:

  • are creating KCP without CoreDNS/etcd tags set. So kubeadm will use its default versions
  • are upgrading to the versions we pass in. Either the one from the e2e config or e.g. an overwrite via an env var

Today the CoreDNS version in our docker.yaml is not used at all in CI, because we always set the env var.

What part would you like to change?

For the CI jobs, are we not still overriding the CoreDNS/etcd image versions through config or env variables?

If so, then both CoreDNS and etcd should both be working following the upgrade just like any other upgrade scenario (assuming that the CoreDNS/etcd versions are >= the starting versions). In which case I would expect that we could verify that not only the CoreDNS Deployment is updated with the new image versions (what it appears we are doing today), but also validate that the CoreDNS Deployment has successfully rolled out and in a good state.

As it is right now, it's possible for the upgrade test to succeed even when the cluster is no longer functional after the upgrade.

I don't think this is as much of an issue for etcd, since the upgrade process would not finish if etcd is broken.

@detiber
Copy link
Member

detiber commented Mar 2, 2022

I think we should do something similar for the kube-proxy Daemonset in WaitForKubeProxyUpgrade

  • Let's discuss this separately and try it in a separate PR.
  • I think this could break upgrade tests which are using CI images. IIRC when we are injecting a Kubernetes CI version in CAPA/CAPO we download a specific kube-proxy version from GCS which is not available in any registry. I.e. a new kube-proxy version won't come up on old nodes as the new image wasn't downloaded there.

Ah, reading a bit closer, I think I understand now. Apologies for my earlier confusion. I was thinking purely about CoreDNS/etcd and not kube-proxy. For the kube-proxy daemonset, I think there are potentially a few ways we could accomplish it:

  • Add the necessary steps to PreKubeadmCommands to load the images into a ci-focused template
  • Add the needed images to PrePullImages in the DockerMachineSpec configs for a ci-focused template
    • I suspect that this will require extending support for images that are not in the local registry, either that or pre-adding the required image(s) to the local registry

@sbueringer
Copy link
Member Author

sbueringer commented Mar 2, 2022

Okay I think we have the same opinion :)

I'll flag the issue as good first issue for the first and third task (fixing docker.yaml and improving the e2e test to validate that CoreDNS comes up). I'll probably then create a new issue only for kube-proxy to see if anyone wants to take that up / if there is demand to improve this.

/good-first-issue

@k8s-ci-robot
Copy link
Contributor

@sbueringer:
This request has been marked as suitable for new contributors.

Guidelines

Please ensure that the issue body includes answers to the following questions:

  • Why are we solving this issue?
  • To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
  • Does this issue have zero to low barrier of entry?
  • How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue command.

In response to this:

Okay I think we have roughly the same opinion.

I'll flag the issue as good first issue for the first and third task (fixing docker.yaml and improving the e2e test to validate that CoreDNS comes up). I'll probably then create a new issue only for kube-proxy to see if anyone wants to take that up / if there is demand to improve this.

/good-first-issue

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels Mar 2, 2022
@valaparthvi
Copy link
Contributor

I'd like to give the good first issue tasks a shot, if that's alright.

@sbueringer
Copy link
Member Author

I'd like to give the good first issue tasks a shot, if that's alright.

Yup sure. Please ignore the 4th task in the list above for now.

@valaparthvi
Copy link
Contributor

I have currently picked up #3661, so won't be assigning this to myself at the moment. But, thanks for the heads-up.

@Ab-hishek
Copy link
Contributor

I would like to work on this one. Can I assign this to myself?

@sbueringer
Copy link
Member Author

I would like to work on this one. Can I assign this to myself?

Yup sure. Please ignore the 4th task in the list above for now.

@Ab-hishek
Copy link
Contributor

/assign

@Ab-hishek
Copy link
Contributor

Hi. I have opened a pull request for this issue but somehow I am not able to get the easyCLA thing working. Can somebody help me with this?

@Ab-hishek
Copy link
Contributor

Hi. I have opened a pull request for this issue but somehow I am not able to get the easyCLA thing working. Can somebody help me with this?

Resolved it. Email was different in my commits than the one used for my GitHub account.

@sbueringer
Copy link
Member Author

/reopen
for the last task. I have to think about how to approach this a bit.

/remove-good-first-issue

@k8s-ci-robot k8s-ci-robot reopened this Mar 29, 2022
@k8s-ci-robot
Copy link
Contributor

@sbueringer: Reopened this issue.

In response to this:

/reopen
for the last task. I have to think about how to approach this a bit.

/remove-good-first-issue

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Ab-hishek
Copy link
Contributor

@sbueringer I have a few queries:

  1. What is CAPA/CAPO

  2. Why are we not handling the Kube-proxy image versions in the same as we are for CoreDNS right now that is via env or e2e-config? What is the need to have a separate specific kube-proxy version from GCS which is not available in any registry?

Please excuse me if my questions seems vague. Just trying get the context and understand the thought process of keeping different Kube-proxy image version for different tests. In upgrade test, we should simply first install some base-image of Kube-proxy(any globally available Kube-proxy image) and then while upgrading check the upgraded image tags.

@sbueringer
Copy link
Member Author

sbueringer commented Mar 29, 2022

  1. What is CAPA/CAPO

cluster api provider AWS, cluster api provider OpenStack

kube-proxy images are aligned to the specific Kubernetes version we're using.
kube-proxy images are run on every node

If I remember correctly some providers have e2e test which are using latest versions of Kubernetes (a commit from the master branch) for which afaik images are not published to a registry only to Google Cloud storage. In those cases we download images for kube-apiserver/kube-controller-manager/kube-scheduler and kube-proxy from GCS and then import them. (xref: https://github.com/kubernetes-sigs/cluster-api/blob/be8ce08326956fb539c128342710aaf267e2dfed/test/framework/kubernetesversions/data/debian_injection_script.envsubst.sh.tpl)

The result is essentially that every Machine we create only has the kube-proxy image locally for the version the Machine was created with. If KCP now updates the kube-proxy version in the kube-proxy Daemonset there is no way that "old" Machines based on the previous Kubernetes version are able to get the newer kube-proxy images.

So if we e.g. validate at the wrong time that kube-proxy is already up, this simply won't work. For example even if KCP is completely upgraded the worker machines will still have kube-proxy Pods with ImagePullBackoff.

But because I'm not sure if I remember all that correctly (it has been a while) and how this stuff is used today, I have to do some more research before I'm able to make a reliable statement on how we can/should address this :)

@sbueringer
Copy link
Member Author

I'm closing this issue. I took a closer look and I don't think it's worth the effort to implement.

Essentially all tests which are using pre release versions (e.g. a lot of CAPD based e2e tests) wouldn't work anymore out of the box because they need the kube-proxy image baked in.

We could do some kind of pre-loading but I don't think it's worth the investment. If kube-proxy doesn't come up our e2e tests will fail anyway as there is no way that the conformance tests will work without.

If we have this case more often we can consider improving the failure output, but right now it basically never happens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/testing Issues or PRs related to testing good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
6 participants