Make sure we use correct CoreDNS image tags in our upgrade tests #6143

sbueringer · 2022-02-16T08:18:12Z

Currently we have a mixture of using CoreDNS tags with and without v in our Kubernetes upgrade test.
We are also not verifying that CoreDNS actually comes up, we only wait until the deployment has the new image tag.

Some context:

It's about the CoreDNS tag we're upgrading to which is configured via COREDNS_VERSION_UPGRADE_TO
There was a migration from k8s.gcr.io/coredns to k8s.gcr.io/coredns/coredns and the following images are available in GCR:
- k8s.gcr.io/coredns <= 1.6.5 / 1.6.6 / 1.6.7 / 1.7.0
- k8s.gcr.io/coredns/coredns v1.6.6 / v1.6.7 / v1.6.9 / v1.7.0 / v1.7.1 / v1.8.0 / v1.8.3 / v1.84 / v1.8.5 / v1.8.6
- gcloud container images list-tags k8s.gcr.io/coredns & gcloud container images list-tags k8s.gcr.io/coredns/coredns
In KCP we are automatically switching to the new image repository with v1.8.0

So I would suggest that we use the v prefix for CoreDNS >= v1.8.0

Tasks:

change COREDNS_VERSION_UPGRADE_TO test/e2e/config/docker.yaml: 1.8.4 => v1.8.6 (we should use the default CoreDNS version of Kubernetes 1.23)
Check periodic and presubmit Kubernetes upgrade jobs in test-infra
Improve Kubernetes upgrade test to verify that the Deployment is rolled out correctly
- I think we should additionally check in WaitForDNSUpgrade that Deployment.Status.ObservedGeneration >= Deployment.Generation and Deployment.Spec.Replicas and Deployment.Status.{AvailableReplicas,UpdatedReplicas} are equal
- kubectl rollout status checks a bit more, but I think a simpler version is good enough for us: https://github.com/kubernetes/kubernetes/blob/bfa4188123ed334d4f5dda3a79994cadf663d8f2/staging/src/k8s.io/kubectl/pkg/polymorphichelpers/rollout_status.go#L59-L92
- I'm aware that this will make the test a bit slower, but I would prefer that over not being able to tell if our CoreDNS upgrade works.
I think we should do something similar for the kube-proxy Daemonset in WaitForKubeProxyUpgrade
- Let's discuss this separately and try it in a separate PR.
- I think this could break upgrade tests which are using CI images. IIRC when we are injecting a Kubernetes CI version in CAPA/CAPO we download a specific kube-proxy version from GCS which is not available in any registry. I.e. a new kube-proxy version won't come up on old nodes as the new image wasn't downloaded there.

/kind bug
[One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]

The text was updated successfully, but these errors were encountered:

sbueringer · 2022-02-16T08:26:26Z

/cc @fabriziopandini @detiber

I'm absolutely not sure about the last task, but I would flag the issue as good-first issue for the first and third if it sounds good to you.

fabriziopandini · 2022-02-16T11:20:18Z

/area testing
/milestone v1.2
+1 on the overall plan; should consider to add the last point behind a flag, so we allow to opt-out if required
as an additional point, we should consider also hardening webhook validation

sbueringer · 2022-02-16T11:31:02Z

we should consider also hardening webhook validation

Agree. My plan was to tackle that as part of: #2599 (comment)

detiber · 2022-02-16T15:04:50Z

/cc @fabriziopandini @detiber

I'm absolutely not sure about the last task, but I would flag the issue as good-first issue for the first and third if it sounds good to you.

The versions of CoreDNS and etcd images are provided through the e2e test config, so regardless of what versions would be used by default for CI versions, I would expect them to be overridden through config and the upgrade to still complete. The same way the default versions are not used for non-CI based upgrades today.

detiber · 2022-02-16T15:10:31Z

One other thing that I will add....

There is currently no verification of progress when checking for all control plane machines to be available, all machine deployment machines to be available, scaling operations for KCP/MD, and rolling out updates for KCP/MD today, so if there is a terminal failure (failure message/reason set on an owned Machine), then ;you need to wait for the wait-machine-upgrade, wait-control-plane, and/or wait-worker-nodes timeout to trigger. It would be nice if there was also a separate timeout for progress to be made that would help these types of failures to cause the test to fail quicker and in an easier to debug way.

sbueringer · 2022-02-16T15:24:40Z

One other thing that I will add....

There is currently no verification of progress when checking for all control plane machines to be available, all machine deployment machines to be available, scaling operations for KCP/MD, and rolling out updates for KCP/MD today, so if there is a terminal failure (failure message/reason set on an owned Machine), then ;you need to wait for the wait-machine-upgrade, wait-control-plane, and/or wait-worker-nodes timeout to trigger. It would be nice if there was also a separate timeout for progress to be made that would help these types of failures to cause the test to fail quicker and in an easier to debug way.

I wonder if we need another timeout or if we can just immediately fail once we discover that a Machine has a a terminal failure. Do you want to open a separate issue for that one? (so we can discuss further there)

The versions of CoreDNS and etcd images are provided through the e2e test config, so regardless of what versions would be used by default for CI versions, I would expect them to be overridden through config and the upgrade to still complete. The same way the default versions are not used for non-CI based upgrades today.

I'm not sure if I understand that. In the cluster upgrade test we:

are creating KCP without CoreDNS/etcd tags set. So kubeadm will use its default versions
are upgrading to the versions we pass in. Either the one from the e2e config or e.g. an overwrite via an env var

Today the CoreDNS version in our docker.yaml is not used at all in CI, because we always set the env var.

What part would you like to change?

detiber · 2022-03-02T15:51:18Z

The versions of CoreDNS and etcd images are provided through the e2e test config, so regardless of what versions would be used by default for CI versions, I would expect them to be overridden through config and the upgrade to still complete. The same way the default versions are not used for non-CI based upgrades today.

I'm not sure if I understand that. In the cluster upgrade test we:

are creating KCP without CoreDNS/etcd tags set. So kubeadm will use its default versions

are upgrading to the versions we pass in. Either the one from the e2e config or e.g. an overwrite via an env var

Today the CoreDNS version in our docker.yaml is not used at all in CI, because we always set the env var.

What part would you like to change?

For the CI jobs, are we not still overriding the CoreDNS/etcd image versions through config or env variables?

If so, then both CoreDNS and etcd should both be working following the upgrade just like any other upgrade scenario (assuming that the CoreDNS/etcd versions are >= the starting versions). In which case I would expect that we could verify that not only the CoreDNS Deployment is updated with the new image versions (what it appears we are doing today), but also validate that the CoreDNS Deployment has successfully rolled out and in a good state.

As it is right now, it's possible for the upgrade test to succeed even when the cluster is no longer functional after the upgrade.

I don't think this is as much of an issue for etcd, since the upgrade process would not finish if etcd is broken.

detiber · 2022-03-02T16:15:44Z

I think we should do something similar for the kube-proxy Daemonset in WaitForKubeProxyUpgrade

Let's discuss this separately and try it in a separate PR.

I think this could break upgrade tests which are using CI images. IIRC when we are injecting a Kubernetes CI version in CAPA/CAPO we download a specific kube-proxy version from GCS which is not available in any registry. I.e. a new kube-proxy version won't come up on old nodes as the new image wasn't downloaded there.

Ah, reading a bit closer, I think I understand now. Apologies for my earlier confusion. I was thinking purely about CoreDNS/etcd and not kube-proxy. For the kube-proxy daemonset, I think there are potentially a few ways we could accomplish it:

Add the necessary steps to PreKubeadmCommands to load the images into a ci-focused template
Add the needed images to PrePullImages in the DockerMachineSpec configs for a ci-focused template
- I suspect that this will require extending support for images that are not in the local registry, either that or pre-adding the required image(s) to the local registry

sbueringer · 2022-03-02T16:58:23Z

Okay I think we have the same opinion :)

I'll flag the issue as good first issue for the first and third task (fixing docker.yaml and improving the e2e test to validate that CoreDNS comes up). I'll probably then create a new issue only for kube-proxy to see if anyone wants to take that up / if there is demand to improve this.

/good-first-issue

k8s-ci-robot · 2022-03-02T16:58:24Z

@sbueringer:
This request has been marked as suitable for new contributors.

Guidelines

Please ensure that the issue body includes answers to the following questions:

Why are we solving this issue?
To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
Does this issue have zero to low barrier of entry?
How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue command.

In response to this:

Okay I think we have roughly the same opinion.

I'll flag the issue as good first issue for the first and third task (fixing docker.yaml and improving the e2e test to validate that CoreDNS comes up). I'll probably then create a new issue only for kube-proxy to see if anyone wants to take that up / if there is demand to improve this.

/good-first-issue

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

valaparthvi · 2022-03-14T08:52:11Z

I'd like to give the good first issue tasks a shot, if that's alright.

sbueringer · 2022-03-15T09:56:59Z

I'd like to give the good first issue tasks a shot, if that's alright.

Yup sure. Please ignore the 4th task in the list above for now.

valaparthvi · 2022-03-15T12:36:07Z

I have currently picked up #3661, so won't be assigning this to myself at the moment. But, thanks for the heads-up.

Ab-hishek · 2022-03-23T20:43:34Z

I would like to work on this one. Can I assign this to myself?

sbueringer · 2022-03-23T21:36:26Z

I would like to work on this one. Can I assign this to myself?

Yup sure. Please ignore the 4th task in the list above for now.

Ab-hishek · 2022-03-24T05:22:42Z

/assign

Ab-hishek · 2022-03-27T19:04:52Z

Hi. I have opened a pull request for this issue but somehow I am not able to get the easyCLA thing working. Can somebody help me with this?

Ab-hishek · 2022-03-27T20:48:20Z

Hi. I have opened a pull request for this issue but somehow I am not able to get the easyCLA thing working. Can somebody help me with this?

Resolved it. Email was different in my commits than the one used for my GitHub account.

sbueringer · 2022-03-29T07:05:56Z

/reopen
for the last task. I have to think about how to approach this a bit.

/remove-good-first-issue

k8s-ci-robot · 2022-03-29T07:07:10Z

@sbueringer: Reopened this issue.

In response to this:

/reopen
for the last task. I have to think about how to approach this a bit.

/remove-good-first-issue

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Ab-hishek · 2022-03-29T07:39:52Z

@sbueringer I have a few queries:

What is CAPA/CAPO
Why are we not handling the Kube-proxy image versions in the same as we are for CoreDNS right now that is via env or e2e-config? What is the need to have a separate specific kube-proxy version from GCS which is not available in any registry?

Please excuse me if my questions seems vague. Just trying get the context and understand the thought process of keeping different Kube-proxy image version for different tests. In upgrade test, we should simply first install some base-image of Kube-proxy(any globally available Kube-proxy image) and then while upgrading check the upgraded image tags.

sbueringer · 2022-03-29T09:28:48Z

What is CAPA/CAPO

cluster api provider AWS, cluster api provider OpenStack

kube-proxy images are aligned to the specific Kubernetes version we're using.
kube-proxy images are run on every node

If I remember correctly some providers have e2e test which are using latest versions of Kubernetes (a commit from the master branch) for which afaik images are not published to a registry only to Google Cloud storage. In those cases we download images for kube-apiserver/kube-controller-manager/kube-scheduler and kube-proxy from GCS and then import them. (xref: https://github.com/kubernetes-sigs/cluster-api/blob/be8ce08326956fb539c128342710aaf267e2dfed/test/framework/kubernetesversions/data/debian_injection_script.envsubst.sh.tpl)

The result is essentially that every Machine we create only has the kube-proxy image locally for the version the Machine was created with. If KCP now updates the kube-proxy version in the kube-proxy Daemonset there is no way that "old" Machines based on the previous Kubernetes version are able to get the newer kube-proxy images.

So if we e.g. validate at the wrong time that kube-proxy is already up, this simply won't work. For example even if KCP is completely upgraded the worker machines will still have kube-proxy Pods with ImagePullBackoff.

But because I'm not sure if I remember all that correctly (it has been a while) and how this stuff is used today, I have to do some more research before I'm able to make a reliable statement on how we can/should address this :)

sbueringer · 2022-06-14T16:30:25Z

I'm closing this issue. I took a closer look and I don't think it's worth the effort to implement.

Essentially all tests which are using pre release versions (e.g. a lot of CAPD based e2e tests) wouldn't work anymore out of the box because they need the kube-proxy image baked in.

We could do some kind of pre-loading but I don't think it's worth the investment. If kube-proxy doesn't come up our e2e tests will fail anyway as there is no way that the conformance tests will work without.

If we have this case more often we can consider improving the failure output, but right now it basically never happens.

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Feb 16, 2022

k8s-ci-robot added this to the v1.2 milestone Feb 16, 2022

k8s-ci-robot added the area/testing Issues or PRs related to testing label Feb 16, 2022

detiber mentioned this issue Mar 2, 2022

E2E framework should fail faster when terminal provisioning failures occur #6239

Closed

k8s-ci-robot added good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels Mar 2, 2022

k8s-ci-robot assigned Ab-hishek Mar 24, 2022

Ab-hishek mentioned this issue Mar 27, 2022

🐛 Update CoreDNS tags and upgrade checks in e2e tests #6338

Closed

Ab-hishek mentioned this issue Mar 27, 2022

🐛 Update CoreDNS tags and upgrade checks in e2e tests #6339

Merged

k8s-ci-robot closed this as completed in #6339 Mar 29, 2022

k8s-ci-robot reopened this Mar 29, 2022

sbueringer closed this as completed Jun 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make sure we use correct CoreDNS image tags in our upgrade tests #6143

Make sure we use correct CoreDNS image tags in our upgrade tests #6143

sbueringer commented Feb 16, 2022 •

edited

Loading

sbueringer commented Feb 16, 2022

fabriziopandini commented Feb 16, 2022

sbueringer commented Feb 16, 2022

detiber commented Feb 16, 2022

detiber commented Feb 16, 2022

sbueringer commented Feb 16, 2022 •

edited

Loading

detiber commented Mar 2, 2022

detiber commented Mar 2, 2022

sbueringer commented Mar 2, 2022 •

edited

Loading

k8s-ci-robot commented Mar 2, 2022

valaparthvi commented Mar 14, 2022

sbueringer commented Mar 15, 2022

valaparthvi commented Mar 15, 2022

Ab-hishek commented Mar 23, 2022

sbueringer commented Mar 23, 2022

Ab-hishek commented Mar 24, 2022

Ab-hishek commented Mar 27, 2022

Ab-hishek commented Mar 27, 2022

sbueringer commented Mar 29, 2022

k8s-ci-robot commented Mar 29, 2022

Ab-hishek commented Mar 29, 2022

sbueringer commented Mar 29, 2022 •

edited

Loading

sbueringer commented Jun 14, 2022

Make sure we use correct CoreDNS image tags in our upgrade tests #6143

Make sure we use correct CoreDNS image tags in our upgrade tests #6143

Comments

sbueringer commented Feb 16, 2022 • edited Loading

sbueringer commented Feb 16, 2022

fabriziopandini commented Feb 16, 2022

sbueringer commented Feb 16, 2022

detiber commented Feb 16, 2022

detiber commented Feb 16, 2022

sbueringer commented Feb 16, 2022 • edited Loading

detiber commented Mar 2, 2022

detiber commented Mar 2, 2022

sbueringer commented Mar 2, 2022 • edited Loading

k8s-ci-robot commented Mar 2, 2022

Guidelines

valaparthvi commented Mar 14, 2022

sbueringer commented Mar 15, 2022

valaparthvi commented Mar 15, 2022

Ab-hishek commented Mar 23, 2022

sbueringer commented Mar 23, 2022

Ab-hishek commented Mar 24, 2022

Ab-hishek commented Mar 27, 2022

Ab-hishek commented Mar 27, 2022

sbueringer commented Mar 29, 2022

k8s-ci-robot commented Mar 29, 2022

Ab-hishek commented Mar 29, 2022

sbueringer commented Mar 29, 2022 • edited Loading

sbueringer commented Jun 14, 2022

sbueringer commented Feb 16, 2022 •

edited

Loading

sbueringer commented Feb 16, 2022 •

edited

Loading

sbueringer commented Mar 2, 2022 •

edited

Loading

sbueringer commented Mar 29, 2022 •

edited

Loading