-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make sure we use correct CoreDNS image tags in our upgrade tests #6143
Comments
I'm absolutely not sure about the last task, but I would flag the issue as good-first issue for the first and third if it sounds good to you. |
/area testing |
Agree. My plan was to tackle that as part of: #2599 (comment) |
The versions of CoreDNS and etcd images are provided through the e2e test config, so regardless of what versions would be used by default for CI versions, I would expect them to be overridden through config and the upgrade to still complete. The same way the default versions are not used for non-CI based upgrades today. |
One other thing that I will add.... There is currently no verification of progress when checking for all control plane machines to be available, all machine deployment machines to be available, scaling operations for KCP/MD, and rolling out updates for KCP/MD today, so if there is a terminal failure (failure message/reason set on an owned Machine), then ;you need to wait for the |
I wonder if we need another timeout or if we can just immediately fail once we discover that a Machine has a a terminal failure. Do you want to open a separate issue for that one? (so we can discuss further there)
I'm not sure if I understand that. In the cluster upgrade test we:
Today the CoreDNS version in our docker.yaml is not used at all in CI, because we always set the env var. What part would you like to change? |
For the CI jobs, are we not still overriding the CoreDNS/etcd image versions through config or env variables? If so, then both CoreDNS and etcd should both be working following the upgrade just like any other upgrade scenario (assuming that the CoreDNS/etcd versions are >= the starting versions). In which case I would expect that we could verify that not only the CoreDNS Deployment is updated with the new image versions (what it appears we are doing today), but also validate that the CoreDNS Deployment has successfully rolled out and in a good state. As it is right now, it's possible for the upgrade test to succeed even when the cluster is no longer functional after the upgrade. I don't think this is as much of an issue for etcd, since the upgrade process would not finish if etcd is broken. |
Ah, reading a bit closer, I think I understand now. Apologies for my earlier confusion. I was thinking purely about CoreDNS/etcd and not kube-proxy. For the kube-proxy daemonset, I think there are potentially a few ways we could accomplish it:
|
Okay I think we have the same opinion :) I'll flag the issue as good first issue for the first and third task (fixing docker.yaml and improving the e2e test to validate that CoreDNS comes up). I'll probably then create a new issue only for kube-proxy to see if anyone wants to take that up / if there is demand to improve this. /good-first-issue |
@sbueringer: GuidelinesPlease ensure that the issue body includes answers to the following questions:
For more details on the requirements of such an issue, please see here and ensure that they are met. If this request no longer meets these requirements, the label can be removed In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I'd like to give the good first issue tasks a shot, if that's alright. |
Yup sure. Please ignore the 4th task in the list above for now. |
I have currently picked up #3661, so won't be assigning this to myself at the moment. But, thanks for the heads-up. |
I would like to work on this one. Can I assign this to myself? |
Yup sure. Please ignore the 4th task in the list above for now. |
/assign |
Hi. I have opened a pull request for this issue but somehow I am not able to get the easyCLA thing working. Can somebody help me with this? |
Resolved it. Email was different in my commits than the one used for my GitHub account. |
/reopen /remove-good-first-issue |
@sbueringer: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@sbueringer I have a few queries:
Please excuse me if my questions seems vague. Just trying get the context and understand the thought process of keeping different Kube-proxy image version for different tests. In upgrade test, we should simply first install some base-image of Kube-proxy(any globally available Kube-proxy image) and then while upgrading check the upgraded image tags. |
cluster api provider AWS, cluster api provider OpenStack kube-proxy images are aligned to the specific Kubernetes version we're using. If I remember correctly some providers have e2e test which are using latest versions of Kubernetes (a commit from the master branch) for which afaik images are not published to a registry only to Google Cloud storage. In those cases we download images for kube-apiserver/kube-controller-manager/kube-scheduler and kube-proxy from GCS and then import them. (xref: https://github.com/kubernetes-sigs/cluster-api/blob/be8ce08326956fb539c128342710aaf267e2dfed/test/framework/kubernetesversions/data/debian_injection_script.envsubst.sh.tpl) The result is essentially that every Machine we create only has the kube-proxy image locally for the version the Machine was created with. If KCP now updates the kube-proxy version in the kube-proxy Daemonset there is no way that "old" Machines based on the previous Kubernetes version are able to get the newer kube-proxy images. So if we e.g. validate at the wrong time that kube-proxy is already up, this simply won't work. For example even if KCP is completely upgraded the worker machines will still have kube-proxy Pods with ImagePullBackoff. But because I'm not sure if I remember all that correctly (it has been a while) and how this stuff is used today, I have to do some more research before I'm able to make a reliable statement on how we can/should address this :) |
I'm closing this issue. I took a closer look and I don't think it's worth the effort to implement. Essentially all tests which are using pre release versions (e.g. a lot of CAPD based e2e tests) wouldn't work anymore out of the box because they need the kube-proxy image baked in. We could do some kind of pre-loading but I don't think it's worth the investment. If kube-proxy doesn't come up our e2e tests will fail anyway as there is no way that the conformance tests will work without. If we have this case more often we can consider improving the failure output, but right now it basically never happens. |
Currently we have a mixture of using CoreDNS tags with and without
v
in our Kubernetes upgrade test.We are also not verifying that CoreDNS actually comes up, we only wait until the deployment has the new image tag.
Some context:
COREDNS_VERSION_UPGRADE_TO
gcloud container images list-tags k8s.gcr.io/coredns
&gcloud container images list-tags k8s.gcr.io/coredns/coredns
So I would suggest that we use the
v
prefix for CoreDNS >= v1.8.0Tasks:
COREDNS_VERSION_UPGRADE_TO
test/e2e/config/docker.yaml: 1.8.4 =>v1.8.6
(we should use the default CoreDNS version of Kubernetes 1.23)WaitForDNSUpgrade
thatDeployment.Status.ObservedGeneration
>=Deployment.Generation
andDeployment.Spec.Replicas
andDeployment.Status.{AvailableReplicas,UpdatedReplicas}
are equalkubectl rollout status
checks a bit more, but I think a simpler version is good enough for us: https://github.com/kubernetes/kubernetes/blob/bfa4188123ed334d4f5dda3a79994cadf663d8f2/staging/src/k8s.io/kubectl/pkg/polymorphichelpers/rollout_status.go#L59-L92WaitForKubeProxyUpgrade
/kind bug
[One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]
The text was updated successfully, but these errors were encountered: