-
Notifications
You must be signed in to change notification settings - Fork 315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Helm deploy fails on pre-install hooks (AKS only) #455
Comments
Same issue (referenced above: istio/istio#6301) while |
This is related to #408 We had our clusters patched yesterday and now helm pre-install hooks are working again. |
Just tried on a new centralus cluster and it still fails |
All (in all regions) new cluster creates should be patched with the fix. thanks for your patience, please report if you still see issues. Existing clusters will eventually get patched. |
I just tried again with Istio on a new cluster in eastus with rbac enabled, and I'm still seeing this issue. Here are the relevant logs for the tiller:
Edit* Just tried installing the tensorflow notebook, which succeeded. Strange that Istio still fails with this. |
I happen to have the similar issue on AKS with post-install hooks but while installing
|
This is caused by the idle timeout on watches, currently this is expected to be ~60s, aware of this issue, partial fix rolling out next week, working at high priority on a full fix. Thanks for reporting the issue. |
Works for me now after upgrading the cluster to 1.10.3 |
The 60-second idle timeout on watches is also preventing our post-install hooks from succeeding. Our cluster is running 1.9.6. |
We tested with 1.10.3 and it still fails. It's a configuration of the api-server that's responsible for this I think. |
There's another issue with Istio that might be related to this watch timeout issue. Looking at the logs for the mixer with
The container never crashes - it just repeats this error repeatedly. Does not happen on minikube. |
We have rolled out fixes, please provide feedback if there are issues with new cluster creates |
We had problems with 1.10.3 and the prometheus-operator chart. Will try again today on a new cluster. |
We still have the watch timeout issue for a default installation of Istio (rbac enabled in AKS). Here are the logs:
Watch still closes after 1 minute. Edit* Even though the 1 minute timeout issue is still not working, the issue with the mixer above (about watching *unstructured.Unstructured) seems to be working now. I think I did find a hacky work-around for now where if I make sure the image for the post-install job exists on every node before it runs (by creating a DaemonSet with the image and resetting the command to an infinite loop that just sleeps), the job takes less than a minute, so we won't get the watch timeout fail. Not ideal, but wanted to report my findings. |
Created a new AKS cluster now and try to install prometheus-operator:
|
Can confirm @StianOvrevage's comment. Cluster created yesterday (1.10.3) fails the same.
|
Looks like this might be working on 1.10.5:
|
Istio seems to now get past the "watch closed before timeout" issue, but still fails with "timed out waiting for condition". Tried with Helm versions 2.8.2 (with kubectl client version 1.9.1) and version 2.9.1 (with kubectl client version 1.10.5). I gave it a 16 minute timeout window. Logs of the tiller:
Works on Minikube. Here are the logs for Minikube (Minikube 0.25.2 Kubernetes 1.9.4 and Helm 2.9.1. I can't get a 1.10.0 minikube cluster started on Windows for some reason):
Edit* The workaround of deploying a DaemonSet with the ~700MB hyperkube image before helm installing Istio seems to still work though (for client/server: 1.9.1/1.9.6 and 1.10.5/1.10.5). Tiller logs of this success:
|
I can confirm that (as of today) in my case upgrading to 1.10.5 via the docs over here (https://docs.microsoft.com/en-us/azure/aks/upgrade-cluster) fixed my issue and I was able to install prometheus via helm. My guess is this one can be closed but it would be great to see if anyone else has some feedback. |
This is now failing again on 1.10.5:
:( |
The fix has not rolled out yet as it did not make it to the realease last week. apologies for the delay, I expect it to start rollout today and rollout to all production regions by end. To expedite patch mailto:[email protected] |
@rite2nikhil - The issue is still occurring in a newly created AKS cluster.
|
This happens to me as well with a a newly created cluster. |
This still happens for me with the post-install stage of prometheus-operator on a newly created cluster. I've tested with both version 1.9.6 and 1.10.6. Region is western-europe. This worked fine on a newly created cluster (1.9.6) last week though. |
The fix for helm/tiller will get rolled out by end of next week, so if this is urgent Please send your cluster info @ [email protected] |
Is there a way to see that the fix is in when we deploy a new cluster? Is it somehow tied to the acsengineVersion tag? |
This still fails on 1.11.1 on a newly created cluster today @rite2nikhil . Does this mean the fix is not fixing it or that this is several different problems or that it does not apply to new clusters yet? Also agree with @twendt . Being able to see which version gets deployed and maybe a changelog and/or status of known bugs would be nice. |
@StianOvrevage I had the same problem with prometheus-operator. I was able to deploy it successfully with the --debug parameter. Without the parameter it failed after 60s. With the parameter it completed after ~10-15s Cluster was created yesterday with K8s Version 1.11.1 |
@tkaepp is it possible that it succeeded the second time because the image was already cached on the nodes? Kubernetes will keep the image on a node for 5(?) minutes after nothing is using it anymore. If kubernetes does not have to pull the images, some helm installs work (because they're able to complete before the 1 minute timeout). This is actually the basis of a workaround I had to helm install Istio, where I deploy a DaemonSet with the hyperkube image before installing - that way the postinstall jobs can complete much more quickly (and within the 1 minute timeout period). I suspect that if you tried the --debug parameter on a freshly created AKS cluster, it would fail similarly. |
Totaly agree with you @StianOvrevage . |
@tkaepp and @plc402 This is a known issue that in most cases can be quickly remedied. Please file a support request via portal.azure.com and link to this issue. In doing so, we will be able to provide a fix. Thank you! |
@jskulavik |
Hi @plc402, Please reply to support requesting that they assign the case to me and we will look at your cluster. Thank you. |
Tested with a new cluster today on Kubernetes 1.11.2, Helm 2.10.0 and Istio 1.0.1, and the helm install worked (without the DaemonSet workaround)! Still having issues with watches and listing resources (interacting with the apiserver in general), but this issue in particular (helm installing with postinstall jobs) seems fixed for me. Edit* Region was centralus |
Nevermind it's happening again:
|
Not fixed here on 1.11.2 on a cluster created 16 minutes ago.
|
Just created a cluster with 1.11.2; it worked this time. 🤷♂️ |
Fails:
|
Please see issue #676 which we are actively working to address. |
Thanks @jskulavik . Any idea on how to implement the workaround with setting the KUBERNETES_* env vars on Helm? Add them to the |
Hi @StianOvrevage, yes, that would be the best place to start in this case given that you're running into Helm issues |
I was hitting this |
This seems to be fixed by: #676 (opt-in preview atm) |
Closing this issue as old/stale/resolved. Note: If you are only seeing this behavior on clusters with a unique configuration (such as custom DNS/VNet/etc) please open an Azure technical support ticket. |
Our helm deployments fail to install on the AKS cluster. The same charts work fine on other clusters including the ACS cluster.
Reproduction:
(I chose the tensorflow-notebook chart for the reproduction because it's not huge and easily available. The same thing also happens with other charts)
Tiller log:
The secret did get created during this:
We reproduced the issue on three separate AKS clusters, all on kubernetes 1.9.6 in west and north europe. We tested with helm 2.7.0, 2.9.0, 2.9.1.
As said above, the same works without issues in our ACS cluster and on multiple terraform based ones (Kubernetes versions 1.7.7 (ACS), 1.8.5, 1.8.4).
Charts without a pre-install hook (or maybe other hooks, didn't isolate that) deploy without any issue on the AKS cluster. The cluster appears to be fine, calls via kubectl work, kubectl port-forward also works, helm list works.
The text was updated successfully, but these errors were encountered: