Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Helm deploy fails on pre-install hooks (AKS only) #455

Closed
msiegenthaler opened this issue Jun 20, 2018 · 44 comments
Closed

Helm deploy fails on pre-install hooks (AKS only) #455

msiegenthaler opened this issue Jun 20, 2018 · 44 comments

Comments

@msiegenthaler
Copy link

Our helm deployments fail to install on the AKS cluster. The same charts work fine on other clusters including the ACS cluster.

Reproduction:

[ms@ ~] helm install --name notebook stable/tensorflow-notebook
Error: watch closed before Until timeout

(I chose the tensorflow-notebook chart for the reproduction because it's not huge and easily available. The same thing also happens with other charts)

Tiller log:

+ tiller-deploy-7ccf99cd64-5745j › tiller
tiller-deploy-7ccf99cd64-5745j tiller [main] 2018/06/20 14:35:43 Starting Tiller v2.9.1 (tls=false)
tiller-deploy-7ccf99cd64-5745j tiller [main] 2018/06/20 14:35:43 GRPC listening on :44134
tiller-deploy-7ccf99cd64-5745j tiller [main] 2018/06/20 14:35:43 Probes listening on :44135
tiller-deploy-7ccf99cd64-5745j tiller [main] 2018/06/20 14:35:43 Storage driver is ConfigMap
tiller-deploy-7ccf99cd64-5745j tiller [main] 2018/06/20 14:35:43 Max history per release is 0
tiller-deploy-7ccf99cd64-5745j tiller [tiller] 2018/06/20 14:35:53 preparing install for notebook
tiller-deploy-7ccf99cd64-5745j tiller [storage] 2018/06/20 14:35:53 getting release history for "notebook"
tiller-deploy-7ccf99cd64-5745j tiller [tiller] 2018/06/20 14:35:53 rendering tensorflow-notebook chart using values
tiller-deploy-7ccf99cd64-5745j tiller [tiller] 2018/06/20 14:35:54 performing install for notebook
tiller-deploy-7ccf99cd64-5745j tiller [tiller] 2018/06/20 14:35:54 executing 1 pre-install hooks for notebook
tiller-deploy-7ccf99cd64-5745j tiller [kube] 2018/06/20 14:35:54 building resources from manifest
tiller-deploy-7ccf99cd64-5745j tiller [kube] 2018/06/20 14:35:54 creating 1 resource(s)
tiller-deploy-7ccf99cd64-5745j tiller [kube] 2018/06/20 14:36:54 Watching for changes to Secret notebook-tensorflow-notebook with timeout of 5m0s
tiller-deploy-7ccf99cd64-5745j tiller [tiller] 2018/06/20 14:36:54 warning: Release notebook pre-install tensorflow-notebook/templates/secrets.yaml could not complete: watch closed before Until timeout
tiller-deploy-7ccf99cd64-5745j tiller [tiller] 2018/06/20 14:36:54 failed install perform step: watch closed before Until timeout

The secret did get created during this:

[ms@ ~] kubectl get secret
NAME                           TYPE                                  DATA      AGE
default-token-vvtqq            kubernetes.io/service-account-token   3         3h
notebook-tensorflow-notebook   Opaque                                1         6m

We reproduced the issue on three separate AKS clusters, all on kubernetes 1.9.6 in west and north europe. We tested with helm 2.7.0, 2.9.0, 2.9.1.
As said above, the same works without issues in our ACS cluster and on multiple terraform based ones (Kubernetes versions 1.7.7 (ACS), 1.8.5, 1.8.4).

Charts without a pre-install hook (or maybe other hooks, didn't isolate that) deploy without any issue on the AKS cluster. The cluster appears to be fine, calls via kubectl work, kubectl port-forward also works, helm list works.

@m1o1
Copy link

m1o1 commented Jun 20, 2018

Same issue (referenced above: istio/istio#6301) while helm installing Istio. In my case it was failing on post-install hooks. Could this be related to AdmissionControllers?

@twendt
Copy link

twendt commented Jun 21, 2018

This is related to #408

We had our clusters patched yesterday and now helm pre-install hooks are working again.

@m1o1
Copy link

m1o1 commented Jun 21, 2018

Just tried on a new centralus cluster and it still fails

@rite2nikhil
Copy link

All (in all regions) new cluster creates should be patched with the fix. thanks for your patience, please report if you still see issues. Existing clusters will eventually get patched.

@m1o1
Copy link

m1o1 commented Jun 22, 2018

I just tried again with Istio on a new cluster in eastus with rbac enabled, and I'm still seeing this issue. Here are the relevant logs for the tiller:

[tiller] 2018/06/22 12:38:30 executing 2 post-install hooks for istio
[kube] 2018/06/22 12:38:30 building resources from manifest
[kube] 2018/06/22 12:38:30 creating 1 resource(s)
[kube] 2018/06/22 12:38:31 Watching for changes to Job istio-cleanup-old-ca with timeout of 5m0s
[kube] 2018/06/22 12:38:31 Add/Modify event for istio-cleanup-old-ca: ADDED
[kube] 2018/06/22 12:38:31 istio-cleanup-old-ca: Jobs active: 1, jobs failed: 0, jobs succeeded: 0
[tiller] 2018/06/22 12:39:31 warning: Release istio post-install istio/charts/security/templates/cleanup-old-ca.yaml could not complete: watch closed before Until timeout
[tiller] 2018/06/22 12:39:31 warning: Release "istio" failed post-install: watch closed before Until timeout
[storage] 2018/06/22 12:39:31 updating release "istio.v1"
[tiller] 2018/06/22 12:39:31 failed install perform step: watch closed before Until timeout

Edit* Just tried installing the tensorflow notebook, which succeeded. Strange that Istio still fails with this.
Edit2* Same issue on centralus

@ChingizMard
Copy link

I happen to have the similar issue on AKS with post-install hooks but while installing cockroachdb. Seems like an exactly same issue as @andrew-dinunzio is having. Here are the logs:

[tiller] 2018/06/22 14:50:34 executing 2 post-install hooks for cockroachdb
[kube] 2018/06/22 14:50:34 building resources from manifest
[kube] 2018/06/22 14:50:34 creating 1 resource(s)
[kube] 2018/06/22 14:50:34 Watching for changes to Job cockroachdb-cockroachdb-init with timeout of 5m0s
[kube] 2018/06/22 14:50:34 Add/Modify event for cockroachdb-cockroachdb-init: ADDED
[kube] 2018/06/22 14:50:34 cockroachdb-cockroachdb-init: Jobs active: 0, jobs failed: 0, jobs succeeded: 0
[kube] 2018/06/22 14:50:34 Add/Modify event for cockroachdb-cockroachdb-init: MODIFIED
[kube] 2018/06/22 14:50:34 cockroachdb-cockroachdb-init: Jobs active: 1, jobs failed: 0, jobs succeeded: 0
[tiller] 2018/06/22 14:51:34 warning: Release cockroachdb post-install cockroachdb/templates/cluster-init.yaml could not complete: watch closed before Until timeout
[tiller] 2018/06/22 14:51:34 warning: Release "cockroachdb" failed post-install: watch closed before Until timeout
[storage] 2018/06/22 14:51:34 updating release "cockroachdb.v1"
[tiller] 2018/06/22 14:51:34 failed install perform step: watch closed before Until timeout

@rite2nikhil
Copy link

This is caused by the idle timeout on watches, currently this is expected to be ~60s, aware of this issue, partial fix rolling out next week, working at high priority on a full fix.

Thanks for reporting the issue.

@msiegenthaler
Copy link
Author

Works for me now after upgrading the cluster to 1.10.3

@rlucioni
Copy link

rlucioni commented Jun 22, 2018

The 60-second idle timeout on watches is also preventing our post-install hooks from succeeding. Our cluster is running 1.9.6.

@m1o1
Copy link

m1o1 commented Jun 22, 2018

We tested with 1.10.3 and it still fails. It's a configuration of the api-server that's responsible for this I think.

@m1o1
Copy link

m1o1 commented Jun 25, 2018

There's another issue with Istio that might be related to this watch timeout issue. Looking at the logs for the mixer with kubectl logs -n istio-system istio-telemetry-54b5bf4847-l9qqt mixer, I see the following logs:

2018-06-25T10:17:50.439694Z     error   istio.io/istio/mixer/pkg/config/crd/store.go:169: Failed to watch *unstructured.Unstructured: the server was unable to return a response in the time allotted, but may still be processing the request (get kuberneteses.config.istio.io)

The container never crashes - it just repeats this error repeatedly. Does not happen on minikube.

@rite2nikhil
Copy link

We have rolled out fixes, please provide feedback if there are issues with new cluster creates

@StianOvrevage
Copy link

We had problems with 1.10.3 and the prometheus-operator chart. Will try again today on a new cluster.

prometheus-operator/prometheus-operator#1514 (comment)

@m1o1
Copy link

m1o1 commented Jul 5, 2018

We still have the watch timeout issue for a default installation of Istio (rbac enabled in AKS).

Here are the logs:

$ k logs -n kube-system tiller-deploy-84f4c8bb78-klc5h
[main] 2018/07/05 14:49:39 Starting Tiller v2.8.2 (tls=false)
[main] 2018/07/05 14:49:39 GRPC listening on :44134
...
[tiller] 2018/07/05 14:49:58 executing 2 post-install hooks for istio
[kube] 2018/07/05 14:49:58 building resources from manifest
[kube] 2018/07/05 14:49:58 creating 1 resource(s)
[kube] 2018/07/05 14:49:59 Watching for changes to Job istio-cleanup-old-ca with timeout of 5m0s
[kube] 2018/07/05 14:49:59 Add/Modify event for istio-cleanup-old-ca: ADDED
[kube] 2018/07/05 14:49:59 istio-cleanup-old-ca: Jobs active: 1, jobs failed: 0, jobs succeeded: 0
[tiller] 2018/07/05 14:50:59 warning: Release istio post-install istio/charts/security/templates/cleanup-old-ca.yaml could not complete: watch closed before Until timeout
[tiller] 2018/07/05 14:50:59 warning: Release "istio" failed post-install: watch closed before Until timeout
[storage] 2018/07/05 14:50:59 updating release "istio.v1"
[tiller] 2018/07/05 14:50:59 failed install perform step: watch closed before Until timeout

Watch still closes after 1 minute.

Edit* Even though the 1 minute timeout issue is still not working, the issue with the mixer above (about watching *unstructured.Unstructured) seems to be working now.

I think I did find a hacky work-around for now where if I make sure the image for the post-install job exists on every node before it runs (by creating a DaemonSet with the image and resetting the command to an infinite loop that just sleeps), the job takes less than a minute, so we won't get the watch timeout fail. Not ideal, but wanted to report my findings.

@StianOvrevage
Copy link

Created a new AKS cluster now and try to install prometheus-operator:

$ helm repo add coreos https://s3-eu-west-1.amazonaws.com/coreos-charts/stable/

$ helm install coreos/prometheus-operator --name prometheus-operator --set serviceMonitorsSelector="{}" --debug
[debug] Created tunnel using local port: '4917'

[debug] SERVER: "127.0.0.1:4917"

[debug] Original chart version: ""
[debug] Fetched coreos/prometheus-operator to /root/.helm/cache/archive/prometheus-operator-0.0.27.tgz

[debug] CHART PATH: /root/.helm/cache/archive/prometheus-operator-0.0.27.tgz

Error: watch closed before Until timeout

@blackbaud-brandonstirnaman

Can confirm @StianOvrevage's comment. Cluster created yesterday (1.10.3) fails the same.

[debug] Created tunnel using local port: '59464'

[debug] SERVER: "localhost:59464"

[debug] Original chart version: ""
[debug] Fetched coreos/prometheus-operator to /Users/brandon.stirnaman/.helm/cache/archive/prometheus-operator-0.0.27.tgz

[debug] CHART PATH: /Users/brandon.stirnaman/.helm/cache/archive/prometheus-operator-0.0.27.tgz

Error: watch closed before Until timeout

@StianOvrevage
Copy link

Looks like this might be working on 1.10.5:

$ helm install coreos/prometheus-operator --name prometheus-operator
NAME:   prometheus-operator
LAST DEPLOYED: Mon Jul 16 11:48:55 2018
NAMESPACE: default
STATUS: DEPLOYED

@m1o1
Copy link

m1o1 commented Jul 17, 2018

Istio seems to now get past the "watch closed before timeout" issue, but still fails with "timed out waiting for condition". Tried with Helm versions 2.8.2 (with kubectl client version 1.9.1) and version 2.9.1 (with kubectl client version 1.10.5). I gave it a 16 minute timeout window. Logs of the tiller:

$ k logs -n kube-system tiller-deploy-5c688d5f9b-mltcv
[main] 2018/07/17 13:05:25 Starting Tiller v2.9.1 (tls=false)
--snip--
[tiller] 2018/07/17 13:07:59 executing 2 post-install hooks for istio
[kube] 2018/07/17 13:07:59 building resources from manifest
[kube] 2018/07/17 13:07:59 creating 1 resource(s)
[kube] 2018/07/17 13:07:59 Watching for changes to Job istio-cleanup-old-ca with timeout of 16m40s
[kube] 2018/07/17 13:07:59 Add/Modify event for istio-cleanup-old-ca: ADDED
[kube] 2018/07/17 13:07:59 istio-cleanup-old-ca: Jobs active: 1, jobs failed: 0, jobs succeeded: 0
[tiller] 2018/07/17 13:24:39 warning: Release istio post-install istio/charts/security/templates/cleanup-old-ca.yaml could not complete: timed out waiting for the condition
[tiller] 2018/07/17 13:24:39 warning: Release "istio" failed post-install: timed out waiting for the condition
[storage] 2018/07/17 13:24:39 updating release "istio.v1"
[tiller] 2018/07/17 13:24:39 failed install perform step: timed out waiting for the condition

Works on Minikube. Here are the logs for Minikube (Minikube 0.25.2 Kubernetes 1.9.4 and Helm 2.9.1. I can't get a 1.10.0 minikube cluster started on Windows for some reason):

[tiller] 2018/07/17 13:42:22 executing 2 post-install hooks for istio
[kube] 2018/07/17 13:42:22 building resources from manifest
[kube] 2018/07/17 13:42:22 creating 1 resource(s)
[kube] 2018/07/17 13:42:22 Watching for changes to Job istio-cleanup-old-ca with timeout of 16m40s
[kube] 2018/07/17 13:42:22 Add/Modify event for istio-cleanup-old-ca: ADDED
[kube] 2018/07/17 13:42:22 istio-cleanup-old-ca: Jobs active: 1, jobs failed: 0, jobs succeeded: 0
[kube] 2018/07/17 13:44:16 Add/Modify event for istio-cleanup-old-ca: MODIFIED
[tiller] 2018/07/17 13:44:16 deleting post-install hook istio-mixer-post-install for release istio due to "before-hook-creation" policy
[kube] 2018/07/17 13:44:16 Starting delete for "istio-mixer-post-install" Job
[kube] 2018/07/17 13:44:16 Using reaper for deleting "istio-mixer-post-install"
[kube] 2018/07/17 13:44:16 jobs.batch "istio-mixer-post-install" not found
[kube] 2018/07/17 13:44:16 building resources from manifest
[kube] 2018/07/17 13:44:16 creating 1 resource(s)
[kube] 2018/07/17 13:44:16 Watching for changes to Job istio-mixer-post-install with timeout of 16m40s
[kube] 2018/07/17 13:44:16 Add/Modify event for istio-mixer-post-install: ADDED
[kube] 2018/07/17 13:44:16 istio-mixer-post-install: Jobs active: 0, jobs failed: 0, jobs succeeded: 0
[kube] 2018/07/17 13:44:16 Add/Modify event for istio-mixer-post-install: MODIFIED
[kube] 2018/07/17 13:44:16 istio-mixer-post-install: Jobs active: 1, jobs failed: 0, jobs succeeded: 0
[kube] 2018/07/17 13:44:26 Add/Modify event for istio-mixer-post-install: MODIFIED
[tiller] 2018/07/17 13:44:26 hooks complete for post-install istio
[tiller] 2018/07/17 13:44:26 deleting post-install hook istio-cleanup-old-ca for release istio due to "hook-succeeded" policy
[kube] 2018/07/17 13:44:26 Starting delete for "istio-cleanup-old-ca" Job
[kube] 2018/07/17 13:44:26 Using reaper for deleting "istio-cleanup-old-ca"
[storage] 2018/07/17 13:44:26 updating release "istio.v1"

Edit* The workaround of deploying a DaemonSet with the ~700MB hyperkube image before helm installing Istio seems to still work though (for client/server: 1.9.1/1.9.6 and 1.10.5/1.10.5). Tiller logs of this success:

[tiller] 2018/07/17 16:53:13 executing 2 post-install hooks for istio
[kube] 2018/07/17 16:53:13 building resources from manifest
[kube] 2018/07/17 16:53:13 creating 1 resource(s)
[kube] 2018/07/17 16:53:13 Watching for changes to Job istio-cleanup-old-ca with timeout of 5m0s
[kube] 2018/07/17 16:53:13 Add/Modify event for istio-cleanup-old-ca: ADDED
[kube] 2018/07/17 16:53:13 istio-cleanup-old-ca: Jobs active: 1, jobs failed: 0, jobs succeeded: 0
[kube] 2018/07/17 16:53:28 Add/Modify event for istio-cleanup-old-ca: MODIFIED
[tiller] 2018/07/17 16:53:28 deleting post-install hook istio-mixer-post-install for release istio due to "before-hook-creation" policy
[kube] 2018/07/17 16:53:36 Starting delete for "istio-mixer-post-install" Job
[kube] 2018/07/17 16:53:36 Using reaper for deleting "istio-mixer-post-install"
[kube] 2018/07/17 16:53:36 jobs.batch "istio-mixer-post-install" not found
[kube] 2018/07/17 16:53:36 building resources from manifest
[kube] 2018/07/17 16:53:36 creating 1 resource(s)
[kube] 2018/07/17 16:53:36 Watching for changes to Job istio-mixer-post-install with timeout of 5m0s
[kube] 2018/07/17 16:53:36 Add/Modify event for istio-mixer-post-install: ADDED
[kube] 2018/07/17 16:53:36 istio-mixer-post-install: Jobs active: 0, jobs failed: 0, jobs succeeded: 0
[kube] 2018/07/17 16:53:36 Add/Modify event for istio-mixer-post-install: MODIFIED
[kube] 2018/07/17 16:53:36 istio-mixer-post-install: Jobs active: 1, jobs failed: 0, jobs succeeded: 0
[kube] 2018/07/17 16:53:41 Add/Modify event for istio-mixer-post-install: MODIFIED
[tiller] 2018/07/17 16:53:41 hooks complete for post-install istio
[tiller] 2018/07/17 16:53:41 deleting post-install hook istio-cleanup-old-ca for release istio due to "hook-succeeded" policy
[kube] 2018/07/17 16:53:41 Starting delete for "istio-cleanup-old-ca" Job
[kube] 2018/07/17 16:53:41 Using reaper for deleting "istio-cleanup-old-ca"

@necevil
Copy link

necevil commented Jul 18, 2018

I can confirm that (as of today) in my case upgrading to 1.10.5 via the docs over here (https://docs.microsoft.com/en-us/azure/aks/upgrade-cluster) fixed my issue and I was able to install prometheus via helm.

My guess is this one can be closed but it would be great to see if anyone else has some feedback.

@StianOvrevage
Copy link

This is now failing again on 1.10.5:

helm install coreos/prometheus-operator --name prometheus-operator
Error: watch closed before Until timeout

:(

@rite2nikhil
Copy link

The fix has not rolled out yet as it did not make it to the realease last week. apologies for the delay, I expect it to start rollout today and rollout to all production regions by end. To expedite patch mailto:[email protected]
with sub id, resource group and resource id(cluster name)

@yanivoliver
Copy link

@rite2nikhil - The issue is still occurring in a newly created AKS cluster.

[kube] 2018/08/05 12:55:05 Starting delete for "cpanel" Job
[kube] 2018/08/05 12:55:05 Using reaper for deleting "cpanel"
[kube] 2018/08/05 12:55:05 jobs.batch "cpanel" not found
[kube] 2018/08/05 12:55:05 building resources from manifest
[kube] 2018/08/05 12:55:05 creating 1 resource(s)
[kube] 2018/08/05 12:55:05 Watching for changes to Job cpanel with timeout of 0s
[kube] 2018/08/05 12:55:05 Add/Modify event for cpanel: ADDED
[kube] 2018/08/05 12:55:05 cpanel: Jobs active: 1, jobs failed: 0, jobs succeeded: 0
[tiller] 2018/08/05 12:56:05 warning: Release excited-panther post-install data-bootstrap/charts/deployment/cpanel-bootstrap.yaml could not complete: watch closed before Until timeout
[tiller] 2018/08/05 12:56:05 warning: Release "excited-panther" failed post-install: watch closed before Until timeout
[storage] 2018/08/05 12:56:05 updating release "excited-panther.v1"
[tiller] 2018/08/05 12:56:05 failed install perform step: watch closed before Until timeout

@yarinm
Copy link

yarinm commented Aug 5, 2018

This happens to me as well with a a newly created cluster.

@ckanljung
Copy link

ckanljung commented Aug 6, 2018

This still happens for me with the post-install stage of prometheus-operator on a newly created cluster. I've tested with both version 1.9.6 and 1.10.6. Region is western-europe.

This worked fine on a newly created cluster (1.9.6) last week though.

@rite2nikhil
Copy link

The fix for helm/tiller will get rolled out by end of next week, so if this is urgent Please send your cluster info @ [email protected]

@twendt
Copy link

twendt commented Aug 9, 2018

Is there a way to see that the fix is in when we deploy a new cluster? Is it somehow tied to the acsengineVersion tag?

@StianOvrevage
Copy link

This still fails on 1.11.1 on a newly created cluster today @rite2nikhil . Does this mean the fix is not fixing it or that this is several different problems or that it does not apply to new clusters yet?

Also agree with @twendt . Being able to see which version gets deployed and maybe a changelog and/or status of known bugs would be nice.

@tkaepp
Copy link

tkaepp commented Aug 22, 2018

@StianOvrevage I had the same problem with prometheus-operator. I was able to deploy it successfully with the --debug parameter. Without the parameter it failed after 60s. With the parameter it completed after ~10-15s

Cluster was created yesterday with K8s Version 1.11.1

@m1o1
Copy link

m1o1 commented Aug 22, 2018

@tkaepp is it possible that it succeeded the second time because the image was already cached on the nodes? Kubernetes will keep the image on a node for 5(?) minutes after nothing is using it anymore. If kubernetes does not have to pull the images, some helm installs work (because they're able to complete before the 1 minute timeout). This is actually the basis of a workaround I had to helm install Istio, where I deploy a DaemonSet with the hyperkube image before installing - that way the postinstall jobs can complete much more quickly (and within the 1 minute timeout period).

I suspect that if you tried the --debug parameter on a freshly created AKS cluster, it would fail similarly.

@doog33k
Copy link

doog33k commented Aug 22, 2018

Totaly agree with you @StianOvrevage .
I just updated my cluster to 1.11.1 and did a test, and I'm still not able to do my helm upgrade because of
"Error: UPGRADE FAILED: watch closed before Until timeout"

@jskulavik
Copy link

@tkaepp and @plc402 This is a known issue that in most cases can be quickly remedied. Please file a support request via portal.azure.com and link to this issue. In doing so, we will be able to provide a fix. Thank you!

@doog33k
Copy link

doog33k commented Aug 23, 2018

@jskulavik
Thank, I did that yesterday, but right now, they answered me the following :
"With Helm being a 3rd party product I am not sure of it’s behavior, but my best guess is that the watch has a timeout of 1m or 60s and it timed out before the operation completed on the Azure side, but the error it gives is misleading as the upgrade did continue and complete on the Azure side."

@jskulavik
Copy link

Hi @plc402,

Please reply to support requesting that they assign the case to me and we will look at your cluster. Thank you.

@m1o1
Copy link

m1o1 commented Sep 6, 2018

Tested with a new cluster today on Kubernetes 1.11.2, Helm 2.10.0 and Istio 1.0.1, and the helm install worked (without the DaemonSet workaround)!

Still having issues with watches and listing resources (interacting with the apiserver in general), but this issue in particular (helm installing with postinstall jobs) seems fixed for me.

Edit* Region was centralus

@nphmuller
Copy link

nphmuller commented Sep 13, 2018

Can also confirmed this is fixed. Upgrading to 1.11.2 wasn't enough, I had to redeploy the cluster.

Update: Region is West-Europe

Nevermind it's happening again:

E0914 09:05:33.142333       1 streamwatcher.go:109] Unable to decode an event from the watch stream: stream error: stream ID 131; INTERNAL_ERROR
[tiller] 2018/09/14 09:05:33 warning: Release myapp post-install myapp/templates/job.yaml could not complete: watch closed before Until timeout
[tiller] 2018/09/14 09:05:33 warning: Release "myapp" failed post-install: watch closed before Until timeout
[storage] 2018/09/14 09:05:33 updating release "myapp.v1"
[tiller] 2018/09/14 09:05:33 failed install perform step: watch closed before Until timeout

@StianOvrevage
Copy link

Not fixed here on 1.11.2 on a cluster created 16 minutes ago.

helm install coreos/prometheus-operator --name prometheus-operator
Error: watch closed before Until timeout

bash-4.4# kubectl get nodes
NAME                       STATUS    ROLES     AGE       VERSION
aks-nodepool1-28497563-0   Ready     agent     16m       v1.11.2

@nemzes
Copy link

nemzes commented Sep 18, 2018

Just created a cluster with 1.11.2; it worked this time. 🤷‍♂️

@StianOvrevage
Copy link

Fails:

Installing helm chart for prometheus-operator
helm install coreos/prometheus-operator --name prometheus-operator
Error: watch closed before Until timeout

@jskulavik
Copy link

Please see issue #676 which we are actively working to address.

@StianOvrevage
Copy link

Thanks @jskulavik . Any idea on how to implement the workaround with setting the KUBERNETES_* env vars on Helm? Add them to the tiller Deployment maybe?

@jskulavik
Copy link

Hi @StianOvrevage, yes, that would be the best place to start in this case given that you're running into Helm issues

@f4tq
Copy link

f4tq commented Oct 16, 2018

I was hitting this watch closed before Until timeout with aks 1.11.3, tiller 2.11 when I created the cluster with az aks create --node-name xxx . ....
The moment I removed --node-name, istio 1.0.1 installed cleanly. I had been installing cleanly for a while but got sick of all my cluster worker nodes named aks-nodepool1-xxx (i.e. the nodepool default ) otherwise I got watch timeout...
I hope provides a clue (and workaround for others) .

@nphmuller
Copy link

This seems to be fixed by: #676 (opt-in preview atm)

@jnoller
Copy link
Contributor

jnoller commented Apr 3, 2019

Closing this issue as old/stale/resolved.

Note:
If this issue still comes up, please confirm you are running the latest AKS release. If you are on the latest release and the issue can be re-created outside of your specific cluster please open a new github issue.

If you are only seeing this behavior on clusters with a unique configuration (such as custom DNS/VNet/etc) please open an Azure technical support ticket.

@jnoller jnoller closed this as completed Apr 3, 2019
@ghost ghost locked as resolved and limited conversation to collaborators Aug 6, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests