Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

streamwatcher.go:109] Unable to decode an event from the watch stream: stream error: stream ID 1; INTERNAL_ERROR #676

Closed
juan-lee opened this issue Oct 3, 2018 · 64 comments

Comments

@juan-lee
Copy link
Contributor

juan-lee commented Oct 3, 2018

Symptoms
Pods using the in cluster config to perform a watch on a resource will see intermittent timeouts and the following error in the pod log.

streamwatcher.go:109] Unable to decode an event from the watch stream: stream error: stream ID 1; INTERNAL_ERROR

If the client performing the watch isn't handling errors gracefully, applications can get into an inconsistent state. Impacted applications include, but are not limited to, nginx-ingress and tiller (helm).

A specific manifestation of this bug is the following error when attempting a helm deployment.

Error: watch closed before Until timeout

Root Cause

  1. The configuration of azureproxy results in unexpected timeouts for outbound watches targeting kubernetes.svc.cluster.local.
  2. On the client side, the client-go implementation of watch.Until does not handle intermittent network failures gracefully. See watch closed before Until timeout, Fix broken watches, and Fix waiting in kubectl rollout status for more details.

Workaround
For the pods/containers that see the INTERNAL_ERROR in their logs add the following environment variables to the container spec. Be sure to replace <your-fqdn-prefix> and <region> so the aks kube-apiserver FQDN is correct.

spec:
  containers:
    env:
    - name: KUBERNETES_PORT_443_TCP_ADDR
      value: <your-fqdn-prefix>.hcp.<region>.azmk8s.io
    - name: KUBERNETES_PORT
      value: tcp://<your-fqdn-prefix>.hcp.<region>.azmk8s.io:443
    - name: KUBERNETES_PORT_443_TCP
      value: tcp://<your-fqdn-prefix>.hcp.<region>.azmk8s.io:443
    - name: KUBERNETES_SERVICE_HOST
      value: <your-fqdn-prefix>.hcp.<region>.azmk8s.io
@DenisBiondic
Copy link

Is this not the same root cause as to performance degradation issue which is also still opened? #620

@necevil
Copy link

necevil commented Oct 4, 2018

I saw this tagged from the TLS Handshake error thread — #14
Can anyone confirm this as one of the issues leading to the handshake timeout?

Would the above work around with environmental variables be a potential workaround for the timeout issue?

@juan-lee
Copy link
Contributor Author

juan-lee commented Oct 4, 2018

I saw this tagged from the TLS Handshake error thread — #14
Can anyone confirm this as one of the issues leading to the handshake timeout?

Would the above work around with environmental variables be a potential workaround for the timeout issue?

@necevil I don't think this is related to #14. This issue is about calling kube-apiserver from inside of the cluster vs. outside.

@digeler
Copy link

digeler commented Oct 7, 2018

any Eta for permanent fix ?

@juan-lee
Copy link
Contributor Author

juan-lee commented Oct 7, 2018

any Eta for permanent fix ?

I don't have an ETA yet, but the fix is actively being worked on. The permanent fix will be to apply the same mitigation automatically for customer pods. I recommend applying the mitigation if you're running into this issue until we can have the fix automatically applied.

@DenisBiondic
Copy link

Wait, are you saying the workaround from above is the proper fix for this issue, and you are not working on a solution to reliably access the master plane without being routed over the public internet?

@juan-lee
Copy link
Contributor Author

juan-lee commented Oct 8, 2018

Wait, are you saying the workaround from above is the proper fix for this issue, and you are not working on a solution to reliably access the master plane without being routed over the public internet?

That's correct @DenisBiondic, we will apply the same change automatically. Just to clarify, the traffic will be routed over the azure backplane similar to how it works now except we will bypass the hop through azureproxy (kube-svc-redirect).

@DenisBiondic
Copy link

@juan-lee Will tunnelfront be removed then?

@holmesb
Copy link

holmesb commented Oct 9, 2018

This is affecting cert-manager too. Errors continue after deleting & recreating the pod.

@DenisBiondic
Copy link

Yeah, it affects each container that has a connection to the master plane (different operators, external dns, ingress controller, cert manager etc.)

@mattfarina
Copy link

mattfarina commented Oct 10, 2018

@holmesb if you're installing cert-manager via Helm you can add the following to your custom values.yaml (passed in with the -f flag):

extraEnv:
- name: KUBERNETES_PORT_443_TCP_ADDR
  value: <your-fqdn-prefix>.hcp.<region>.azmk8s.io
- name: KUBERNETES_PORT
  value: tcp://<your-fqdn-prefix>.hcp.<region>.azmk8s.io:443
- name: KUBERNETES_PORT_443_TCP
  value: tcp://<your-fqdn-prefix>.hcp.<region>.azmk8s.io:443
- name: KUBERNETES_SERVICE_HOST
  value: <your-fqdn-prefix>.hcp.<region>.azmk8s.io

You can get your specific URL from the portal.

Note, tiller uses client-go so it inherits the bugs client-go has as was previously pointed out.

@ttelfer
Copy link

ttelfer commented Oct 15, 2018

Also running into this problem using https://github.com/argoproj/argo-events. Currently using Azure AKS 1.11.3.

Does anyone know if there is a fix planned?

@dkipping
Copy link

dkipping commented Oct 16, 2018

Workaround
For the pods/containers that see the INTERNAL_ERROR in their logs add the following environment variables to the container spec. Be sure to replace <your-fqdn-prefix> and <region> so the aks kube-apiserver FQDN is correct.

        env:
        - name: KUBERNETES_PORT_443_TCP_ADDR
          value: <your-fqdn-prefix>.hcp.<region>.azmk8s.io
        - name: KUBERNETES_PORT
          value: tcp://<your-fqdn-prefix>.hcp.<region>.azmk8s.io:443
        - name: KUBERNETES_PORT_443_TCP
          value: tcp://<your-fqdn-prefix>.hcp.<region>.azmk8s.io:443
        - name: KUBERNETES_SERVICE_HOST
          value: <your-fqdn-prefix>.hcp.<region>.azmk8s.io

Is there a possibility to inject this into a tiller-deploy pod already running on my cluster or does this have to be done at initialisation time?
Can I even pass these environment variables as parameters or is this only possbile by changing the yaml?

EDIT:

Nevermind, I found out that you have to add these lines to the deployment by kubectl edit deployment <deployment-name> -n <namespace>. Afterwards the pods will get updated.
BTW: using the standard editor (vi) caused problems for updating in my terminal, thus using KUBE_EDITOR="nano" kubectl edit deployment <deployment-name> -n <namespace> might be a good idea if you are having issues too.

@evandervalk
Copy link

evandervalk commented Oct 16, 2018

I'm running into the same issue.

I've tried the workaround described by @dkipping to no avail. Setting the environment variables on the tiller-deploy doesn't seem to work.

Edit:
I am also having trouble identifying the pod that is having the issues. So far I've only seen it happen in the logs of kube-ingress-nginx-ingress-controller. Setting the environment variables on the tiller-deploy doesn't feel right. I would expect these variables on the kube-ingress-nginx-ingress-controller. Then again setting these variables in general just feels like a hack.

@dkipping
Copy link

dkipping commented Oct 16, 2018

@evandervalk could you apply the values to the tiller-deploy deployment at least? so are they saved and still there when you print out or enter the specs of the tiller-deploy again?

If they got applied, could you check that they got applied to a (newly spawned) tiller-deploy-<id> pod too?

Edit:
Regarding your edit: you should set these values not in the tiller-deploy but on the kube-ingress-nginx-ingress-controller if the error only occurs in the logs of this pod according to @juan-lee.
But if the error occurs during the install via helm / tiller, you should carefully check tiller's logs. If it occurs there, setting these values on tiller-deploy might be needed too.

@evandervalk
Copy link

Editing the deployment went as you described. I can edit the values in the tiller-deploy and it prints them out when describing the resource.

After editing the configuration it created a new pod with the earlier mentions environment variables defined. So that also seems to work. I have the variables defined in the tiller-deploy-<id> pod.

However, after editing the configuration and making sure everything propagated, I am still seeing:
streamwatcher.go:109] Unable to decode an event from the watch stream: stream error: stream ID 1677; INTERNAL_ERROR multiple times on the kube-ingress-nginx-ingress-controller.

If I describe the pod for kube-ingress-nginx-ingress-controller I do not see these environment variables.

I would assume these environment variables should be defined on kube-ingress-nginx-ingress-controller rather than tiller-deploy?

@dkipping
Copy link

I would assume these environment variables should be defined on kube-ingress-nginx-ingress-controller rather than tiller-deploy?

Yes, sounds like it.
I guess you can't break more than not having it running by trying to add the environment variables there.

@evandervalk
Copy link

evandervalk commented Oct 16, 2018

I will try and report back.

Edit: I've removed the environment variables from tiller-deploy and added them to kube-ingress-nginx-ingress-controller. Monitoring the logs to see if I come across the error message again. So far it's looking good.

Edit 2: Setting the environment variables on the kube-ingress-nginx-ingress-controller solved my issues. Thank you @dkipping for taking the time to walk me through it.

@wutingbupt
Copy link

wutingbupt commented Oct 16, 2018 via email

@fhoy
Copy link

fhoy commented Nov 5, 2018

@weinong : Good to know. Is this being rolled out region by region, and if so, is there a set sequence? I am particularly interested in when this will be rolled out in region westeurope.

@weinong
Copy link
Contributor

weinong commented Nov 5, 2018

West Europe will be later this week. It will be applied to all new clusters at first. Then we will roll it out to all existing clusters shortly after.

@alibengtsson
Copy link

Hi,when will the fix be released for
centralus and southeastasia ?

thanks in advanced

@martinhartig
Copy link

fix has reached westeurope.
Indeed, it has solved several issues. Thx.

@LanceTheDev
Copy link

Still getting the Issue:
2018-11-12T07:49:47.964622Z error Unable to decode an event from the watch stream: stream error: stream ID 1529; INTERNAL_ERROR
2018-11-12T07:49:47.964882Z error Unable to decode an event from the watch stream: stream error: stream ID 1519; INTERNAL_ERROR

Do I need to reinstall Istio for this fix to work?

@alibengtsson
Copy link

Still getting the Issue:
2018-11-12T07:49:47.964622Z error Unable to decode an event from the watch stream: stream error: stream ID 1529; INTERNAL_ERROR
2018-11-12T07:49:47.964882Z error Unable to decode an event from the watch stream: stream error: stream ID 1519; INTERNAL_ERROR

Do I need to reinstall Istio for this fix to work?
try to only restart/delete the envoy ingress gateway and reboot/delete your pods check the logs again in ingress controller.

@wutingbupt
Copy link

Hi,

Do we need to do anything for our cluster? Or it will be applied automatically?

Br,
Tim

@andig
Copy link

andig commented Nov 12, 2018

I understood:

@strtdusty It's a feature flag they set on AKS. For now only applied to new clusters. Not sure if the feature can be disabled manually.

So you'll have to recreate the cluster.

@mmosttler
Copy link

@adinunzio84 You mentioned that with this fix there may need to be an additional ServiceEntry needed for Istio. We are having the issue described and also have Istio installed so I am curious if you have the fix and what ServiceEntry you had to add?

@juan-lee
Copy link
Contributor Author

b

Hi,

Do we need to do anything for our cluster? Or it will be applied automatically?

Br,
Tim

The fix is no longer behind a feature flag. All new clusters will get it automatically. Existing clusters will need to do a scale or upgrade to get the fix.

@wutingbupt
Copy link

@juan-lee Thanks for your reply, we rebuilt our cluster yesterday, however, I still can see this problem this morning.
We are in westeurope, and the cluster version is 1.11.3

Br,
Tim

@alibengtsson
Copy link

Hi, i redeployed with terraform in westeurope and centralus region. with nginx ingress controller. I can confirm from my side that the messages are gone from the log file.
i used version 1.11.2.

thanks for the fix.

@juan-lee
Copy link
Contributor Author

@juan-lee Thanks for your reply, we rebuilt our cluster yesterday, however, I still can see this problem this morning.
We are in westeurope, and the cluster version is 1.11.3

Br,
Tim

Can you provide some more details? Does the pod in question have the appropriate KUBERNETES_ environment variables set.

@m1o1
Copy link

m1o1 commented Nov 13, 2018

@mmosttler Not fully tested, but this kinda works:

apiVersion: networking.istio.io/v1alpha3
kind: ServiceEntry
metadata:
  name: azmk8s-ext
  namespace: default
spec:
  hosts:
  - ${FQDN}
  location: MESH_EXTERNAL
  ports:
  - name: https
    number: 443
    protocol: HTTPS
resolution: DNS

I did cat serviceentry.yml | envsubst | kubectl apply -f - with FQDN=$(az aks show -n ${CLUSTER_NAME} -g ${CLUSTER_RG} --query "fqdn" --output tsv)

This is the closest I got to it working. The javascript Kubernetes client is happy with this, but the Go client is not.

Please let me know if you come up with something better though

@DenisBiondic
Copy link

I can confirm that the problem is gone from our clusters. (west EU)

@weinong
Copy link
Contributor

weinong commented Dec 3, 2018

thanks for reporting back. i'm closing the issue for now

@weinong weinong closed this as completed Dec 3, 2018
@dharmeshkakadia
Copy link

I am still seeing this issue even this week in east us.

@juan-lee
Copy link
Contributor Author

juan-lee commented Dec 4, 2018

I am still seeing this issue even this week in east us.

Can you elaborate on your scenario? Also, keep in mind that pods will need to be restarted in order to get the fix. You can check to see if a pod has the fix by seeing if KUBERNETES_PORT, etc env variables are set for each container.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests