-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
linkerd-destination seemingly serving incorrect Pod IPs causing connection errors #8956
Comments
You could try running |
I tried this locally with a trivial setup and wasn't able to encounter this error easily: Setup a local cluster with multiple nodes:
After installing linkerd, installing a daemonset application:
Delete once instance of one of the servers:
Check each client pod for errors:
Unfortunately, there were no errors. |
I wasn't aware of I can send full client debug logs but this will be a sizeable file, how would you like me to share them? |
Yeah, it's going to port forward to a random destination controller
Easiest would probably be to try to attach a gzip to the issue? Or you could try to isolate the ~minute around the target pod being deleted |
Ok, so the most recent example of this is the pod The IP for this pod is I've attached the logs from the client proxy showing connections to this IP - offer-price-index-service-ie-66bbbfcff-f8kbg-debug-logs.txt Around the time the pod dies, at 14:56, I get errors from two out of three destination pods:
Around this time, it also seems we have some shuffling around of linkerd-destination pods: The two destination pods that restart here were running on preemptible nodes, I assume the node got killed at this time so two new replicas come up. The otel-agent pod that died at 14:56 would have also been because of the preemptible nodes disappearing. I'm not sure if this helps or makes things more confusing 🤯 One of the destination pods has outlived the other two during this period due to not being on a preemptible node. I don't understand why the client is still now, hours later, trying to connect to this dead IP. |
I'm going to move the destination pods to non preemptible nodes to hopefully get more stability, but I wouldn't expect this to be causing issues (unless the client proxy is caching this address for whatever reason, given the diagnostics gives the correct output). |
Hi, I wanted to follow up on this as we're still seeing this behaviour, only I've just caught it via diagnostics for the first time. Log entry from our workload:
Diagnostics thinks this pod IP is still valid:
However that pod doesn't exist:
Looking at metrics for this, I can see the pod above stopped around 7am, but Linkerd is still sending traffic to it: |
Just digging into destination logs, and I have a number of the following:
In the last 12 hours, this log entry has appeared for the following pods:
otel-agent is a daemonset, and the above pods all have one thing in common: they all run on preemptible GKE nodes. I wonder if there is a signal not being received when these nodes are preempted and removed? |
I'm not sure if it's of any help but Google lists the preemption termination process here. I've been digging further into this with the help of @mateiidavid this morning via a Slack thread here. To start with, I'm just comparing the output of diagnostics to kubectl to prove there is a problem:
Matei advised I can use the diagnostics script to retrieve the endpoints for otel-agent from each destination pod individually. The diagnostics command above simply port-forwards to the deployment, thus you get a "random" response from one of your replicas. I wanted to confirm the view of the world from each destination pod, so I ran:
I ran this for each destination pod (I have three replicas in HA mode) and got wildly different results. I took the get stream, exported to a sheet and regex'd around to get counts: Note: I repeated this 3/4 times for each pod to make sure it wasn't a one-off. In a nutshell, all current otel-agent pods exist in each destination pod, but so do a large number of pods that have been deleted many hours ago. Each red highlighted pod no longer exists but is still registered as a target. The pods I've checked so far have had a corresponding log entry, which may or may not be relevant:
This pod, I have no idea how to proceed with this other than disabling HA mode and seeing if it improves. This is our major blocker to rolling out Linkerd to our production clusters now, and I'm reluctant to go to production without HA mode. |
This seems to be quite easily reproducible by deleting a GKE node:
I have tried manually deleting two nodes with the above commands (one preemptible and one standard provisioning) and both pods still appear in the diagnostics endpoint. I've left it a good 30+ minutes and that pod is still appearing in the diag endpoints. Deleting Now, interestingly: Deleting a node and monitoring for a statefulset or deployment pod does not cause this issue. The statefulset/deployment pods disappear from the diagnostics immediately and reappeared when the new pod reaches Running status. I then tried repeating this exercise for a different daemonset in the same cluster, deleting a node and watching diagnostics. The pod disappeared immediately. The difference between the two daemonsets is one is using a clusterIP service and one is using a headless clusterIP service. The service for otel-agent:
So, I created a headless service to sit alongside this:
and then repeated the same test:
As you can see, the pod disappeared immediately from the diagnostics from the headless service, but remains in the clusterIP service. I'm going to keep testing this headless vs non-headless theory with deployments / statefulsets as I can't remember whether the deployment and STS I tried had headless / non-headless, but this is promising. |
I checked our development cluster this morning which runs with only one replica of destination and can confirm the same issue is present, so this is not HA related. |
Hey @dwilliams782 thanks for the investigation and updates on this. I attempted to reproduce this today without luck. I'm going to read through things again to make sure I didn't miss anything, but I'll leave my reproduction steps below. It'd be helpful if you can test this on GKE, as I'm using k3d and not seeing any issues. SetupI'm using the following nginx DaemonSet with a ClusterIP service apiVersion: v1
kind: Service
metadata:
name: nginx-svc
spec:
selector:
app: nginx-ds
ports:
- port: 80
targetPort: 80
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nginx-ds
spec:
selector:
matchLabels:
app: nginx-ds
template:
metadata:
labels:
app: nginx-ds
annotations:
linkerd.io/inject: enabled
spec:
containers:
- name: nginx
image: nginx I have Linkerd installed all on the same node $ kubectl get -n linkerd pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
linkerd-proxy-injector-77f864cf57-8nfd9 2/2 Running 0 56s 10.42.1.16 k3d-k3s-default-server-0 <none> <none>
linkerd-identity-856d7bd7c-fj52r 2/2 Running 0 56s 10.42.1.15 k3d-k3s-default-server-0 <none> <none>
linkerd-destination-8cbf85b67-2jgf6 4/4 Running 0 56s 10.42.1.14 k3d-k3s-default-server-0 <none> <none> I create a second node and then begin the test $ k3d node create test
...
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
k3d-k3s-default-server-0 Ready control-plane,master 15m v1.22.6+k3s1
k3d-test-0 NotReady <none> 3s v1.22.6+k3s1 Attempted ReproductionCreate the nginx Daemonset and Service and check Linkerd's view $ kubectl apply -f ~/Projects/tmp/nginx-daemonset.yaml
service/nginx-svc created
daemonset.apps/nginx-ds created
$ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-ds-x5w55 2/2 Running 0 99s 10.42.1.17 k3d-k3s-default-server-0 <none> <none>
nginx-ds-jfh9b 2/2 Running 0 97s 10.42.2.2 k3d-test-0 <none> <none>
$ linkerd diagnostics endpoints nginx-svc.default.svc.cluster.local:80
NAMESPACE IP PORT POD SERVICE
default 10.42.1.17 80 nginx-ds-x5w55 nginx-svc.default
default 10.42.2.2 80 nginx-ds-jfh9b nginx-svc.default Delete $ kubectl delete node k3d-test-0
node "k3d-test-0" deleted
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
nginx-ds-x5w55 2/2 Running 0 111s Instantly reflected through Linkerd $ linkerd diagnostics endpoints nginx-svc.default.svc.cluster.local:80
NAMESPACE IP PORT POD SERVICE
default 10.42.1.17 80 nginx-ds-x5w55 nginx-svc.default Let me know if you think I'm missing anything obvious, but from your more recent comments I think everything is there and this is not running Linkerd in HA mode. |
To be clear, I do think there is a bug here in the destination service given that you can create two Services that target the same DaemonSet, and one stays up to date while the other returns stale endpoints; the issue now is reducing this down to a more minimal example so that the bug is easier to track down. As I stated above, it'd be great if you can try the same reproduction as more on GKE and let me know if you see anything different. I'd be surprised if GKE was the culprit here, but it could be something different about how Pods are de-scheduled on Nodes that are shutting down compared to k3d (or something else). The log line that you are seeing `unable to fetch pod ...: pod ... not found" is definitely relevant. This is occurring in the destination controller when trying to update its view of endpoints. It tries to lookup a Pod with the k8s API and — expecting to find it — it fails and returns this error. It seems like this should lead to the address being cleared, but that doesn't appear to be happening. If you're unable to see any more logs than that, it may be worth adding some additional log lines in and testing with that. |
Hi Kevin, Thanks for attempting to recreate. Like you, I have been unable to recreate this with any other resource. I've tried deployments, statefulsets, even additional daemonsets with both headless and clusterIP services, and cannot reproduce this outside of this affected otel-agent daemonset. A colleague this morning pointed out that there are multiple ports exposed on the service and daemonset, so I tried diagnostics for all ports:
You can see from this every other service:port combination is correct, even not specifying a port is correct, however the profile for 4317 is incorrect? |
Well, my quick updates. I have tried to fix it, but my knowledge on linkerd internals is not enough, as it has quite complex and threaded, cached internals, etc.. So, linkerd syncs addresses from endpointslices.discovery.k8s.io (
Once Deployment restarted
How to reproduce:
Open questions: When, and Why does |
For context, Eugene is my (far more experienced!) colleague and has been looking into this today. We have disabled endpointSlices via Edit: thought I'd add, I just noticed that the stable version 2.11 sets endpointSlices to false, and the edge release sets it to true by default. |
@eugenepaniot In the destination logs that you posted above, had you made any code changes or is it from an edge release version? I see the original issue was using In regards to the reproduction, I attempted this again with the Have you still only see this occur with the |
@kleimkuhler yes, I added some debug things ( |
…mmand (#9200) Closes #9141 This introduces the `--destination-pod` flag to the `linkerd diagnostics endpoints` command which allows users to target a specific destination Pod when there are multiple running in a cluster. This can be useful for issues like #8956, where Linkerd HA is installed and there seem to be stale endpoints in the destination service. Being able to run this command and identity which destination Pod (if not all) have an incorrect view of the cluster. Signed-off-by: Kevin Leimkuhler <[email protected]>
@kleimkuhler Thanks for the new We've been unable to reproduce this with anything other than this one particular daemonset, service and port combination. Every other port for that service reported correctly. Since we last posted, this daemonset has undergone a big version bump, and we also removed another deployment/service in the same namespace which also used the same port (4317) but had different selectors. This should make zero difference, but since these changes, I've been unable to reproduce. We've now replaced this daemonset with a deployment (massively reducing replica count), re-enabled endpointSlices and it looks to be stable so far. There definitely is a bug here, somewhere, but I have no idea what more I can do to try and reproduce this. I may well spin up a new GKE cluster with only Linkerd and the same set up of otel we had when we encountered this bug, but that will need to wait until I have some time. |
Ok thanks for the update on this. Sounds like a lot has a changed on your end; glad to hear it sounds like things are stable for you now but also very interested to hear if you end up being able to reproduce this! I'll keep this open for now and let me know if anything changes on your end. |
What is the issue?
Slack thread here.
One of our application’s proxy’s is full of failed to connect errors. Examples:
Debug logs aren’t much use:
The service
otel-agent.open-telemetry.svc.cluster.local:4317
points to a DaemonSet, so pods are constantly switching with our preemptible nodes.I caught a failed request in Tap, where all other requests are ~1ms:
I then found this log entry:
10.225.13.4
was for a pod,otel-agent-6kg56
until yesterday (21/07) at 13:52. That IP then got reassigned to podtap-injector-7c4b6dc5f9-7wsxn
. For some reason, the linkerd-proxy is sending requests to IPs that no longer belong to that Daemonset.I run three replicas of linkerd-destination, which at the time of writing had:
I searched the last two days of logs from linkerd-destination pods for any logs relating to the otel-agent pod that was no longer relevant and found two log entries:
zqxhr
had since been deleted butlxl5f
is still alive. My theory is that this linkerd-destination pod has entered into an unhealthy state, but is still ready, and sending "bad information" to the linkerd-proxy containers requesting destination information.I then killed the linkerd-destination pod with these errors at 13:10 and have not seen this error come back again.
How can it be reproduced?
Not confirmed, but removing pods from a Daemonset and checking if the destination updates?
Logs, error output, etc
See section above.
output of
linkerd check -o short
I have a mismatched CLI version so ignore the errors:
Environment
Kubernetes version: 1.22
Host: GKE
Linkerd version: edge-22.7.1
Possible solution
No response
Additional context
No response
Would you like to work on fixing this bug?
No response
The text was updated successfully, but these errors were encountered: