Replies: 2 comments
-
Meanwhile tweaked the settings further which brought silghtly more stability, but is still not ideal. Especially on peak communication via the headless service monitors causing a lot of 5xx errors still 😢 However it recovers a bit more reliable now and the
Still graceful for any input! |
Beta Was this translation helpful? Give feedback.
-
Last week's edge (edge-24.9.3) adds the ability to further configure the multicluster probe (#13061). If your multicluster connections aren't very reliable, relaxing those settings might help. |
Beta Was this translation helpful? Give feedback.
-
🧵 FollowUp discussion from Slack as suggested
Dear Linkerd Community,
we’re currently running a Linkerd Multicluster Setup connecting 3 managed EKS (1.28) clusters, using Linkerd .
The primary cluster is located in eu-central, connecting to Clusters based in US-East and AP-Southeast (So the later ones are not connected).
We primarily use the
service-mirror
function of Linkerd’s Multicluster component to talk from the primary cluster via Service Endpoint to a service on the connected EKS Clusters in the other regions.Workload is rather push-based REST/HTTP than continuous flow of data and in total usually only a few MB maximum per call. That should be enough for the Background informatioin.
We’re running the following Linkerd (Helm) Versions on all clusters:
Issue
...we’ve is that - probably due to the RTT between the clusters - that the Endpoints fail often. Most of the time they repair themselves and so only a percentage of calls has to be re-submitted.
However, sometimes the Endpoints (by the service mirror component) don’t get repaired at all. We especially have this issue more often on the connection to EU -> US than EU -> ASIA.
The linkerd-service-mirror will report smth like this:
Workaround
We only solve it temporarily by following the workaround, we get
504,503,502s
until we restart thelinkerd-multicluster
components (linkerd-gateway
andlinkerd-service-mirror-<remote-to-primary>
).This will resolve our HTTP-based connection again, recovering from errors in the linkerd-proxy sidecar:
Steps taken so far
One countermeasure we’ve done is to extend the following default timeouts of the Linkerd Proxy by modifing the ConfigMap in order to make it a bit more stable than it was before:
Question
Do you know any other parameters that we could tweak to make the multicluster communication (especially the Service Mirror Component) more reliable and stable?
Any input welcome 🙂
Beta Was this translation helpful? Give feedback.
All reactions