Fix `linkerd mc check` failing in the presence of lots of mirrored services #10893

alpeb · 2023-05-12T19:51:20Z

The "all mirror services have endpoints" check can fail in the presence of lots of mirrored services because for each service we query the kube api for its endpoints, and those calls reuse the same golang context, which ends up reaching its deadline.

To fix, we create a new context object per call.

Repro

First patch check.go to introduce a sleep in order to simulate network latency:

diff --git a/multicluster/cmd/check.go b/multicluster/cmd/check.go
index b2b4158bf..f3083f436 100644
--- a/multicluster/cmd/check.go
+++ b/multicluster/cmd/check.go
@@ -627,6 +627,7 @@ func (hc *healthChecker) checkIfMirrorServicesHaveEndpoints(ctx context.Context)
        for _, svc := range mirrorServices.Items {
                // Check if there is a relevant end-point
                endpoint, err := hc.KubeAPIClient().CoreV1().Endpoints(svc.Namespace).Get(ctx, svc.Name, metav1.GetOptions{})
+               time.Sleep(1 * time.Second)
                if err != nil || len(endpoint.Subsets) == 0 {
                        servicesWithNoEndpoints = append(servicesWithNoEndpoints, fmt.Sprintf("%s.%s mirrored from cluster [%s]", svc.Name, svc.Namespace, svc.Labels[k8s.RemoteClusterNameLabel]))
                }

Then run the multicluster integration tests to setup a multicluster scenario, and then create lots of mirrored services!

$ bin/docker-build

# accommodate to your own arch
$ bin/tests --name multicluster --skip-cluster-delete $PWD/target/cli/linux-amd64/linkerd

# we are currently in the target cluster context
$ k create ns testing

# create pod
$ k -n testing run nginx --image=nginx --restart=Never

# create 50 services pointing to it, flagged to be mirrored
$ for i in {1..50}; do k -n testing expose po nginx --port 80 --name "nginx-$i" -l mirror.linkerd.io/exported=true; done

# switch to the source cluster
$ k config use-context k3d-source

# this will trigger the creation of the mirrored services, wait till the
# 50 are created
$ k create ns testing

$ bin/go-run cli mc check --verbose
github.com/linkerd/linkerd2/multicluster/cmd
github.com/linkerd/linkerd2/cli/cmd
linkerd-multicluster
--------------------
√ Link CRD exists
√ Link resources are valid
        * target
√ remote cluster access credentials are valid
        * target
√ clusters share trust anchors
        * target
√ service mirror controller has required permissions
        * target
√ service mirror controllers are running
        * target
DEBU[0000] Starting port forward to https://0.0.0.0:34201/api/v1/namespaces/linkerd-multicluster/pods/linkerd-service-mirror-target-7c4496869f-6xsp4/portforward?timeout=30s 39327:9999
DEBU[0000] Port forward initialised
√ probe services able to communicate with all gateway mirrors
        * target
DEBU[0031] error retrieving Endpoints: client rate limiter Wait returned an error: context deadline exceeded
DEBU[0032] error retrieving Endpoints: client rate limiter Wait returned an error: context deadline exceeded
DEBU[0033] error retrieving Endpoints: client rate limiter Wait returned an error: context deadline exceeded
DEBU[0034] error retrieving Endpoints: client rate limiter Wait returned an error: context deadline exceeded
DEBU[0035] error retrieving Endpoints: client rate limiter Wait returned an error: context deadline exceeded
DEBU[0036] error retrieving Endpoints: client rate limiter Wait returned an error: context deadline exceeded
DEBU[0037] error retrieving Endpoints: client rate limiter Wait returned an error: context deadline exceeded

…rvices The "all mirror services have endpoints" check can fail in the presence of lots of mirrored services because for each service we query the kube api for its endpoints, and those calls reuse the same golang context, which ends up reaching its deadline. To fix, we create a new context object per call. ## Repro First patch `check.go` to introduce a sleep in order to simulate network latency: ```diff diff --git a/multicluster/cmd/check.go b/multicluster/cmd/check.go index b2b4158bf..f3083f436 100644 --- a/multicluster/cmd/check.go +++ b/multicluster/cmd/check.go @@ -627,6 +627,7 @@ func (hc *healthChecker) checkIfMirrorServicesHaveEndpoints(ctx context.Context) for _, svc := range mirrorServices.Items { // Check if there is a relevant end-point endpoint, err := hc.KubeAPIClient().CoreV1().Endpoints(svc.Namespace).Get(ctx, svc.Name, metav1.GetOptions{}) + time.Sleep(1 * time.Second) if err != nil || len(endpoint.Subsets) == 0 { servicesWithNoEndpoints = append(servicesWithNoEndpoints, fmt.Sprintf("%s.%s mirrored from cluster [%s]", svc.Name, svc.Namespace, svc.Labels[k8s.RemoteClusterNameLabel])) } ``` Then run the `multicluster` integration tests to setup a multicluster scenario, and then create lots of mirrored services! ```bash $ bin/docker-build # accommodate to your own arch $ bin/tests --name multicluster --skip-cluster-delete $PWD/target/cli/linux-amd64/linkerd # we are currently in the target cluster context $ k create ns testing # create pod $ k -n testing run nginx --image=nginx --restart=Never # create 50 services pointing to it, flagged to be mirrored $ for i in {1..50}; do k -n testing expose po nginx --port 80 --name "nginx-$i" -l mirror.linkerd.io/exported=true; done # switch to the source cluster $ k config use-context k3d-source # this will trigger the creation of the mirrored services, wait till the # 50 are created $ k create ns testing $ bin/go-run cli mc check --verbose github.com/linkerd/linkerd2/multicluster/cmd github.com/linkerd/linkerd2/cli/cmd linkerd-multicluster -------------------- √ Link CRD exists √ Link resources are valid * target √ remote cluster access credentials are valid * target √ clusters share trust anchors * target √ service mirror controller has required permissions * target √ service mirror controllers are running * target DEBU[0000] Starting port forward to https://0.0.0.0:34201/api/v1/namespaces/linkerd-multicluster/pods/linkerd-service-mirror-target-7c4496869f-6xsp4/portforward?timeout=30s 39327:9999 DEBU[0000] Port forward initialised √ probe services able to communicate with all gateway mirrors * target DEBU[0031] error retrieving Endpoints: client rate limiter Wait returned an error: context deadline exceeded DEBU[0032] error retrieving Endpoints: client rate limiter Wait returned an error: context deadline exceeded DEBU[0033] error retrieving Endpoints: client rate limiter Wait returned an error: context deadline exceeded DEBU[0034] error retrieving Endpoints: client rate limiter Wait returned an error: context deadline exceeded DEBU[0035] error retrieving Endpoints: client rate limiter Wait returned an error: context deadline exceeded DEBU[0036] error retrieving Endpoints: client rate limiter Wait returned an error: context deadline exceeded DEBU[0037] error retrieving Endpoints: client rate limiter Wait returned an error: context deadline exceeded ```

mateiidavid

Thanks @alpeb for the explanation and the fix! Looks good.

mateiidavid · 2023-05-16T14:58:22Z

multicluster/cmd/check.go

@@ -625,9 +626,12 @@ func (hc *healthChecker) checkIfMirrorServicesHaveEndpoints(ctx context.Context)
 		return err
 	}
 	for _, svc := range mirrorServices.Items {
-		// Check if there is a relevant end-point
+		// have to use a new ctx for each call, otherwise we risk reaching the original context deadline


If we have lots of mirror services, I wonder if it's better to add concurrency here to also speed up the time it takes to list everything. Not really important though, just a ux consideration 🤷🏻

Yeah, the check response speed can improve as long as there's not bottleneck on the serving side. That might be interesting to explore as a follow up.

adleong

Nice! I didn't even realize we had a hardcoded 30 second request timeout in addition to the retry deadline specified by the --wait flag. Good find and fix.

…rvices (#10893) The "all mirror services have endpoints" check can fail in the presence of lots of mirrored services because for each service we query the kube api for its endpoints, and those calls reuse the same golang context, which ends up reaching its deadline. To fix, we create a new context object per call. ## Repro First patch `check.go` to introduce a sleep in order to simulate network latency: ```diff diff --git a/multicluster/cmd/check.go b/multicluster/cmd/check.go index b2b4158bf..f3083f436 100644 --- a/multicluster/cmd/check.go +++ b/multicluster/cmd/check.go @@ -627,6 +627,7 @@ func (hc *healthChecker) checkIfMirrorServicesHaveEndpoints(ctx context.Context) for _, svc := range mirrorServices.Items { // Check if there is a relevant end-point endpoint, err := hc.KubeAPIClient().CoreV1().Endpoints(svc.Namespace).Get(ctx, svc.Name, metav1.GetOptions{}) + time.Sleep(1 * time.Second) if err != nil || len(endpoint.Subsets) == 0 { servicesWithNoEndpoints = append(servicesWithNoEndpoints, fmt.Sprintf("%s.%s mirrored from cluster [%s]", svc.Name, svc.Namespace, svc.Labels[k8s.RemoteClusterNameLabel])) } ``` Then run the `multicluster` integration tests to setup a multicluster scenario, and then create lots of mirrored services! ```bash $ bin/docker-build # accommodate to your own arch $ bin/tests --name multicluster --skip-cluster-delete $PWD/target/cli/linux-amd64/linkerd # we are currently in the target cluster context $ k create ns testing # create pod $ k -n testing run nginx --image=nginx --restart=Never # create 50 services pointing to it, flagged to be mirrored $ for i in {1..50}; do k -n testing expose po nginx --port 80 --name "nginx-$i" -l mirror.linkerd.io/exported=true; done # switch to the source cluster $ k config use-context k3d-source # this will trigger the creation of the mirrored services, wait till the # 50 are created $ k create ns testing $ bin/go-run cli mc check --verbose github.com/linkerd/linkerd2/multicluster/cmd github.com/linkerd/linkerd2/cli/cmd linkerd-multicluster -------------------- √ Link CRD exists √ Link resources are valid * target √ remote cluster access credentials are valid * target √ clusters share trust anchors * target √ service mirror controller has required permissions * target √ service mirror controllers are running * target DEBU[0000] Starting port forward to https://0.0.0.0:34201/api/v1/namespaces/linkerd-multicluster/pods/linkerd-service-mirror-target-7c4496869f-6xsp4/portforward?timeout=30s 39327:9999 DEBU[0000] Port forward initialised √ probe services able to communicate with all gateway mirrors * target DEBU[0031] error retrieving Endpoints: client rate limiter Wait returned an error: context deadline exceeded DEBU[0032] error retrieving Endpoints: client rate limiter Wait returned an error: context deadline exceeded DEBU[0033] error retrieving Endpoints: client rate limiter Wait returned an error: context deadline exceeded DEBU[0034] error retrieving Endpoints: client rate limiter Wait returned an error: context deadline exceeded DEBU[0035] error retrieving Endpoints: client rate limiter Wait returned an error: context deadline exceeded DEBU[0036] error retrieving Endpoints: client rate limiter Wait returned an error: context deadline exceeded DEBU[0037] error retrieving Endpoints: client rate limiter Wait returned an error: context deadline exceeded ```

alpeb added area/cli area/multicluster labels May 12, 2023

alpeb requested a review from a team as a code owner May 12, 2023 19:51

mateiidavid approved these changes May 16, 2023

View reviewed changes

adleong approved these changes May 18, 2023

View reviewed changes

alpeb merged commit 1d064fa into main May 18, 2023

alpeb deleted the alpeb/mc-check-rate-limiting-fixup branch May 18, 2023 14:18

alpeb added this to the stable-2.13.6 milestone Aug 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix `linkerd mc check` failing in the presence of lots of mirrored services #10893

Fix `linkerd mc check` failing in the presence of lots of mirrored services #10893

alpeb commented May 12, 2023

mateiidavid left a comment

mateiidavid May 16, 2023 •

edited

Loading

alpeb May 17, 2023

adleong left a comment

Fix linkerd mc check failing in the presence of lots of mirrored services #10893

Fix linkerd mc check failing in the presence of lots of mirrored services #10893

Conversation

alpeb commented May 12, 2023

Repro

mateiidavid left a comment

Choose a reason for hiding this comment

mateiidavid May 16, 2023 • edited Loading

Choose a reason for hiding this comment

alpeb May 17, 2023

Choose a reason for hiding this comment

adleong left a comment

Choose a reason for hiding this comment

Fix `linkerd mc check` failing in the presence of lots of mirrored services #10893

Fix `linkerd mc check` failing in the presence of lots of mirrored services #10893

mateiidavid May 16, 2023 •

edited

Loading