-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix linkerd mc check
failing in the presence of lots of mirrored services
#10893
Conversation
…rvices The "all mirror services have endpoints" check can fail in the presence of lots of mirrored services because for each service we query the kube api for its endpoints, and those calls reuse the same golang context, which ends up reaching its deadline. To fix, we create a new context object per call. ## Repro First patch `check.go` to introduce a sleep in order to simulate network latency: ```diff diff --git a/multicluster/cmd/check.go b/multicluster/cmd/check.go index b2b4158bf..f3083f436 100644 --- a/multicluster/cmd/check.go +++ b/multicluster/cmd/check.go @@ -627,6 +627,7 @@ func (hc *healthChecker) checkIfMirrorServicesHaveEndpoints(ctx context.Context) for _, svc := range mirrorServices.Items { // Check if there is a relevant end-point endpoint, err := hc.KubeAPIClient().CoreV1().Endpoints(svc.Namespace).Get(ctx, svc.Name, metav1.GetOptions{}) + time.Sleep(1 * time.Second) if err != nil || len(endpoint.Subsets) == 0 { servicesWithNoEndpoints = append(servicesWithNoEndpoints, fmt.Sprintf("%s.%s mirrored from cluster [%s]", svc.Name, svc.Namespace, svc.Labels[k8s.RemoteClusterNameLabel])) } ``` Then run the `multicluster` integration tests to setup a multicluster scenario, and then create lots of mirrored services! ```bash $ bin/docker-build # accommodate to your own arch $ bin/tests --name multicluster --skip-cluster-delete $PWD/target/cli/linux-amd64/linkerd # we are currently in the target cluster context $ k create ns testing # create pod $ k -n testing run nginx --image=nginx --restart=Never # create 50 services pointing to it, flagged to be mirrored $ for i in {1..50}; do k -n testing expose po nginx --port 80 --name "nginx-$i" -l mirror.linkerd.io/exported=true; done # switch to the source cluster $ k config use-context k3d-source # this will trigger the creation of the mirrored services, wait till the # 50 are created $ k create ns testing $ bin/go-run cli mc check --verbose github.com/linkerd/linkerd2/multicluster/cmd github.com/linkerd/linkerd2/cli/cmd linkerd-multicluster -------------------- √ Link CRD exists √ Link resources are valid * target √ remote cluster access credentials are valid * target √ clusters share trust anchors * target √ service mirror controller has required permissions * target √ service mirror controllers are running * target DEBU[0000] Starting port forward to https://0.0.0.0:34201/api/v1/namespaces/linkerd-multicluster/pods/linkerd-service-mirror-target-7c4496869f-6xsp4/portforward?timeout=30s 39327:9999 DEBU[0000] Port forward initialised √ probe services able to communicate with all gateway mirrors * target DEBU[0031] error retrieving Endpoints: client rate limiter Wait returned an error: context deadline exceeded DEBU[0032] error retrieving Endpoints: client rate limiter Wait returned an error: context deadline exceeded DEBU[0033] error retrieving Endpoints: client rate limiter Wait returned an error: context deadline exceeded DEBU[0034] error retrieving Endpoints: client rate limiter Wait returned an error: context deadline exceeded DEBU[0035] error retrieving Endpoints: client rate limiter Wait returned an error: context deadline exceeded DEBU[0036] error retrieving Endpoints: client rate limiter Wait returned an error: context deadline exceeded DEBU[0037] error retrieving Endpoints: client rate limiter Wait returned an error: context deadline exceeded ```
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @alpeb for the explanation and the fix! Looks good.
@@ -625,9 +626,12 @@ func (hc *healthChecker) checkIfMirrorServicesHaveEndpoints(ctx context.Context) | |||
return err | |||
} | |||
for _, svc := range mirrorServices.Items { | |||
// Check if there is a relevant end-point | |||
// have to use a new ctx for each call, otherwise we risk reaching the original context deadline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have lots of mirror services, I wonder if it's better to add concurrency here to also speed up the time it takes to list everything. Not really important though, just a ux consideration 🤷🏻
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the check response speed can improve as long as there's not bottleneck on the serving side. That might be interesting to explore as a follow up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! I didn't even realize we had a hardcoded 30 second request timeout in addition to the retry deadline specified by the --wait
flag. Good find and fix.
…rvices (#10893) The "all mirror services have endpoints" check can fail in the presence of lots of mirrored services because for each service we query the kube api for its endpoints, and those calls reuse the same golang context, which ends up reaching its deadline. To fix, we create a new context object per call. ## Repro First patch `check.go` to introduce a sleep in order to simulate network latency: ```diff diff --git a/multicluster/cmd/check.go b/multicluster/cmd/check.go index b2b4158bf..f3083f436 100644 --- a/multicluster/cmd/check.go +++ b/multicluster/cmd/check.go @@ -627,6 +627,7 @@ func (hc *healthChecker) checkIfMirrorServicesHaveEndpoints(ctx context.Context) for _, svc := range mirrorServices.Items { // Check if there is a relevant end-point endpoint, err := hc.KubeAPIClient().CoreV1().Endpoints(svc.Namespace).Get(ctx, svc.Name, metav1.GetOptions{}) + time.Sleep(1 * time.Second) if err != nil || len(endpoint.Subsets) == 0 { servicesWithNoEndpoints = append(servicesWithNoEndpoints, fmt.Sprintf("%s.%s mirrored from cluster [%s]", svc.Name, svc.Namespace, svc.Labels[k8s.RemoteClusterNameLabel])) } ``` Then run the `multicluster` integration tests to setup a multicluster scenario, and then create lots of mirrored services! ```bash $ bin/docker-build # accommodate to your own arch $ bin/tests --name multicluster --skip-cluster-delete $PWD/target/cli/linux-amd64/linkerd # we are currently in the target cluster context $ k create ns testing # create pod $ k -n testing run nginx --image=nginx --restart=Never # create 50 services pointing to it, flagged to be mirrored $ for i in {1..50}; do k -n testing expose po nginx --port 80 --name "nginx-$i" -l mirror.linkerd.io/exported=true; done # switch to the source cluster $ k config use-context k3d-source # this will trigger the creation of the mirrored services, wait till the # 50 are created $ k create ns testing $ bin/go-run cli mc check --verbose github.com/linkerd/linkerd2/multicluster/cmd github.com/linkerd/linkerd2/cli/cmd linkerd-multicluster -------------------- √ Link CRD exists √ Link resources are valid * target √ remote cluster access credentials are valid * target √ clusters share trust anchors * target √ service mirror controller has required permissions * target √ service mirror controllers are running * target DEBU[0000] Starting port forward to https://0.0.0.0:34201/api/v1/namespaces/linkerd-multicluster/pods/linkerd-service-mirror-target-7c4496869f-6xsp4/portforward?timeout=30s 39327:9999 DEBU[0000] Port forward initialised √ probe services able to communicate with all gateway mirrors * target DEBU[0031] error retrieving Endpoints: client rate limiter Wait returned an error: context deadline exceeded DEBU[0032] error retrieving Endpoints: client rate limiter Wait returned an error: context deadline exceeded DEBU[0033] error retrieving Endpoints: client rate limiter Wait returned an error: context deadline exceeded DEBU[0034] error retrieving Endpoints: client rate limiter Wait returned an error: context deadline exceeded DEBU[0035] error retrieving Endpoints: client rate limiter Wait returned an error: context deadline exceeded DEBU[0036] error retrieving Endpoints: client rate limiter Wait returned an error: context deadline exceeded DEBU[0037] error retrieving Endpoints: client rate limiter Wait returned an error: context deadline exceeded ```
The "all mirror services have endpoints" check can fail in the presence of lots of mirrored services because for each service we query the kube api for its endpoints, and those calls reuse the same golang context, which ends up reaching its deadline.
To fix, we create a new context object per call.
Repro
First patch
check.go
to introduce a sleep in order to simulate network latency:Then run the
multicluster
integration tests to setup a multicluster scenario, and then create lots of mirrored services!