Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix linkerd mc check failing in the presence of lots of mirrored services #10893

Merged
merged 1 commit into from
May 18, 2023

Conversation

alpeb
Copy link
Member

@alpeb alpeb commented May 12, 2023

The "all mirror services have endpoints" check can fail in the presence of lots of mirrored services because for each service we query the kube api for its endpoints, and those calls reuse the same golang context, which ends up reaching its deadline.

To fix, we create a new context object per call.

Repro

First patch check.go to introduce a sleep in order to simulate network latency:

diff --git a/multicluster/cmd/check.go b/multicluster/cmd/check.go
index b2b4158bf..f3083f436 100644
--- a/multicluster/cmd/check.go
+++ b/multicluster/cmd/check.go
@@ -627,6 +627,7 @@ func (hc *healthChecker) checkIfMirrorServicesHaveEndpoints(ctx context.Context)
        for _, svc := range mirrorServices.Items {
                // Check if there is a relevant end-point
                endpoint, err := hc.KubeAPIClient().CoreV1().Endpoints(svc.Namespace).Get(ctx, svc.Name, metav1.GetOptions{})
+               time.Sleep(1 * time.Second)
                if err != nil || len(endpoint.Subsets) == 0 {
                        servicesWithNoEndpoints = append(servicesWithNoEndpoints, fmt.Sprintf("%s.%s mirrored from cluster [%s]", svc.Name, svc.Namespace, svc.Labels[k8s.RemoteClusterNameLabel]))
                }

Then run the multicluster integration tests to setup a multicluster scenario, and then create lots of mirrored services!

$ bin/docker-build

# accommodate to your own arch
$ bin/tests --name multicluster --skip-cluster-delete $PWD/target/cli/linux-amd64/linkerd

# we are currently in the target cluster context
$ k create ns testing

# create pod
$ k -n testing run nginx --image=nginx --restart=Never

# create 50 services pointing to it, flagged to be mirrored
$ for i in {1..50}; do k -n testing expose po nginx --port 80 --name "nginx-$i" -l mirror.linkerd.io/exported=true; done

# switch to the source cluster
$ k config use-context k3d-source

# this will trigger the creation of the mirrored services, wait till the
# 50 are created
$ k create ns testing

$ bin/go-run cli mc check --verbose
github.com/linkerd/linkerd2/multicluster/cmd
github.com/linkerd/linkerd2/cli/cmd
linkerd-multicluster
--------------------
√ Link CRD exists
√ Link resources are valid
        * target
√ remote cluster access credentials are valid
        * target
√ clusters share trust anchors
        * target
√ service mirror controller has required permissions
        * target
√ service mirror controllers are running
        * target
DEBU[0000] Starting port forward to https://0.0.0.0:34201/api/v1/namespaces/linkerd-multicluster/pods/linkerd-service-mirror-target-7c4496869f-6xsp4/portforward?timeout=30s 39327:9999
DEBU[0000] Port forward initialised
√ probe services able to communicate with all gateway mirrors
        * target
DEBU[0031] error retrieving Endpoints: client rate limiter Wait returned an error: context deadline exceeded
DEBU[0032] error retrieving Endpoints: client rate limiter Wait returned an error: context deadline exceeded
DEBU[0033] error retrieving Endpoints: client rate limiter Wait returned an error: context deadline exceeded
DEBU[0034] error retrieving Endpoints: client rate limiter Wait returned an error: context deadline exceeded
DEBU[0035] error retrieving Endpoints: client rate limiter Wait returned an error: context deadline exceeded
DEBU[0036] error retrieving Endpoints: client rate limiter Wait returned an error: context deadline exceeded
DEBU[0037] error retrieving Endpoints: client rate limiter Wait returned an error: context deadline exceeded

…rvices

The "all mirror services have endpoints" check can fail in the presence
of lots of mirrored services because for each service we query the kube
api for its endpoints, and those calls reuse the same golang context,
which ends up reaching its deadline.

To fix, we create a new context object per call.

## Repro

First patch `check.go` to introduce a sleep in order to simulate network
latency:

```diff
diff --git a/multicluster/cmd/check.go b/multicluster/cmd/check.go
index b2b4158bf..f3083f436 100644
--- a/multicluster/cmd/check.go
+++ b/multicluster/cmd/check.go
@@ -627,6 +627,7 @@ func (hc *healthChecker) checkIfMirrorServicesHaveEndpoints(ctx context.Context)
        for _, svc := range mirrorServices.Items {
                // Check if there is a relevant end-point
                endpoint, err := hc.KubeAPIClient().CoreV1().Endpoints(svc.Namespace).Get(ctx, svc.Name, metav1.GetOptions{})
+               time.Sleep(1 * time.Second)
                if err != nil || len(endpoint.Subsets) == 0 {
                        servicesWithNoEndpoints = append(servicesWithNoEndpoints, fmt.Sprintf("%s.%s mirrored from cluster [%s]", svc.Name, svc.Namespace, svc.Labels[k8s.RemoteClusterNameLabel]))
                }
```

Then run the `multicluster` integration tests to setup a multicluster
scenario, and then create lots of mirrored services!

```bash
$ bin/docker-build

# accommodate to your own arch
$ bin/tests --name multicluster --skip-cluster-delete $PWD/target/cli/linux-amd64/linkerd

# we are currently in the target cluster context
$ k create ns testing

# create pod
$ k -n testing run nginx --image=nginx --restart=Never

# create 50 services pointing to it, flagged to be mirrored
$ for i in {1..50}; do k -n testing expose po nginx --port 80 --name "nginx-$i" -l mirror.linkerd.io/exported=true; done

# switch to the source cluster
$ k config use-context k3d-source

# this will trigger the creation of the mirrored services, wait till the
# 50 are created
$ k create ns testing

$ bin/go-run cli mc check --verbose
github.com/linkerd/linkerd2/multicluster/cmd
github.com/linkerd/linkerd2/cli/cmd
linkerd-multicluster
--------------------
√ Link CRD exists
√ Link resources are valid
        * target
√ remote cluster access credentials are valid
        * target
√ clusters share trust anchors
        * target
√ service mirror controller has required permissions
        * target
√ service mirror controllers are running
        * target
DEBU[0000] Starting port forward to https://0.0.0.0:34201/api/v1/namespaces/linkerd-multicluster/pods/linkerd-service-mirror-target-7c4496869f-6xsp4/portforward?timeout=30s 39327:9999
DEBU[0000] Port forward initialised
√ probe services able to communicate with all gateway mirrors
        * target
DEBU[0031] error retrieving Endpoints: client rate limiter Wait returned an error: context deadline exceeded
DEBU[0032] error retrieving Endpoints: client rate limiter Wait returned an error: context deadline exceeded
DEBU[0033] error retrieving Endpoints: client rate limiter Wait returned an error: context deadline exceeded
DEBU[0034] error retrieving Endpoints: client rate limiter Wait returned an error: context deadline exceeded
DEBU[0035] error retrieving Endpoints: client rate limiter Wait returned an error: context deadline exceeded
DEBU[0036] error retrieving Endpoints: client rate limiter Wait returned an error: context deadline exceeded
DEBU[0037] error retrieving Endpoints: client rate limiter Wait returned an error: context deadline exceeded
```
Copy link
Member

@mateiidavid mateiidavid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @alpeb for the explanation and the fix! Looks good.

@@ -625,9 +626,12 @@ func (hc *healthChecker) checkIfMirrorServicesHaveEndpoints(ctx context.Context)
return err
}
for _, svc := range mirrorServices.Items {
// Check if there is a relevant end-point
// have to use a new ctx for each call, otherwise we risk reaching the original context deadline
Copy link
Member

@mateiidavid mateiidavid May 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have lots of mirror services, I wonder if it's better to add concurrency here to also speed up the time it takes to list everything. Not really important though, just a ux consideration 🤷🏻

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the check response speed can improve as long as there's not bottleneck on the serving side. That might be interesting to explore as a follow up.

Copy link
Member

@adleong adleong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I didn't even realize we had a hardcoded 30 second request timeout in addition to the retry deadline specified by the --wait flag. Good find and fix.

@alpeb alpeb merged commit 1d064fa into main May 18, 2023
@alpeb alpeb deleted the alpeb/mc-check-rate-limiting-fixup branch May 18, 2023 14:18
@alpeb alpeb added this to the stable-2.13.6 milestone Aug 1, 2023
hawkw pushed a commit that referenced this pull request Aug 9, 2023
…rvices (#10893)

The "all mirror services have endpoints" check can fail in the presence
of lots of mirrored services because for each service we query the kube
api for its endpoints, and those calls reuse the same golang context,
which ends up reaching its deadline.

To fix, we create a new context object per call.

## Repro

First patch `check.go` to introduce a sleep in order to simulate network
latency:

```diff
diff --git a/multicluster/cmd/check.go b/multicluster/cmd/check.go
index b2b4158bf..f3083f436 100644
--- a/multicluster/cmd/check.go
+++ b/multicluster/cmd/check.go
@@ -627,6 +627,7 @@ func (hc *healthChecker) checkIfMirrorServicesHaveEndpoints(ctx context.Context)
        for _, svc := range mirrorServices.Items {
                // Check if there is a relevant end-point
                endpoint, err := hc.KubeAPIClient().CoreV1().Endpoints(svc.Namespace).Get(ctx, svc.Name, metav1.GetOptions{})
+               time.Sleep(1 * time.Second)
                if err != nil || len(endpoint.Subsets) == 0 {
                        servicesWithNoEndpoints = append(servicesWithNoEndpoints, fmt.Sprintf("%s.%s mirrored from cluster [%s]", svc.Name, svc.Namespace, svc.Labels[k8s.RemoteClusterNameLabel]))
                }
```

Then run the `multicluster` integration tests to setup a multicluster
scenario, and then create lots of mirrored services!

```bash
$ bin/docker-build

# accommodate to your own arch
$ bin/tests --name multicluster --skip-cluster-delete $PWD/target/cli/linux-amd64/linkerd

# we are currently in the target cluster context
$ k create ns testing

# create pod
$ k -n testing run nginx --image=nginx --restart=Never

# create 50 services pointing to it, flagged to be mirrored
$ for i in {1..50}; do k -n testing expose po nginx --port 80 --name "nginx-$i" -l mirror.linkerd.io/exported=true; done

# switch to the source cluster
$ k config use-context k3d-source

# this will trigger the creation of the mirrored services, wait till the
# 50 are created
$ k create ns testing

$ bin/go-run cli mc check --verbose
github.com/linkerd/linkerd2/multicluster/cmd
github.com/linkerd/linkerd2/cli/cmd
linkerd-multicluster
--------------------
√ Link CRD exists
√ Link resources are valid
        * target
√ remote cluster access credentials are valid
        * target
√ clusters share trust anchors
        * target
√ service mirror controller has required permissions
        * target
√ service mirror controllers are running
        * target
DEBU[0000] Starting port forward to https://0.0.0.0:34201/api/v1/namespaces/linkerd-multicluster/pods/linkerd-service-mirror-target-7c4496869f-6xsp4/portforward?timeout=30s 39327:9999
DEBU[0000] Port forward initialised
√ probe services able to communicate with all gateway mirrors
        * target
DEBU[0031] error retrieving Endpoints: client rate limiter Wait returned an error: context deadline exceeded
DEBU[0032] error retrieving Endpoints: client rate limiter Wait returned an error: context deadline exceeded
DEBU[0033] error retrieving Endpoints: client rate limiter Wait returned an error: context deadline exceeded
DEBU[0034] error retrieving Endpoints: client rate limiter Wait returned an error: context deadline exceeded
DEBU[0035] error retrieving Endpoints: client rate limiter Wait returned an error: context deadline exceeded
DEBU[0036] error retrieving Endpoints: client rate limiter Wait returned an error: context deadline exceeded
DEBU[0037] error retrieving Endpoints: client rate limiter Wait returned an error: context deadline exceeded
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants