Add support for remote discovery #11224

adleong · 2023-08-09T00:43:14Z

Adds support for remote discovery to the destination controller.

When the destination controller gets a Get request for a Service with the multicluster.linkerd.io/remote-discovery label, this is an indication that the destination controller should discover the endpoints for this service from a remote cluster. The destination controller will look for a remote cluster which has been linked to it (using the linkerd multicluster link command) with that name. It will look at the multicluster.linkerd.io/remote-discovery label for the service name to look up in that cluster. It then streams back the endpoint data for that remote service.

Since we now have multiple client-go informers for the same resource types (one for the local cluster and one for each linked remote cluster) we add a cluster label onto the prometheus metrics for the informers and EndpointWatchers to ensure that each of these components' metrics are correctly tracked and don't overwrite each other.

Signed-off-by: Alex Leong <[email protected]>

adleong · 2023-08-09T00:46:11Z

controller/api/destination/watcher/cluster_store.go

@@ -247,21 +267,21 @@ func decodeK8sConfigFromSecret(data []byte, enableEndpointSlices bool) (*k8s.API
 			ctx,
 			cfg,
 			true,
-			k8s.ES, k8s.Pod, k8s.Svc, k8s.SP, k8s.Job, k8s.Srv,


@mateiidavid I couldn't find where we used ServiceProfile or Node data so I removed them. Do you know if these are actually used anywhere?

No, think it was from bad copy&pasta on my end. I re-used the same resources as the destination service. AFAIK SPs are only used for GetProfile() and Node resources are queried in the endpoint translator exclusively.

The SP informer is still gonna get called via NewServer()->watcher.NewProfileWatcher() and if not initialized it will generate a panic? As for Nodes, we still use the endpoint translator, no?
Another thing I don't understand is why we need Job (here and in the original destination API) (introduced in 03762cc).

I was looking into the Job remark and it does seem like it's not used in the endpoints watcher. We get the owner kind and name here:

linkerd2/controller/api/destination/watcher/endpoints_watcher.go

Lines 974 to 977 in 2d3f3d3

ownerKind, ownerName, err := pp.metadataAPI.GetOwnerKindAndName(context.Background(), pod, false)

if err != nil {

return Address{}, PodID{}, err

}

Which in turn will use the metadata API to directly query the API server for Job/RS info.

linkerd2/controller/k8s/metadata_api.go

Lines 224 to 248 in 2d3f3d3

if retry {

parentObj, err = api.client.

Resource(batchv1.SchemeGroupVersion.WithResource("jobs")).

Namespace(pod.Namespace).

Get(ctx, parent.Name, metav1.GetOptions{})

if err != nil {

log.Warnf("failed to retrieve job from direct API call %s/%s: %s", pod.Namespace, parent.Name, err)

}

}

}

case "ReplicaSet":

var rsObj *metav1.PartialObjectMetadata

rsObj, err = api.getByNamespace(RS, pod.Namespace, parent.Name)

if err != nil {

log.Warnf("failed to retrieve replicaset from indexer %s/%s: %s", pod.Namespace, parent.Name, err)

if retry {

rsObj, err = api.client.

Resource(appsv1.SchemeGroupVersion.WithResource("replicasets")).

Namespace(pod.Namespace).

Get(ctx, parent.Name, metav1.GetOptions{})

if err != nil {

log.Warnf("failed to retrieve replicaset from direct API call %s/%s: %s", pod.Namespace, parent.Name, err)

}

}

}

So while we need RBAC for Jobs it seems not to be required as an informer. It would seem better for the metadata client to also look for these resources in its cache. I think those are out of scope here tho.

re: profile watcher, the watchers we build in the cluster store don't start a profile watcher, so we should be fine not to use it imo. In my case it didn't crash. Profile lookups are done "locally".

# Generate template for imported service $ linkerd profile podinfo-target --template | k apply -f - serviceprofile.linkerd.io/podinfo-target.default.svc.cluster.local created $ linkerd dg endpoints podinfo-target.default.svc.cluster.local NAMESPACE IP PORT POD SERVICE default 10.24.0.23 9898 podinfo-7c887c7bd5-tvkw9 podinfo.default # Test discovery resp. $ go run controller/script/destination-client/main.go -token "{\"nodeName\": \"k3d-multiaz-agent-8\" }" -path podinfo-target.default.svc.cluster.local:80 -method getProfile INFO[0000] fully_qualified_name:"podinfo-target.default.svc.cluster.local" routes:{condition:{all:{matches:{method:{registered:POST}} matches:{path:{regex:"/authors/\\d+"}}}} response_classes:{condition:{status:{min:500 max:599}} is_failure:true} metrics_labels:{key:"route" value:"/authors/{id}"}} retry_budget:{retry_ratio:0.2 min_retries_per_second:10 ttl:{seconds:10}} INFO[0000]

Thanks Matei. So we can get rid of k8s.Job when instantiating the remoteAPI but leave it in the instantiation of metadataAPI.

It would seem better for the metadata client to also look for these resources in its cache

Don't we do this already? We call api.getByNamespace() and if that errors out, we fallback to the direct call.

About SP, you're right! (and now I get Node is only required in the metadataAPI, so that's fine too).

controller/api/destination/server.go

Signed-off-by: Alex Leong <[email protected]>

alpeb · 2023-08-10T12:24:10Z

controller/api/destination/watcher/cluster_store.go

+	secretsInformerFactory := informers.NewSharedInformerFactoryWithOptions(client, k8s.ResyncTime, informers.WithNamespace(namespace))
+	secrets := secretsInformerFactory.Core().V1().Secrets().Informer()


Perhaps it's worth going through k8s.NewNamespacedAPI() to get the cache metric for free? You could also leverage the Sync on the api method that returns.

I considered that, but it would mean also passing in a dynamic client and a l5dCrd client which never get used. It seemed simpler to avoid all of that machinery and just build what we need.

Actually those two clients can be passed as nils.

alpeb · 2023-08-10T13:10:06Z

controller/api/destination/watcher/cluster_store.go

@@ -247,21 +267,21 @@ func decodeK8sConfigFromSecret(data []byte, enableEndpointSlices bool) (*k8s.API
 			ctx,
 			cfg,
 			true,
-			k8s.ES, k8s.Pod, k8s.Svc, k8s.SP, k8s.Job, k8s.Srv,


The SP informer is still gonna get called via NewServer()->watcher.NewProfileWatcher() and if not initialized it will generate a panic? As for Nodes, we still use the endpoint translator, no?
Another thing I don't understand is why we need Job (here and in the original destination API) (introduced in 03762cc).

mateiidavid

This looks great! I tested this out in a local environment (2 clusters, one linked against the other). Traffic works well, service profiles were discovered correctly on the imported service, and authz policies on the exported service work well too.

I'll note that linkerd mc check doesn't work at the moment; it will throw an error if the imported service doesn't have endpoints (being addressed in #11226) .The test seems to fail because there's a diff in linkerd mc allow:

# > means it was introduced by this change
25c25,28
<   resources: ["pods", "endpoints", "services"]
---
>   resources: ["pods", "endpoints", "services", "nodes"]
>   verbs: ["list", "get", "watch"]
> - apiGroups: ["linkerd.io"]
>   resources: ["serviceprofiles"]
30,32d32
< - apiGroups: ["policy.linkerd.io"]
<   resources: ["servers"]
<   verbs: ["list", "get", "watch"]
36d35
<   resourceNames: ["linkerd-config"]
39a39,41
> - apiGroups: ["policy.linkerd.io"]
>   resources: ["authorizationpolicies", "httproutes", "meshtlsauthentications", "networkauthentications", "servers", "serverauthorizations"]
>   verbs: ["list", "get", "watch"]

Signed-off-by: Alex Leong <[email protected]>

alpeb

🚢 👍

mateiidavid

🙌🏻 🚢

@mclavel

## edge-23.8.2 This edge release adds improvements to Linkerd's multi-cluster features as part of the [flat network support] planned for Linkerd stable-2.14.0. In addition, it fixes an issue ([#10764]) where warnings about an invalid metric were logged frequently by the Destination controller. * Added a new `remoteDiscoverySelector` field to the multicluster `Link` CRD, which enables a service mirroring mod where the control plane performs discovery for the mirrored service from the remote cluster, rather than creating Endpoints for the mirrored service in the source cluster ([#11190], [#11201], [#11220], and [#11224]) * Fixed missing "Services" menu item in the Spanish localization for the `linkerd-viz` web dashboard ([#11229]) (thanks @mclavel!) * Replaced `server_port_subscribers` Destination controller gauge metric with `server_port_subscribes` and `server_port_unsubscribes` counter metrics ([#11206]; fixes [#10764]) * Replaced deprecated `failure-domain.beta.kubernetes.io` labels in Helm charts with `topology.kubernetes.io` labels ([#11148]; fixes [#11114]) (thanks @piyushsingariya!) [#10764]: #10764 [#11114]: #11114 [#11148]: #11148 [#11190]: #11190 [#11201]: #11201 [#11206]: #11206 [#11220]: #11220 [#11224]: #11224 [#11229]: #11229 [flat network support]: https://linkerd.io/2023/07/20/enterprise-multi-cluster-at-scale-supporting-flat-networks-in-linkerd/

@mclavel

## edge-23.8.2 This edge release adds improvements to Linkerd's multi-cluster features as part of the [flat network support] planned for Linkerd stable-2.14.0. In addition, it fixes an issue ([#10764]) where warnings about an invalid metric were logged frequently by the Destination controller. * Added a new `remoteDiscoverySelector` field to the multicluster `Link` CRD, which enables a service mirroring mode where the control plane performs discovery for the mirrored service from the remote cluster, rather than creating Endpoints for the mirrored service in the source cluster ([#11190], [#11201], [#11220], and [#11224]) * Fixed missing "Services" menu item in the Spanish localization for the `linkerd-viz` web dashboard ([#11229]) (thanks @mclavel!) * Replaced `server_port_subscribers` Destination controller gauge metric with `server_port_subscribes` and `server_port_unsubscribes` counter metrics ([#11206]; fixes [#10764]) * Replaced deprecated `failure-domain.beta.kubernetes.io` labels in Helm charts with `topology.kubernetes.io` labels ([#11148]; fixes [#11114]) (thanks @piyushsingariya!) [#10764]: #10764 [#11114]: #11114 [#11148]: #11148 [#11190]: #11190 [#11201]: #11201 [#11206]: #11206 [#11220]: #11220 [#11224]: #11224 [#11229]: #11229 [flat network support]: https://linkerd.io/2023/07/20/enterprise-multi-cluster-at-scale-supporting-flat-networks-in-linkerd/

adleong added 8 commits August 7, 2023 21:31

Add rbac

745bc1a

Signed-off-by: Alex Leong <[email protected]>

Use namespace scoped informer for secrets

47d5173

Signed-off-by: Alex Leong <[email protected]>

Fix namespace of role binding

b078774

Signed-off-by: Alex Leong <[email protected]>

Set config annotations on destination cred secret

d912a04

Signed-off-by: Alex Leong <[email protected]>

Merge branch 'main' into alex/remote-disco

b433b5d

Add remote access rbac for pods

3613e24

Signed-off-by: Alex Leong <[email protected]>

Add more remote rbac

446b54d

Signed-off-by: Alex Leong <[email protected]>

Fix rbac even better

6127d8c

Signed-off-by: Alex Leong <[email protected]>

adleong requested a review from a team as a code owner August 9, 2023 00:43

adleong commented Aug 9, 2023

View reviewed changes

Merge branch 'main' into alex/remote-disco

a03941f

mateiidavid reviewed Aug 9, 2023

View reviewed changes

controller/api/destination/server.go Outdated Show resolved Hide resolved

controller/api/destination/server.go Outdated Show resolved Hide resolved

adleong added 2 commits August 10, 2023 00:07

Add cluster label for informer and endpoint watcher metrics

ef738e3

Signed-off-by: Alex Leong <[email protected]>

Add unit test for remote discovery

c2fb233

Signed-off-by: Alex Leong <[email protected]>

alpeb reviewed Aug 10, 2023

View reviewed changes

mateiidavid reviewed Aug 10, 2023

View reviewed changes

adleong added 3 commits August 10, 2023 19:16

feedback

5056cab

Signed-off-by: Alex Leong <[email protected]>

Add schemas only once in fake clients

bcb27b1

Signed-off-by: Alex Leong <[email protected]>

Use API instead of bare secrets informer

bc27d49

Signed-off-by: Alex Leong <[email protected]>

alpeb approved these changes Aug 11, 2023

View reviewed changes

mateiidavid approved these changes Aug 11, 2023

View reviewed changes

adleong merged commit 368b638 into main Aug 11, 2023
37 checks passed

adleong deleted the alex/remote-disco branch August 11, 2023 16:31

hawkw mentioned this pull request Aug 11, 2023

edge-23.8.2 #11235

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for remote discovery #11224

Add support for remote discovery #11224

adleong commented Aug 9, 2023 •

edited

Loading

adleong Aug 9, 2023

mateiidavid Aug 9, 2023

alpeb Aug 10, 2023

mateiidavid Aug 10, 2023 •

edited

Loading

alpeb Aug 10, 2023

alpeb Aug 10, 2023

adleong Aug 10, 2023

alpeb Aug 10, 2023

alpeb Aug 10, 2023

mateiidavid left a comment

alpeb left a comment

mateiidavid left a comment

	ownerKind, ownerName, err := pp.metadataAPI.GetOwnerKindAndName(context.Background(), pod, false)
	if err != nil {
	return Address{}, PodID{}, err
	}

	if retry {
	parentObj, err = api.client.
	Resource(batchv1.SchemeGroupVersion.WithResource("jobs")).
	Namespace(pod.Namespace).
	Get(ctx, parent.Name, metav1.GetOptions{})
	if err != nil {
	log.Warnf("failed to retrieve job from direct API call %s/%s: %s", pod.Namespace, parent.Name, err)
	}
	}
	}
	case "ReplicaSet":
	var rsObj *metav1.PartialObjectMetadata
	rsObj, err = api.getByNamespace(RS, pod.Namespace, parent.Name)
	if err != nil {
	log.Warnf("failed to retrieve replicaset from indexer %s/%s: %s", pod.Namespace, parent.Name, err)
	if retry {
	rsObj, err = api.client.
	Resource(appsv1.SchemeGroupVersion.WithResource("replicasets")).
	Namespace(pod.Namespace).
	Get(ctx, parent.Name, metav1.GetOptions{})
	if err != nil {
	log.Warnf("failed to retrieve replicaset from direct API call %s/%s: %s", pod.Namespace, parent.Name, err)
	}
	}
	}

		secretsInformerFactory := informers.NewSharedInformerFactoryWithOptions(client, k8s.ResyncTime, informers.WithNamespace(namespace))
		secrets := secretsInformerFactory.Core().V1().Secrets().Informer()

Add support for remote discovery #11224

Add support for remote discovery #11224

Conversation

adleong commented Aug 9, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mateiidavid Aug 10, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mateiidavid left a comment

Choose a reason for hiding this comment

alpeb left a comment

Choose a reason for hiding this comment

mateiidavid left a comment

Choose a reason for hiding this comment

adleong commented Aug 9, 2023 •

edited

Loading

mateiidavid Aug 10, 2023 •

edited

Loading