Introduce an EndpointsWatcher cache structure #11190

mateiidavid · 2023-08-01T16:02:05Z

Currently, the control plane does not support indexing and discovering resources across cluster boundaries. In a multicluster set-up, it is desirable to have access to endpoint data (and by extension, any configuration associated with that endpoint) to support pod-to-pod communication. Linkerd's destination service should be extended to support reading discovery information from multiple sources (i.e. API Servers).

This change introduces an EndpointsWatcher cache. On creation, the cache registers a pair of event handlers:

One handler to read multicluster secrets that embed kubeconfig files. For each such secret, the cache creates a corresponding EndpointsWatcher to read remote discovery information.
Another handle to evict entries from the cache when a multicluster secret is deleted.

Additionally, a set of tests have been added that assert the behaviour of the cache when secrets are created and/or deleted. The cache will be used by the destination service to do look-ups for services that belong to another cluster, and that are running in a "remote discovery" mode.

Signed-off-by: Matei David [email protected]

* EndpointsWatcher cache will be populated with one EndpointsWatcher per cluster (including the local cluster) * Using a namespace scoped informer, it will add/remove watchers based on secrets read from the namespace. Signed-off-by: Matei David <[email protected]>

Signed-off-by: Matei David <[email protected]>

adleong

direction looks good here. we'll want to be able to stop the informers when the secret is deleted, which I think we can do with the channel passed into Sync. we may also want to add a Stop function to EndpointsWatcher to unregister the event listeners to avoid memory leaks, but maybe stopping the informers and removing the EndpointsWatcher from the cache is sufficient to getting it all eligible for gc.

Signed-off-by: Matei David <[email protected]>

controller/api/destination/watcher/cluster_watcher.go

controller/api/destination/watcher/endpoints_watcher.go

controller/api/destination/watcher/cluster_watcher.go

Signed-off-by: Matei David <[email protected]>

adleong · 2023-08-04T19:01:02Z

controller/api/destination/watcher/cluster_store.go

@@ -221,8 +221,10 @@ func (cs *ClusterStore) addCluster(clusterName string, secret *v1.Secret) error
 		stopCh,
 	}

-	remoteAPI.Sync(stopCh)
-	metadataAPI.Sync(stopCh)
+	go func() {


we could even go a step further and sync each of these in their own goroutine to sync them in parallel.

alpeb · 2023-08-04T19:05:22Z

controller/api/destination/watcher/cluster_store_test.go

+		remoteAPI, err := k8s.NewFakeAPI([]string{}...)
+		if err != nil {
+			return nil, nil, err
+		}
+
+		metadataAPI, err := k8s.NewFakeMetadataAPI(nil)


Following up on our off-github convo, I think the race occurs between these two calls NewFakeAPI and NewFakeMetadataAPI, because they rely on a call to ToRuntimeObject() which uses this global struct instance from scheme/register.go, which holds the map that is contended over:

var Scheme = runtime.NewScheme()

Maybe a simple fix could be to have ToRuntimeObject() receive as an argument the scheme instance, which could be instantiated separately for these two API clients.

alpeb · 2023-08-04T19:17:16Z

controller/api/destination/watcher/cluster_store_test.go

+			}
+
+			// Handle delete events
+			if len(tt.deleteClusters) != 0 {


I think this line is not needed.

alpeb · 2023-08-04T19:29:02Z

controller/api/destination/watcher/cluster_store.go

+		decodeFn:             decodeFn,
+	}
+
+	_, err := cs.k8sAPI.Secret().Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{


Why is that we're not interested in Update events? Can you add a comment about that?

alpeb · 2023-08-04T19:36:52Z

controller/k8s/metadata_api.go

@@ -57,6 +58,26 @@ func InitializeMetadataAPI(kubeConfig string, resources ...APIResource) (*Metada
 	return api, nil
 }

+func InitializeMetadataAPIForConfig(kubeConfig *rest.Config, resources ...APIResource) (*MetadataAPI, error) {


You can slim down InitializeMetadataAPI body now:

func InitializeMetadataAPI(kubeConfig string, resources ...APIResource) (*MetadataAPI, error) { config, err := k8s.GetConfig(kubeConfig, "") if err != nil { return nil, fmt.Errorf("error configuring Kubernetes API client: %w", err) } return InitializeMetadataAPIForConfig(config, resources...) }

alpeb

Exciting!

Left some non-critical comments above. The one thing that might be annoying are the sporadic test failures though.

Also, fyi, EndpointSlices entered GA in k8s 1.21, which is our oldest supported version, at which point the EndpointSlices feature gate defaults to true. And in more recent versions the feature gate is gone altogether. So you could just assume that's enabled to simplify things, but leaving that at your discretion 😉

@mclavel

## edge-23.8.2 This edge release adds improvements to Linkerd's multi-cluster features as part of the [flat network support] planned for Linkerd stable-2.14.0. In addition, it fixes an issue ([#10764]) where warnings about an invalid metric were logged frequently by the Destination controller. * Added a new `remoteDiscoverySelector` field to the multicluster `Link` CRD, which enables a service mirroring mod where the control plane performs discovery for the mirrored service from the remote cluster, rather than creating Endpoints for the mirrored service in the source cluster ([#11190], [#11201], [#11220], and [#11224]) * Fixed missing "Services" menu item in the Spanish localization for the `linkerd-viz` web dashboard ([#11229]) (thanks @mclavel!) * Replaced `server_port_subscribers` Destination controller gauge metric with `server_port_subscribes` and `server_port_unsubscribes` counter metrics ([#11206]; fixes [#10764]) * Replaced deprecated `failure-domain.beta.kubernetes.io` labels in Helm charts with `topology.kubernetes.io` labels ([#11148]; fixes [#11114]) (thanks @piyushsingariya!) [#10764]: #10764 [#11114]: #11114 [#11148]: #11148 [#11190]: #11190 [#11201]: #11201 [#11206]: #11206 [#11220]: #11220 [#11224]: #11224 [#11229]: #11229 [flat network support]: https://linkerd.io/2023/07/20/enterprise-multi-cluster-at-scale-supporting-flat-networks-in-linkerd/

@mclavel

## edge-23.8.2 This edge release adds improvements to Linkerd's multi-cluster features as part of the [flat network support] planned for Linkerd stable-2.14.0. In addition, it fixes an issue ([#10764]) where warnings about an invalid metric were logged frequently by the Destination controller. * Added a new `remoteDiscoverySelector` field to the multicluster `Link` CRD, which enables a service mirroring mode where the control plane performs discovery for the mirrored service from the remote cluster, rather than creating Endpoints for the mirrored service in the source cluster ([#11190], [#11201], [#11220], and [#11224]) * Fixed missing "Services" menu item in the Spanish localization for the `linkerd-viz` web dashboard ([#11229]) (thanks @mclavel!) * Replaced `server_port_subscribers` Destination controller gauge metric with `server_port_subscribes` and `server_port_unsubscribes` counter metrics ([#11206]; fixes [#10764]) * Replaced deprecated `failure-domain.beta.kubernetes.io` labels in Helm charts with `topology.kubernetes.io` labels ([#11148]; fixes [#11114]) (thanks @piyushsingariya!) [#10764]: #10764 [#11114]: #11114 [#11148]: #11148 [#11190]: #11190 [#11201]: #11201 [#11206]: #11206 [#11220]: #11220 [#11224]: #11224 [#11229]: #11229 [flat network support]: https://linkerd.io/2023/07/20/enterprise-multi-cluster-at-scale-supporting-flat-networks-in-linkerd/

mateiidavid added 2 commits August 1, 2023 15:58

Handle error

2e4f1e9

Signed-off-by: Matei David <[email protected]>

adleong reviewed Aug 1, 2023

View reviewed changes

mateiidavid added 5 commits August 2, 2023 17:33

Add graceful shutdown for informers

3e1266b

Signed-off-by: Matei David <[email protected]>

Add first tests for endpoints watcher cache

ab6d71e

Signed-off-by: Matei David <[email protected]>

Fix comments

1362d73

Signed-off-by: Matei David <[email protected]>

Add more tests for endpoints watcher cache

213f5db

Signed-off-by: Matei David <[email protected]>

Re-visit comments and fix lints

1081968

Signed-off-by: Matei David <[email protected]>

mateiidavid marked this pull request as ready for review August 3, 2023 13:54

mateiidavid requested a review from a team as a code owner August 3, 2023 13:54

mateiidavid added 4 commits August 3, 2023 14:39

Fix logs for Stop() function in Endpointswatcher

907c22e

Signed-off-by: Matei David <[email protected]>

Remove Endpoints access when using EndpointSlices

45f8fab

Signed-off-by: Matei David <[email protected]>

Rename EndpointsWatcherCache to ClusterStore

ce70d12

Signed-off-by: Matei David <[email protected]>

Actually remove old files

3d64ca9

Signed-off-by: Matei David <[email protected]>

adleong reviewed Aug 3, 2023

View reviewed changes

mateiidavid added 3 commits August 4, 2023 15:30

Review feedback

6172527

Signed-off-by: Matei David <[email protected]>

Remove Len() from cluster store

90ffd35

Signed-off-by: Matei David <[email protected]>

Release lock in test

501e57a

Signed-off-by: Matei David <[email protected]>

adleong approved these changes Aug 4, 2023

View reviewed changes

Sync API clients async

50db8fd

Signed-off-by: Matei David <[email protected]>

adleong reviewed Aug 4, 2023

View reviewed changes

alpeb reviewed Aug 4, 2023

View reviewed changes

alpeb approved these changes Aug 4, 2023

View reviewed changes

adleong merged commit c2780ad into main Aug 4, 2023
34 checks passed

adleong deleted the matei/secret-destination-read branch August 4, 2023 20:59

hawkw mentioned this pull request Aug 11, 2023

edge-23.8.2 #11235

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce an EndpointsWatcher cache structure #11190

Introduce an EndpointsWatcher cache structure #11190

mateiidavid commented Aug 1, 2023 •

edited

Loading

adleong left a comment

adleong Aug 4, 2023

alpeb Aug 4, 2023

alpeb Aug 4, 2023

alpeb Aug 4, 2023

alpeb Aug 4, 2023

alpeb left a comment •

edited

Loading

Introduce an EndpointsWatcher cache structure #11190

Introduce an EndpointsWatcher cache structure #11190

Conversation

mateiidavid commented Aug 1, 2023 • edited Loading

adleong left a comment

Choose a reason for hiding this comment

adleong Aug 4, 2023

Choose a reason for hiding this comment

alpeb Aug 4, 2023

Choose a reason for hiding this comment

alpeb Aug 4, 2023

Choose a reason for hiding this comment

alpeb Aug 4, 2023

Choose a reason for hiding this comment

alpeb Aug 4, 2023

Choose a reason for hiding this comment

alpeb left a comment • edited Loading

Choose a reason for hiding this comment

mateiidavid commented Aug 1, 2023 •

edited

Loading

alpeb left a comment •

edited

Loading