[Connect] Add Server Features to GetCluster by avatus · Pull Request #17497 · gravitational/teleport

avatus · 2022-10-17T19:02:11Z

This will set the foundation for feature detection use in the Connect application. Previously, we could get all the information we needed just from the cluster profile (stored on disk) but now, with the demand to understand which features may be available on the current Auth server, we must ping. If there is a better way to handle this, I'm all ears!

Currently, the only relevant information needed for Connect is AdvancedAccessWorkflows (access requests), but we can add keys to the Connect API as needed.

To enable feature detection in the Connect application, we need to ping the auth server to understand which features are enabled. Previously, we could get away with any cluster information stored in the cluster profile but a proxy dial is necessary now to get an auth ping response.

zmb3 · 2022-10-17T19:14:16Z

  // User is the cluster access control list of the logged-in user
  LoggedInUser logged_in_user = 7;
+  // features describes the auth servers features
+  Features features = 8;


Any reason not to include the entire ping response? It seems like knowing things like the version, whether or not it's a FIPS server, and if there are any license warnings might also be useful.

No reason besides wanting to keep the API concise and tidy to what is needed for this PR. Not opposed to adding in the extra, unused responses to prevent extra PRs in the future.

I actually recommended Michael to only include what he absolutely needs; it's easy to add to APIs, but hard to remove from them.

codingllama

LGTM, but get someone from Connect to have a look too.

codingllama · 2022-10-17T22:05:43Z

+		defer proxyClient.Close()
+
+		authPingResponse, err = proxyClient.CurrentCluster().Ping(ctx)
+


nit: keep err close to whatever deals with it (ifs, returns)

Suggested change

codingllama · 2022-10-17T22:05:56Z

+
+		return trace.Wrap(err)
+	})
+


nit:

Suggested change

ravicious · 2022-10-18T10:44:28Z

@@ -76,7 +76,7 @@ func (s *Handler) RemoveCluster(ctx context.Context, req *api.RemoveClusterReque

 // GetCluster returns a cluster
 func (s *Handler) GetCluster(ctx context.Context, req *api.GetClusterRequest) (*api.Cluster, error) {


This is a longer comment about some of the intricacies of the frontend app.

I think this handler is a good place to fetch features but we need to consider the implications for the frontend app. The endpoint used to make no network calls and now it'll make one.

The logout problem

We call this endpoint on logout. It seems it's a remnant of times when logging out of a cluster and removing it from a cluster list were two separate actions. So you'd log out, then the Electron app would sync the cluster info, just so that connected flips from true to false.

Merging this PR alone would mean that the user is no longer able to log out from a cluster that's offline or if they don't have an internet connection. To be honest, it looks like it might already be a problem in very specific scenarios as I see you added code that fetches assumed requests.

At this point I think we can just remove the call to syncClusterInfo. This gets rid of that problem.

Bundling multiple calls into one

You already saw that there's a bunch of comments in the frontend app mentioning how this endpoint will not return an error if the cert has expired because it makes no network calls.

Fortunately, other than on logout this endpoint is called only in one other place, syncRootCluster, where it's always coupled with a call to fetch the list of leaf clusters. This one makes a network request anyway and will fail if the cert has expired.

I also just noticed while writing this comment that syncClusterInfo now calls fetchClusterAssumedRequests.

On top of returning a list of features, GetCluster could also return info about leaf clusters and assumed requests. This way we save ourselves a lot of additional work on the frontend side.

On a related note, the comment in syncRootCluster mentions how there's a race condition. It seems that I was wrong at the time of writing it, there's no race condition as we wait for both promises to finish.

The flashing UI problem

On app startup, we call clustersService.syncRootClusters. This fetches a list of root clusters (tshd makes no network requests here, just reads data from the disk) and resets the state of each cluster in ClustersService. Once the list is obtained, ClustersService proceeds to sync more details for each root cluster but doesn't wait for results.

What it means for us here is that on each app startup, the list of features would be simply wiped out. If there are UI elements which are supposed to be shown only if a feature is enabled, this means those UI elements will show up only once we get a response from GetCluster.

This kind of behavior might cause problems because, for example, a component might check features before they're actually available and show some kind of an error. We've seen this kind of problems in the past.

Ultimately I think we can skip dealing with this problem for now and see how it all pans out.

A digression on request status

As you can see in the ClustersServiceState type, we track sync status for resources like kubes. apps, servers and dbs but not for root and leaf clusters.

In the past, we ran into a problem (gravitational/webapps#920) where <DocumentTerminal> would call clustersService.findCluster(clusterUri). This worked fine for root clusters since they're always available in ClustersService once the app boots up. But it was a problem for leaf clusters since they're fetched only later.

As a result, <DocumentTerminal> would crash because this piece of info wasn't available yet.

At this point, we considered tracking sync status for leaf and root clusters. This way <DocumentTerminal> could simply wait for the leaf clusters to finish syncing and only then proceed with the rest of the logic.

However, we realized that the only reason we make this call to ClustersService is to obtain information which could be extracted from the URI anyway, without any additional calls. The fix meant that <DocumentTerminal> could start operating immediately, without having to wait for other requests to finish.

So we just have to be cognizant of situations like that one where perhaps not waiting is a suitable solution.

Going forward

From this long writeup, the only problem requiring immediate attention is the logout one which seems to be easy to fix.

Bundling multiple calls into one would massively help with reducing the complexity of the whole flow. But with the coming v11 release I understand there might not be time to get it in there. Could you file an issue for that?

The flashing UI problem can be left alone for now, in theory we could experiment with sync status there but again it might not be necessary. And it wouldn't help with not flashing the UI anyway since we need to fetch available features from the cluster anyway.

We could treat those additional details as a separate resource but I'm yet to consider this option in more detail.

At this point I think we can just remove the call to syncClusterInfo. This gets rid of that problem.

The logout problem is handled in the frontend compliment to this PR, the same way you suggest here. I must admit, when I came across the 'bug' of not being able to logout when I initially started this, I was thrown way off. "How the heck am I not able to log out?!". Reading your background gives a lot of context. Thank you!

The flashing UI problem

This is one thing that has bothered me a bit, even before this PR. We would always show a blank UI and just wait for the cluster to sync (I suppose showing that syncing state would alleviate this but that isn't finished). I think in general a "cluster init" type of loading state would be nice. Separate from a syncing state, but only for startup. I think this is something we can do in the future for sure.

ravicious · 2022-10-18T10:48:48Z

+		return nil, trace.Wrap(err)
+	}
+
+	cluster.Features = features


This introduces a field that sometimes holds a value and sometimes not, depending on which methods were called.

I think it might make more sense to create a new type and embed clusters.Cluster within it. Something like, idk, DetailedCluster? Could be even ClusterWithFeatures and then we can remane it if we need to add more of those "details".

Then GetCluster could return this type while other places could continue to work with just clusters.Cluster.

My 2c:

This introduces a field that sometimes holds a value and sometimes not, depending on which methods were called.

This is fine, just document the field and move on. A type with optional/transient fields is nothing new and it won't explode your API with various combinations of ClusterWithThis and ClusterWithThat.

Doesn't it make it harder to understand which fields are available on the struct when looking at a specific line in the codebase?

I can see that there might be problems with exploding combinations in the future and the resource view pattern is something I did see in JSON APIs. But here we have a pretty clear cut situation where in most cases we always want to use a simple version of the object and one endpoint returns all details.

So it seems to me that having two separate structs & messages would make it clearer as to what to expect from specific methods & endpoints that.

Doesn't it make it harder to understand which fields are available on the struct when looking at a specific line in the codebase?

Usually it's not a problem - the caller of RPC X knows what they did and thus which fields are available. It's also common for gRPC clients to deal with missing fields due to version skew (a problem Connect mostly doesn't have).

In this particular case, if the field is important, make it available via more RPCs; otherwise an "optional" approach works.

Good documentation is paramount, though. We need to easily known which RPCs do return this and which don't.

I agree that this field wouldn't be needed on cluster.Cluster and wouldn't mind removing it. It doesn't need its own struct anywhere tbh. I also don't want to create new structs like ClusterWithDetails and in this case, I don't think anything new is warranted. We have GetClusterFeatures to return the features when needed.

However, to model an already existing pattern: just as LoggedInUser is included in the Cluster message in the proto package, Features should be as well. LoggedInUser also only reads from disk stored profile, but this will soon change as we will need to make a network request for AccessCapabilities. The only time LoggedInUser and Features are relevant are for the root cluster as both of these come from the auth server, and can be missing on a leaf cluster query (the only other place this message would return from).

Merging any cluster information, logged in user information, and auth server info into a single GetCluster method (similar to the proposed idea) would be the best place for this. It would also lend itself to a nice cluster init flow in the app when we implement that. I think changes should be made to syncRootClusters on the frontend to basically run a Promise.allSettled(map of get clusters). If we are doing pings to the auth server, I don't want to run those in a loop like the current ListClusters available in the handler now (and probably just remove ListClusters after that)

In summary: I'm in favor of removing Features from cluster.Cluster and keeping the proposed features on the proto Cluster message. We can just update newApiRootCluster to include this information. I think it's a good middleground. thoughts?

edit: AddCluster handler also returns a Cluster message, but I think this information above is relevant to that call and should be added there too so, doesn't detract.

In summary: I'm in favor of removing Features from cluster.Cluster and keeping the proposed features on the proto Cluster message. We can just update newApiRootCluster to include this information. I think it's a good middleground. thoughts?

That sounds fine to me.

edit: AddCluster handler also returns a Cluster message, but I think this information above is relevant to that call and should be added there too so, doesn't detract.

Yeah, otherwise you'd need to add the root cluster and then sync it immediately, that makes sense. Once we go down that path we might realize that it'd actually be beneficial to add Features field on cluster.Cluster so do what you think is best.

After fiddling with the AddCluster side of things, I think it's best to add the Features field on cluster.Cluster as Alan suggested (apologies for the back and forth). I documented the field, the area in newApiRootCluster where it applies, and in proto message. AddCluster doesn't (and can't) get any auth server information like features because the user hasn't logged in yet, only added the cluster information to a profile, so the return message will not include the features field (documented).

I can see GetCluster being used quite extensively in the near future with tshd-initiated communication and persistent auth/web clients. Using it to fetch all the relevant information now is a bit wonky, but once those features are in, will be the new paradigm imo.

ravicious · 2022-10-18T10:51:34Z

  // User is the cluster access control list of the logged-in user
  LoggedInUser logged_in_user = 7;
+  // features describes the auth servers features
+  Features features = 8;


Just like with the struct in the clusters package, I don't think it's good idea to have a field that's empty most of the time with the exception of one particular endpoint.

It might be better to introduce a new message, extending Cluster fields or copying them over (I don't know what capabilities protobufs offer in this area).

Same remarks as above, I think this is alright.

A few counter-examples in our codebase:

teleport/api/proto/teleport/devicetrust/v1/device.proto

Lines 57 to 61 in 35352aa

// Enrollment token for the device.

// Only present in situations where device creation and enrollment are rolled

// into a single operation.

// Transient.

DeviceEnrollToken enroll_token = 7;

teleport/api/proto/teleport/devicetrust/v1/device.proto

Lines 71 to 76 in 35352aa

// Device data collected during enrollment and device authentication.

// Enrollment data is always present, while authentication data is capped at N

// most recent events.

// Only present in certain read modes.

// Transient.

repeated DeviceCollectedData collected_data = 10;

teleport/api/proto/teleport/devicetrust/v1/devicetrust_service.proto

Lines 162 to 163 in 35352aa

// Specifices which fields of the resource should be returned in the response.

DeviceView view = 3;

https://cloud.google.com/apis/design/design_patterns#resource_view (pattern)

ravicious · 2022-10-18T12:35:06Z

+		return nil, trace.Wrap(err)
+	}
+
+	return authPingResponse.ServerFeatures, nil


Is it possible for the cluster to have access requests enabled but the user to not have access to them?

In the web app, I see that the features are scoped per user rather than per cluster which I believe is a good choice. Ultimately in our case we can use the same endpoint for both (it already returns info about currently logged in user), it's just a matter of how we approach that on the backend.

I was working on LoggedInUser ACL related things but decided to separate it into it's own PR. Reading the comments here gave me a bit to think about so the state of that PR was pushed back a bit.

For access requests specifically, on the webui all users can see the tab and the UI even if they have no access to Access Requests. Although I believe this is in more of a "TODO" state than a conscious design.

The current change I have proposed in the frontend UI wouldn't be mutually exclusive to this idea. If the cluster doesn't have Access Requests enabled, then the UI shouldn't show the feature at all.

Additionaly, we can have a more granular user ACL for different aspects of access requests, this would be tied into the smaller components of the UI. (such as disabling the DELETE button if they do not have access to DELETE)

ravicious · 2022-10-20T14:50:31Z

+  // Does not include detailed cluster information that would require a network request.
  rpc ListRootClusters(ListClustersRequest) returns (ListClustersResponse);
  // ListLeafClusters lists leaf clusters
+  // Does not include detailed cluster information that would require a network request.


Could you add a comment to GetCluster that it makes a network request and returns detailed cluster information?

…_to_clusters

This will set the foundation for feature detection use in the Connect application. Previously, we could get all the information we needed just from the cluster profile (stored on disk) but now, with the demand to understand which features may be available on the current Auth server, we must ping. If there is a better way to handle this, I'm all ears! Currently, the only relevant information needed for Connect is AdvancedAccessWorkflows (access requests), but we can add keys to the Connect API as needed.

avatus added 3 commits October 17, 2022 13:46

Add protogen files

0b1b6a0

Handle GetClusterFeatures error

65c76f1

avatus requested review from gzdunek and ravicious October 17, 2022 19:02

github-actions Bot requested review from atburke and codingllama October 17, 2022 19:02

avatus mentioned this pull request Oct 17, 2022

[Connect] Conditionally render Access Request navigation menu gravitational/webapps#1273

Merged

zmb3 reviewed Oct 17, 2022

View reviewed changes

codingllama approved these changes Oct 17, 2022

View reviewed changes

Remove extra lines before err handling

681d7c9

ravicious reviewed Oct 18, 2022

View reviewed changes

avatus added 3 commits October 19, 2022 17:53

Update comments around Feature in clusters.Cluster related operations

94337d9

Conditionally set features field in cluster proto responses

3b82f5d

Protofiles

6366801

gzdunek approved these changes Oct 20, 2022

View reviewed changes

ravicious approved these changes Oct 20, 2022

View reviewed changes

github-actions Bot removed the request for review from atburke October 20, 2022 14:50

avatus added 6 commits October 20, 2022 11:43

Update GetCluster comment

fb06ef7

Update protofiles

477d058

Merge branch 'master' into michaelmyers/connect/add_feature_detection…

f7581dd

…_to_clusters

Merge branch 'master' into michaelmyers/connect/add_feature_detection…

6671265

…_to_clusters

Merge branch 'master' into michaelmyers/connect/add_feature_detection…

42e9b7b

…_to_clusters

Merge branch 'master' into michaelmyers/connect/add_feature_detection…

2b8be28

…_to_clusters

avatus enabled auto-merge (squash) October 20, 2022 19:33

Merge branch 'master' into michaelmyers/connect/add_feature_detection…

5d08317

…_to_clusters

avatus merged commit c8c02c4 into master Oct 20, 2022

avatus added a commit that referenced this pull request Oct 20, 2022

Cherrypick #17497

f9d447b

avatus mentioned this pull request Oct 20, 2022

[v11] [Connect] Add Server Features to GetCluster #17497 #17642

Merged

ravicious mentioned this pull request Oct 24, 2022

Connect: Fix filtering internal logins gravitational/webapps#1292

Merged

ravicious mentioned this pull request Dec 8, 2022

daemon.Service: Rename GetCluster to ResolveFullCluster #19180

Merged

gzdunek mentioned this pull request Dec 19, 2022

[Connect] Add SuggestedReviewers and RequestableRoles to LoggedInUser #19466

Merged

ravicious mentioned this pull request Jan 27, 2023

Connect: return logged in user in ListRootClusters #20809

Merged

avatus deleted the michaelmyers/connect/add_feature_detection_to_clusters branch August 21, 2023 18:36

		defer proxyClient.Close()

		authPingResponse, err = proxyClient.CurrentCluster().Ping(ctx)

		@@ -76,7 +76,7 @@ func (s Handler) RemoveCluster(ctx context.Context, req api.RemoveClusterReque

		// GetCluster returns a cluster
		func (s Handler) GetCluster(ctx context.Context, req api.GetClusterRequest) (*api.Cluster, error) {

	// Enrollment token for the device.
	// Only present in situations where device creation and enrollment are rolled
	// into a single operation.
	// Transient.
	DeviceEnrollToken enroll_token = 7;

	// Device data collected during enrollment and device authentication.
	// Enrollment data is always present, while authentication data is capped at N
	// most recent events.
	// Only present in certain read modes.
	// Transient.
	repeated DeviceCollectedData collected_data = 10;

	// Specifices which fields of the resource should be returned in the response.
	DeviceView view = 3;

Conversation

avatus commented Oct 17, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codingllama left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

The logout problem

Bundling multiple calls into one

The flashing UI problem

A digression on request status

Going forward

Uh oh!

avatus Oct 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

avatus Oct 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

avatus Oct 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

avatus Oct 18, 2022 •

edited

Loading

avatus Oct 18, 2022 •

edited

Loading

avatus Oct 18, 2022 •

edited

Loading