[Connect] Add Server Features to GetCluster #17497
Conversation
To enable feature detection in the Connect application, we need to ping the auth server to understand which features are enabled. Previously, we could get away with any cluster information stored in the cluster profile but a proxy dial is necessary now to get an auth ping response.
| // User is the cluster access control list of the logged-in user | ||
| LoggedInUser logged_in_user = 7; | ||
| // features describes the auth servers features | ||
| Features features = 8; |
There was a problem hiding this comment.
Any reason not to include the entire ping response? It seems like knowing things like the version, whether or not it's a FIPS server, and if there are any license warnings might also be useful.
There was a problem hiding this comment.
No reason besides wanting to keep the API concise and tidy to what is needed for this PR. Not opposed to adding in the extra, unused responses to prevent extra PRs in the future.
There was a problem hiding this comment.
I actually recommended Michael to only include what he absolutely needs; it's easy to add to APIs, but hard to remove from them.
codingllama
left a comment
There was a problem hiding this comment.
LGTM, but get someone from Connect to have a look too.
| defer proxyClient.Close() | ||
|
|
||
| authPingResponse, err = proxyClient.CurrentCluster().Ping(ctx) | ||
|
|
There was a problem hiding this comment.
nit: keep err close to whatever deals with it (ifs, returns)
|
|
||
| return trace.Wrap(err) | ||
| }) | ||
|
|
| @@ -76,7 +76,7 @@ func (s *Handler) RemoveCluster(ctx context.Context, req *api.RemoveClusterReque | |||
|
|
|||
| // GetCluster returns a cluster | |||
| func (s *Handler) GetCluster(ctx context.Context, req *api.GetClusterRequest) (*api.Cluster, error) { | |||
There was a problem hiding this comment.
This is a longer comment about some of the intricacies of the frontend app.
I think this handler is a good place to fetch features but we need to consider the implications for the frontend app. The endpoint used to make no network calls and now it'll make one.
The logout problem
We call this endpoint on logout. It seems it's a remnant of times when logging out of a cluster and removing it from a cluster list were two separate actions. So you'd log out, then the Electron app would sync the cluster info, just so that connected flips from true to false.
Merging this PR alone would mean that the user is no longer able to log out from a cluster that's offline or if they don't have an internet connection. To be honest, it looks like it might already be a problem in very specific scenarios as I see you added code that fetches assumed requests.
At this point I think we can just remove the call to syncClusterInfo. This gets rid of that problem.
Bundling multiple calls into one
You already saw that there's a bunch of comments in the frontend app mentioning how this endpoint will not return an error if the cert has expired because it makes no network calls.
Fortunately, other than on logout this endpoint is called only in one other place, syncRootCluster, where it's always coupled with a call to fetch the list of leaf clusters. This one makes a network request anyway and will fail if the cert has expired.
I also just noticed while writing this comment that syncClusterInfo now calls fetchClusterAssumedRequests.
On top of returning a list of features, GetCluster could also return info about leaf clusters and assumed requests. This way we save ourselves a lot of additional work on the frontend side.
On a related note, the comment in syncRootCluster mentions how there's a race condition. It seems that I was wrong at the time of writing it, there's no race condition as we wait for both promises to finish.
The flashing UI problem
On app startup, we call clustersService.syncRootClusters. This fetches a list of root clusters (tshd makes no network requests here, just reads data from the disk) and resets the state of each cluster in ClustersService. Once the list is obtained, ClustersService proceeds to sync more details for each root cluster but doesn't wait for results.
What it means for us here is that on each app startup, the list of features would be simply wiped out. If there are UI elements which are supposed to be shown only if a feature is enabled, this means those UI elements will show up only once we get a response from GetCluster.
This kind of behavior might cause problems because, for example, a component might check features before they're actually available and show some kind of an error. We've seen this kind of problems in the past.
Ultimately I think we can skip dealing with this problem for now and see how it all pans out.
A digression on request status
As you can see in the ClustersServiceState type, we track sync status for resources like kubes. apps, servers and dbs but not for root and leaf clusters.
In the past, we ran into a problem (gravitational/webapps#920) where <DocumentTerminal> would call clustersService.findCluster(clusterUri). This worked fine for root clusters since they're always available in ClustersService once the app boots up. But it was a problem for leaf clusters since they're fetched only later.
As a result, <DocumentTerminal> would crash because this piece of info wasn't available yet.
At this point, we considered tracking sync status for leaf and root clusters. This way <DocumentTerminal> could simply wait for the leaf clusters to finish syncing and only then proceed with the rest of the logic.
However, we realized that the only reason we make this call to ClustersService is to obtain information which could be extracted from the URI anyway, without any additional calls. The fix meant that <DocumentTerminal> could start operating immediately, without having to wait for other requests to finish.
So we just have to be cognizant of situations like that one where perhaps not waiting is a suitable solution.
Going forward
From this long writeup, the only problem requiring immediate attention is the logout one which seems to be easy to fix.
Bundling multiple calls into one would massively help with reducing the complexity of the whole flow. But with the coming v11 release I understand there might not be time to get it in there. Could you file an issue for that?
The flashing UI problem can be left alone for now, in theory we could experiment with sync status there but again it might not be necessary. And it wouldn't help with not flashing the UI anyway since we need to fetch available features from the cluster anyway.
We could treat those additional details as a separate resource but I'm yet to consider this option in more detail.
There was a problem hiding this comment.
At this point I think we can just remove the call to syncClusterInfo. This gets rid of that problem.
The logout problem is handled in the frontend compliment to this PR, the same way you suggest here. I must admit, when I came across the 'bug' of not being able to logout when I initially started this, I was thrown way off. "How the heck am I not able to log out?!". Reading your background gives a lot of context. Thank you!
The flashing UI problem
This is one thing that has bothered me a bit, even before this PR. We would always show a blank UI and just wait for the cluster to sync (I suppose showing that syncing state would alleviate this but that isn't finished). I think in general a "cluster init" type of loading state would be nice. Separate from a syncing state, but only for startup. I think this is something we can do in the future for sure.
| return nil, trace.Wrap(err) | ||
| } | ||
|
|
||
| cluster.Features = features |
There was a problem hiding this comment.
This introduces a field that sometimes holds a value and sometimes not, depending on which methods were called.
I think it might make more sense to create a new type and embed clusters.Cluster within it. Something like, idk, DetailedCluster? Could be even ClusterWithFeatures and then we can remane it if we need to add more of those "details".
Then GetCluster could return this type while other places could continue to work with just clusters.Cluster.
There was a problem hiding this comment.
My 2c:
This introduces a field that sometimes holds a value and sometimes not, depending on which methods were called.
This is fine, just document the field and move on. A type with optional/transient fields is nothing new and it won't explode your API with various combinations of ClusterWithThis and ClusterWithThat.
There was a problem hiding this comment.
Doesn't it make it harder to understand which fields are available on the struct when looking at a specific line in the codebase?
I can see that there might be problems with exploding combinations in the future and the resource view pattern is something I did see in JSON APIs. But here we have a pretty clear cut situation where in most cases we always want to use a simple version of the object and one endpoint returns all details.
So it seems to me that having two separate structs & messages would make it clearer as to what to expect from specific methods & endpoints that.
There was a problem hiding this comment.
Doesn't it make it harder to understand which fields are available on the struct when looking at a specific line in the codebase?
Usually it's not a problem - the caller of RPC X knows what they did and thus which fields are available. It's also common for gRPC clients to deal with missing fields due to version skew (a problem Connect mostly doesn't have).
In this particular case, if the field is important, make it available via more RPCs; otherwise an "optional" approach works.
Good documentation is paramount, though. We need to easily known which RPCs do return this and which don't.
There was a problem hiding this comment.
I agree that this field wouldn't be needed on cluster.Cluster and wouldn't mind removing it. It doesn't need its own struct anywhere tbh. I also don't want to create new structs like ClusterWithDetails and in this case, I don't think anything new is warranted. We have GetClusterFeatures to return the features when needed.
However, to model an already existing pattern: just as LoggedInUser is included in the Cluster message in the proto package, Features should be as well. LoggedInUser also only reads from disk stored profile, but this will soon change as we will need to make a network request for AccessCapabilities. The only time LoggedInUser and Features are relevant are for the root cluster as both of these come from the auth server, and can be missing on a leaf cluster query (the only other place this message would return from).
Merging any cluster information, logged in user information, and auth server info into a single GetCluster method (similar to the proposed idea) would be the best place for this. It would also lend itself to a nice cluster init flow in the app when we implement that. I think changes should be made to syncRootClusters on the frontend to basically run a Promise.allSettled(map of get clusters). If we are doing pings to the auth server, I don't want to run those in a loop like the current ListClusters available in the handler now (and probably just remove ListClusters after that)
In summary: I'm in favor of removing Features from cluster.Cluster and keeping the proposed features on the proto Cluster message. We can just update newApiRootCluster to include this information. I think it's a good middleground. thoughts?
edit: AddCluster handler also returns a Cluster message, but I think this information above is relevant to that call and should be added there too so, doesn't detract.
There was a problem hiding this comment.
In summary: I'm in favor of removing
Featuresfromcluster.Clusterand keeping the proposedfeatureson the proto Cluster message. We can just updatenewApiRootClusterto include this information. I think it's a good middleground. thoughts?
That sounds fine to me.
edit:
AddClusterhandler also returns aClustermessage, but I think this information above is relevant to that call and should be added there too so, doesn't detract.
Yeah, otherwise you'd need to add the root cluster and then sync it immediately, that makes sense. Once we go down that path we might realize that it'd actually be beneficial to add Features field on cluster.Cluster so do what you think is best.
There was a problem hiding this comment.
After fiddling with the AddCluster side of things, I think it's best to add the Features field on cluster.Cluster as Alan suggested (apologies for the back and forth). I documented the field, the area in newApiRootCluster where it applies, and in proto message. AddCluster doesn't (and can't) get any auth server information like features because the user hasn't logged in yet, only added the cluster information to a profile, so the return message will not include the features field (documented).
I can see GetCluster being used quite extensively in the near future with tshd-initiated communication and persistent auth/web clients. Using it to fetch all the relevant information now is a bit wonky, but once those features are in, will be the new paradigm imo.
| // User is the cluster access control list of the logged-in user | ||
| LoggedInUser logged_in_user = 7; | ||
| // features describes the auth servers features | ||
| Features features = 8; |
There was a problem hiding this comment.
Just like with the struct in the clusters package, I don't think it's good idea to have a field that's empty most of the time with the exception of one particular endpoint.
It might be better to introduce a new message, extending Cluster fields or copying them over (I don't know what capabilities protobufs offer in this area).
There was a problem hiding this comment.
Same remarks as above, I think this is alright.
A few counter-examples in our codebase:
teleport/api/proto/teleport/devicetrust/v1/device.proto
Lines 57 to 61 in 35352aa
teleport/api/proto/teleport/devicetrust/v1/device.proto
Lines 71 to 76 in 35352aa
teleport/api/proto/teleport/devicetrust/v1/devicetrust_service.proto
Lines 162 to 163 in 35352aa
- https://cloud.google.com/apis/design/design_patterns#resource_view (pattern)
| return nil, trace.Wrap(err) | ||
| } | ||
|
|
||
| return authPingResponse.ServerFeatures, nil |
There was a problem hiding this comment.
Is it possible for the cluster to have access requests enabled but the user to not have access to them?
In the web app, I see that the features are scoped per user rather than per cluster which I believe is a good choice. Ultimately in our case we can use the same endpoint for both (it already returns info about currently logged in user), it's just a matter of how we approach that on the backend.
There was a problem hiding this comment.
I was working on LoggedInUser ACL related things but decided to separate it into it's own PR. Reading the comments here gave me a bit to think about so the state of that PR was pushed back a bit.
There was a problem hiding this comment.
For access requests specifically, on the webui all users can see the tab and the UI even if they have no access to Access Requests. Although I believe this is in more of a "TODO" state than a conscious design.
The current change I have proposed in the frontend UI wouldn't be mutually exclusive to this idea. If the cluster doesn't have Access Requests enabled, then the UI shouldn't show the feature at all.
Additionaly, we can have a more granular user ACL for different aspects of access requests, this would be tied into the smaller components of the UI. (such as disabling the DELETE button if they do not have access to DELETE)
| // Does not include detailed cluster information that would require a network request. | ||
| rpc ListRootClusters(ListClustersRequest) returns (ListClustersResponse); | ||
| // ListLeafClusters lists leaf clusters | ||
| // Does not include detailed cluster information that would require a network request. |
There was a problem hiding this comment.
Could you add a comment to GetCluster that it makes a network request and returns detailed cluster information?
This will set the foundation for feature detection use in the Connect application. Previously, we could get all the information we needed just from the cluster profile (stored on disk) but now, with the demand to understand which features may be available on the current Auth server, we must ping. If there is a better way to handle this, I'm all ears! Currently, the only relevant information needed for Connect is AdvancedAccessWorkflows (access requests), but we can add keys to the Connect API as needed.
This will set the foundation for feature detection use in the Connect application. Previously, we could get all the information we needed just from the cluster profile (stored on disk) but now, with the demand to understand which features may be available on the current Auth server, we must ping. If there is a better way to handle this, I'm all ears!
Currently, the only relevant information needed for Connect is
AdvancedAccessWorkflows(access requests), but we can add keys to the Connect API as needed.