Skip to content

Conversation

@amrc-benmorrow
Copy link
Contributor

This is not fully tested yet.

Add an Auth interface to RxClient. This tracks ACLs via change-notify, caching and tracking the ACL for a given principal for half an hour. Calls to the existing Auth interface methods will now go via this cache instead of the HTTP request cache.

Update existing Auth clients to use RxClient. In most cases this is all that is required to support change-notify. In some cases additional updates to use the ConfigDB notify interface are also included.

@amrc-benmorrow amrc-benmorrow force-pushed the bmz/auth-rx-client branch 4 times, most recently from 91f04a0 to 0ca1421 Compare October 1, 2025 14:19
@amrc-benmorrow
Copy link
Contributor Author

Causes failures in cluster-manager on initial install. The failures are caused by subscriptions to the ConfigDB being refused. I thought this was to do with permissions being cached inappropriately in RxClient.Auth but this doesn't appear to be the cause. The failures go away once auth and configdb have been restarted.

@amrc-benmorrow
Copy link
Contributor Author

amrc-benmorrow commented Oct 3, 2025

The cause here is that adding a new principal UPN mapping is not causing the permissions for that principal to be correctly recalculated internally. This problem lies entirely inside the Auth service.

The ConfigDB is performing a WATCH on v2/acl/kerberos/sv1clustermanager. Initially this returns a 403 response as the sv1configdb UPN isn't mapped to an account yet. (This is why this only appears on initial install.) This causes the ConfigDB to respond 503 to cluster manager subscriptions, as we can't check ACLs. So far this is all correct. However when the sv1configdb account mapping is created (by service-setup, in this case, but krbkeys would be the same) the WATCH request should update with the ACL information, but instead it sends another 403 response.

This can be reliably reproduced by:

  • Creating an account with ReadACL permission.
  • Deleting the UPN mapping for that account.
  • Subscribing to WATCH v2/acl/....
  • Recreating the UPN mapping.

This overrides the core acl-fetch method in the Auth interface with one
which subscribes to notifications. This gives us immediate updates when
permissions change. We subscribe to the ACL for a given principal for
half an hour; if we have no more requests in that time the subscription
will be dropped.
This gives push authentication.
Code which has moved from acs-configdb to js-service-api was not moved
across correctly.
Skip the Directory for now as that needs direct access to the raw ACL.
The Directory needs access to the raw ACL, so expose this as a public
method. It's cleaner to override a public method in any case.
This should be more reliable as the Sparkplug notifications do not sync
well with the HTTP caching.
The ServiceClient already had correct code to return a 503 when we got
permission denied from the Auth service. Use this instead of overriding
it.

Switch to Response in place of Maybe as this preserves the HTTP response
code.
I even created a sequence function especially for this purpose...
Rationalise our error returns in ServiceClient so that we only return
ServiceError for a problem with the remote service. Ordinary parameter
errors now throw other errors. This means that a ServiceError thrown
during processing of a request should always invoke a 503 response.

Remove the hack in Auth.fetch_acl to throw a ServiceError with 503 in.
The original logic here did not retry failed updates and left this to
the 10-minute reconcile loop. Changing it to retry individual actions
causes a retrying update to block all other update, including an update
to delete the cluster.

Group actions by cluster. Attach retry logic to each action and then
switch through the actions for each cluster. This ensures we abandon a
retrying action if we should be doing something else instead.
Instead of duck-typing use a proper class to distinguish between 'a
service we contacted returned an error', 'we need to return a specific
error' and other failures which happen to result in an error object with
a `status` property.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants