Skip to content
74 changes: 74 additions & 0 deletions docs/proposals/respect-rbac-for-resource-exclusions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
---
title: Respect RBAC for Resource Inclusions/Exclusions

authors:
- "@gdsoumya"
- "@alexmt"

sponsors:
- TBD

reviewers:
- @jannfis

approvers:
- @jannfis

creation-date: 2023-05-03

---

# Enhancement Idea

This is a proposal to provide the ability to configure argocd controller, to respect the current RBAC permissions
when handling resources besides the already existing resource inclusions and exclusions.

## Summary

Argo CD administrator will be able to configure in `argocd-cm`, whether to enable or disable(default) the feature where the controller will
only monitor resources that the current service account allows it to read.

## Motivation

Some users restrict the access of the argocd to specific resources using rbac and this feature will enable them to continue
using argocd without having to manually configure resource exclusions for all the resources that they don't want argocd to be managing.

## Proposal

The configuration for this will be present in the `argocd-cm`, we will add new boolean field `resource.respectRBAC` in the
cm which can be set to `true` to enable this feature, by default the feature is disabled.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this be safe to enable by default? If the controller doesn't have access to a resource, respecting RBAC will only help it avoid 403s.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can keep it enabled by default, but that might suddenly change the behavior of the controller when a user upgrades which I thought might not be welcomed by some users.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense! I think we'd want to everywhere target enabling by default. But agreed, best to move cautiously.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also be voting for kind of a cautious move here. Make it a feature toggle, test it extensively, and then move forward to enable it by default.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I think by default for now at least this would be disabled and opt in for the user.


For the implementation there are 3 proposals :

1. Modify `gitops-engine` pkg to make a `SelfSubjectAccessReview` request before adding any resource to the watch list, in this approach we are making an extra
api server call to check if controller has access to the resource, this does increase the no. of kubeapi calls made but is more accurate.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re: The number of kubeapi calls. In clusters with a lot of APIs (e.g. OpenShift comes with 200 APIs installed by default), that's already at least a burst of 200 API calls in a very short time frame. The default QPS for the client-side rate limiter in the kube API client is 50 with a burst of 100 (2 * QPS).

With additional SelfSubjectAccessReview, this numbers would increase to a burst of 400 calls.

Things get even worse when you consider namespace-scoped mode. Because the watches would be established for each of the APIs per managed namespaces. Considering the 200+ APIs example, this would result in at least 400 calls per managed namespace.

This seems to be numbers that can easily break the API server.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default QPS for the client-side rate limiter in the kube API client is 50 with a burst of 100 (2 * QPS).

The default in kube has actually been increased to either 300 or 500. But Argo's default is currently set lower. We should set ours to the new default

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, agreed.

50/100 is a very conservative number. I know that in the OpenShift client, this has been changed for a while (I think to 350 for the burst at least), and if upstream K8s has adopted a similar increase by default, we should do it as well. People can still fine-tune it using ARGOCD_K8S_CLIENT_QPS and ARGOCD_K8S_CLIENT_BURST when they want more conservative rate limits, but I guess the better user experience is to have them increased by default.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2. Modify `gitops-engine` pkg to check for forbidden/unauthorized errors when listing for resources, this is more efficient approach as the
no. of kubeapi calls made does not change, but there is a chance of false positives as similar errors can be returned from kubeapi server or env specific proxies in other situations
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder what kind of false positives that would be?

And, if there's a false-positive, chances are that SelfSubjectAccessReview will return positive, but the API server will reply with a 403 when trying to establish the watch, then again resulting in a hard error?

3. Combine approaches 1 and 2, in this controller will check the api response for the list call, and if it receives forbidden/unauthorized it will make the `SelfSubjectAccessReview` call.
This approach is accurate and at the same time, only makes extra api calls if the list calls fail in the first place.
Comment on lines +47 to +48
Copy link
Copy Markdown
Member

@jannfis jannfis Jun 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This strongly depends on the use-case of the user and what kind of permissions they are willing to give to a specific Argo CD instance. For example, if there's an instance that is only allowed to manage 10 out of 200 APIs, you would still have 190 additional calls to the SelfSubjectAccessReview.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we maybe make the use of SubjectSelfAccessReview configurable? Then people can have the trade-off they want - on clusters with a low amount of APIs, they can configure the mechanism to the highest precision while on clusters with high amount of APIs, they trade precision against speed.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure I think we can have another config option that enables the stricter check. Will update the proposal with this

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.


In all solutions, once controller determines that it does not have access to the resource it will stop monitoring it.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the user adds/removes access later ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A controller restart will be necessary

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Controller automatically does full "resync" every day. So it will auto-discover RBAC change once a day

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes ^^ though if you want it instantly a restart is needed.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexmt The watches will be re-established every 10 minutes as well, right? I think part of that is also rediscovery changes in available APIs. I guess with that, auto-discovery of RBAC changes would also happen all 10 minutes instead of only after cache expiry?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think controller is retrying only in-progress watches every 10 minutes. So it will notice if RBAC no longer allows accessing resource but won't notice if RBAC is allowing new resources

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if you add new CRDs to the cluster, Argo CD will pick those up without the need for a restart. Is that due to an informer on the CRD API?

As for the restart, I think a cluster-cache resync (which can be triggered via Argo CD's API) should be sufficient? Maybe we can add a new functionality that Argo CD will also re-try (probe) the APIs it received a 401 for previously? In a configurable interval, which defaults to the same duration as the watch timeout (10 minutes)?

Anyway, I think this is implementation detail and could even be added later on. Just keeping this here as a potential idea.


### Implementation decision

It was decided that we will go with approach 3 from the above list, but instead of a boolean flag we will have the `resource.respectRBAC` take 3 configuration options for the users :
- `strict` : This will perform both the checks i.e. whether the list call response is forbidden/unauthorized and if it is make the `SelfSubjectAccessReview` call to confirm.
- `normal` : This will only check whether the list call response is forbidden/unauthorized and skip `SelfSubjectAccessReview` call.
- unset/empty : This will disable the feature and controller will continue to monitor all resources.

NOTE: By default `resource.respectRBAC` will be unset or `""` which disables the feature

Users who are okay with an increase in kube api server calls can opt for strict option while users who are concerned with higher api calls can compromise on the accuracy and opt for the normal option.

## Security Considerations and Risks

There are no particular security risks associated with this change, this proposal rather improves the argocd controller
to not access/monitor resources that it does not have permission to access.

## Upgrade / Downgrade Strategy

There is no special upgrade strategy needed, all existing argocd configmaps will continue to work
and old configs without the `resource.respectRBAC` config will cause no change in argocd controllers behavior.

While downgrading to older version, if the user had configured `resource.respectRBAC` previously this would be ignored completely
and argocd would revert to its default behavior of trying to monitor all resources.