Skip to content

Conversation

@njhale
Copy link
Contributor

@njhale njhale commented Sep 19, 2019

Proposal for simplifying OLM's APIs.

@openshift-ci-robot openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 19, 2019
@njhale
Copy link
Contributor Author

njhale commented Sep 19, 2019

@deads2k
Copy link
Contributor

deads2k commented Sep 19, 2019

/assign @jwforres

@openshift-ci-robot openshift-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 24, 2019

A Cluster Admin can generate `ServiceAccounts` with different installation privileges as needed; e.g. namespace list, pod create, etc. To allow a user to reference a `ServiceAccount`, a Cluster Admin can bind a `Role/ClusterRole` with the `use` custom verb to the user.

This behavior will be enforced by a `ValidatingAdmissionWebhook`. In order to configure serving certs for the webhook
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By "This behavior" do you mean the user's access to the ServiceAccount? Ther big difference vs OperatorGroups is the user supplying the SA for the Operator CR. With OperatorGroups admins could setup what SA is used and not allow users to manage OperatorGroups at all. For OSD, my concern is a user could reference a SA with higher privilege than dedicated-admins have by simply knowing what strings to plug into the Operator CR.
Is there any RBAC check on the SA ref?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By "This behavior" do you mean the user's access to the ServiceAccount?

Yes.

For OSD, my concern is a user could reference a SA with higher privilege than dedicated-admins have by simply knowing what strings to plug into the Operator CR.
Is there any RBAC check on the SA ref?

This is what the use custom verb and ValidatingAdmissionWebhook will ensure. Operator create/update requests will be rejected if the user does not have use on the referenced ServiceAccount.

Ther big difference vs OperatorGroups is the user supplying the SA for the Operator CR. With OperatorGroups admins could setup what SA is used and not allow users to manage OperatorGroups at all.

Fair point. We've been working on an amendment to this proposal that shifts away from requiring a ServiceAccount. So, let me take the opportunity to get some early feedback.

Essentially, we want an "android-esque" permission approval flow. For us, this means that if the user that creates an Operator has the required permissions, OLM will install it with only the subset of permissions required. If an update comes through that requires greater permissions than the original subset, a user of greater or equal privileges must approve a permission expansion before OLM will install the update (and so on). The resource that will gate permission expansion will be InstallPlan.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For OSD, my concern is a user could reference a SA with higher privilege than dedicated-admins have by simply knowing what strings to plug into the Operator CR.
Is there any RBAC check on the SA ref?

This is what the use custom verb and ValidatingAdmissionWebhook will ensure. Operator create/update requests will be rejected if the user does not have use on the referenced ServiceAccount.

Do you have some docs on the use verb? I'd like to understand it in the context of this enhancment and for use in OSD where we're starting to develop some ValidatingAdmissionWebhooks. I see a discrete list of operations allowed but it doesn't include "USE": https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/#webhook-configuration

Ther big difference vs OperatorGroups is the user supplying the SA for the Operator CR. With OperatorGroups admins could setup what SA is used and not allow users to manage OperatorGroups at all.

Fair point. We've been working on an amendment to this proposal that shifts away from requiring a ServiceAccount. So, let me take the opportunity to get some early feedback.

Essentially, we want an "android-esque" permission approval flow. For us, this means that if the user that creates an Operator has the required permissions, OLM will install it with only the subset of permissions required. If an update comes through that requires greater permissions than the original subset, a user of greater or equal privileges must approve a permission expansion before OLM will install the update (and so on). The resource that will gate permission expansion will be InstallPlan.

Seems reasonable assuming the following:

  • Auto-approval will not auto-approve an InstallPlan if additional permissions required are outside of permissions available to the user that created the Operator at the time of the creation of that Operator.
  • Approving an InstallPlan checks the approvers permissions as though that user was the one creating the Operator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have some docs on the use verb?

It's a made up RBAC verb, which would go in a (Cluster)Role; i.e not a webhook verb/action.

Seems reasonable assuming the following: ...

You've pretty much outlined the idea -- there are a few slight differences that will hopefully be clear once we push the updates to this enhancement.

lastTransitionTime: "2019-09-16T22:26:29Z"
```

Not all components will need to have status conditions surfaced. Initially, it should be sufficient to add conditions for:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a material cost / downside to adding all components and all status conditions? If there is a good reason to limit this I'm all for it, otherwise it will be a great convenience to have everything roll up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not every resource has status conditions and we'll most likely need to do a mapping when we implement this (since under the cover they are different golang types). Other than that, I was thinking that it didn't make sense to have conditions for resources like ServiceAccounts and RBAC, etc... That being said, we can always add more as we find we need them.


- A new API resource is required
- Listing operators becomes a privileged operation that leaks info about existing namespaces and resources
- Implict agreement that an operator is a cluster singleton
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this has some drawbacks, see previous comment.


## Alternatives

We can define a namespace-scoped operator resource in addition to the cluster-scoped one. We can use the cluster-scoped resource to indicate an operator may span multiple namespaces, and a namespace scoped operator to indicate it should be installed only in a specific namespace.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the use cases for operators that "span multiple namespaces"? An operator can watch resources across the cluster with sufficient RBAC, so it's not about what it can touch. And you could use multiple instances of Operator to install an operator more than once.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the use cases for operators that "span multiple namespaces"? An operator can watch resources across the cluster with sufficient RBAC, so it's not about what it can touch.

We're trying to set the stage for a better operator author user story. An operator author knows best how their operator works, and it's come into question whether enforcing the use of a single namespace is a reasonable constraint. For example, some operators are written with hardcoded namespace, and others with RoleBindings in kube-system. Rather than "what it can touch", I'd say it's a "native configuration" story. Along with this we also need to add knobs for admins to map those native namespaces.

And you could use multiple instances of Operator to install an operator more than once.

I don't think that multiple instances of an operator is a tenable configuration for a cluster. This comes down to the fact that conversion webhooks and extension api-servers are effectively singletons and require exactly one backing service to operate. While we can come up with creative (and probably complex) ways of intercepting and routing traffic, I think that any real solution will require better first-class multitenancy in kubernetes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I think of the operator installation I think mostly about the Deployment and associated Roles and RoleBindings (permissions in CSV currently). Anything outside of this is managed by part of the operator's bundle or managed by the operator's controller directly. I see bundled and controller managed resources as outside of the scope of what Operator will manage. It is possible today to create a bundle with Roles and RoleBindings outside of the namespace in which the Deployment is created, but Subscription is namespaced. What is changing about the operator story that changes the scope of operator installation to imply operators are a cluster scoped resource? At the end of the day, the core of an operator is the controller deployed as a pod by a deployment, which is namespaced.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the operator story is changing; the implementation is evolving to better match the reality of the platform. As @njhale points out, having multiple instances of the same controller running concurrently is not a good fit for the way k8s is evolving its CRD story. As I understand it, the outcome is that you'll have one instance of a controller running in the cluster, but then you might be able to choose whether that controller watches one namespace, a list of namespaces, or all namespaces.


#### Constraining Installation

In a fashion similar `OperatorGroups`, we can constrain the installation of an operator by associating it with a `ServiceAccount`. For an `Operator` resource, the user's desired `ServiceAccount` will be specified by the `spec.install.as` field. The install namespace, where the namespaced contents of an operator will be installed to, is specified by the `install.namespace` field.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is driving this use case? Most admins are used to the idea of running a package manager as "root" as long as they can:

  1. have an opportunity to see what it's going to do (in this case, just list the resources that will be created)
  2. are able to limit the sources from which content is installed to sources they trust

The far more interesting concern is what permissions the operator itself runs with over the long term. That is a better fit for your android update analogy below. It's also observable from the operator bundle, so it's mostly a solved problem (unless you want to influence auto-upgrade gating based on RBAC changes).

When installing an operator, won't that in most cases involve creating a ServiceAccount, ClusterRole, and ClusterRoleBinding for that operator? If the installing user has the ability to create those things, don't they already have the keys to the kingdom? What value will we get from tracking resource-by-resource what that user can do vs what it takes to install the operator? Would it be sufficient to just assume that installing operators takes a lot of privilege, and gate that with RBAC for the Operator resource?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user or SA creating a ClusterRoleBinding cannot do so unless they already have the permissions being granted. For managed services (OSD, ARO) a customer does not run as a cluster-admin. In this case we cannot allow customers to install any arbitrary operator as they could elevate privileges.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does ARO stand for?

Copy link
Contributor Author

@njhale njhale Oct 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is driving this use case? Most admins are used to the idea of running a package manager as "root"...

As @jewzaam points out, the admin installing an operator is not always a full blown cluster-admin.

The far more interesting concern is what permissions the operator itself runs with over the long term... It's also observable from the operator bundle, so it's mostly a solved problem (unless you want to influence auto-upgrade gating based on RBAC changes).

We plan on doing this. The recent proposal updates (hopefully) make this more clear.

If the installing user has the ability to create those things, don't they already have the keys to the kingdom?

kube's RBAC model prevents a user from creating/updating (Cluster)Roles/RoleBindings if they don't already have matching privileges at the same or greater scope (or have the escalate verb).

What value will we get from tracking resource-by-resource what that user can do vs what it takes to install the operator?

We prevent users from escalating their privilege via OLM.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does ARO stand for?

Azure Red Hat OpenShift


## Alternatives

We can define a namespace-scoped operator resource in addition to the cluster-scoped one. We can use the cluster-scoped resource to indicate an operator may span multiple namespaces, and a namespace scoped operator to indicate it should be installed only in a specific namespace.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the operator story is changing; the implementation is evolving to better match the reality of the platform. As @njhale points out, having multiple instances of the same controller running concurrently is not a good fit for the way k8s is evolving its CRD story. As I understand it, the outcome is that you'll have one instance of a controller running in the cluster, but then you might be able to choose whether that controller watches one namespace, a list of namespaces, or all namespaces.

@njhale njhale changed the title WIP Enhancement: Simplify OLM APIs Simplify OLM APIs Oct 2, 2019
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 2, 2019
Copy link
Contributor

@ecordell ecordell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

I think this looks good and goes into the right amount of detail. We'll probably want a couple of proposals in the OLM repo (particularly around the ServiceAccount generation and the Install approval process).

@dmesser @jwforres @shawn-hurley Please do a final pass on this so we can merge. Thanks!

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 3, 2019
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ecordell, njhale

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 3, 2019
@openshift-merge-robot openshift-merge-robot merged commit 0e06c9d into openshift:master Oct 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants