Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support generic additional resources #9

Open
damemi opened this issue Sep 20, 2021 · 8 comments
Open

Support generic additional resources #9

damemi opened this issue Sep 20, 2021 · 8 comments
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@damemi
Copy link

damemi commented Sep 20, 2021

Custom schedulers could have any number of additional resources necessary -- configmaps, CRDs, rbac policies, etc... Users should either be able to provide these to the operator, or we should document that they are responsible for creating additional resources on their own

@damemi
Copy link
Author

damemi commented Dec 13, 2021

x-ref #12

@ffromani
Copy link
Contributor

ffromani commented Dec 13, 2021

A possible way to solve this issue could be the following.

general principles I'm trying to follow in the proposal:

  • do not introduce new types (e.g. new CRDs) unless clearly obviously needed, for simplicity's sake
  • provide a clear auditable trail between the scheduler image and the extra resources. IOW, it should always possible to verify that scheduler image ABC really requested extra resources RR version VV
  • simple as possible to implement for both the scheduler image supplier and the secondary scheduler operator
    -opt-in and completely transparent for secondary schedulers which do not care about or need extra resources

proposal:

  1. Add a field in the API to opt-in. This way this new feature is completely transparent for secondary schedulers which do not care about or need extra resources.
  2. If the secondary scheduler does NOT opt in, bail out and use the existing flow
  3. If the secondary scheduler opts in, its image MUST provide an additional /bin/get-manifests endpoint
  4. The secondary scheduler operator (SSOP) will create a $SCHEDULER_NAME-resources deployment in the same namespace as the main deployment, using the /bin/get-manifests entrypoint
  5. Should the $SCHEDULER_NAME-resources pod end up in non-running state, the process will abort and the state will be reported to the cluster (conditions, events): the secondary scheduler deployment will FAIL, kubernetes events will be emitted and the secondaryScheduler CR’s condition will be updated accordingly.
  6. The /bin/get-manifests endpoint MUST emit on stdout the extra resources - in YAML format the scheduler needs, and then sleep forever. It MUST take the least possible amount of resources in doing so. We do like this to avoid to introduce extra kubernetes objects (extra CRDs) and to have a simple solution to implement and to consume. I'll elaborate later about alternatives
  7. SSOP will consume all the output of the /bin/get-manifests, validate the objects for basic sanity, and will create them AFTER it creates its builtin, default objects. Initially, only these kubernetes objects will be supported:
  • ConfigMap
  • ClusterRole
  • ClusterRoleBinding
  1. Should unsupported objects be requested, the process will abort and the state will be reported to the cluster (conditions, events): the secondary scheduler deployment will FAIL, kubernetes events will be emitted and the secondaryScheduler CR’s condition will be update accordingly.
  2. All the extra objects WILL be created in the same namespace as SSOP and the secondary-scheduler. The namespace, if present, WILL be overridden (a log entry will be emitted). Kubernetes events will be emitted to signal the creation of each extra object

Alternatives:

encode somehow the extra resources in the CRD

We have two main options: add extra resources as blob (string type in the API), which is extensible but obscure, or add explicit types in the spec, which is hard to read and to extend. In both cases, this is doable, but IMO increases the maintainership cost because it creates a disconnect between the extra resources definition and the scheduler which requires them. With the proposed approach there is a clear link and a clear responsability chain between the scheduler and the definition of extra resources.

POC PR: #17

ffromani added a commit to ffromani/secondary-scheduler-operator that referenced this issue Dec 14, 2021
Add support in the reconciler code to fetch
secondary-scheduler extra resources using the
`/bin/get-manifests` entrypoint and an auxiliary
deployment.

For details, see: openshift#9 (comment)

Signed-off-by: Francesco Romani <[email protected]>
ffromani added a commit to ffromani/secondary-scheduler-operator that referenced this issue Feb 21, 2022
Add support in the reconciler code to fetch
secondary-scheduler extra resources using the
`/bin/get-manifests` entrypoint and an auxiliary
deployment.

For details, see: openshift#9 (comment)

Signed-off-by: Francesco Romani <[email protected]>
@ingvagabund
Copy link
Member

After discussion with @fromanirh on the Slack:

  • The secondary scheduler operator needs to be aware of any extra resource it reconciles
  • When the extra resources are updated (e.g. during upgrades) the SSO needs to make sure to properly delete any extra resource that is no longer needed (e.g. through annotations and/or revisions)
  • We need to limit the set of extra resources the SSO will support to a minimum so we do not support everything (the set of supported extra resources can be extended later if needed). We can start with CM, CR and CR binding.
  • I suggest to present the extra resources through a ConfigMap to give the end users a flexible way of choosing how the resources are going to be populated (and updated). E.g. by an external component or by hand.
  • Only a user with enough privileges (RBAC rules) will be able to CRUD the ConfigMap
  • In case extra manifest(s) are part of a scheduler image, the SSO can be instructed to render a deployment executing a binary inside the image which will be required to inject any extra manifest into the ConfigMap

@ingvagabund
Copy link
Member

The secondary-scheduler-extra-resources deployment as you implemented in #17 might follow the same approach as the installer pods these days. E.g.

$ oc logs -n openshift-kube-scheduler installer-6-ip-10-0-157-21.ec2.internal
...
I0223 09:02:43.958146       1 cmd.go:92] (*installerpod.InstallOptions)(0xc000300ea0)({
 KubeConfig: (string) "",
 ...
 Namespace: (string) (len=24) "openshift-kube-scheduler",
 PodConfigMapNamePrefix: (string) (len=18) "kube-scheduler-pod",
})
...
I0223 09:03:15.010909       1 cmd.go:646] Writing config file "/etc/kubernetes/static-pod-resources/kube-scheduler-pod-6/configmaps/kube-scheduler-pod/version" ...
I0223 09:03:15.010952       1 cmd.go:646] Writing config file "/etc/kubernetes/static-pod-resources/kube-scheduler-pod-6/configmaps/kube-scheduler-pod/forceRedeploymentReason" ...
...
I0223 09:03:15.383131       1 cmd.go:638] Writing static pod manifest "/etc/kubernetes/manifests/kube-scheduler-pod.yaml" ...
{"kind":"Pod","apiVersion":"v1","metadata":{"name":"openshift-kube-scheduler","namespace":"openshift-kube-scheduler","creationTimestamp":null,"labels":{"app":"openshift-kube-scheduler","revision":"6","scheduler":"true"},..."status":{}}

Where instead of Writing config file the "installer" would log Storing ClusterRole manifest NAME into secondary-scheduler-extra-resources-configmap or similar. The SSO repository might provide a predefined pkg/extra-resources-injector implementation which any end user would "just" vendor and provide the extra manifests as assets.

@ingvagabund
Copy link
Member

ingvagabund commented Mar 21, 2022

In more detail:

  • Some plugins like coscheduling own a CRD and required at least RBAC to update CR status (among the basic get/list/watch). However, the SSO can not reconcile a CRD of which it knows nothing in advance. E.g. when a CRD version is updated, the SSO does not know how to migrate it to a newer version (the CRD upgrade path is unknown). It is responsibility of an admin to migrate the CR data before the CRD can be updaded. Meaning the SSO can not provide a full reconciliation.
  • Some plugins require additional RBAC (besides what's already granted by system:kube-scheduler and system:volume-scheduler cluster roles). On the other hand the SSO can not reconcile every extra clusterrole/clusterrolebinding that's provided. Allowing this creates a security hole for updating/deleting already existing cluster roles which might (un)intentionally change permissions of other component. Either be removing permissions (e.g. removing ability of the default scheduler to bind pods) or extending permissions (e.g. allowing user application to read secrets from another namespace). Thus, the ability to reconcile clusterroles/clusterolebindings has to be limited to allow to create only a single clusterrole which is always explicitly bound to the openshift-secondary-scheduler-operator/secondary-scheduler SA and explicitly named (e.g. secondary-scheduler-extra-cluster-role). Also, all the verbs have to be limited to only get, list and watch. Optionally to crd/status resource with the update verb (to be debated). Even with the get/list/watch cluster role permissions it is possible to intentionally create a plugin which can dump every object in the cluster. So the documentation/instruction needs to mention the permission to list SS container's logs should be given only to SSO admins.
  • The cluster role permissions can be set in the same way as in the OLM case by specifying clusterPermissions in the operator CR. For better automation the clusterrole permissions could be specified through a configmap if the clusterPermissions is not specified.
  • In case some of the scheduling plugins require a CRD or clusterrole/clusterrolebinding which is not supported by the SSO, different mechanism of deploying those is required. The SSO has a limited support due to the fact the secondary scheduler image is unknown in advance.
  • With the extra resources limited to only a single clusterrole (explicitly named) and the clusterrolebinding (managed by the SSO), it is safe to delete those after the operator gets uninstalled without any risk to other components. Any other resource which is required by the secondary scheduler and is explicitly reconciled by a different mechanism needs to be deleted separately.

@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 19, 2022
@openshift-bot
Copy link

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 19, 2022
@ffromani
Copy link
Contributor

/lifecycle frozen

@openshift-ci openshift-ci bot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Jul 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
Development

No branches or pull requests

4 participants