Support generic additional resources #9

damemi · 2021-09-20T18:25:04Z

Custom schedulers could have any number of additional resources necessary -- configmaps, CRDs, rbac policies, etc... Users should either be able to provide these to the operator, or we should document that they are responsible for creating additional resources on their own

damemi · 2021-12-13T16:47:06Z

x-ref #12

ffromani · 2021-12-13T16:56:05Z

A possible way to solve this issue could be the following.

general principles I'm trying to follow in the proposal:

do not introduce new types (e.g. new CRDs) unless clearly obviously needed, for simplicity's sake
provide a clear auditable trail between the scheduler image and the extra resources. IOW, it should always possible to verify that scheduler image ABC really requested extra resources RR version VV
simple as possible to implement for both the scheduler image supplier and the secondary scheduler operator
-opt-in and completely transparent for secondary schedulers which do not care about or need extra resources

proposal:

Add a field in the API to opt-in. This way this new feature is completely transparent for secondary schedulers which do not care about or need extra resources.
If the secondary scheduler does NOT opt in, bail out and use the existing flow
If the secondary scheduler opts in, its image MUST provide an additional /bin/get-manifests endpoint
The secondary scheduler operator (SSOP) will create a $SCHEDULER_NAME-resources deployment in the same namespace as the main deployment, using the /bin/get-manifests entrypoint
Should the $SCHEDULER_NAME-resources pod end up in non-running state, the process will abort and the state will be reported to the cluster (conditions, events): the secondary scheduler deployment will FAIL, kubernetes events will be emitted and the secondaryScheduler CR’s condition will be updated accordingly.
The /bin/get-manifests endpoint MUST emit on stdout the extra resources - in YAML format the scheduler needs, and then sleep forever. It MUST take the least possible amount of resources in doing so. We do like this to avoid to introduce extra kubernetes objects (extra CRDs) and to have a simple solution to implement and to consume. I'll elaborate later about alternatives
SSOP will consume all the output of the /bin/get-manifests, validate the objects for basic sanity, and will create them AFTER it creates its builtin, default objects. Initially, only these kubernetes objects will be supported:

ConfigMap
ClusterRole
ClusterRoleBinding

Should unsupported objects be requested, the process will abort and the state will be reported to the cluster (conditions, events): the secondary scheduler deployment will FAIL, kubernetes events will be emitted and the secondaryScheduler CR’s condition will be update accordingly.
All the extra objects WILL be created in the same namespace as SSOP and the secondary-scheduler. The namespace, if present, WILL be overridden (a log entry will be emitted). Kubernetes events will be emitted to signal the creation of each extra object

Alternatives:

encode somehow the extra resources in the CRD

We have two main options: add extra resources as blob (string type in the API), which is extensible but obscure, or add explicit types in the spec, which is hard to read and to extend. In both cases, this is doable, but IMO increases the maintainership cost because it creates a disconnect between the extra resources definition and the scheduler which requires them. With the proposed approach there is a clear link and a clear responsability chain between the scheduler and the definition of extra resources.

POC PR: #17

Add support in the reconciler code to fetch secondary-scheduler extra resources using the `/bin/get-manifests` entrypoint and an auxiliary deployment. For details, see: openshift#9 (comment) Signed-off-by: Francesco Romani <[email protected]>

ingvagabund · 2022-02-23T13:23:43Z

After discussion with @fromanirh on the Slack:

The secondary scheduler operator needs to be aware of any extra resource it reconciles
When the extra resources are updated (e.g. during upgrades) the SSO needs to make sure to properly delete any extra resource that is no longer needed (e.g. through annotations and/or revisions)
We need to limit the set of extra resources the SSO will support to a minimum so we do not support everything (the set of supported extra resources can be extended later if needed). We can start with CM, CR and CR binding.
I suggest to present the extra resources through a ConfigMap to give the end users a flexible way of choosing how the resources are going to be populated (and updated). E.g. by an external component or by hand.
Only a user with enough privileges (RBAC rules) will be able to CRUD the ConfigMap
In case extra manifest(s) are part of a scheduler image, the SSO can be instructed to render a deployment executing a binary inside the image which will be required to inject any extra manifest into the ConfigMap

ingvagabund · 2022-02-23T13:33:32Z

The secondary-scheduler-extra-resources deployment as you implemented in #17 might follow the same approach as the installer pods these days. E.g.

$ oc logs -n openshift-kube-scheduler installer-6-ip-10-0-157-21.ec2.internal
...
I0223 09:02:43.958146       1 cmd.go:92] (*installerpod.InstallOptions)(0xc000300ea0)({
 KubeConfig: (string) "",
 ...
 Namespace: (string) (len=24) "openshift-kube-scheduler",
 PodConfigMapNamePrefix: (string) (len=18) "kube-scheduler-pod",
})
...
I0223 09:03:15.010909       1 cmd.go:646] Writing config file "/etc/kubernetes/static-pod-resources/kube-scheduler-pod-6/configmaps/kube-scheduler-pod/version" ...
I0223 09:03:15.010952       1 cmd.go:646] Writing config file "/etc/kubernetes/static-pod-resources/kube-scheduler-pod-6/configmaps/kube-scheduler-pod/forceRedeploymentReason" ...
...
I0223 09:03:15.383131       1 cmd.go:638] Writing static pod manifest "/etc/kubernetes/manifests/kube-scheduler-pod.yaml" ...
{"kind":"Pod","apiVersion":"v1","metadata":{"name":"openshift-kube-scheduler","namespace":"openshift-kube-scheduler","creationTimestamp":null,"labels":{"app":"openshift-kube-scheduler","revision":"6","scheduler":"true"},..."status":{}}

Where instead of Writing config file the "installer" would log Storing ClusterRole manifest NAME into secondary-scheduler-extra-resources-configmap or similar. The SSO repository might provide a predefined pkg/extra-resources-injector implementation which any end user would "just" vendor and provide the extra manifests as assets.

ingvagabund · 2022-03-21T17:14:34Z

In more detail:

Some plugins like coscheduling own a CRD and required at least RBAC to update CR status (among the basic get/list/watch). However, the SSO can not reconcile a CRD of which it knows nothing in advance. E.g. when a CRD version is updated, the SSO does not know how to migrate it to a newer version (the CRD upgrade path is unknown). It is responsibility of an admin to migrate the CR data before the CRD can be updaded. Meaning the SSO can not provide a full reconciliation.
Some plugins require additional RBAC (besides what's already granted by system:kube-scheduler and system:volume-scheduler cluster roles). On the other hand the SSO can not reconcile every extra clusterrole/clusterrolebinding that's provided. Allowing this creates a security hole for updating/deleting already existing cluster roles which might (un)intentionally change permissions of other component. Either be removing permissions (e.g. removing ability of the default scheduler to bind pods) or extending permissions (e.g. allowing user application to read secrets from another namespace). Thus, the ability to reconcile clusterroles/clusterolebindings has to be limited to allow to create only a single clusterrole which is always explicitly bound to the openshift-secondary-scheduler-operator/secondary-scheduler SA and explicitly named (e.g. secondary-scheduler-extra-cluster-role). Also, all the verbs have to be limited to only get, list and watch. Optionally to crd/status resource with the update verb (to be debated). Even with the get/list/watch cluster role permissions it is possible to intentionally create a plugin which can dump every object in the cluster. So the documentation/instruction needs to mention the permission to list SS container's logs should be given only to SSO admins.
The cluster role permissions can be set in the same way as in the OLM case by specifying clusterPermissions in the operator CR. For better automation the clusterrole permissions could be specified through a configmap if the clusterPermissions is not specified.
In case some of the scheduling plugins require a CRD or clusterrole/clusterrolebinding which is not supported by the SSO, different mechanism of deploying those is required. The SSO has a limited support due to the fact the secondary scheduler image is unknown in advance.
With the extra resources limited to only a single clusterrole (explicitly named) and the clusterrolebinding (managed by the SSO), it is safe to delete those after the operator gets uninstalled without any risk to other components. Any other resource which is required by the secondary scheduler and is explicitly reconciled by a different mechanism needs to be deleted separately.

openshift-bot · 2022-06-19T18:49:10Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2022-07-19T19:15:07Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

ffromani · 2022-07-19T19:58:03Z

/lifecycle frozen

ffromani mentioned this issue Dec 14, 2021

Initial support for secondary scheduler extra resources #17

Closed

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 19, 2022

openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 19, 2022

openshift-ci bot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Jul 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support generic additional resources #9

Support generic additional resources #9

damemi commented Sep 20, 2021

damemi commented Dec 13, 2021

ffromani commented Dec 13, 2021 •

edited

Loading

ingvagabund commented Feb 23, 2022

ingvagabund commented Feb 23, 2022

ingvagabund commented Mar 21, 2022 •

edited

Loading

openshift-bot commented Jun 19, 2022

openshift-bot commented Jul 19, 2022

ffromani commented Jul 19, 2022

Support generic additional resources #9

Support generic additional resources #9

Comments

damemi commented Sep 20, 2021

damemi commented Dec 13, 2021

ffromani commented Dec 13, 2021 • edited Loading

general principles I'm trying to follow in the proposal:

proposal:

Alternatives:

encode somehow the extra resources in the CRD

ingvagabund commented Feb 23, 2022

ingvagabund commented Feb 23, 2022

ingvagabund commented Mar 21, 2022 • edited Loading

openshift-bot commented Jun 19, 2022

openshift-bot commented Jul 19, 2022

ffromani commented Jul 19, 2022

ffromani commented Dec 13, 2021 •

edited

Loading

ingvagabund commented Mar 21, 2022 •

edited

Loading