OCPBUGS-58313: Admit sysctls based on the worker node kernel version instead of the current node kernel version#151
Conversation
|
Skipping CI for Draft Pull Request. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: jubittajohn The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
1e59723 to
64fa496
Compare
| // The kernel version here refers to that of the worker nodes, since sysctl | ||
| // settings apply to pods running on worker nodes rather than control plane nodes |
There was a problem hiding this comment.
What guarantees are there that a pod admitted by SCC will only be scheduled to a worker node?
There was a problem hiding this comment.
I have changed it to compute the minimum kernel version across all the nodes in the cluster. Is that what was intended?
There was a problem hiding this comment.
I want to confirm what was intended in the original comment:
Was the goal to compute the minimum kernel version across all nodes in the cluster rather than just the worker nodes, or was the question about how SCC-admitted pods are guaranteed to run only on worker nodes?
My understanding is that SCC does not control scheduling, and that placement is determined by schedulers using taints/tolerations. Could you clarify which interpretation is correct?
64fa496 to
0805e19
Compare
|
@jubittajohn: This pull request references Jira Issue OCPBUGS-58313, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@benluddy |
0805e19 to
b605621
Compare
| func (c *constraint) SetExternalKubeInformerFactory(informers informers.SharedInformerFactory) { | ||
| c.namespaceLister = informers.Core().V1().Namespaces().Lister() | ||
| c.nodeLister = informers.Core().V1().Nodes().Lister() | ||
| c.listersSynced = append(c.listersSynced, informers.Core().V1().Namespaces().Informer().HasSynced) |
There was a problem hiding this comment.
| c.listersSynced = append(c.listersSynced, informers.Core().V1().Namespaces().Informer().HasSynced) | |
| c.listersSynced = append( | |
| c.listersSynced, | |
| informers.Core().V1().Namespaces().Informer().HasSynced, | |
| informers.Core().V1().Nodes.().Informer().HasSynced, | |
| ) |
There was a problem hiding this comment.
You must add the node-lister to the list of listers that we need to wait for, before computing admission.
| } | ||
|
|
||
| providers, errs := sccmatching.CreateProvidersFromConstraints(ctx, a.GetNamespace(), constraints, c.namespaceLister) | ||
| providers, errs := sccmatching.CreateProvidersFromConstraints(ctx, a.GetNamespace(), constraints, c.namespaceLister, c.nodeLister) |
There was a problem hiding this comment.
From a code / maintenance perspective, I must say that handing the nodeLister 6 levels down doesn't look good. We are coupling all those functions with the nodeLister. E.g.:
func NewSimpleProvider(
scc *securityv1.SecurityContextConstraints,
nodeLister corev1listers.NodeLister,
) (SecurityContextConstraintsProvider, error)It reads well if you transform a scc into a provider, but now you have a scc and a nodeLister?!
Couldn't we check the legit sysctls and extend the constraints.AllowedUnsafeSysctls or adjust the SimpleProvider to hold those specificly, so that we don't pass c.nodeLister down? It would read better like so:
func NewSimpleProvider(
scc *securityv1.SecurityContextConstraints,
availableSysCtls []string,
) (SecurityContextConstraintsProvider, error)or
// Though this taints the clear meaning of what is defined by the user and what is possible by the system.
NewSimpleProvider(
updateSysctlsBasedOnSafeWhitelist(scc),
)Everything else might be more effort, like some Factory or so.
WDYT?
There was a problem hiding this comment.
Thank you for suggestion.
I've refactored NewSimpleProvider to take availableSysCtls []string instead of nodeLister.
929529d to
fedc70e
Compare
…s instead of the current node kernel version Signed-off-by: jubittajohn <jujohn@redhat.com>
fedc70e to
a39660d
Compare
|
@jubittajohn: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
|
||
| // Create the provider | ||
| provider, err := NewSimpleProvider(constraint) | ||
| provider, err := NewSimpleProvider(constraint, sysctl.SafeSysctlAllowlist(nodeLister)) |
There was a problem hiding this comment.
This calculation doesn't change per request, right? So we could move it out of the for-loop, right?
Due to the fact that CreateProviderFromConstraint is in a for loop that iterates on sccs, we have:
O(SCCs * Nodes)
But we could have
O(SCCs) + O(Nodes)
If we move the check out of the for-loop.
It would be good to figure out how many nodes big clusters have as we could improve the performance drastically on clusters with plenty of Nodes.
If we cache the result and add an event listener that is being updated on Add, Update and Delete events of the nodeInformer.
This would create a O(SCCs) + O(1) situation per request.
With 10k Nodes and 10 SCCs those numbers change drastically:
- Current: 100k
- Outside the for-loop: 10,010
- Cache: 10 + per Node change with "per Node change" being a lot smaller than "per Pod admission"
| } | ||
|
|
||
| if minVersion == nil { | ||
| return nil, fmt.Errorf("no worker nodes found") |
There was a problem hiding this comment.
This could happen, if parsing the kernal for all nodes fail, right?
We check also all nodes, not just "worker nodes".
So we could add a check for the len(nodes) != 0 in the error handling to distinguish between "no nodes found" and "couldn't parse kernel version(s)"
The sysctls should be admitted based on the worker node kernel version instead of the current node kernel version(kernel version of the machine running api server), to avoid SCC admitting a pod with a sysctl param that is unsafe on the worker's kernel.