Skip to content

OCPBUGS-58313: Admit sysctls based on the worker node kernel version instead of the current node kernel version#151

Open
jubittajohn wants to merge 1 commit intoopenshift:masterfrom
jubittajohn:fix-sysctl-kernel-mismatch
Open

OCPBUGS-58313: Admit sysctls based on the worker node kernel version instead of the current node kernel version#151
jubittajohn wants to merge 1 commit intoopenshift:masterfrom
jubittajohn:fix-sysctl-kernel-mismatch

Conversation

@jubittajohn
Copy link
Contributor

The sysctls should be admitted based on the worker node kernel version instead of the current node kernel version(kernel version of the machine running api server), to avoid SCC admitting a pod with a sysctl param that is unsafe on the worker's kernel.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 29, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 29, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 29, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jubittajohn
Once this PR has been reviewed and has the lgtm label, please assign ibihim for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Comment on lines 88 to 89
// The kernel version here refers to that of the worker nodes, since sysctl
// settings apply to pods running on worker nodes rather than control plane nodes
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What guarantees are there that a pod admitted by SCC will only be scheduled to a worker node?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed it to compute the minimum kernel version across all the nodes in the cluster. Is that what was intended?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to confirm what was intended in the original comment:
Was the goal to compute the minimum kernel version across all nodes in the cluster rather than just the worker nodes, or was the question about how SCC-admitted pods are guaranteed to run only on worker nodes?

My understanding is that SCC does not control scheduling, and that placement is determined by schedulers using taints/tolerations. Could you clarify which interpretation is correct?

@jubittajohn jubittajohn force-pushed the fix-sysctl-kernel-mismatch branch from 64fa496 to 0805e19 Compare November 5, 2025 18:43
@jubittajohn jubittajohn changed the title Admit sysctls based on the worker node kernel version instead of the current node kernel version OCPBUGS-58313: Admit sysctls based on the worker node kernel version instead of the current node kernel version Nov 5, 2025
@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Nov 5, 2025
@openshift-ci-robot
Copy link

@jubittajohn: This pull request references Jira Issue OCPBUGS-58313, which is invalid:

  • expected the bug to be in one of the following states: NEW, ASSIGNED, POST, but it is Verified instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

The sysctls should be admitted based on the worker node kernel version instead of the current node kernel version(kernel version of the machine running api server), to avoid SCC admitting a pod with a sysctl param that is unsafe on the worker's kernel.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@jubittajohn
Copy link
Contributor Author

jubittajohn commented Nov 5, 2025

@benluddy
I have a question regarding the kerenel check. The check was added only for the newly added sysctls in the apiserver-library-go(to avoid regressions . related comment: #148 (comment)), but the kubelet performs kernel checks for older sysctls as well(https://github.com/openshift/kubernetes/blob/master/pkg/kubelet/sysctl/safe_sysctls.go#L35-L72). Could this be a problem, since we are allowing the older sysctls here irrespective of the supported kernel version?

@jubittajohn jubittajohn requested a review from benluddy November 7, 2025 18:28
@jubittajohn jubittajohn marked this pull request as ready for review November 18, 2025 17:08
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 18, 2025
@openshift-ci openshift-ci bot requested review from ibihim and liouk November 18, 2025 17:16
@jubittajohn jubittajohn force-pushed the fix-sysctl-kernel-mismatch branch from 0805e19 to b605621 Compare November 20, 2025 21:01
func (c *constraint) SetExternalKubeInformerFactory(informers informers.SharedInformerFactory) {
c.namespaceLister = informers.Core().V1().Namespaces().Lister()
c.nodeLister = informers.Core().V1().Nodes().Lister()
c.listersSynced = append(c.listersSynced, informers.Core().V1().Namespaces().Informer().HasSynced)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
c.listersSynced = append(c.listersSynced, informers.Core().V1().Namespaces().Informer().HasSynced)
c.listersSynced = append(
c.listersSynced,
informers.Core().V1().Namespaces().Informer().HasSynced,
informers.Core().V1().Nodes.().Informer().HasSynced,
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You must add the node-lister to the list of listers that we need to wait for, before computing admission.

}

providers, errs := sccmatching.CreateProvidersFromConstraints(ctx, a.GetNamespace(), constraints, c.namespaceLister)
providers, errs := sccmatching.CreateProvidersFromConstraints(ctx, a.GetNamespace(), constraints, c.namespaceLister, c.nodeLister)
Copy link
Contributor

@ibihim ibihim Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a code / maintenance perspective, I must say that handing the nodeLister 6 levels down doesn't look good. We are coupling all those functions with the nodeLister. E.g.:

func NewSimpleProvider(
  scc *securityv1.SecurityContextConstraints,
  nodeLister corev1listers.NodeLister,
) (SecurityContextConstraintsProvider, error)

It reads well if you transform a scc into a provider, but now you have a scc and a nodeLister?!

Couldn't we check the legit sysctls and extend the constraints.AllowedUnsafeSysctls or adjust the SimpleProvider to hold those specificly, so that we don't pass c.nodeLister down? It would read better like so:

func NewSimpleProvider(
  scc *securityv1.SecurityContextConstraints,
  availableSysCtls []string,
) (SecurityContextConstraintsProvider, error)

or

// Though this taints the clear meaning of what is defined by the user and what is possible by the system.
NewSimpleProvider(
  updateSysctlsBasedOnSafeWhitelist(scc),
)

Everything else might be more effort, like some Factory or so.

WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for suggestion.

I've refactored NewSimpleProvider to take availableSysCtls []string instead of nodeLister.

@jubittajohn jubittajohn force-pushed the fix-sysctl-kernel-mismatch branch 2 times, most recently from 929529d to fedc70e Compare December 1, 2025 20:44
…s instead of the current node kernel version

Signed-off-by: jubittajohn <jujohn@redhat.com>
@jubittajohn jubittajohn force-pushed the fix-sysctl-kernel-mismatch branch from fedc70e to a39660d Compare December 1, 2025 21:07
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 1, 2025

@jubittajohn: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@jubittajohn jubittajohn requested a review from ibihim December 2, 2025 06:44

// Create the provider
provider, err := NewSimpleProvider(constraint)
provider, err := NewSimpleProvider(constraint, sysctl.SafeSysctlAllowlist(nodeLister))
Copy link
Contributor

@ibihim ibihim Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This calculation doesn't change per request, right? So we could move it out of the for-loop, right?

Due to the fact that CreateProviderFromConstraint is in a for loop that iterates on sccs, we have:

O(SCCs * Nodes)

But we could have

O(SCCs) + O(Nodes)

If we move the check out of the for-loop.

It would be good to figure out how many nodes big clusters have as we could improve the performance drastically on clusters with plenty of Nodes.
If we cache the result and add an event listener that is being updated on Add, Update and Delete events of the nodeInformer.

This would create a O(SCCs) + O(1) situation per request.

With 10k Nodes and 10 SCCs those numbers change drastically:

  1. Current: 100k
  2. Outside the for-loop: 10,010
  3. Cache: 10 + per Node change with "per Node change" being a lot smaller than "per Pod admission"

}

if minVersion == nil {
return nil, fmt.Errorf("no worker nodes found")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could happen, if parsing the kernal for all nodes fail, right?
We check also all nodes, not just "worker nodes".

So we could add a check for the len(nodes) != 0 in the error handling to distinguish between "no nodes found" and "couldn't parse kernel version(s)"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants