Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-4800: Split UnCoreCache awareness #4810

Merged
merged 2 commits into from
Oct 10, 2024

Conversation

ajcaldelas
Copy link
Contributor

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 27, 2024
@k8s-ci-robot
Copy link
Contributor

Welcome @ajcaldelas!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 27, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @ajcaldelas. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Aug 27, 2024
@kannon92
Copy link
Contributor

/cc

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 27, 2024

###### How can this feature be enabled / disabled in a live cluster?

Cannot be dynamically enabled/disabled because of the CPU Manager state
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this true? So the only way to enable this feature is to create a node with these settings?

Could I not set this policy and restart kubelet?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree @kannon92 phrasing seems more correct and easier to follow. We can totally enable the feature but this require a kubelet restart, and possibly the deletion of the cpu manager state file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would something like this clear it up?

This feature depends on the CPU Manager State file found in /var/lib/kubelet/cpu_manager_state because of this feature the deletion of this file and restart of the Kubelet with the feature enabled again.

This is basically the reasoning maybe @wongchar could help me out here with the phrasing and more reasoning as to why this happens.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct. Need to drain the node, enable/disable the feature, remove the cpu_manager_state file, reload daemons, and restart kubelet

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd write
"disabling the feature requires a kubelet config change and a kubelet restart" or so.

The state file deletion is a a side effect inherited by how cpumanager works. Nothing specific in this feature seems to imply or require a deletion is warranted.

Actually, deletions of state files should be avoided entirely and they are a smell, but that's another story for another day.

Copy link
Contributor Author

@ajcaldelas ajcaldelas Sep 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this sound?
From our testing this is a requirement going from none to static cannot be done dynamically because of the cpu_manager_state file. The node needs to be drained and the policy checkpoint file need to be removed before restart Kubelet. This requirement stems from a CPUManager and has nothing to do with this feature specifically.

@kad
Copy link
Member

kad commented Oct 8, 2024

@kad @klueska howdy! I think this is ready for your review, PTAL

all my comments seems to be incorporated, so looks ok for me.

@kannon92
Copy link
Contributor

kannon92 commented Oct 8, 2024

/assign @klueska

According to the KEP board you are marked as the sig-node approver for this.

@kannon92
Copy link
Contributor

kannon92 commented Oct 8, 2024

@haircommander can you review this and see if your "request changes" is still needed?

Copy link
Contributor

@haircommander haircommander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 8, 2024
@ffromani
Copy link
Contributor

ffromani commented Oct 8, 2024

LGTM

@ffromani
Copy link
Contributor

ffromani commented Oct 9, 2024

/lgtm

(formally this time)


- Modify the "Allocate" static policy to check for the option, prefer-align-cpus-by-uncorecache. For platforms where SMT is enabled, prefer-align-cpus-by-uncorecache will continue to follow default behaviour and try to allocate full cores when possible. prefer-align-cpus-by-uncorecache can be enabled along with full-pcpus-only and enforce full core assignment with uncorecache alignment.

- prefer-align-cpus-by-uncorecache will be compatible with the default CPU allocation logic with future plans to be compatible with the options distribute-cpus-across-numa and distribute-cpus-across-cores.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that for now you cannot combine these options together? What happens if you do?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, for now when the mentioned flags are enabled will cause a failure and node will not be schedulable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We reached agreement that expanding/improving compatibility between options is a Beta graduation blocker.

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 9, 2024
@ffromani
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 10, 2024
@wongchar
Copy link
Contributor

@klueska please let us know if there's anything else, thanks!

@klueska
Copy link
Contributor

klueska commented Oct 10, 2024

/approve
/cc @mrunalp for final approval

@k8s-ci-robot
Copy link
Contributor

@klueska: GitHub didn't allow me to request PR reviews from the following users: for, final, approval.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/approve
/cc @mrunalp for final approval

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@soltysh
Copy link
Contributor

soltysh commented Oct 10, 2024

/approve
for some reason the bot didn't pick the previous one 🤷‍♂️

@mrunalp
Copy link
Contributor

mrunalp commented Oct 10, 2024

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ajcaldelas, klueska, mrunalp, soltysh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 10, 2024
@k8s-ci-robot k8s-ci-robot merged commit c3b0e15 into kubernetes:master Oct 10, 2024
4 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.32 milestone Oct 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/node Categorizes an issue or PR as relevant to SIG Node. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.