[OCPNODE-852] workerlatency profiles - updating the kubelet configuration based on the nodes.config.openshift.io resource #3015

sairameshv · 2022-03-14T16:08:43Z

- What I did
Added support to update the kubelet configuration based on the node.config.openshift.io spec changes

-How to verify it
Create/Update the nodes.config.openshift.io CR with the relevant/valid workerlatency profile and expect the corresponding changes in the Kubelet Configuration that are rolled out on all the worker nodes by MCO leveraging the KubeletConfig CR.

- Description for the changelog
Based on the enhancement, users can create/modify the node.config.openshift.io resource which results in the corresponding kubelet configuration changes on all the worker nodes.

sairameshv · 2022-03-14T16:11:11Z

/hold
Based on the discussions happened regarding #2959, these changes are made as part of an alternative approach.
cc: @harche @rphillips @yuqi-zhang @swghosh

sairameshv · 2022-03-30T08:58:01Z

/retest-required

pkg/apis/machineconfiguration.openshift.io/v1/types.go

pkg/controller/kubelet-config/helpers.go

pkg/controller/kubelet-config/kubelet_config_nodes.go

rphillips · 2022-03-30T17:27:12Z

test/e2e-shared-tests/helpers.go:167:19: ambiguous selector cs.Nodes (typecheck)
		node, err := cs.Nodes().Get(context.TODO(), target.Name, metav1.GetOptions{})
		                ^

pkg/controller/kubelet-config/kubelet_config_nodes.go

sairameshv · 2022-04-13T15:18:57Z

Did a pass and left some comments. The general direction is looking much better.

Also as a general note, would it be possible to structure your commits into 2: 1 - the API/client-go bump 2 - your code for the node object management

As a side note, this may also be a good candidate for the bootstrap-unit test, which was designed to test bootstrapping kubelet/features. But that doesn't have to be part of this PR

Addressed, re-arranged the commits.

harche · 2022-04-13T15:38:24Z

/test e2e-aws

yuqi-zhang

So I did some testing with this PR:

create a nodes.config.openshift.io object with latency profile medium - result: created 97-generated-kubelet with "nodeStatusUpdateFrequency": "20s", ✔️
create a kubeletconfig, generating 99-generated-kubelet with both the config and "nodeStatusUpdateFrequency": "20s", ✔️
create another kubeletconfig, generating 99-generated-kubelet-1 with the new config and "nodeStatusUpdateFrequency": "20s", ✔️
attempt to delete nodes.config.openshift.io/cluster - can't, which I think is expected ✔️
attempt to change nodes.config.openshift.io/cluster back to latency profile default - the 97-generated-kubelet does work and changes to "nodeStatusUpdateFrequency": "10s", BUT the 99-kubelet-generated/99-kubelet-generated-1 still has "nodeStatusUpdateFrequency": "20s", . Meaning that it will overwrite (and thus will not work). i.e. if you have any kubeletconfigs on the node, you can't modify the profile anymore, which I think is incorrect ❌
attempt to create a new nodes.config.openshift.io/test object. Creation is successful, but nothing happens. I assume only nodes.config.openshift.io/cluster is used? ❔

Re point 5: I think we have to somehow trigger the kubeletconfig object sync again with the latest kubeletconfig object such that it regenerates the config... which is problematic. Actually now I think about it, featuregates may also have this problem...? But that doesn't come up as often since you create them and don't generally change them, whereas the latency profile is multiple-step from default->medium->low.

yuqi-zhang · 2022-04-13T22:20:20Z

One option for the above could be to have the node.config.openshift.io sync loop also check and update each kubeletconfig-machineconfig (i.e. 99-generated-worker-kubelet, 99-generated-worker-kubelet-1, etc.) and update all of them.

sairameshv · 2022-04-14T08:00:10Z

So I did some testing with this PR:

create a nodes.config.openshift.io object with latency profile medium - result: created 97-generated-kubelet with "nodeStatusUpdateFrequency": "20s", heavy_check_mark

create a kubeletconfig, generating 99-generated-kubelet with both the config and "nodeStatusUpdateFrequency": "20s", heavy_check_mark

create another kubeletconfig, generating 99-generated-kubelet-1 with the new config and "nodeStatusUpdateFrequency": "20s", heavy_check_mark

attempt to delete nodes.config.openshift.io/cluster - can't, which I think is expected heavy_check_mark

attempt to change nodes.config.openshift.io/cluster back to latency profile default - the 97-generated-kubelet does work and changes to "nodeStatusUpdateFrequency": "10s", BUT the 99-kubelet-generated/99-kubelet-generated-1 still has "nodeStatusUpdateFrequency": "20s", . Meaning that it will overwrite (and thus will not work). i.e. if you have any kubeletconfigs on the node, you can't modify the profile anymore, which I think is incorrect x

attempt to create a new nodes.config.openshift.io/test object. Creation is successful, but nothing happens. I assume only nodes.config.openshift.io/cluster is used? grey_question

Re point 5: I think we have to somehow trigger the kubeletconfig object sync again with the latest kubeletconfig object such that it regenerates the config... which is problematic. Actually now I think about it, featuregates may also have this problem...? But that doesn't come up as often since you create them and don't generally change them, whereas the latency profile is multiple-step from default->medium->low.

Hello @yuqi-zhang , Thanks for bringing out this scenario.
Point 5:
The issue has been resolved by listing all the existing kubeletconfigs and calling the syncKubeletConfig inside the syncNodeConfig function.
Yeah, this is the issue with the existing featuregate sync as well. (I can add the fix as part of a new PR if required).

Point 6:
Reinforced the code to reject the addition/update of a new nodes.config.openshift.io object with the name other than "cluster". Tested and verified this functionality.

sairameshv · 2022-04-14T10:55:10Z

/retest-required

sairameshv · 2022-04-14T13:07:11Z

/retest-required

yuqi-zhang

Holding until #2868 merges. For consistency purposes, left a few nits below, as well as some other comments

yuqi-zhang · 2022-04-14T14:45:27Z

pkg/controller/kubelet-config/kubelet_config_nodes.go

Ok yeah this should work. Fortunately we keep that 1to1 mapping so we won't accidentally generate new MCs

yuqi-zhang · 2022-04-14T14:48:08Z

pkg/controller/kubelet-config/kubelet_config_nodes.go

Not a problem right now, but if we ever move to multiple pools, I think this should instead do this outside of the for loop and gate on whether any pools changed. Just a minor note

Sure, Addressed

yuqi-zhang · 2022-04-14T19:50:45Z

pkg/controller/kubelet-config/kubelet_config_features.go

I assume statements like this were used for debug? If not, could you either remove or format this? (The error gets returned anyways)

yuqi-zhang · 2022-04-14T19:51:09Z

pkg/controller/kubelet-config/kubelet_config_features.go

For error statements, could you use %w instead of %v to be consistent with #2868

yuqi-zhang · 2022-04-14T19:52:31Z

pkg/controller/kubelet-config/kubelet_config_nodes.go

Will not list these separately, but again, if you want to log the error, please do so via error logs at regular verbosity, %w the error, or V(4) as debug statements

yuqi-zhang · 2022-04-14T19:54:50Z

pkg/controller/kubelet-config/kubelet_config_nodes.go

So, just to make sure, all other node.config.openshift.io objects that are not cluster should be ignored? Will this be in the documentation?

Oh I see in the create/update syncs you just degrade if the node object is not named cluster

yuqi-zhang · 2022-04-14T20:04:58Z

pkg/controller/kubelet-config/kubelet_config_nodes.go

Hmm ok so this is the main check for whether the update is allowable or not.

So, I thought through the possibilities, and I think this will work so long as we only have 3 latency profiles.

i.e. this check is only for updates. So if I had 4 profiles that can only go from a->b->c->d and I want to skip b/c for example, I think can just do:
a->c (degrade, nothing happens)
c->d -> this actually works, because even though the actual config is on a, we are looking only at our object (on c) and desired (on d). So we essentially would have done a->d right? And this would let it through

In a more perfect scenario I think the controller creating the MC (sync function) should do the final check between current and desired. This should just enqueue and have the sync handle this.

I'm not against merging as is, but just for future, if we have more things to handle, the sync loop should be the place, and not individual informer checks.

The idea here is to automate the process of this transition in future(as mentioned in the (TODO) comment) without rejecting the user request. That automatic transition needs a thorough design, testing and hence implemented a check here this way for now.

addressed the review comments

harche · 2022-04-18T04:17:52Z

/retest

openshift-ci · 2022-04-18T06:36:36Z

@sairameshv: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/4.12-upgrade-from-stable-4.11-images	`90faaa9`	link	true	`/test 4.12-upgrade-from-stable-4.11-images`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

harche · 2022-04-18T06:40:12Z

/test e2e-gcp-op

rphillips · 2022-04-18T13:28:38Z

Thank you for all the reviews and hard work on this.

/lgtm

yuqi-zhang

There are a few more locations where there seems to be extra non-needed debug statements, as well as %v instead of %w in the error formatting, but let's fix those up as a follow up PR, no need to block the actual functionality on that.

Thanks for all the fixups. I think we are good to go.

openshift-ci · 2022-04-19T22:01:38Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rphillips, sairameshv, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [yuqi-zhang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

sairameshv marked this pull request as draft March 14, 2022 16:08

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 14, 2022

openshift-ci bot requested review from jkyros and kikisdeliveryservice March 14, 2022 16:09

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 14, 2022

sairameshv force-pushed the node_object branch 2 times, most recently from 997467f to f362e38 Compare March 17, 2022 17:28

sairameshv force-pushed the node_object branch 3 times, most recently from 5f77c6d to 0d10faf Compare March 29, 2022 12:37

sairameshv marked this pull request as ready for review March 30, 2022 08:55

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 30, 2022

openshift-ci bot requested a review from sinnykumari March 30, 2022 08:57

sairameshv changed the title ~~[OCPNODE-852] updating the kubelet configuration based on the nodes.config.openshift.io resource~~ [OCPNODE-852] workerlatency profiles - updating the kubelet configuration based on the nodes.config.openshift.io resource Mar 30, 2022

rphillips reviewed Mar 30, 2022

View reviewed changes

pkg/apis/machineconfiguration.openshift.io/v1/types.go Outdated Show resolved Hide resolved

rphillips reviewed Mar 30, 2022

View reviewed changes

pkg/controller/kubelet-config/helpers.go Outdated Show resolved Hide resolved

rphillips reviewed Mar 30, 2022

View reviewed changes

pkg/controller/kubelet-config/helpers.go Outdated Show resolved Hide resolved

rphillips reviewed Mar 30, 2022

View reviewed changes

pkg/controller/kubelet-config/kubelet_config_nodes.go Outdated Show resolved Hide resolved

rphillips reviewed Mar 30, 2022

View reviewed changes

pkg/controller/kubelet-config/kubelet_config_nodes.go Outdated Show resolved Hide resolved

rphillips reviewed Mar 30, 2022

View reviewed changes

pkg/controller/kubelet-config/kubelet_config_nodes.go Outdated Show resolved Hide resolved

rphillips reviewed Mar 30, 2022

View reviewed changes

pkg/controller/kubelet-config/kubelet_config_nodes.go Outdated Show resolved Hide resolved

QiWang19 reviewed Mar 30, 2022

View reviewed changes

pkg/controller/kubelet-config/kubelet_config_nodes.go Outdated Show resolved Hide resolved

sairameshv force-pushed the node_object branch from 96d47a9 to 3fef140 Compare March 31, 2022 08:05

sairameshv mentioned this pull request Mar 31, 2022

monitoring of a new resource - node.config.openshift.io #2959

Closed

sairameshv force-pushed the node_object branch 3 times, most recently from 1fc165f to b25e0c4 Compare March 31, 2022 16:34

sairameshv force-pushed the node_object branch from 248672b to 305eed0 Compare April 13, 2022 13:58

bumped API, client-go

edc0ff0

sairameshv force-pushed the node_object branch from 305eed0 to 0a30a05 Compare April 13, 2022 14:38

yuqi-zhang reviewed Apr 13, 2022

View reviewed changes

sairameshv force-pushed the node_object branch from 0a30a05 to 7b77956 Compare April 14, 2022 07:06

yuqi-zhang reviewed Apr 14, 2022

View reviewed changes

sairameshv force-pushed the node_object branch from 7b77956 to cec3bc2 Compare April 15, 2022 03:17

sairameshv requested a review from yuqi-zhang April 15, 2022 03:18

handle the nodes.config.openshift.io resource changes

e41d84d

addressed the review comments

sairameshv force-pushed the node_object branch from cec3bc2 to e41d84d Compare April 16, 2022 02:05

openshift-ci bot assigned rphillips Apr 18, 2022

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 18, 2022

yuqi-zhang approved these changes Apr 19, 2022

View reviewed changes

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 19, 2022

openshift-merge-robot merged commit 74b7a1e into openshift:master Apr 19, 2022

swghosh mentioned this pull request Apr 25, 2022

Config Observer and Latency Controller for nodes.config.openshift.io WorkerLatencyProfile openshift/cluster-kube-apiserver-operator#1328

Merged

yuqi-zhang mentioned this pull request Sep 21, 2022

feat: support generating cpu partitioning file from infra flag #3335

Closed

harche mentioned this pull request Jun 5, 2023

Update openshift/api to disable EventedPLEG featuregate in techpreview openshift/cluster-config-operator#317

Merged

jkyros mentioned this pull request Sep 28, 2023

MCO-52: add MCO API into openshift/api openshift/api#1453

Merged

[OCPNODE-852] workerlatency profiles - updating the kubelet configuration based on the nodes.config.openshift.io resource #3015

[OCPNODE-852] workerlatency profiles - updating the kubelet configuration based on the nodes.config.openshift.io resource #3015

Uh oh!

Conversation

sairameshv commented Mar 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sairameshv commented Mar 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sairameshv commented Mar 30, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rphillips commented Mar 30, 2022

Uh oh!

Uh oh!

sairameshv commented Apr 13, 2022

Uh oh!

harche commented Apr 13, 2022

Uh oh!

yuqi-zhang left a comment

Choose a reason for hiding this comment

Uh oh!

yuqi-zhang commented Apr 13, 2022

Uh oh!

sairameshv commented Apr 14, 2022

Uh oh!

sairameshv commented Apr 14, 2022

Uh oh!

sairameshv commented Apr 14, 2022

Uh oh!

yuqi-zhang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

harche commented Apr 18, 2022

Uh oh!

openshift-ci bot commented Apr 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

harche commented Apr 18, 2022

Uh oh!

rphillips commented Apr 18, 2022

Uh oh!

yuqi-zhang left a comment

sairameshv commented Mar 14, 2022 •

edited

Loading

sairameshv commented Mar 14, 2022 •

edited

Loading

openshift-ci bot commented Apr 18, 2022 •

edited

Loading