removing unnecessary condition #29

julian3xl · 2022-10-11T13:39:10Z

This condition is supposed to be necessary if you install the node feature discovery chart included on this chart, but in the real world a production cluster already have that stuff set up and running, so having the proper node selector is mandatory and this condition on the chart avoids it.

Signed-off-by: julian3xl [email protected]

Signed-off-by: julian3xl <[email protected]>

julian3xl · 2022-10-11T13:42:33Z

@St0rmingBr4in or @y2kenny-amd, I'm asking you to review this pull request because you are the latest ones with activity on this repository that seems to be really abandoned :(

y2kenny · 2022-10-11T13:48:35Z

My understanding is the NFD is an add-on. Do you have any documentation to indicate otherwise?

julian3xl · 2022-10-11T13:51:45Z

Yeah, the problem is that if I don't want to install the included NFD because I have one already running on my cluster (that's gonna be the 99,9% of the use cases), I have to set nfd.enabled to false, and because this condition on the chart I'm not able to setup node selectors anymore, and they are really important because otherwise I can't avoid the amd gpu device plugin to be deployed into all my machines on the cluster

julian3xl · 2022-10-11T13:53:03Z

what I'm trying to say is that node selectors is something related with Kubernetes and they should not be dependant on if I'm installing and add-on or not

y2kenny · 2022-10-11T13:55:51Z

I don't understand. If you use NFD on your cluster, you set nfd.enabled to on here to use the node selector. This device plugin does not install NFD for you.

julian3xl · 2022-10-11T13:59:29Z

yeah man, it does it https://github.com/RadeonOpenCompute/k8s-device-plugin/blob/abd271eac61268e8735227b9efb5af5578f3e872/helm/amd-gpu/Chart.yaml#L21

y2kenny · 2022-10-11T14:03:36Z

Ah ok, may be that's the real issue. My intention was for that dependencies to be optional (or may be I can make that dependencies 'softer' some how.)

julian3xl · 2022-10-11T14:05:50Z

well, that's fine. I've seen that in other charts and it's fine to get it installed in case you don't want to do it by yourself but anyway kubernetes standard fields should be always settable in my opinion.

julian3xl · 2022-10-11T14:08:41Z

Because the documentation says that the node selector is set to feature.node.kubernetes.io/pci-0300_1002.present: "true" by default and that's totally fine and acceptable if someone might want to keep it empty, it just needs to set Values.node_selector to "" so this chart would be fitting the standard.

y2kenny · 2022-10-11T15:16:44Z

Um.... I will need to think about this. Not necessarily disagreeing here just need to context switch my brain back into this because I don't work with Helm every day.

boniek83 · 2023-03-09T09:58:01Z

Will node selector issue be resolved anytime soon? Kustomizing helm charts for basic features is a pain.

julian3xl · 2023-03-09T10:12:54Z

To be honest, I don't see the maintainers of this repo really interested on fix this... 🤷

y2kenny · 2023-03-09T16:17:26Z

Is this still an issue now that the helm dependency uses >= instead of = for nfd when nfd.enabled is true?

julian3xl · 2023-03-09T22:15:11Z

nodeSelector is a standard helm chart property, it exists on every chart so this discussion doesn't make sense, regardless if you're installing NFD or not you should be able to set node selectors, man! that's the charts standard. Go artifacthub.io, check some charts and you will see that almost all of them have the option to set nodeSelectors. I see your point, you want to ensure:

node_selector:
  feature.node.kubernetes.io/pci-0300_1002.present: "true"

ok, put some helper to merge that selector to .Values.node_selector so users could add their custom nodeSelectors too without having to install nfd from your chart.

y2kenny · 2023-03-09T22:59:02Z

ok, put some helper to merge that selector to .Values.node_selector so users could add their custom nodeSelectors too without having to install nfd from your chart.

Can you update this pull request to reflect what you mean by this?

rptaylor · 2024-04-04T02:05:11Z

I agree, this is an issue. The situation is that this chart installs NFD, if nfd.enabled=true :
https://github.com/ROCm/k8s-device-plugin/blob/master/helm/amd-gpu/Chart.yaml#L25

However the default value is nfd.enabled=false, and in scenarios discussed here, which would be the case for me too, it must remain false, because I need to install and configure NFD separately myself, not using this chart.

That being the case, with nfd disabled, it unnecessarily blocks the use of nodeselectors: https://github.com/ROCm/k8s-device-plugin/blob/master/helm/amd-gpu/templates/deviceplugin-daemonset.yaml#L19

However nodeselectors are a normal built-in k8s feature that should generally always be configurable by chart users if they want. Nodeselectors do not depend on NFD per se, though may rely on labels set by NFD.

Unfortunately this chart overloads the nfd.enabled property for two different purposes: whether to install NFD, and whether to apply nodeselector labels. That being the case, I think it is not possible to address this without some form of breaking change (it would just require users to read the release notes and update their values, not a big deal).

IIUC, the change proposed here would result in the default behaviour changing from:
NFD not installed and no nodeselector (so device plugin runs on all nodes)
to:
NFD not installed and default nodeselector labels being applied (so potentially the plugin would not run on any nodes if NFD is not applying labels). Users should simply adjust .Values.node_selector to an empty list to preserve the previous behaviour.

An alternative approach would be to add a node_selectors_enabled variable (default false). People who were relying on the previous behaviour of setting nfd.enabled=true in their values would need to also set node_selectors_enabled: true in order to maintain the same behaviour as before. If they upgrade without reading the release notes the failure scenario would be that the device plugin would run on all nodes instead of only the selected nodes.

It seems to me the safer and more conservative approach is to avoid unintentionally running on undesired nodes?

All things considered, if I have understood the situation correctly, IMHO the simplest and safest way to fix this would be to merge this PR and add a release note to the chart docs saying "if you want to keep the default behaviour of running the plugin on all nodes set .Values.node_selector to an empty list".

@y2kenny what do you think?

y2kenny · 2024-04-11T04:20:59Z

@rptaylor, thanks for the detailed explanation. I think others have wrote similar thing but it didn't 'click' in my head until I read what you wrote so that is much appreciated.

As to the solution to this problem, I actually don't think running the plugin on every single node is an issue, because:

the plugin check for the existence of the driver, and

k8s-device-plugin/cmd/k8s-device-plugin/main.go

Line 327 in 0a820b8

if _, err := os.Stat(path); err == nil {
if not present, the plugin is just a blocked process by the golang select statement
https://github.com/kubevirt/device-plugin-manager/blob/16a4b8a71a689ccf89f6cacd5ac3dbea84ffadb1/pkg/dpm/manager.go#L69

If the choice is between running the plugin on all nodes vs no nodes as the default behaviour, I would prefer all because the user is installing the Helm Chart for a reason. Ideally I would have a context sensitive default value but sounds like that is also not welcome because nodeselector is well used/customized by cluster admins.

The remaining issue is how to communicate the feature.node.kubernetes.io/pci-0300_1002.present: "true". Should I just set a variable of some sort but leave it unused or just document it in the README?

rptaylor · 2024-04-11T17:06:57Z

2. if not present, the plugin is just a blocked process by the golang select statement
   https://github.com/kubevirt/device-plugin-manager/blob/16a4b8a71a689ccf89f6cacd5ac3dbea84ffadb1/pkg/dpm/manager.go#L69

Meaning that it will remain in interruptible sleep indefinitely?

If the choice is between running the plugin on all nodes vs no nodes as the default behaviour, I would prefer all because > the user is installing the Helm Chart for a reason. Ideally I would have a context sensitive default value but sounds like
that is also not welcome because nodeselector is well used/customized by cluster admins.

I am not sure how the chart could detect the right default, it depends on what the admin wants to do and if NFD is installed independently. Anyway I'll make a MR.

The remaining issue is how to communicate the feature.node.kubernetes.io/pci-0300_1002.present: "true". Should I just set a variable of some sort but leave it unused or just document it in the README?

I think the values of node_selector could just remain the same. People can modify it if they need.

y2kenny-amd · 2024-04-18T15:56:33Z

Resolved by #58

removing unnecessary condition

64ad88a

Signed-off-by: julian3xl <[email protected]>

julian3xl mentioned this pull request Oct 11, 2022

Issues in helm chart #19

Open

Merge branch 'master' into master

54dde91

rptaylor mentioned this pull request Apr 5, 2024

Remove reference to deprecated allow-privileged flag for kubelet #55

Merged

rptaylor mentioned this pull request Apr 11, 2024

add node_selector_enabled variable #58

Merged

y2kenny-amd closed this Apr 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

removing unnecessary condition #29

removing unnecessary condition #29

julian3xl commented Oct 11, 2022

julian3xl commented Oct 11, 2022

y2kenny commented Oct 11, 2022

julian3xl commented Oct 11, 2022 •

edited

Loading

julian3xl commented Oct 11, 2022

y2kenny commented Oct 11, 2022

julian3xl commented Oct 11, 2022

y2kenny commented Oct 11, 2022

julian3xl commented Oct 11, 2022

julian3xl commented Oct 11, 2022 •

edited

Loading

y2kenny commented Oct 11, 2022

boniek83 commented Mar 9, 2023 •

edited

Loading

julian3xl commented Mar 9, 2023

y2kenny commented Mar 9, 2023

julian3xl commented Mar 9, 2023 •

edited

Loading

y2kenny commented Mar 9, 2023

rptaylor commented Apr 4, 2024 •

edited

Loading

y2kenny commented Apr 11, 2024

rptaylor commented Apr 11, 2024 •

edited

Loading

y2kenny-amd commented Apr 18, 2024

removing unnecessary condition #29

removing unnecessary condition #29

Conversation

julian3xl commented Oct 11, 2022

julian3xl commented Oct 11, 2022

y2kenny commented Oct 11, 2022

julian3xl commented Oct 11, 2022 • edited Loading

julian3xl commented Oct 11, 2022

y2kenny commented Oct 11, 2022

julian3xl commented Oct 11, 2022

y2kenny commented Oct 11, 2022

julian3xl commented Oct 11, 2022

julian3xl commented Oct 11, 2022 • edited Loading

y2kenny commented Oct 11, 2022

boniek83 commented Mar 9, 2023 • edited Loading

julian3xl commented Mar 9, 2023

y2kenny commented Mar 9, 2023

julian3xl commented Mar 9, 2023 • edited Loading

y2kenny commented Mar 9, 2023

rptaylor commented Apr 4, 2024 • edited Loading

y2kenny commented Apr 11, 2024

rptaylor commented Apr 11, 2024 • edited Loading

y2kenny-amd commented Apr 18, 2024

julian3xl commented Oct 11, 2022 •

edited

Loading

julian3xl commented Oct 11, 2022 •

edited

Loading

boniek83 commented Mar 9, 2023 •

edited

Loading

julian3xl commented Mar 9, 2023 •

edited

Loading

rptaylor commented Apr 4, 2024 •

edited

Loading

rptaylor commented Apr 11, 2024 •

edited

Loading