-
Notifications
You must be signed in to change notification settings - Fork 462
MCO-650: Implement custom pool booting #5361
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Skipping CI for Draft Pull Request. |
yuqi-zhang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The general workflow sounds fine to me, definitely aligns with our general discussions around how to make this work.
The only real "functionality" we're adding is to have the node request additional labels via annotations, which should be safe I feel like since if the CSR went through to make it a node, we should already trust it.
I guess technically the workflow today probably sees users adding labels via machine/machineset objects, so at worst we'd duplicate that part? Either way it shouldn't cause an error
| nodeAnnotations := map[string]string{ | ||
| daemonconsts.CurrentMachineConfigAnnotationKey: conf, | ||
| daemonconsts.DesiredMachineConfigAnnotationKey: conf, | ||
| daemonconsts.FirstPivotMachineConfigAnnotationKey: conf, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, for some reason I forgot we did this, neat
|
Hello, Just for information we started to use the managed user-data in our machineset definition (To get the correct final config at startup). It works however we rapidly faced issues with clusters that were created with previous version of openshift. |
|
Hi @jlhuilier-1a ! You're correct, the reason you're running into that is because your boot images are likely out of date. The *-user-data stub needs to be compatible with the boot image referenced by your machineset. Some more context can be found here. The boot image update mechanism described in this document also attempts to upgrade these secrets to the newest version for the same reason. As for this PR, the instructions described were just the easiest path to test 😄 You could also copy an existing user-data secret (say, call it infra-user-data) and then edit the MCS endpoint within the stub to target the infra pool. Then, edit the machineset to reference that instead; this should work in a similar fashion. When we do eventually bring this feature to GA, we will be sure to make a note of this in the workflow - thanks for bringing this up, appreciate your review! |
7977608 to
5939c02
Compare
|
@djoshy: This pull request explicitly references no jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@djoshy: This pull request references MCO-650 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
5939c02 to
4892de8
Compare
|
/retest-required |
|
|
||
| klog.Infof("Node %s was booted into custom pool %s; applying node selector labels: %v", poolName, node.Name, labelsToApply) | ||
|
|
||
| // Apply the labels to the node and add annotation indicating custom pool labels were applied |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question for my curiosity, is there any concern around users deleting the annotation key or changing it's value? (I'd hope they wouldn't, but anything's possible)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, anything's possible - a user could potentially mess with this annotation to cause strange behavior(but nothing we can't undo). The same could apply to most of our MCO sided annotations, though. I guess I'd hope its an informed user since a good bit of this workflow is manual(new machineset & set up secret) 😅
isabella-janssen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
Code looks clean to me & the added units test look to cover the appropriate cases and are passing.
As a Jira nit, I'd imagine this will need some doc updates? If so, can you add the mco_doc_required label to track that need, please?
Done, thanks for the callout! Also added an e2e to the disruptive suite in the last commit, PTAL when you have a sec. It's not merge-ready since I'll want to rebase on helpers from #5391, but just wanted to get the flow figured out. |
|
/payload-job periodic-ci-openshift-machine-config-operator-release-4.21-periodics-e2e-aws-mco-disruptive-techpreview-1of2 periodic-ci-openshift-machine-config-operator-release-4.21-periodics-e2e-aws-mco-disruptive-techpreview-2of2 |
|
@djoshy: trigger 2 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/961b9040-cb01-11f0-9b98-8f7d91bc0c87-0 |
| o.Expect(err).NotTo(o.HaveOccurred()) | ||
| }) | ||
|
|
||
| g.AfterEach(func(ctx context.Context) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of handling the cleanup this way & I think the comments in here are nicely written. 🤩
Initial read through of the tests looks good to me. Once the you rebase with the helpers I'll take a closer look. |
ecb45d3 to
63898f5
Compare
|
/payload-job periodic-ci-openshift-machine-config-operator-release-4.21-periodics-e2e-aws-mco-disruptive-techpreview-2of2 |
|
@djoshy: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/54dcc290-cee2-11f0-8059-14dd4cd0f9c9-0 |
63898f5 to
ca839ba
Compare
|
/payload-job periodic-ci-openshift-machine-config-operator-release-4.21-periodics-e2e-aws-mco-disruptive-techpreview-2of2 /payload-job periodic-ci-openshift-machine-config-operator-release-4.21-periodics-e2e-aws-mco-disruptive |
|
@djoshy: trigger 11 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/95786450-cf94-11f0-8330-730ebea93444-0 |
|
Disruptive runs look good, except for metal jobs. It seems like there is some sort of validation that makes the These sort of errors are typical for a webhook - the solution here might be to have an alternative flow for the metal platform for this specific step. I had hoped the JSON patch would've saved us some trouble by making this step platform agnostic, but oh well, we got 90% of the way there 😄 |
Nodes that boot with a custom pool's rendered MachineConfig were being incorrectly moved to the worker pool after scale-up because they lacked the pool's node selector labels. This change detects such nodes via their first pivot MachineConfig and applies the appropriate labels.
|
@djoshy: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
e8e0474 to
2bfe5d7
Compare
|
/payload-job periodic-ci-openshift-machine-config-operator-release-4.21-periodics-e2e-metal-ipi-ovn-ipv4-mco-disruptive |
|
@djoshy: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/ff547460-d127-11f0-8ec0-171c403bfd7e-0 |
2bfe5d7 to
2030623
Compare
|
/payload-job periodic-ci-openshift-machine-config-operator-release-4.21-periodics-e2e-metal-ipi-ovn-ipv4-mco-disruptive Added a skip for the metal cases, since scaling them isn't currently automated in CI |
|
@djoshy: trigger 2 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/bf9020b0-d1f2-11f0-9b83-bf8c8d4b2b07-0 |
|
/payload-job periodic-ci-openshift-machine-config-operator-release-4.21-periodics-e2e-metal-ipi-ovn-ipv4-mco-disruptive PIS tests don't look happy but I don't think I touched any of their test code in the last commit 🤔 |
|
@djoshy: trigger 2 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/c55d78a0-d443-11f0-8410-e434b5567f3d-0 |
isabella-janssen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
All previous conversations were addressed new test looks to be passing.
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: djoshy, isabella-janssen, yuqi-zhang The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
- What I did
machineconfiguration.openshift.io/firstPivotConfig) that stores the very first config that was served to the node.machineconfiguration.openshift.io/customPoolLabelsApplied) for bookkeeping purposes after this, so it doesn't attempt to label the node again if it was removed by some other actor.- How to verify it
infra:- Edit its name to be unique.
- Change the MachineSet's userDataSecret field to
infra-user-data-managedif you used theinfraMCP I provided in the previous step. If you used another MCP name, use$MCP_NAME-user-data-managedinstead.- I also changed the cluster-api-machineset labels to match, but I am unsure if it is needed.