-
Notifications
You must be signed in to change notification settings - Fork 461
Add kernel-64k kernelType #3903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi @jbtrystram. Thanks for your PR. I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
Not an MCO expert but this extension is more a "kernel-rt" extension than a classic extension so it should likely follow a similar code path. |
|
Found:
So we should likely introduce a new value for |
|
Updated the PR with a first round of changes, following the Should I duplicate the e2e test TestKernelType and test for the 64k kernel ? Or is this test enough to validate that switching kernels works without testing both cases ? Also, please wait a bit while I go through a deployment of this to test on a real cluster :) |
4c84f86 to
77b6c2e
Compare
|
/ok-to-test |
|
/retest |
jlebon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we also need to have something similar to queueRevertRTKernel() but for this new 64k-hugepages kernel.
|
I just got this to work nicely on an ARM64 cluster after creating the following machine config : The rendered machineConfig is updated successfully, and the node get updated through the MCD. |
|
/retest |
|
Here is the recap of some testing I've done today :
A new MCD log : [...]
All the nodes are up with the new MachineConfig:
MCD log
All the nodes are back to the previously rendered MachineConfig: |
394fd58 to
5a59a93
Compare
|
/retest |
1 similar comment
|
/retest |
install/0000_80_machine-config-operator_01_machineconfig.crd.yaml
Outdated
Show resolved
Hide resolved
sinnykumari
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please link jira story associated to this PR.
This will need e2e testing added as well, existing kernelType related test is at https://github.com/openshift/machine-config-operator/blob/master/test/e2e/mcd_test.go#L195 .
One question: Are we going to have 64K kernelType for RT kernel as well in future or it is irrelevant?
Putting hold for QE pre-merge testing
/hold
|
Looking at epic https://issues.redhat.com/browse/COS-2402 and PR description, it seems like this is only relvent to aarch64. I am curious, are we allowing users to switch to kernel-64k on all arches or only on aarch64? If so, mentioning this explicitly in doc would avoid accidental use of this field on other arches. Also perhaps throwing an error message in daemon log on non aarch64 nodes. |
|
this is now merged :) |
|
/retest |
|
Hi, this feature cannot be supported on the GCP platform as their VMs do not allow booting kernels with 64k pagesize. We'll document this, but we wonder if we could error out something in case a user tries to apply an MC for that in GCP. I'd see it as a nice to have for improving the UX behind this feature. |
| This feature is available with OCP 4.4 and onward releases as both `day 1` and `day 2` operation. It allows to choose between traditional and Real Time (RT) kernel on an RHCOS node. Supported values are | ||
| `""` or `default` for traditional kernel and `realtime` for RT kernel. | ||
| `""` or `default` for traditional kernel, `realtime` for RT kernel and `64k-pages` for 64k memory pages on aarch64. | ||
| Note that `64k-pages` and `realtime` cannot be selected at the same time. Also, 64k pages support is limited to aarch64 architecture. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there some way we can limit this change to apply only on aarch64 nodes? and possibly throw some sort of error if the associated machine pool is not aarch64 based?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I know, RT kernel is not supported on ARM64 cluster, So apply kernelType realtime + 64k-pages on ARM64 cluster is invalid case.
If we apply kernelType 64k-pages on AMD64 cluster, machine config pool will be degraded with below error message
- lastTransitionTime: '2023-10-18T06:55:53Z'
message: 'Failed to render configuration for pool worker: kernelType=64k-pages is
invalid'
reason: ''
status: 'True'
type: RenderDegradedThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops..did not notice that ...thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also does this error mean that the machine will be unusable when the update fails? i am thinking of a case where we have a multi-arch compute cluster with x86+arm64 compute nodes? would the x86 machines error out and be in a "NotReady" state?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Prashanth684 I'll try it out today and followup
update: Clusterbot is not working for me today so I can't try that quickly without going through the process of building a whole release payload, which I won't be able to do today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that
64k-pagesandrealtimecannot be selected at the same time. Also, 64k pages support is limited to aarch64 architecture.
Should we also say that realtime cannot be selected on aarch64 to make things clearer (not sure for P/Z)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also does this error mean that the machine will be unusable when the update fails? i am thinking of a case where we have a multi-arch compute cluster with x86+arm64 compute nodes? would the x86 machines error out and be in a "NotReady" state?
If the realtime kernel is applying on arm64 node, the machine will be degraded i.e. machineconfiguration.openshift.io/state: Degraded annotation machineconfiguration.openshift.io/reason will show you that rt packages are not available
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If user applies unsupported combination, node will goes degraded. MCO applies config per pool. So, it is admin responsibility to make sure that they apply generic config when a pool has multi architecture nodes. Good thing is if one node fails to apply the config, MCD marks the node degraded and hence it won't start upgrading another node (when maxUnavialble: 1) to cascade the issue. On the degraded node, existing stuff would work fine but in order to schedule a new pod, issue on the node will need to be first resolved so that update can complete and node is marked again schedulable.
|
Update latest test result EFI stub: ERROR: This 64 KB granular kernel is not supported by your CPU
Failed to boot both default and fallback entries.
Press any key to continue...MCO QE will execute cases on other platforms later |
|
Update latest test result |
Interesting. Do you have a link about this with more information? Is it for all machine types or only the smaller ones?
If indeed it's not supported at all, my vote would also be to give a nicer error here rather than wait until the user hits a likely much more obscure error down the line. That said, I wouldn't necessarily block this PR on this if @jbtrystram would rather do that as a follow-up. |
Any machine size cannot boot a 64k pagesize Linux kernel on Google Cloud. We have some internal discussions about this, and there should be BZs. I'll try to dig them again. |
|
Update latest test result Summarize the test status
|
|
/unhold |
|
@rioliu-rh: The label(s) DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/label qe-approved |
Since RHEL 9 reverted to 4K memory pages for aarch64, add a way to switch to a hugepage kernel. The MachineConfig should contain the following to trigger the kernel switch: spec: kernelType: 64k-pages This is exclusive with the `realtime` kernel option. xref https://issues.redhat.com/browse/COS-2402 This requires openshift/os#1351 Signed-off-by: jbtrystram <jbtrystram@redhat.com>
f627a5c to
ddd975a
Compare
|
@jbtrystram: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
QE testing has been done. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cgwalters, jbtrystram, sinnykumari The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Since RHEL 9 reverted to 4K memory pages for aarch64, add a way to
switch to a hugepage kernel.
The MachineConfig should contain the following to trigger the kernel
switch:
spec:
kernelType: 64k-pages
This is exclusive with the
realtimekernel option.xref https://issues.redhat.com/browse/COS-2402
This requires openshift/os#1351