-
Notifications
You must be signed in to change notification settings - Fork 462
COS-1926, MCO-116, OCPBUGS-8703, OCPBUGS-9951: rhel coreos 9 4.13 katamari #3604
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
COS-1926, MCO-116, OCPBUGS-8703, OCPBUGS-9951: rhel coreos 9 4.13 katamari #3604
Conversation
When we move from RHCOS 8 -> RHCOS 9, the SSH keys are not being written to the new location because: 1. When the upgrade configs are written to the node, it is still running RHCOS 8, so the keys are not being written to the new location. 2. The node reboots into RHCOS 9 to complete the upgrade. 3. The "are we on the latest config" functions detect that we are indeed on the latest config and so it does not attempt to perform an update.
ref: https://issues.redhat.com/browse/COS-1983 We introduced a new `rhel-coreos` that is RHEL 9 to aid having a switch be an atomic operation. After design discussion we realized it's easier to have an "unversioned" image though, so this drops the `-8`.
Unfortunately rpm-ostree requires this right now; we have an issue and code to provide a better API in coreos/rpm-ostree#2542 But using that will require shipping the updated rpm-ostree in RHEL 8.6.z or at least OCP 4.12.z, which is problematic. Because we know the new MCD will always be upgrading to RHEL9, for now let's update this hardcoded list. In the future we can detect when the running host has `--remove-installed-kernel` and use it instead.
Rapid file changes triggering the path unit can start the service here frequently, and then this can cause the start limit to be hit, and then systemd will refuse further activations (unless we bumped the limit). I don't think we need to synchronize the iptables rules more than once every 3 seconds.
We hit a confusing failure in https://issues.redhat.com/browse/OCPBUGS-8113 where the MCD will get stuck if deploying the RT kernel fails, because the switch to the RT kernel operates from the *booted* deployment state, but by default rpm-ostree wants to operate from pending. Move up the "cleanup pending deployment on failure" `defer` to right before we do anything else.
The RT kernel switch logic operates from the *booted* deployment, not pending. I had in my head that the MCO always cleaned up pending, but due to another bug we didn't. There's no reason to leave this cleanup to a defer; do it before we do anything else. (But keep the defer because it's cleaner to *also* cleanup if we fail)
This fixes a regression with the previous commit openshift@8ac5bee where we would simply fail to roll out on RT node systems any further MachineConfig changes.
|
@cgwalters: No Bugzilla bug is referenced in the title of this pull request. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cgwalters The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@cgwalters: No Bugzilla bug is referenced in the title of this pull request. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@cgwalters: This pull request references Jira Issue OCPBUGS-9951, which is valid. 6 validation(s) were run on this bug
Requesting review from QA contact: The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/payload 4.13 nightly blocking |
|
@cgwalters: trigger 6 job(s) of type blocking for the nightly release of OCP 4.13
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/f8d10e50-bee2-11ed-9323-2e30db52b368-0 |
|
@cgwalters: No Bugzilla bug is referenced in the title of this pull request. Retaining the bugzilla/valid-bug label as it was manually added. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@cgwalters: No Bugzilla bug is referenced in the title of this pull request. Retaining the bugzilla/valid-bug label as it was manually added. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Last minute change in openshift#3496 resulted in the number being removed from the end of the `rhel-coros-8/9` image, it is now just simply in there as `rhel-coreos`, and as a result the regex that was scraping out the extensions images (because fcos/scos dont' ship them) no longer works. This adjusts the sed command in the Dockerfile so it matches again now that the number is missing, and the extensions are properly removed. (cherry picked from commit cb2958d)
|
Let's assign @sdodson to do the cherry-pick-approved label. |
|
Assigning to scott for approval |
|
/label cherry-pick-approved |
|
/label cherry-pick-approved |
|
@cgwalters: Jira Issue OCPBUGS-9951: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-9951 has been moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This rolls together the following PRs:
rhel-coreosimage name change #3597daemon: Clean up
switchKernela bitDe-duplicate calls to
canonicalizeKernelTypeto make thelogic easier to read. Also add a few comments.
(cherry picked from commit b75c7af)
vendor: Bump coreos/rpm-ostree-client-go
In prep for usage in MCD.
(cherry picked from commit cae67a6)
daemon: Make switchKernel less stateful
This is prep for fixing RHEL9 upgrades while maintaining
kernel-rt.Previously the
switchKernellogic tried to carefully handleall 4 cases (default -> default, default -> rt, rt -> default, rt -> rt).
But, the last one (rt -> rt) was not quite right because
the previous
rpm-ostree rebasecommand already preserved the previouskernel. In fact it was pretty expensive to do things this way
because we'd e.g. regenerate the initramfs twice.
To say this another way: when doing a RHEL9 update, it's actually
the first
rpm-ostree rebasecommand which fails before weeven get to
switchKernel.And the reason is due to the introduction of a new
-coresubpackage;xref https://issues.redhat.com/browse/OCPBUGS-8113
So here's the new logic to handle this:
rebaseoperation to the new OS, we detectany previous overrides of any packages starting with
kernel-rtand we remove them. Notably this avoids hardcoding any specific
kernel subpackages; we just remove everything starting with
kernel-rtwhich should be more robust to subpackage changesin the future.
rebaseoperation will hence start out by deploying thestock image i.e. with throughput kernel (though note we are
carefully preserving other local overrides)
switchKernelfunction now longer needs to take the previousmachineconfig state into account (except for logging).
Instead, we just detect if the target is RT, and if so we then we
apply the latest packages.
This significantly simplifies the logic in
switchKernel, and willhelp fix RHEL9 upgrades.
(cherry picked from commit 8ac5bee)
Merge pull request #3595 from cgwalters/backport-switchkernel-4.13
OCPBUGS-8703: Backport switchkernel 4.13
ensures that RHCOS 9 SSH keys are in the right place
OKD release controller is out-of-date
ensures SSH keys get moved to the correct location
When we move from RHCOS 8 -> RHCOS 9, the SSH keys are not being written
to the new location because:
teaches TestIgn3Cfg about the new RHCOS 9 key path
checks perms for SSH key path dirs as well
Switch to rhel-coreos (9)
ref: https://issues.redhat.com/browse/COS-1983
We introduced a new
rhel-coreosthat is RHEL 9 to aid having a switch bean atomic operation. After design discussion we realized it's easier
to have an "unversioned" image though, so this drops the
-8.daemon: Also override
kernel-modules-coreUnfortunately rpm-ostree requires this right now; we have an issue
and code to provide a better API in coreos/rpm-ostree#2542
But using that will require shipping the updated rpm-ostree in RHEL 8.6.z
or at least OCP 4.12.z, which is problematic.
Because we know the new MCD will always be upgrading to RHEL9,
for now let's update this hardcoded list. In the future we can
detect when the running host has
--remove-installed-kernelanduse it instead.
openshift-azure-routes: Avoid synchronizing too quickly
Rapid file changes triggering the path unit can start the
service here frequently, and then this can cause the start
limit to be hit, and then systemd will refuse further
activations (unless we bumped the limit).
I don't think we need to synchronize the iptables
rules more than once every 3 seconds.
daemon: Move cleanup of pending deployment earlier
We hit a confusing failure in https://issues.redhat.com/browse/OCPBUGS-8113
where the MCD will get stuck if deploying the RT kernel fails, because
the switch to the RT kernel operates from the booted deployment
state, but by default rpm-ostree wants to operate from pending.
Move up the "cleanup pending deployment on failure"
defertoright before we do anything else.
daemon: Always remove pending deployment before we do updates
The RT kernel switch logic operates from the booted deployment,
not pending. I had in my head that the MCO always cleaned up
pending, but due to another bug we didn't.
There's no reason to leave this cleanup to a defer; do it
before we do anything else.
(But keep the defer because it's cleaner to also cleanup if
we fail)
daemon: Only switchkernel if we are doing an OS update or kernel change
This fixes a regression with the previous commit
8ac5bee
where we would simply fail to roll out on RT node systems any further MachineConfig
changes.
Dockerfile: Fix removing extensions for fcos/scos
Last minute change in #3496 resulted in the number being removed from
the end of the
rhel-coros-8/9image, it is now just simply in there asrhel-coreos, and as a result the regex that was scraping out theextensions images (because fcos/scos dont' ship them) no longer works.
This adjusts the sed command in the Dockerfile so it matches again now
that the number is missing, and the extensions are properly removed.
(cherry picked from commit cb2958d)