Skip to content

Conversation

@cgwalters
Copy link
Member

@cgwalters cgwalters commented Mar 10, 2023

This rolls together the following PRs:


daemon: Clean up switchKernel a bit

De-duplicate calls to canonicalizeKernelType to make the
logic easier to read. Also add a few comments.

(cherry picked from commit b75c7af)


vendor: Bump coreos/rpm-ostree-client-go

In prep for usage in MCD.

(cherry picked from commit cae67a6)


daemon: Make switchKernel less stateful

This is prep for fixing RHEL9 upgrades while maintaining kernel-rt.

Previously the switchKernel logic tried to carefully handle
all 4 cases (default -> default, default -> rt, rt -> default, rt -> rt).

But, the last one (rt -> rt) was not quite right because
the previous rpm-ostree rebase command already preserved the previous
kernel. In fact it was pretty expensive to do things this way
because we'd e.g. regenerate the initramfs twice.

To say this another way: when doing a RHEL9 update, it's actually
the first rpm-ostree rebase command which fails before we
even get to switchKernel.

And the reason is due to the introduction of a new -core subpackage;
xref https://issues.redhat.com/browse/OCPBUGS-8113

So here's the new logic to handle this:

  • Before we do the rebase operation to the new OS, we detect
    any previous overrides of any packages starting with kernel-rt
    and we remove them. Notably this avoids hardcoding any specific
    kernel subpackages; we just remove everything starting with
    kernel-rt which should be more robust to subpackage changes
    in the future.
  • Consequently the rebase operation will hence start out by deploying the
    stock image i.e. with throughput kernel (though note we are
    carefully preserving other local overrides)
  • The switchKernel function now longer needs to take the previous
    machineconfig state into account (except for logging).
    Instead, we just detect if the target is RT, and if so we then we
    apply the latest packages.

This significantly simplifies the logic in switchKernel, and will
help fix RHEL9 upgrades.

(cherry picked from commit 8ac5bee)


Merge pull request #3595 from cgwalters/backport-switchkernel-4.13

OCPBUGS-8703: Backport switchkernel 4.13

ensures that RHCOS 9 SSH keys are in the right place


OKD release controller is out-of-date


ensures SSH keys get moved to the correct location

When we move from RHCOS 8 -> RHCOS 9, the SSH keys are not being written
to the new location because:

  1. When the upgrade configs are written to the node, it is still running RHCOS 8, so the keys are not being written to the new location.
  2. The node reboots into RHCOS 9 to complete the upgrade.
  3. The "are we on the latest config" functions detect that we are indeed on the latest config and so it does not attempt to perform an update.

teaches TestIgn3Cfg about the new RHCOS 9 key path


checks perms for SSH key path dirs as well


Switch to rhel-coreos (9)

ref: https://issues.redhat.com/browse/COS-1983

We introduced a new rhel-coreos that is RHEL 9 to aid having a switch be
an atomic operation. After design discussion we realized it's easier
to have an "unversioned" image though, so this drops the -8.


daemon: Also override kernel-modules-core

Unfortunately rpm-ostree requires this right now; we have an issue
and code to provide a better API in coreos/rpm-ostree#2542
But using that will require shipping the updated rpm-ostree in RHEL 8.6.z
or at least OCP 4.12.z, which is problematic.

Because we know the new MCD will always be upgrading to RHEL9,
for now let's update this hardcoded list. In the future we can
detect when the running host has --remove-installed-kernel and
use it instead.


openshift-azure-routes: Avoid synchronizing too quickly

Rapid file changes triggering the path unit can start the
service here frequently, and then this can cause the start
limit to be hit, and then systemd will refuse further
activations (unless we bumped the limit).

I don't think we need to synchronize the iptables
rules more than once every 3 seconds.


daemon: Move cleanup of pending deployment earlier

We hit a confusing failure in https://issues.redhat.com/browse/OCPBUGS-8113
where the MCD will get stuck if deploying the RT kernel fails, because
the switch to the RT kernel operates from the booted deployment
state, but by default rpm-ostree wants to operate from pending.

Move up the "cleanup pending deployment on failure" defer to
right before we do anything else.


daemon: Always remove pending deployment before we do updates

The RT kernel switch logic operates from the booted deployment,
not pending. I had in my head that the MCO always cleaned up
pending, but due to another bug we didn't.

There's no reason to leave this cleanup to a defer; do it
before we do anything else.

(But keep the defer because it's cleaner to also cleanup if
we fail)


daemon: Only switchkernel if we are doing an OS update or kernel change

This fixes a regression with the previous commit
8ac5bee
where we would simply fail to roll out on RT node systems any further MachineConfig
changes.


Dockerfile: Fix removing extensions for fcos/scos

Last minute change in #3496 resulted in the number being removed from
the end of the rhel-coros-8/9 image, it is now just simply in there as
rhel-coreos, and as a result the regex that was scraping out the
extensions images (because fcos/scos dont' ship them) no longer works.

This adjusts the sed command in the Dockerfile so it matches again now
that the number is missing, and the extensions are properly removed.

(cherry picked from commit cb2958d)


cheesesashimi and others added 11 commits March 10, 2023 01:19
When we move from RHCOS 8 -> RHCOS 9, the SSH keys are not being written
to the new location because:

1. When the upgrade configs are written to the node, it is still running RHCOS 8, so the keys are not being written to the new location.
2. The node reboots into RHCOS 9 to complete the upgrade.
3. The "are we on the latest config" functions detect that we are indeed on the latest config and so it does not attempt to perform an update.
ref: https://issues.redhat.com/browse/COS-1983

We introduced a new `rhel-coreos` that is RHEL 9 to aid having a switch be
an atomic operation.  After design discussion we realized it's easier
to have an "unversioned" image though, so this drops the `-8`.
Unfortunately rpm-ostree requires this right now; we have an issue
and code to provide a better API in coreos/rpm-ostree#2542
But using that will require shipping the updated rpm-ostree in RHEL 8.6.z
or at least OCP 4.12.z, which is problematic.

Because we know the new MCD will always be upgrading to RHEL9,
for now let's update this hardcoded list.  In the future we can
detect when the running host has `--remove-installed-kernel` and
use it instead.
Rapid file changes triggering the path unit can start the
service here frequently, and then this can cause the start
limit to be hit, and then systemd will refuse further
activations (unless we bumped the limit).

I don't think we need to synchronize the iptables
rules more than once every 3 seconds.
We hit a confusing failure in https://issues.redhat.com/browse/OCPBUGS-8113
where the MCD will get stuck if deploying the RT kernel fails, because
the switch to the RT kernel operates from the *booted* deployment
state, but by default rpm-ostree wants to operate from pending.

Move up the "cleanup pending deployment on failure" `defer` to
right before we do anything else.
The RT kernel switch logic operates from the *booted* deployment,
not pending.  I had in my head that the MCO always cleaned up
pending, but due to another bug we didn't.

There's no reason to leave this cleanup to a defer; do it
before we do anything else.

(But keep the defer because it's cleaner to *also* cleanup if
 we fail)
This fixes a regression with the previous commit
openshift@8ac5bee
where we would simply fail to roll out on RT node systems any further MachineConfig
changes.
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 10, 2023

@cgwalters: No Bugzilla bug is referenced in the title of this pull request.
To reference a bug, add 'Bug XXX:' to the title of this pull request and request another bug refresh with /bugzilla refresh.

Details

In response to this:

rhel coreos 9 4.13 katamari

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot requested review from jkyros and sinnykumari March 10, 2023 01:27
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 10, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 10, 2023
@cgwalters cgwalters changed the title rhel coreos 9 4.13 katamari OCPBUGS-8703, OCPBUGS-9951: rhel coreos 9 4.13 katamari Mar 10, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 10, 2023

@cgwalters: No Bugzilla bug is referenced in the title of this pull request.
To reference a bug, add 'Bug XXX:' to the title of this pull request and request another bug refresh with /bugzilla refresh.

Details

In response to this:

OCPBUGS-8703, OCPBUGS-9951: rhel coreos 9 4.13 katamari

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Mar 10, 2023
@openshift-ci-robot
Copy link
Contributor

@cgwalters: This pull request references Jira Issue OCPBUGS-9951, which is valid.

6 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.13.0) matches configured target version for branch (4.13.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
  • dependent bug Jira Issue OCPBUGS-9685 is in the state MODIFIED, which is one of the valid states (MODIFIED, ON_QA, VERIFIED)
  • dependent Jira Issue OCPBUGS-9685 targets the "4.14.0" version, which is one of the valid target versions: 4.14.0
  • bug has dependents

Requesting review from QA contact:
/cc @rioliu-rh

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

This rolls together 3 PRs:


daemon: Clean up switchKernel a bit

De-duplicate calls to canonicalizeKernelType to make the
logic easier to read. Also add a few comments.

(cherry picked from commit b75c7af)


vendor: Bump coreos/rpm-ostree-client-go

In prep for usage in MCD.

(cherry picked from commit cae67a6)


daemon: Make switchKernel less stateful

This is prep for fixing RHEL9 upgrades while maintaining kernel-rt.

Previously the switchKernel logic tried to carefully handle
all 4 cases (default -> default, default -> rt, rt -> default, rt -> rt).

But, the last one (rt -> rt) was not quite right because
the previous rpm-ostree rebase command already preserved the previous
kernel. In fact it was pretty expensive to do things this way
because we'd e.g. regenerate the initramfs twice.

To say this another way: when doing a RHEL9 update, it's actually
the first rpm-ostree rebase command which fails before we
even get to switchKernel.

And the reason is due to the introduction of a new -core subpackage;
xref https://issues.redhat.com/browse/OCPBUGS-8113

So here's the new logic to handle this:

  • Before we do the rebase operation to the new OS, we detect
    any previous overrides of any packages starting with kernel-rt
    and we remove them. Notably this avoids hardcoding any specific
    kernel subpackages; we just remove everything starting with
    kernel-rt which should be more robust to subpackage changes
    in the future.
  • Consequently the rebase operation will hence start out by deploying the
    stock image i.e. with throughput kernel (though note we are
    carefully preserving other local overrides)
  • The switchKernel function now longer needs to take the previous
    machineconfig state into account (except for logging).
    Instead, we just detect if the target is RT, and if so we then we
    apply the latest packages.

This significantly simplifies the logic in switchKernel, and will
help fix RHEL9 upgrades.

(cherry picked from commit 8ac5bee)


Merge pull request #3595 from cgwalters/backport-switchkernel-4.13

OCPBUGS-8703: Backport switchkernel 4.13

ensures that RHCOS 9 SSH keys are in the right place


OKD release controller is out-of-date


ensures SSH keys get moved to the correct location

When we move from RHCOS 8 -> RHCOS 9, the SSH keys are not being written
to the new location because:

  1. When the upgrade configs are written to the node, it is still running RHCOS 8, so the keys are not being written to the new location.
  2. The node reboots into RHCOS 9 to complete the upgrade.
  3. The "are we on the latest config" functions detect that we are indeed on the latest config and so it does not attempt to perform an update.

teaches TestIgn3Cfg about the new RHCOS 9 key path


checks perms for SSH key path dirs as well


Switch to rhel-coreos (9)

ref: https://issues.redhat.com/browse/COS-1983

We introduced a new rhel-coreos that is RHEL 9 to aid having a switch be
an atomic operation. After design discussion we realized it's easier
to have an "unversioned" image though, so this drops the -8.


daemon: Also override kernel-modules-core

Unfortunately rpm-ostree requires this right now; we have an issue
and code to provide a better API in coreos/rpm-ostree#2542
But using that will require shipping the updated rpm-ostree in RHEL 8.6.z
or at least OCP 4.12.z, which is problematic.

Because we know the new MCD will always be upgrading to RHEL9,
for now let's update this hardcoded list. In the future we can
detect when the running host has --remove-installed-kernel and
use it instead.


openshift-azure-routes: Avoid synchronizing too quickly

Rapid file changes triggering the path unit can start the
service here frequently, and then this can cause the start
limit to be hit, and then systemd will refuse further
activations (unless we bumped the limit).

I don't think we need to synchronize the iptables
rules more than once every 3 seconds.


daemon: Move cleanup of pending deployment earlier

We hit a confusing failure in https://issues.redhat.com/browse/OCPBUGS-8113
where the MCD will get stuck if deploying the RT kernel fails, because
the switch to the RT kernel operates from the booted deployment
state, but by default rpm-ostree wants to operate from pending.

Move up the "cleanup pending deployment on failure" defer to
right before we do anything else.


daemon: Always remove pending deployment before we do updates

The RT kernel switch logic operates from the booted deployment,
not pending. I had in my head that the MCO always cleaned up
pending, but due to another bug we didn't.

There's no reason to leave this cleanup to a defer; do it
before we do anything else.

(But keep the defer because it's cleaner to also cleanup if
we fail)


daemon: Only switchkernel if we are doing an OS update or kernel change

This fixes a regression with the previous commit
8ac5bee
where we would simply fail to roll out on RT node systems any further MachineConfig
changes.


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot requested a review from rioliu-rh March 10, 2023 01:28
@cgwalters
Copy link
Member Author

/payload 4.13 nightly blocking

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 10, 2023

@cgwalters: trigger 6 job(s) of type blocking for the nightly release of OCP 4.13

  • periodic-ci-openshift-release-master-nightly-4.13-e2e-aws-sdn-upgrade
  • periodic-ci-openshift-release-master-ci-4.13-e2e-azure-ovn-upgrade
  • periodic-ci-openshift-release-master-ci-4.13-upgrade-from-stable-4.12-e2e-gcp-ovn-rt-upgrade
  • periodic-ci-openshift-release-master-nightly-4.13-e2e-aws-sdn-serial
  • periodic-ci-openshift-release-master-nightly-4.13-e2e-metal-ipi-ovn-ipv6
  • periodic-ci-openshift-release-master-nightly-4.13-e2e-metal-ipi-sdn

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/f8d10e50-bee2-11ed-9323-2e30db52b368-0

@cgwalters cgwalters changed the title OCPBUGS-8703, OCPBUGS-9951: rhel coreos 9 4.13 katamari COS-1926, OCPBUGS-8703, OCPBUGS-9951: rhel coreos 9 4.13 katamari Mar 10, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 10, 2023

@cgwalters: No Bugzilla bug is referenced in the title of this pull request.
To reference a bug, add 'Bug XXX:' to the title of this pull request and request another bug refresh with /bugzilla refresh.

Retaining the bugzilla/valid-bug label as it was manually added.

Details

In response to this:

COS-1926, OCPBUGS-8703, OCPBUGS-9951: rhel coreos 9 4.13 katamari

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@cgwalters cgwalters changed the title COS-1926, OCPBUGS-8703, OCPBUGS-9951: rhel coreos 9 4.13 katamari COS-1926, MCO-116, OCPBUGS-8703, OCPBUGS-9951: rhel coreos 9 4.13 katamari Mar 10, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 10, 2023

@cgwalters: No Bugzilla bug is referenced in the title of this pull request.
To reference a bug, add 'Bug XXX:' to the title of this pull request and request another bug refresh with /bugzilla refresh.

Retaining the bugzilla/valid-bug label as it was manually added.

Details

In response to this:

COS-1926, MCO-116, OCPBUGS-8703, OCPBUGS-9951: rhel coreos 9 4.13 katamari

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Last minute change in openshift#3496 resulted in the number being removed from
the end of the `rhel-coros-8/9` image, it is now just simply in there as
`rhel-coreos`, and as a result the regex that was scraping out the
extensions images (because fcos/scos dont' ship them) no longer works.

This adjusts the sed command in the Dockerfile so it matches again now
that the number is missing, and the extensions are properly removed.

(cherry picked from commit cb2958d)
@cgwalters
Copy link
Member Author

Let's assign @sdodson to do the cherry-pick-approved label.

@sinnykumari
Copy link
Contributor

Assigning to scott for approval
/assign @sdodson

@rioliu-rh
Copy link

/label cherry-pick-approved

@openshift-ci openshift-ci bot added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label Mar 13, 2023
@sdodson
Copy link
Member

sdodson commented Mar 13, 2023

/label cherry-pick-approved
someone got here first, but for formality lets go.

@sdodson sdodson merged commit 02d3527 into openshift:release-4.13 Mar 13, 2023
@openshift-ci-robot
Copy link
Contributor

@cgwalters: Jira Issue OCPBUGS-9951: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-9951 has been moved to the MODIFIED state.

Details

In response to this:

This rolls together the following PRs:


daemon: Clean up switchKernel a bit

De-duplicate calls to canonicalizeKernelType to make the
logic easier to read. Also add a few comments.

(cherry picked from commit b75c7af)


vendor: Bump coreos/rpm-ostree-client-go

In prep for usage in MCD.

(cherry picked from commit cae67a6)


daemon: Make switchKernel less stateful

This is prep for fixing RHEL9 upgrades while maintaining kernel-rt.

Previously the switchKernel logic tried to carefully handle
all 4 cases (default -> default, default -> rt, rt -> default, rt -> rt).

But, the last one (rt -> rt) was not quite right because
the previous rpm-ostree rebase command already preserved the previous
kernel. In fact it was pretty expensive to do things this way
because we'd e.g. regenerate the initramfs twice.

To say this another way: when doing a RHEL9 update, it's actually
the first rpm-ostree rebase command which fails before we
even get to switchKernel.

And the reason is due to the introduction of a new -core subpackage;
xref https://issues.redhat.com/browse/OCPBUGS-8113

So here's the new logic to handle this:

  • Before we do the rebase operation to the new OS, we detect
    any previous overrides of any packages starting with kernel-rt
    and we remove them. Notably this avoids hardcoding any specific
    kernel subpackages; we just remove everything starting with
    kernel-rt which should be more robust to subpackage changes
    in the future.
  • Consequently the rebase operation will hence start out by deploying the
    stock image i.e. with throughput kernel (though note we are
    carefully preserving other local overrides)
  • The switchKernel function now longer needs to take the previous
    machineconfig state into account (except for logging).
    Instead, we just detect if the target is RT, and if so we then we
    apply the latest packages.

This significantly simplifies the logic in switchKernel, and will
help fix RHEL9 upgrades.

(cherry picked from commit 8ac5bee)


Merge pull request #3595 from cgwalters/backport-switchkernel-4.13

OCPBUGS-8703: Backport switchkernel 4.13

ensures that RHCOS 9 SSH keys are in the right place


OKD release controller is out-of-date


ensures SSH keys get moved to the correct location

When we move from RHCOS 8 -> RHCOS 9, the SSH keys are not being written
to the new location because:

  1. When the upgrade configs are written to the node, it is still running RHCOS 8, so the keys are not being written to the new location.
  2. The node reboots into RHCOS 9 to complete the upgrade.
  3. The "are we on the latest config" functions detect that we are indeed on the latest config and so it does not attempt to perform an update.

teaches TestIgn3Cfg about the new RHCOS 9 key path


checks perms for SSH key path dirs as well


Switch to rhel-coreos (9)

ref: https://issues.redhat.com/browse/COS-1983

We introduced a new rhel-coreos that is RHEL 9 to aid having a switch be
an atomic operation. After design discussion we realized it's easier
to have an "unversioned" image though, so this drops the -8.


daemon: Also override kernel-modules-core

Unfortunately rpm-ostree requires this right now; we have an issue
and code to provide a better API in coreos/rpm-ostree#2542
But using that will require shipping the updated rpm-ostree in RHEL 8.6.z
or at least OCP 4.12.z, which is problematic.

Because we know the new MCD will always be upgrading to RHEL9,
for now let's update this hardcoded list. In the future we can
detect when the running host has --remove-installed-kernel and
use it instead.


openshift-azure-routes: Avoid synchronizing too quickly

Rapid file changes triggering the path unit can start the
service here frequently, and then this can cause the start
limit to be hit, and then systemd will refuse further
activations (unless we bumped the limit).

I don't think we need to synchronize the iptables
rules more than once every 3 seconds.


daemon: Move cleanup of pending deployment earlier

We hit a confusing failure in https://issues.redhat.com/browse/OCPBUGS-8113
where the MCD will get stuck if deploying the RT kernel fails, because
the switch to the RT kernel operates from the booted deployment
state, but by default rpm-ostree wants to operate from pending.

Move up the "cleanup pending deployment on failure" defer to
right before we do anything else.


daemon: Always remove pending deployment before we do updates

The RT kernel switch logic operates from the booted deployment,
not pending. I had in my head that the MCO always cleaned up
pending, but due to another bug we didn't.

There's no reason to leave this cleanup to a defer; do it
before we do anything else.

(But keep the defer because it's cleaner to also cleanup if
we fail)


daemon: Only switchkernel if we are doing an OS update or kernel change

This fixes a regression with the previous commit
8ac5bee
where we would simply fail to roll out on RT node systems any further MachineConfig
changes.


Dockerfile: Fix removing extensions for fcos/scos

Last minute change in #3496 resulted in the number being removed from
the end of the rhel-coros-8/9 image, it is now just simply in there as
rhel-coreos, and as a result the regex that was scraping out the
extensions images (because fcos/scos dont' ship them) no longer works.

This adjusts the sed command in the Dockerfile so it matches again now
that the number is missing, and the extensions are properly removed.

(cherry picked from commit cb2958d)


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@cgwalters
Copy link
Member Author

Followup in

openshift-merge-robot pushed a commit to openshift/release that referenced this pull request Mar 14, 2023
phani0041gs pushed a commit to phani0041gs/release that referenced this pull request Mar 14, 2023
bmanzari pushed a commit to bmanzari/release that referenced this pull request Mar 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. qe-approved Signifies that QE has signed off on this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.