OCPBUGS-32519: Fix appliance CI jobs by zaneb · Pull Request #8297 · openshift/installer

zaneb · 2024-04-22T12:07:54Z

The agent-register-cluster systemd service is disabled in the unconfigured ignition since #8093, which broke the appliance CI jobs.

openshift-ci-robot · 2024-04-22T12:07:59Z

@zaneb: This pull request references Jira Issue OCPBUGS-32519, which is invalid:

expected the bug to target the "4.16.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

The agent-register-cluster systemd service is disabled in the unconfigured ignition since #8242, which broke the appliance CI jobs.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

zaneb · 2024-04-22T12:08:56Z

/jira refresh

openshift-ci-robot · 2024-04-22T12:09:02Z

@zaneb: This pull request references Jira Issue OCPBUGS-32519, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.16.0) matches configured target version for branch (4.16.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @mhanss

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

zaneb · 2024-04-22T12:13:28Z

I'm not sure if this will work, but we'll see.

zaneb · 2024-04-22T20:28:00Z

It didn't break the non-appliance jobs, so that's something 😄

openshift-ci-robot · 2024-04-23T05:08:27Z

@zaneb: This pull request references Jira Issue OCPBUGS-32519, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.16.0) matches configured target version for branch (4.16.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @mhanss

Details

In response to this:

The agent-register-cluster systemd service is disabled in the unconfigured ignition since #8093, which broke the appliance CI jobs.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

zaneb · 2024-04-23T10:54:09Z

I am attempting to take advantage of the pre-204c4ae implementation here, which still exists. This is preferable to a trivial fix in the ignition because we'll need it for AGENT-838.

However, despite the service now getting enabled as expected:

Apr 23 18:11:51 master-0 load-config-iso.sh[1400]: Service start-cluster-installation.service is disabled, enabling it
Apr 23 18:11:51 master-0 systemctl[1467]: Created symlink /etc/systemd/system/multi-user.target.wants/start-cluster-installation.service → /etc/systemd/system/start-cluster-installation.service.
Apr 23 18:11:51 master-0 systemctl[1467]: Created symlink /etc/systemd/system/multi-user.target.wants/agent-register-cluster.service → /etc/systemd/system/agent-register-cluster.service.
Apr 23 18:11:51 master-0 systemd[1]: Reloading.

agent-register-cluster.service never starts.

@rwsu did you have this working at one point? If so, what was the trick? Or did you make the switch in 204c4ae because it didn't work at all?

rwsu · 2024-04-23T19:58:23Z

agent-register-cluster.service never starts because it was removed as default enabled service in one of the day2 PRs: 9716c1f#diff-51bfa26054a56aa6871696bdc34e443b2ea944e5df9420c2e96db85595099ea2L302-L312

Since they were enabled by default before in the unconfigured ignition, I'm not sure what affect re-enabling the service in load-config-iso.sh was having. Perhaps that code block should be removed unless @bfournie remembers another reason for it.

The switch was having us not rely on enabling/disabling services but have the services be conditional on files being present on the file system. After /etc/assisted/rendezvous-host.env is copied from the config ISO, node-zero.service goes to success and writes out /etc/assisted/node0 which then triggers the other agent services to start.

This can be helpful in debugging, especially in CI but occasionally in the field.

Enabling a systemd service required by a target that is already in progress doesn't actually result in it starting, because systemd does not rebuild its run queue from the existing transaction. To get around this, explicitly start the start-cluster-installation service when we enable it in the appliance.

There are two pairs of services that are always needed together in the agent ISO. For installing a cluster, agent-register-cluster.service followed (later) by start-cluster-installation.service. For adding a node to a cluster, agent-import-cluster.service followed (later) by agent-add-node.service. Reflect these dependencies in the systemd units' Install sections, so that we only need to enable either start-cluster-installation.service or agent-add-node.service to ensure all of the required services are enabled. This will simplify the implementation of adding a node via the appliance, where one flow or the other will need to be triggered in response to the config ISO being attached. Do not make either unit a requirement of multi-user.target, as they conflict. That allows us to enable both units (i.e. execute their Install sections) in the ignition. This ensures that when we start start-cluster-installation.service upon seeing a config ISO attached to the appliance, agent-register-cluster.service also gets started. This service was previously inadvertantly disabled by 9716c1f. Up to then, we also relied on enabling start-cluster-installation.service in the unconfigured ignition. However, due to the remnants of an implementation that existed prior to 204c4ae, there is still code in load-config-iso.sh to enable the service after the config drive is attached. This would be needed for an interactive ISO, but for the applicance we changed to enabling the service in the ignition. In future we will need to choose whether to start start-cluster-installation.service or agent-add-node.service based on the contents of the config drive, so continue to do this at runtime rather than simply re-enabling the former in the unconfigured ignition.

We either want to import a cluster and add a node, or register a cluster and start installation. Never both. Record this in the systemd units.

In systemd unit files, After= only affects ordering and can reference units that don't exist or exist but are not enabled, so there is no reason to use templating. In the appliance use case we will need the agent image that is installed to disk to contain systemd units that work for either installing or adding a node (when the appropriate config is added), so both units must have the correct ordering.

Don't attempt to start agent-register-infraenv or apply-host-config unless either start-cluster-installation or agent-add-node is enabled. This prevents any races that could cause these units to start early and complain about dependencies having failed.

zaneb · 2024-04-24T06:43:01Z

Enabling a service that is wanted by the current target on the fly is not sufficient to get systemd to start it, because it does not regenerate its run queue when a service is enabled (see systemd/systemd#23034 (comment)). The service must be explicitly started, which will also start dependencies if they are configured correctly.

In future it might be tidier to have an agent-install-cluster.target that all the installation services are WantedBy and we could systemctl isolate agent-install-cluster.target to start them up. For now, this should be sufficient.

zaneb · 2024-04-24T10:16:23Z

/retest

zaneb · 2024-04-24T11:09:11Z

/cc @andfasano

openshift-ci · 2024-04-24T12:50:45Z

@zaneb: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/okd-e2e-agent-compact-ipv4	`cefe677`	link	false	`/test okd-e2e-agent-compact-ipv4`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

andfasano · 2024-04-24T16:59:24Z

/lgtm

Thanks @zaneb for the fix. I've also quickly verified the add-node workflow and it seems ok. @rwsu could you please have a run on the patch with the add-node workflow? Thanks

rwsu

I verified the appliance workflow is working again and the day-2 add nodes workflow is good with this patch.

/approve

openshift-ci · 2024-04-25T16:40:33Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rwsu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~data/data/agent/OWNERS~~ [rwsu]
~~pkg/asset/agent/OWNERS~~ [rwsu]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2024-04-25T22:59:16Z

/retest-required

Remaining retests: 0 against base HEAD 388a35c and 2 for PR HEAD cefe677 in total

openshift-ci-robot · 2024-04-25T23:50:50Z

@zaneb: Jira Issue OCPBUGS-32519: All pull requests linked via external trackers have merged:

openshift/installer#8297

Jira Issue OCPBUGS-32519 has been moved to the MODIFIED state.

Details

In response to this:

The agent-register-cluster systemd service is disabled in the unconfigured ignition since #8093, which broke the appliance CI jobs.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-merge-robot · 2024-04-26T22:08:04Z

Fix included in accepted release 4.16.0-0.nightly-2024-04-26-145258

openshift-merge-robot · 2024-04-29T22:28:59Z

Fix included in accepted release 4.16.0-0.nightly-2024-04-29-154406

openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Apr 22, 2024

openshift-ci bot requested review from mhanss, pawanpinjarkar and rwsu April 22, 2024 12:09

zaneb force-pushed the appliance-register-cluster branch from e1388cb to 5922f49 Compare April 23, 2024 05:05

Include systemd units in agent-gather

c902739

This can be helpful in debugging, especially in CI but occasionally in the field.

zaneb force-pushed the appliance-register-cluster branch 2 times, most recently from 90876aa to 09f0bf4 Compare April 23, 2024 21:22

zaneb added 4 commits April 24, 2024 18:19

Prevent conflicting services starting

8c792a9

We either want to import a cluster and add a node, or register a cluster and start installation. Never both. Record this in the systemd units.

zaneb force-pushed the appliance-register-cluster branch from 09f0bf4 to 7a30df7 Compare April 24, 2024 06:26

openshift-ci bot requested a review from andfasano April 24, 2024 11:09

openshift-ci bot assigned andfasano Apr 24, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 24, 2024

rwsu approved these changes Apr 25, 2024

View reviewed changes

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 25, 2024

openshift-merge-bot bot merged commit 4bbb027 into openshift:master Apr 25, 2024

Conversation

zaneb commented Apr 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Apr 22, 2024

Uh oh!

zaneb commented Apr 22, 2024

Uh oh!

openshift-ci-robot commented Apr 22, 2024

Uh oh!

zaneb commented Apr 22, 2024

Uh oh!

zaneb commented Apr 22, 2024

Uh oh!

openshift-ci-robot commented Apr 23, 2024

Uh oh!

zaneb commented Apr 23, 2024

Uh oh!

rwsu commented Apr 23, 2024

Uh oh!

zaneb commented Apr 24, 2024

Uh oh!

zaneb commented Apr 24, 2024

Uh oh!

zaneb commented Apr 24, 2024

Uh oh!

openshift-ci bot commented Apr 24, 2024

Uh oh!

andfasano commented Apr 24, 2024

Uh oh!

rwsu left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Apr 25, 2024

Uh oh!

openshift-ci-robot commented Apr 25, 2024

Uh oh!

openshift-ci-robot commented Apr 25, 2024

Uh oh!

openshift-merge-robot commented Apr 26, 2024

Uh oh!

openshift-merge-robot commented Apr 29, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

zaneb commented Apr 22, 2024 •

edited

Loading