Skip to content

OCPBUGS-32519: Fix appliance CI jobs#8297

Merged
openshift-merge-bot[bot] merged 6 commits intoopenshift:masterfrom
zaneb:appliance-register-cluster
Apr 25, 2024
Merged

OCPBUGS-32519: Fix appliance CI jobs#8297
openshift-merge-bot[bot] merged 6 commits intoopenshift:masterfrom
zaneb:appliance-register-cluster

Conversation

@zaneb
Copy link
Member

@zaneb zaneb commented Apr 22, 2024

The agent-register-cluster systemd service is disabled in the unconfigured ignition since #8093, which broke the appliance CI jobs.

@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Apr 22, 2024
@openshift-ci-robot
Copy link
Contributor

@zaneb: This pull request references Jira Issue OCPBUGS-32519, which is invalid:

  • expected the bug to target the "4.16.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

The agent-register-cluster systemd service is disabled in the unconfigured ignition since #8242, which broke the appliance CI jobs.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@zaneb
Copy link
Member Author

zaneb commented Apr 22, 2024

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Apr 22, 2024
@openshift-ci-robot
Copy link
Contributor

@zaneb: This pull request references Jira Issue OCPBUGS-32519, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.16.0) matches configured target version for branch (4.16.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @mhanss

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@zaneb
Copy link
Member Author

zaneb commented Apr 22, 2024

I'm not sure if this will work, but we'll see.

@zaneb
Copy link
Member Author

zaneb commented Apr 22, 2024

It didn't break the non-appliance jobs, so that's something 😄

@zaneb zaneb force-pushed the appliance-register-cluster branch from e1388cb to 5922f49 Compare April 23, 2024 05:05
@openshift-ci-robot
Copy link
Contributor

@zaneb: This pull request references Jira Issue OCPBUGS-32519, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.16.0) matches configured target version for branch (4.16.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @mhanss

Details

In response to this:

The agent-register-cluster systemd service is disabled in the unconfigured ignition since #8093, which broke the appliance CI jobs.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@zaneb
Copy link
Member Author

zaneb commented Apr 23, 2024

I am attempting to take advantage of the pre-204c4ae implementation here, which still exists. This is preferable to a trivial fix in the ignition because we'll need it for AGENT-838.

However, despite the service now getting enabled as expected:

Apr 23 18:11:51 master-0 load-config-iso.sh[1400]: Service start-cluster-installation.service is disabled, enabling it
Apr 23 18:11:51 master-0 systemctl[1467]: Created symlink /etc/systemd/system/multi-user.target.wants/start-cluster-installation.service → /etc/systemd/system/start-cluster-installation.service.
Apr 23 18:11:51 master-0 systemctl[1467]: Created symlink /etc/systemd/system/multi-user.target.wants/agent-register-cluster.service → /etc/systemd/system/agent-register-cluster.service.
Apr 23 18:11:51 master-0 systemd[1]: Reloading.

agent-register-cluster.service never starts.

@rwsu did you have this working at one point? If so, what was the trick? Or did you make the switch in 204c4ae because it didn't work at all?

@rwsu
Copy link
Contributor

rwsu commented Apr 23, 2024

agent-register-cluster.service never starts because it was removed as default enabled service in one of the day2 PRs: 9716c1f#diff-51bfa26054a56aa6871696bdc34e443b2ea944e5df9420c2e96db85595099ea2L302-L312

Since they were enabled by default before in the unconfigured ignition, I'm not sure what affect re-enabling the service in load-config-iso.sh was having. Perhaps that code block should be removed unless @bfournie remembers another reason for it.

The switch was having us not rely on enabling/disabling services but have the services be conditional on files being present on the file system. After /etc/assisted/rendezvous-host.env is copied from the config ISO, node-zero.service goes to success and writes out /etc/assisted/node0 which then triggers the other agent services to start.

This can be helpful in debugging, especially in CI but occasionally in
the field.
@zaneb zaneb force-pushed the appliance-register-cluster branch 2 times, most recently from 90876aa to 09f0bf4 Compare April 23, 2024 21:22
zaneb added 4 commits April 24, 2024 18:19
Enabling a systemd service required by a target that is already in
progress doesn't actually result in it starting, because systemd does
not rebuild its run queue from the existing transaction. To get around
this, explicitly start the start-cluster-installation service when we
enable it in the appliance.
There are two pairs of services that are always needed together in the
agent ISO. For installing a cluster, agent-register-cluster.service
followed (later) by start-cluster-installation.service. For adding a
node to a cluster, agent-import-cluster.service followed (later) by
agent-add-node.service.

Reflect these dependencies in the systemd units' Install sections, so
that we only need to enable either start-cluster-installation.service or
agent-add-node.service to ensure all of the required services are
enabled. This will simplify the implementation of adding a node via the
appliance, where one flow or the other will need to be triggered in
response to the config ISO being attached. Do not make either unit a
requirement of multi-user.target, as they conflict. That allows us to
enable both units (i.e. execute their Install sections) in the ignition.

This ensures that when we start start-cluster-installation.service upon
seeing a config ISO attached to the appliance,
agent-register-cluster.service also gets started. This service was
previously inadvertantly disabled by
9716c1f.

Up to then, we also relied on enabling
start-cluster-installation.service in the unconfigured ignition.
However, due to the remnants of an implementation that existed prior to
204c4ae, there is still code in
load-config-iso.sh to enable the service after the config drive is
attached. This would be needed for an interactive ISO, but for the
applicance we changed to enabling the service in the ignition.

In future we will need to choose whether to start
start-cluster-installation.service or agent-add-node.service based on
the contents of the config drive, so continue to do this at runtime
rather than simply re-enabling the former in the unconfigured ignition.
We either want to import a cluster and add a node, or register a cluster
and start installation. Never both. Record this in the systemd units.
In systemd unit files, After= only affects ordering and can reference
units that don't exist or exist but are not enabled, so there is no
reason to use templating.

In the appliance use case we will need the agent image that is installed
to disk to contain systemd units that work for either installing or
adding a node (when the appropriate config is added), so both units must
have the correct ordering.
@zaneb zaneb force-pushed the appliance-register-cluster branch from 09f0bf4 to 7a30df7 Compare April 24, 2024 06:26
Don't attempt to start agent-register-infraenv or apply-host-config
unless either start-cluster-installation or agent-add-node is enabled.

This prevents any races that could cause these units to start early and
complain about dependencies having failed.
@zaneb
Copy link
Member Author

zaneb commented Apr 24, 2024

Enabling a service that is wanted by the current target on the fly is not sufficient to get systemd to start it, because it does not regenerate its run queue when a service is enabled (see systemd/systemd#23034 (comment)). The service must be explicitly started, which will also start dependencies if they are configured correctly.

In future it might be tidier to have an agent-install-cluster.target that all the installation services are WantedBy and we could systemctl isolate agent-install-cluster.target to start them up. For now, this should be sufficient.

@zaneb
Copy link
Member Author

zaneb commented Apr 24, 2024

/retest

@zaneb
Copy link
Member Author

zaneb commented Apr 24, 2024

/cc @andfasano

@openshift-ci openshift-ci bot requested a review from andfasano April 24, 2024 11:09
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 24, 2024

@zaneb: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/okd-e2e-agent-compact-ipv4 cefe677 link false /test okd-e2e-agent-compact-ipv4

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@andfasano
Copy link
Contributor

/lgtm

Thanks @zaneb for the fix. I've also quickly verified the add-node workflow and it seems ok. @rwsu could you please have a run on the patch with the add-node workflow? Thanks

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 24, 2024
Copy link
Contributor

@rwsu rwsu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I verified the appliance workflow is working again and the day-2 add nodes workflow is good with this patch.

/approve

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 25, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rwsu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 25, 2024
@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 388a35c and 2 for PR HEAD cefe677 in total

@openshift-merge-bot openshift-merge-bot bot merged commit 4bbb027 into openshift:master Apr 25, 2024
@openshift-ci-robot
Copy link
Contributor

@zaneb: Jira Issue OCPBUGS-32519: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-32519 has been moved to the MODIFIED state.

Details

In response to this:

The agent-register-cluster systemd service is disabled in the unconfigured ignition since #8093, which broke the appliance CI jobs.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-merge-robot
Copy link
Contributor

Fix included in accepted release 4.16.0-0.nightly-2024-04-26-145258

@openshift-merge-robot
Copy link
Contributor

Fix included in accepted release 4.16.0-0.nightly-2024-04-29-154406

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants