OCPBUGS-32519: Fix appliance CI jobs#8297
OCPBUGS-32519: Fix appliance CI jobs#8297openshift-merge-bot[bot] merged 6 commits intoopenshift:masterfrom
Conversation
|
@zaneb: This pull request references Jira Issue OCPBUGS-32519, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
@zaneb: This pull request references Jira Issue OCPBUGS-32519, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
I'm not sure if this will work, but we'll see. |
|
It didn't break the non-appliance jobs, so that's something 😄 |
e1388cb to
5922f49
Compare
|
@zaneb: This pull request references Jira Issue OCPBUGS-32519, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
I am attempting to take advantage of the pre-204c4ae implementation here, which still exists. This is preferable to a trivial fix in the ignition because we'll need it for AGENT-838. However, despite the service now getting enabled as expected:
@rwsu did you have this working at one point? If so, what was the trick? Or did you make the switch in 204c4ae because it didn't work at all? |
|
agent-register-cluster.service never starts because it was removed as default enabled service in one of the day2 PRs: 9716c1f#diff-51bfa26054a56aa6871696bdc34e443b2ea944e5df9420c2e96db85595099ea2L302-L312 Since they were enabled by default before in the unconfigured ignition, I'm not sure what affect re-enabling the service in load-config-iso.sh was having. Perhaps that code block should be removed unless @bfournie remembers another reason for it. The switch was having us not rely on enabling/disabling services but have the services be conditional on files being present on the file system. After /etc/assisted/rendezvous-host.env is copied from the config ISO, node-zero.service goes to success and writes out /etc/assisted/node0 which then triggers the other agent services to start. |
This can be helpful in debugging, especially in CI but occasionally in the field.
90876aa to
09f0bf4
Compare
Enabling a systemd service required by a target that is already in progress doesn't actually result in it starting, because systemd does not rebuild its run queue from the existing transaction. To get around this, explicitly start the start-cluster-installation service when we enable it in the appliance.
There are two pairs of services that are always needed together in the agent ISO. For installing a cluster, agent-register-cluster.service followed (later) by start-cluster-installation.service. For adding a node to a cluster, agent-import-cluster.service followed (later) by agent-add-node.service. Reflect these dependencies in the systemd units' Install sections, so that we only need to enable either start-cluster-installation.service or agent-add-node.service to ensure all of the required services are enabled. This will simplify the implementation of adding a node via the appliance, where one flow or the other will need to be triggered in response to the config ISO being attached. Do not make either unit a requirement of multi-user.target, as they conflict. That allows us to enable both units (i.e. execute their Install sections) in the ignition. This ensures that when we start start-cluster-installation.service upon seeing a config ISO attached to the appliance, agent-register-cluster.service also gets started. This service was previously inadvertantly disabled by 9716c1f. Up to then, we also relied on enabling start-cluster-installation.service in the unconfigured ignition. However, due to the remnants of an implementation that existed prior to 204c4ae, there is still code in load-config-iso.sh to enable the service after the config drive is attached. This would be needed for an interactive ISO, but for the applicance we changed to enabling the service in the ignition. In future we will need to choose whether to start start-cluster-installation.service or agent-add-node.service based on the contents of the config drive, so continue to do this at runtime rather than simply re-enabling the former in the unconfigured ignition.
We either want to import a cluster and add a node, or register a cluster and start installation. Never both. Record this in the systemd units.
In systemd unit files, After= only affects ordering and can reference units that don't exist or exist but are not enabled, so there is no reason to use templating. In the appliance use case we will need the agent image that is installed to disk to contain systemd units that work for either installing or adding a node (when the appropriate config is added), so both units must have the correct ordering.
09f0bf4 to
7a30df7
Compare
Don't attempt to start agent-register-infraenv or apply-host-config unless either start-cluster-installation or agent-add-node is enabled. This prevents any races that could cause these units to start early and complain about dependencies having failed.
|
Enabling a service that is wanted by the current target on the fly is not sufficient to get systemd to start it, because it does not regenerate its run queue when a service is enabled (see systemd/systemd#23034 (comment)). The service must be explicitly started, which will also start dependencies if they are configured correctly. In future it might be tidier to have an |
|
/retest |
|
/cc @andfasano |
|
@zaneb: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
rwsu
left a comment
There was a problem hiding this comment.
I verified the appliance workflow is working again and the day-2 add nodes workflow is good with this patch.
/approve
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: rwsu The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@zaneb: Jira Issue OCPBUGS-32519: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-32519 has been moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Fix included in accepted release 4.16.0-0.nightly-2024-04-26-145258 |
|
Fix included in accepted release 4.16.0-0.nightly-2024-04-29-154406 |
The
agent-register-clustersystemd service is disabled in the unconfigured ignition since #8093, which broke the appliance CI jobs.