Skip to content

Conversation

@cgwalters
Copy link
Member

This is the mirror/dual of
#1802

With this the MCD (when run as a pod) stops using machine-config-daemon-host.service.
and creates a dynamic unit instead.

With the combination of both, machine-config-daemon-host.service is on the
path to not being used by default and migrating to a "4.1 bootimage aid".

The systemd-run model of creating a unit dynamically is much clearer for what we want here;
conceptually the service is just a dynamic child of this pod (if we could we'd
tie the lifecycle together). Further:

  • Let's shorten our systemd unit names by using the mco- prefix
  • Inject the RPMOSTREE_CLIENT_ID, see coreos/rpm-ostree@016c1c5

For example, one weird semantic of systemctl start is that it "joins" if
somehow the service is started for another reason. But here, if somehow
two instances of the MCD were running then systemd-run will say e.g.:
Failed to start transient service unit: Unit mco-pivot.service already exists.

This is the mirror/dual of
openshift#1802

With this the MCD (when run as a pod) stops using `machine-config-daemon-host.service`.
and creates a dynamic unit instead.

With the combination of both, `machine-config-daemon-host.service` is on the
path to not being used by default and migrating to a "4.1 bootimage aid".

The systemd-run model of creating a unit dynamically is much clearer for what we want here;
conceptually the service is just a dynamic child of this pod (if we could we'd
tie the lifecycle together).  Further:

- Let's shorten our systemd unit names by using the `mco-` prefix
- Inject the `RPMOSTREE_CLIENT_ID`, see coreos/rpm-ostree@016c1c5

For example, one weird semantic of `systemctl start` is that it "joins" if
somehow the service is started for another reason.  But here, if somehow
two instances of the MCD were running then `systemd-run` will say e.g.:
`Failed to start transient service unit: Unit mco-pivot.service already exists.`
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 9, 2020
@cgwalters
Copy link
Member Author

Another reason to do this is that the current
Before=kubelet.service in machine-config-daemon-host.service is just wrong when we're running as part of the cluster. systemd doesn't complain but still - we want to clearly separate the flow of firstboot vs node updates while having them share code. So basically, separate systemd units running the same binaries.

@cgwalters
Copy link
Member Author

/hold
I sent this up as a cleanup, and I want to have CI test it, but let's focus on #1804

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 9, 2020
@cgwalters
Copy link
Member Author

Looking at the logs for this one...we have no coverage of this in current CI 😢 (the upgrade test is testing from an older revision, so we won't see the new logs).

// tie the lifecycle together). Further, let's shorten our systemd unit names
// by using the mco- prefix, and we also inject the RPMOSTREE_CLIENT_ID now.
err := exec.Command("systemd-run", "--wait", "--collect", "--unit=mco-pivot",
"-E", "RPMOSTREE_CLIENT_ID=mco", "/usr/libexec/machine-config-daemon", "pivot", osImageURL).Run()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

umm, so we are directly calling m-c-d pivot here which means in case of an update pivot will run systemctl reboot (https://github.com/openshift/machine-config-operator/blob/master/cmd/machine-config-daemon/pivot.go#L253) . Can this cause system to reboot without executing remaining control flow?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My original theory that somehow the pivot code was rebooting turned out to be wrong. That code only reboots if a CLI argument is passed or if a stamp file exists, but just above in this function we unlink it to be sure:

	// This is written by code injected by the MCS, but we always want the MCD to be in control of reboots
	if err := os.Remove("/run/pivot/reboot-needed"); err != nil && !os.IsNotExist(err) {
		return errors.Wrap(err, "deleting pivot reboot-needed file")
	}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I think we could just remove that pivot reboot code now too. Nothing should be using it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, it won't have /run/pivot/reboot-needed file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious, what is the use of passing osImageURL to m-c-d pivot, does it get processed?

func followPivotJournalLogs(stopCh <-chan time.Time) {
cmd := exec.Command("journalctl", "-f", "-b", "-o", "cat",
"-u", "rpm-ostreed",
"-u", "pivot",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we are removing pivot service, does it make sense to remove from logging?

"-u", "rpm-ostreed",
"-u", "pivot",
"-u", "mco-pivot",
"-u", "machine-config-daemon-host")
Copy link
Contributor

@sinnykumari sinnykumari Jun 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this since mco-pivot replaces machine-config-daemon-host service

@sinnykumari
Copy link
Contributor

Since we are directly calling machine-config-daemon pivot, I think m-c-d-host service is never going to run. Does that mean we can remove machine-config-daemon-host.service as well as part of this PR?

@cgwalters
Copy link
Member Author

Since we are directly calling machine-config-daemon pivot, I think m-c-d-host service is never going to run. Does that mean we can remove machine-config-daemon-host.service as well as part of this PR?

Not without landing #1802 first right?

I'm thinking that we should keep all of these "cleanup firstboot" PRs on hold until we land #1766 - or we can pile them all together there so that way we get CI coverage too.

@ashcrow ashcrow requested review from runcom and removed request for ashcrow June 10, 2020 16:09
@sinnykumari
Copy link
Contributor

Since we are directly calling machine-config-daemon pivot, I think m-c-d-host service is never going to run. Does that mean we can remove machine-config-daemon-host.service as well as part of this PR?

Not without landing #1802 first right?

ummm, it should be ideally optional because right now RunPivot() is called during both firstboot and subsequent boot.

I'm thinking that we should keep all of these "cleanup firstboot" PRs on hold until we land #1766 - or we can pile them all together there so that way we get CI coverage too.

I agree with this, will be nice to club all related commits in a single PR and get them well tested together.

@cgwalters
Copy link
Member Author

Closing in favor of piling onto doing this in "one shot" with #1766
(And thanks Sinny for reviewing the code here!)

@cgwalters cgwalters closed this Jun 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants