Skip to content

Conversation

@cgwalters
Copy link
Member

@cgwalters cgwalters commented Jun 9, 2020

I haven't verified this yet but I think this is the core fix
for https://bugzilla.redhat.com/show_bug.cgi?id=1842906

Basically the MCS injects both /etc/pivot/image-pullspec which
4.1 bootimages need, and we get /etc/ignition-machine-config-encapsulated.json
which starts machine-config-daemon-firstboot.service.

We should only have one running. I think we have a setup like this:

4.1: Uses pivot.service which lives on the host already
4.2: Uses machine-config-daemon-firstboot-v42.service from
#1706
≥ 4.3: Uses machine-config-daemon-firstboot.service

So let's stop starting this by default.

I haven't verified this yet but I think this is the core fix
for https://bugzilla.redhat.com/show_bug.cgi?id=1842906

Basically the MCS injects both `/etc/pivot/image-pullspec` which
4.1 bootimages need, *and* we get `/etc/ignition-machine-config-encapsulated.json`
which starts `machine-config-daemon-firstboot.service`.

We should only have one running.  I think we have a setup like this:

4.1: Uses `pivot.service` which lives on the host already
4.2: Uses `machine-config-daemon-firstboot-v42.service` from
     openshift#1706
≥4.3: Uses `machine-config-daemon-firstboot.service`

So let's stop starting this by default.
@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 9, 2020
@cgwalters
Copy link
Member Author

I'll also say here I am hopeful for #1766 - basically rather than creating a mess of systemd units written by Ignition with tricky conditionals ("programming with systemd units") and where we exec a separate copy of the MCD, we dynamically inject the same code and run it e.g. as a systemd generator and have it generate units at runtime.

That way e.g. we can do things like "detect if this is a 4.1 or 4.2 or 4.3 bootimage" at early boot on the node, rather than using ConditionPathExists=/sysroot/.coreos-aleph-version.json and things like that.

@runcom
Copy link
Member

runcom commented Jun 9, 2020

Let’s get this in after tests pass I think, I would be highly surprised if we were relying on this service always starting and this effectively helps us narrow down or fix the issue we’re seeing

/approve
/lgtm

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, runcom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 9, 2020
@cgwalters
Copy link
Member Author

cgwalters commented Jun 10, 2020

e2e-aws is https://bugzilla.redhat.com/show_bug.cgi?id=1835371
e2e-gcp-op failure is https://bugzilla.redhat.com/show_bug.cgi?id=1827712
(And we're still on track to land that fix)

(Also, thanks to whoever created https://search.svc.ci.openshift.org/ !)

Didn't bother to look at scaleup since it can't be related.

/retest

@cgwalters
Copy link
Member Author

OK now to be honest I'm super confused because in current 4.4 (before this PR)

[root@osiris-s625c-master-0 ~]# ls -al /etc/systemd/system/multi-user.target.wants/machine-config-daemon-host.service
ls: cannot access '/etc/systemd/system/multi-user.target.wants/machine-config-daemon-host.service': No such file or directory
[root@osiris-s625c-master-0 ~]# systemctl is-enabled machine-config-daemon-host
disabled
[root@osiris-s625c-master-0 ~]# 

Yet...we have that [Install] the same as the others so it should be there...

But, it turned out https://bugzilla.redhat.com/show_bug.cgi?id=1842906 was something entirely different so we can chase this down at our leisure...

ExecStart=/usr/libexec/machine-config-daemon pivot
[Install]
WantedBy=multi-user.target
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to remove this as it may not adding up in this service.

Speaking out loud:
From the documentation https://www.freedesktop.org/software/systemd/man/systemd.unit.html , it says WantedBy= option may be used more than once, or a space-separated list of unit names may be given. A symbolic link is created in the .wants/ or .requires/ directory of each of the listed units when this unit is installed by systemctl enable. By default m-c-d-host.service is in disabled state, so I hope this wasn't causing any issue so far.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohhhhh I had completely missed that we had enabled: false. Thanks! That's indeed why the services aren't clashing.

@runcom
Copy link
Member

runcom commented Jun 10, 2020

/retest

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

3 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci-robot
Copy link
Contributor

@cgwalters: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-aws-scaleup-rhel7 09a5ebc link /test e2e-aws-scaleup-rhel7
ci/prow/e2e-aws 09a5ebc link /test e2e-aws

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@cgwalters
Copy link
Member Author

Based on #1804 (comment) this is just a cosmetic fix. Eh. Probably not worth the CI resources to land this, we'll rework the whole thing more later.

@cgwalters cgwalters closed this Jun 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants