[release-4.4] Bug 1830102: templates: Add a special machine-config-daemon-firstboot-v42.service #1707

ashcrow · 2020-04-30T21:41:21Z

Backport of #1706

This is aiming to fix:
https://bugzilla.redhat.com/show_bug.cgi?id=1829642
AKA
#1215 (comment)

Basically we have our systemd units dynamically differentiate between
"4.2" and "4.3 or above" by looking at the aleph version.

This is aiming to fix: https://bugzilla.redhat.com/show_bug.cgi?id=1829642 AKA openshift#1215 (comment) Basically we have our systemd units dynamically differentiate between "4.2" and "4.3 or above" by looking at the aleph version.

openshift-ci-robot · 2020-04-30T21:41:23Z

@ashcrow: No Bugzilla bug is referenced in the title of this pull request.
To reference a bug, add 'Bug XXX:' to the title of this pull request and request another bug refresh with /bugzilla refresh.

Details

In response to this:

templates: Add a special machine-config-daemon-firstboot-v42.service

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2020-04-30T21:42:08Z

@ashcrow: This pull request references Bugzilla bug 1830102, which is invalid:

expected the bug to target the "4.4.0" release, but it targets "4.4.z" instead
expected dependent Bugzilla bug 1829642 to be in one of the following states: VERIFIED, RELEASE_PENDING, CLOSED (ERRATA), but it is POST instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

Bug 1830102: templates: Add a special machine-config-daemon-firstboot-v42.service

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

kikisdeliveryservice · 2020-04-30T21:44:40Z

/bugzilla refresh

openshift-ci-robot · 2020-04-30T21:44:44Z

@kikisdeliveryservice: This pull request references Bugzilla bug 1830102, which is invalid:

expected dependent Bugzilla bug 1829642 to be in one of the following states: VERIFIED, RELEASE_PENDING, CLOSED (ERRATA), but it is POST instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

kikisdeliveryservice · 2020-04-30T21:45:40Z

fixed underlying bz to 4.4.0 and readded bugzilla valid..

kikisdeliveryservice · 2020-04-30T23:12:00Z

failures seem a bit flaky.. timeouts/ebs can't be deleted, looking but also:

/test e2e-aws

jianzhangbjz · 2020-05-01T00:34:17Z

/retest

kikisdeliveryservice · 2020-05-01T01:51:07Z

this e2e-aws test is def flaking

kikisdeliveryservice · 2020-05-01T01:56:44Z

seems to be regularly hitting : https://bugzilla.redhat.com/show_bug.cgi?id=1829241

sinnykumari · 2020-05-01T08:31:16Z

Tested this PR, with following steps and scaled-up node came up:

Launched a OCP 4.2 cluster
Create a OCP 4.4 payload with this PR [release-4.4] Bug 1830102: templates: Add a special machine-config-daemon-firstboot-v42.service #1707 included
Upgraded 4.2 cluster with OCP payload created in step 2
scaled up machine set to create one more node worker node
scale up worked fine and new worker node joined the cluster

Reading through journal log from one of the scaled-up worker node:

To easy understand which log is from first boot

sh-4.4# last reboot
reboot   system boot  4.18.0-193.el8.x Fri May  1 07:47   still running
reboot   system boot  4.18.0-80.11.2.e Fri May  1 07:45 - 07:47  (00:02)

Shows that upgrade went directly from 4.2 to 4.4

sh-4.4# rpm-ostree status
State: idle
AutomaticUpdates: disabled
Deployments:
* pivot://registry.svc.ci.openshift.org/ci-ln-x2x0vm2/stable@sha256:ee887a93b1b274ad486e3044d0701ec0fe1310d5263aa612bb2d0bf7871618b6
              CustomOrigin: Managed by machine-config-operator
                   Version: 44.81.202004281956-0 (2020-04-28T20:01:58Z)

  pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:19813cabd2ab94c16eb1f9485270b6ab3f5bfc8c182361e744f02b804b4f3669
              CustomOrigin: Managed by machine-config-operator
                   Version: 42.81.20200420.0 (2020-04-20T05:00:25Z)

Log of services which is relevant for this issue. Only machine-config-daemon-firstboot-v42.service ran which is what we want

sh-4.4# journalctl -u machine-config-daemon-host.service
-- Logs begin at Fri 2020-05-01 07:44:55 UTC, end at Fri 2020-05-01 07:55:35 UTC. --
-- No entries --

sh-4.4# journalctl -u machine-config-daemon-firstboot.service
-- Logs begin at Fri 2020-05-01 07:44:55 UTC, end at Fri 2020-05-01 07:56:14 UTC. --
-- No entries --

sh-4.4# journalctl -u machine-config-daemon-firstboot-v42.service
-- Logs begin at Fri 2020-05-01 07:44:55 UTC, end at Fri 2020-05-01 07:55:48 UTC. --
May 01 07:45:27 ip-10-0-156-191.ec2.internal systemd[1]: Starting Machine Config Daemon Firstboot (4.2 bootimage)...
May 01 07:45:32 ip-10-0-156-191 machine-config-daemon[1255]: I0501 07:45:32.093996    1255 rpm-ostree.go:356] Running captured: rpm-ostree status --json
May 01 07:45:32 ip-10-0-156-191 machine-config-daemon[1255]: I0501 07:45:32.187637    1255 rpm-ostree.go:152] Previous pivot: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cc71fbd134f063d9fc0ccc78933b89c8dd2b1418b7a7b85bb70de87bc>
May 01 07:45:32 ip-10-0-156-191 machine-config-daemon[1255]: I0501 07:45:32.188406    1255 run.go:16] Running: podman pull -q --authfile /var/lib/kubelet/config.json registry.svc.ci.openshift.org/ci-ln-x2x0vm2/stable@sha256:ee887a93b1b27>
May 01 07:45:39 ip-10-0-156-191 podman[1427]: 2020-05-01 07:45:39.258760312 +0000 UTC m=+2.604446879 system refresh
May 01 07:46:01 ip-10-0-156-191 podman[1427]: 2020-05-01 07:46:01.293168992 +0000 UTC m=+24.638855608 image pull  
May 01 07:46:01 ip-10-0-156-191 machine-config-daemon[1255]: d8fb1b06991285f988dee36f2bf89b1e20ebd8f2ee593000c097298742a099e6
May 01 07:46:01 ip-10-0-156-191 machine-config-daemon[1255]: I0501 07:46:01.306952    1255 rpm-ostree.go:356] Running captured: podman inspect --type=image registry.svc.ci.openshift.org/ci-ln-x2x0vm2/stable@sha256:ee887a93b1b274ad486e304>
May 01 07:46:01 ip-10-0-156-191 machine-config-daemon[1255]: I0501 07:46:01.516940    1255 rpm-ostree.go:356] Running captured: podman create --net=none --name ostree-container-pivot registry.svc.ci.openshift.org/ci-ln-x2x0vm2/stable@sha>
May 01 07:46:01 ip-10-0-156-191 podman[1527]: 2020-05-01 07:46:01.750859421 +0000 UTC m=+0.210134825 container create 7136728a5e0dc558e2b1cc0488cb70ab93b2c3ac5fd396dc9e48e78534d6ad57 (image=registry.svc.ci.openshift.org/ci-ln-x2x0vm2/sta>
May 01 07:46:01 ip-10-0-156-191 machine-config-daemon[1255]: I0501 07:46:01.768963    1255 rpm-ostree.go:356] Running captured: podman mount 7136728a5e0dc558e2b1cc0488cb70ab93b2c3ac5fd396dc9e48e78534d6ad57
May 01 07:46:01 ip-10-0-156-191 machine-config-daemon[1255]: I0501 07:46:01.833957    1255 rpm-ostree.go:234] Pivoting to: 44.81.202004281956-0 (e09f8755c9c7a5b95cd884610132af062b264e7c62dd66c8924f1c79acd7394b)
May 01 07:46:50 ip-10-0-156-191 podman[1600]: 2020-05-01 07:46:50.952999305 +0000 UTC m=+0.057878883 container remove 7136728a5e0dc558e2b1cc0488cb70ab93b2c3ac5fd396dc9e48e78534d6ad57 (image=registry.svc.ci.openshift.org/ci-ln-x2x0vm2/sta>
May 01 07:46:51 ip-10-0-156-191 podman[1612]: 2020-05-01 07:46:51.496406572 +0000 UTC m=+0.504185406 image remove d8fb1b06991285f988dee36f2bf89b1e20ebd8f2ee593000c097298742a099e6 registry.svc.ci.openshift.org/ci-ln-x2x0vm2/stable@sha256:>
May 01 07:46:51 ip-10-0-156-191 machine-config-daemon[1255]: I0501 07:46:51.519226    1255 pivot.go:247] Rebooting due to /run/pivot/reboot-needed
May 01 07:46:51 ip-10-0-156-191 systemd[1]: Stopped Machine Config Daemon Firstboot (4.2 bootimage).
May 01 07:46:51 ip-10-0-156-191 systemd[1]: machine-config-daemon-firstboot-v42.service: Consumed 37.975s CPU time
-- Reboot --
May 01 07:47:53 localhost systemd[1]: machine-config-daemon-firstboot-v42.service: Bound to unit ignition-firstboot-complete.service, but unit isn't active.
May 01 07:47:53 localhost systemd[1]: Dependency failed for Machine Config Daemon Firstboot (4.2 bootimage).
May 01 07:47:53 localhost systemd[1]: machine-config-daemon-firstboot-v42.service: Job machine-config-daemon-firstboot-v42.service/start failed with result 'dependency'.

crio.service still seems to be failing during firstboot

sh-4.4# journalctl -u crio.service
-- Logs begin at Fri 2020-05-01 07:44:55 UTC, end at Fri 2020-05-01 08:24:04 UTC. --
May 01 07:45:35 ip-10-0-156-191 systemd[1]: Starting Open Container Initiative Daemon...
May 01 07:45:35 ip-10-0-156-191 crio[1444]: time="2020-05-01 07:45:35.294078226Z" level=fatal msg="config validation: invalid runtime_path for runtime 'runc': "stat : no such file or directory""
May 01 07:45:35 ip-10-0-156-191 systemd[1]: crio.service: Main process exited, code=exited, status=1/FAILURE
May 01 07:45:35 ip-10-0-156-191 systemd[1]: crio.service: Failed with result 'exit-code'.
May 01 07:45:35 ip-10-0-156-191 systemd[1]: Failed to start Open Container Initiative Daemon.
May 01 07:45:35 ip-10-0-156-191 systemd[1]: crio.service: Consumed 91ms CPU time
-- Reboot --
May 01 07:47:55 ip-10-0-156-191 systemd[1]: Starting Open Container Initiative Daemon...
May 01 07:47:55 ip-10-0-156-191 crio[1251]: time="2020-05-01 07:47:55.467526814Z" level=info msg="using conmon executable \"/usr/libexec/crio/conmon\""
May 01 07:47:55 ip-10-0-156-191 crio[1251]: time="2020-05-01 07:47:55.470391978Z" level=info msg="Update default CNI network name to "
May 01 07:47:55 ip-10-0-156-191 crio[1251]: time="2020-05-01 07:47:55.537900371Z" level=info msg="no seccomp profile specified, using the internal default"
May 01 07:47:55 ip-10-0-156-191 systemd[1]: Started Open Container Initiative Daemon.
May 01 07:47:56 ip-10-0-156-191 crio[1251]: time="2020-05-01 07:47:56.605135278Z" level=warning msg="imageStatus: can't find k8s.gcr.io/pause:3.1" id=7eb784bb-9f64-45c7-8b6a-94b7f75a878f
May 01 07:48:09 ip-10-0-156-191 crio[1251]: time="2020-05-01 07:48:09.710930543Z" level=info msg="attempting to run pod sandbox with infra container: openshift-cluster-node-tuning-operator/tuned-vdds7/POD" id=9573cb42-ef4c-4138-bdbb-9841>
May 01 07:48:09 ip-10-0-156-191 crio[1251]: time="2020-05-01 07:48:09.733625238Z" level=info msg="attempting to run pod sandbox with infra container: openshift-monitoring/node-exporter-jlkv8/POD" id=2bd0d01a-cd35-4007-a967-e62355b0470e
May 01 07:48:09 ip-10-0-156-191 crio[1251]: time="2020-05-01 07:48:09.784773416Z" level=info msg="attempting to run pod sandbox with infra container: openshift-multus/multus-xrvtm/POD" id=6d8e4f68-768a-4e42-95c4-d5befac97d79

runcom · 2020-05-01T09:25:12Z

crio.service still seems to be failing during firstboot

Is it actually working fine after reboot? From the logs it says started so I’m assuming after reboot it works, can we basically check it the node is healthy?

sinnykumari · 2020-05-01T11:54:16Z

crio.service still seems to be failing during firstboot

Is it actually working fine after reboot? From the logs it says started so I’m assuming after reboot it works, can we basically check it the node is healthy?

yes, crio is running fine after reboot. My concern/question was related to failure during firstboot if that impacts anything.

crio status looks good after reboot.

sh-4.4# systemctl status crio.service
● crio.service - Open Container Initiative Daemon
   Loaded: loaded (/usr/lib/systemd/system/crio.service; disabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/crio.service.d
           └─10-default-env.conf
   Active: active (running) since Fri 2020-05-01 07:47:55 UTC; 35min ago
     Docs: https://github.com/cri-o/cri-o
 Main PID: 1251 (crio)
    Tasks: 18
   Memory: 2.1G
      CPU: 3min 28.866s
   CGroup: /system.slice/crio.service
           └─1251 /usr/bin/crio --enable-metrics=true --metrics-port=9537

May 01 08:23:30 ip-10-0-156-191 crio[1251]: time="2020-05-01 08:23:30.811730102Z" level=info msg="exec'd [/bin/bash -c #!/bin/bash\n/usr/share/openvswitch/scripts/ovs-ctl status > /dev/null &&\n/usr/bin/ovs-appctl -T 5 ofproto/list

Node seems healthy to me as they are all in ready state. MCO reported all 5 worker nodes in same pool which matches current rendered config for worker.

$ oc get node
NAME                           STATUS   ROLES    AGE    VERSION
ip-10-0-133-3.ec2.internal     Ready    master   102m   v1.17.1
ip-10-0-135-104.ec2.internal   Ready    worker   96m    v1.17.1
ip-10-0-136-196.ec2.internal   Ready    worker   96m    v1.17.1
ip-10-0-143-18.ec2.internal    Ready    master   102m   v1.17.1
ip-10-0-144-94.ec2.internal    Ready    master   102m   v1.17.1
ip-10-0-147-230.ec2.internal   Ready    worker   23m    v1.17.1
ip-10-0-156-191.ec2.internal   Ready    worker   25m    v1.17.1
ip-10-0-158-7.ec2.internal     Ready    worker   96m    v1.17.1

$ oc get mc
NAME                                                        GENERATEDBYCONTROLLER                      IGNITIONVERSION   AGE
...
rendered-master-2420c5eb28b3cda1898276f4b49d89f4            fac5a23280e6467ec95bb0237fae3ce387d04f0b   2.2.0             114m
rendered-master-36c0a4c50e0a59559bb8c23f604278c3            2e4a79075b751796250e1745df42480989556869   2.2.0             60m
rendered-worker-65b5bc648ff87c0ca04099c9b92268fe            2e4a79075b751796250e1745df42480989556869   2.2.0             60m
rendered-worker-777e5f98f837db367d0dfa587d8cfc88            fac5a23280e6467ec95bb0237fae3ce387d04f0b   2.2.0             114m

$ oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-36c0a4c50e0a59559bb8c23f604278c3   True      False      False      3              3                   3                     0                      114m
worker   rendered-worker-65b5bc648ff87c0ca04099c9b92268fe   True      False      False      5              5                   5                     0                      114m

Did I miss something to check for node to be considered as healthy?

runcom · 2020-05-01T12:15:26Z

Alrighty, so I think crio is started no matter what before pivot and it correctly fails. After reboot it’s all fine. I think it’s ok, the only thing we could do is prevent crio from starting as we do for kubelet but I don’t see it as a big deal cc @cgwalters

cgwalters · 2020-05-01T12:18:25Z

Regarding crio yeah, that came up on the call and also noted it here #1706 (comment) My thinking here is that this is a bug in all branches - one we should fix, I just was trying to do the most minimal/obvious fix. Since this gets us nodes joining the cluster, let's do other things as a followup?

sinnykumari · 2020-05-01T12:32:28Z

Yup, if doesn't impact cluster, this fix should be sufficient to move further!

runcom · 2020-05-01T12:55:47Z

Not sure what goes on so the e2e-aws tho

runcom · 2020-05-01T12:56:17Z

/lgtm

openshift-ci-robot · 2020-05-01T12:56:39Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ashcrow, runcom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [ashcrow,runcom]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

runcom · 2020-05-01T12:58:39Z

/retest

ashcrow · 2020-05-01T13:16:48Z

/override ci/prow/e2e-aws

Per discussion with @sdodson and team we are overriding as this job is flaking on network.

This supersedes the need for #1700

openshift-ci-robot · 2020-05-01T13:17:07Z

@ashcrow: Overrode contexts on behalf of ashcrow: ci/prow/e2e-aws

Details

In response to this:

/override ci/prow/e2e-aws

Per discussion with @sdodson and team we are overriding as this job is flaking on network.

This supersedes the need for #1700

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2020-05-01T13:22:35Z

@ashcrow: All pull requests linked via external trackers have merged: openshift/machine-config-operator#1707. Bugzilla bug 1830102 has been moved to the MODIFIED state.

Details

In response to this:

[release-4.4] Bug 1830102: templates: Add a special machine-config-daemon-firstboot-v42.service

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ashcrow added the bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. label Apr 30, 2020

ashcrow requested review from bparees, kikisdeliveryservice, mrunalp, runcom and yuqi-zhang April 30, 2020 21:41

openshift-ci-robot removed the bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. label Apr 30, 2020

openshift-ci-robot requested a review from ericavonb April 30, 2020 21:41

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 30, 2020

kikisdeliveryservice changed the title ~~templates: Add a special machine-config-daemon-firstboot-v42.service~~ Bug 1830102: templates: Add a special machine-config-daemon-firstboot-v42.service Apr 30, 2020

openshift-ci-robot added bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Apr 30, 2020

mrunalp added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label Apr 30, 2020

ashcrow added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Apr 30, 2020

ashcrow mentioned this pull request Apr 30, 2020

Bug 1829642: templates: Add a special machine-config-daemon-firstboot-v42.service #1706

Merged

openshift-ci-robot added bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. and removed bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Apr 30, 2020

kikisdeliveryservice added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Apr 30, 2020

kikisdeliveryservice changed the title ~~Bug 1830102: templates: Add a special machine-config-daemon-firstboot-v42.service~~ [release-4.4] Bug 1830102: templates: Add a special machine-config-daemon-firstboot-v42.service May 1, 2020

openshift-ci-robot assigned runcom May 1, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 1, 2020

ashcrow mentioned this pull request May 1, 2020

[release-4.4] crio: Add explicit paths for configuration settings #1700

Closed

openshift-merge-robot merged commit 910f22c into openshift:release-4.4 May 1, 2020

This was referenced May 1, 2020

Enable 4.4.2 in fast channel(s) openshift/cincinnati-graph-data#224

Closed

Enable 4.4.2 in stable channel(s) openshift/cincinnati-graph-data#225

Closed

[release-4.4] Bug 1830102: templates: Add a special machine-config-daemon-firstboot-v42.service #1707

[release-4.4] Bug 1830102: templates: Add a special machine-config-daemon-firstboot-v42.service #1707

Uh oh!

Conversation

ashcrow commented Apr 30, 2020

Uh oh!

openshift-ci-robot commented Apr 30, 2020

Uh oh!

openshift-ci-robot commented Apr 30, 2020

Uh oh!

kikisdeliveryservice commented Apr 30, 2020

Uh oh!

openshift-ci-robot commented Apr 30, 2020

Uh oh!

kikisdeliveryservice commented Apr 30, 2020

Uh oh!

kikisdeliveryservice commented Apr 30, 2020

Uh oh!

jianzhangbjz commented May 1, 2020

Uh oh!

kikisdeliveryservice commented May 1, 2020

Uh oh!

kikisdeliveryservice commented May 1, 2020

Uh oh!

sinnykumari commented May 1, 2020

Uh oh!

runcom commented May 1, 2020

Uh oh!

sinnykumari commented May 1, 2020

Uh oh!

runcom commented May 1, 2020

Uh oh!

cgwalters commented May 1, 2020

Uh oh!

sinnykumari commented May 1, 2020

Uh oh!

runcom commented May 1, 2020

Uh oh!

runcom commented May 1, 2020

Uh oh!

openshift-ci-robot commented May 1, 2020

Uh oh!

runcom commented May 1, 2020

Uh oh!

ashcrow commented May 1, 2020

Uh oh!

openshift-ci-robot commented May 1, 2020

Uh oh!

openshift-ci-robot commented May 1, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants