Skip to content

Conversation

@ashcrow
Copy link
Member

@ashcrow ashcrow commented Apr 30, 2020

Backport of #1706

This is aiming to fix:
https://bugzilla.redhat.com/show_bug.cgi?id=1829642
AKA
#1215 (comment)

Basically we have our systemd units dynamically differentiate between
"4.2" and "4.3 or above" by looking at the aleph version.

This is aiming to fix:
https://bugzilla.redhat.com/show_bug.cgi?id=1829642
AKA
openshift#1215 (comment)

Basically we have our systemd units dynamically differentiate between
"4.2" and "4.3 or above" by looking at the aleph version.
@ashcrow ashcrow added the bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. label Apr 30, 2020
@openshift-ci-robot openshift-ci-robot removed the bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. label Apr 30, 2020
@openshift-ci-robot
Copy link
Contributor

@ashcrow: No Bugzilla bug is referenced in the title of this pull request.
To reference a bug, add 'Bug XXX:' to the title of this pull request and request another bug refresh with /bugzilla refresh.

Details

In response to this:

templates: Add a special machine-config-daemon-firstboot-v42.service

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 30, 2020
@kikisdeliveryservice kikisdeliveryservice changed the title templates: Add a special machine-config-daemon-firstboot-v42.service Bug 1830102: templates: Add a special machine-config-daemon-firstboot-v42.service Apr 30, 2020
@openshift-ci-robot openshift-ci-robot added bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Apr 30, 2020
@openshift-ci-robot
Copy link
Contributor

@ashcrow: This pull request references Bugzilla bug 1830102, which is invalid:

  • expected the bug to target the "4.4.0" release, but it targets "4.4.z" instead
  • expected dependent Bugzilla bug 1829642 to be in one of the following states: VERIFIED, RELEASE_PENDING, CLOSED (ERRATA), but it is POST instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

Bug 1830102: templates: Add a special machine-config-daemon-firstboot-v42.service

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mrunalp mrunalp added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label Apr 30, 2020
@ashcrow ashcrow added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Apr 30, 2020
@kikisdeliveryservice
Copy link
Contributor

/bugzilla refresh

@openshift-ci-robot openshift-ci-robot added bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. and removed bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Apr 30, 2020
@openshift-ci-robot
Copy link
Contributor

@kikisdeliveryservice: This pull request references Bugzilla bug 1830102, which is invalid:

  • expected dependent Bugzilla bug 1829642 to be in one of the following states: VERIFIED, RELEASE_PENDING, CLOSED (ERRATA), but it is POST instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@kikisdeliveryservice kikisdeliveryservice added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Apr 30, 2020
@kikisdeliveryservice
Copy link
Contributor

fixed underlying bz to 4.4.0 and readded bugzilla valid..

@kikisdeliveryservice
Copy link
Contributor

failures seem a bit flaky.. timeouts/ebs can't be deleted, looking but also:

/test e2e-aws

@kikisdeliveryservice kikisdeliveryservice changed the title Bug 1830102: templates: Add a special machine-config-daemon-firstboot-v42.service [release-4.4] Bug 1830102: templates: Add a special machine-config-daemon-firstboot-v42.service May 1, 2020
@jianzhangbjz
Copy link
Member

/retest

@kikisdeliveryservice
Copy link
Contributor

this e2e-aws test is def flaking

@kikisdeliveryservice
Copy link
Contributor

seems to be regularly hitting : https://bugzilla.redhat.com/show_bug.cgi?id=1829241

@sinnykumari
Copy link
Contributor

Tested this PR, with following steps and scaled-up node came up:

  1. Launched a OCP 4.2 cluster
  2. Create a OCP 4.4 payload with this PR [release-4.4] Bug 1830102: templates: Add a special machine-config-daemon-firstboot-v42.service #1707 included
  3. Upgraded 4.2 cluster with OCP payload created in step 2
  4. scaled up machine set to create one more node worker node
  5. scale up worked fine and new worker node joined the cluster

Reading through journal log from one of the scaled-up worker node:

  1. To easy understand which log is from first boot
sh-4.4# last reboot
reboot   system boot  4.18.0-193.el8.x Fri May  1 07:47   still running
reboot   system boot  4.18.0-80.11.2.e Fri May  1 07:45 - 07:47  (00:02)
  1. Shows that upgrade went directly from 4.2 to 4.4
sh-4.4# rpm-ostree status
State: idle
AutomaticUpdates: disabled
Deployments:
* pivot://registry.svc.ci.openshift.org/ci-ln-x2x0vm2/stable@sha256:ee887a93b1b274ad486e3044d0701ec0fe1310d5263aa612bb2d0bf7871618b6
              CustomOrigin: Managed by machine-config-operator
                   Version: 44.81.202004281956-0 (2020-04-28T20:01:58Z)

  pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:19813cabd2ab94c16eb1f9485270b6ab3f5bfc8c182361e744f02b804b4f3669
              CustomOrigin: Managed by machine-config-operator
                   Version: 42.81.20200420.0 (2020-04-20T05:00:25Z)
  1. Log of services which is relevant for this issue. Only machine-config-daemon-firstboot-v42.service ran which is what we want
sh-4.4# journalctl -u machine-config-daemon-host.service
-- Logs begin at Fri 2020-05-01 07:44:55 UTC, end at Fri 2020-05-01 07:55:35 UTC. --
-- No entries --

sh-4.4# journalctl -u machine-config-daemon-firstboot.service
-- Logs begin at Fri 2020-05-01 07:44:55 UTC, end at Fri 2020-05-01 07:56:14 UTC. --
-- No entries --

sh-4.4# journalctl -u machine-config-daemon-firstboot-v42.service
-- Logs begin at Fri 2020-05-01 07:44:55 UTC, end at Fri 2020-05-01 07:55:48 UTC. --
May 01 07:45:27 ip-10-0-156-191.ec2.internal systemd[1]: Starting Machine Config Daemon Firstboot (4.2 bootimage)...
May 01 07:45:32 ip-10-0-156-191 machine-config-daemon[1255]: I0501 07:45:32.093996    1255 rpm-ostree.go:356] Running captured: rpm-ostree status --json
May 01 07:45:32 ip-10-0-156-191 machine-config-daemon[1255]: I0501 07:45:32.187637    1255 rpm-ostree.go:152] Previous pivot: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cc71fbd134f063d9fc0ccc78933b89c8dd2b1418b7a7b85bb70de87bc>
May 01 07:45:32 ip-10-0-156-191 machine-config-daemon[1255]: I0501 07:45:32.188406    1255 run.go:16] Running: podman pull -q --authfile /var/lib/kubelet/config.json registry.svc.ci.openshift.org/ci-ln-x2x0vm2/stable@sha256:ee887a93b1b27>
May 01 07:45:39 ip-10-0-156-191 podman[1427]: 2020-05-01 07:45:39.258760312 +0000 UTC m=+2.604446879 system refresh
May 01 07:46:01 ip-10-0-156-191 podman[1427]: 2020-05-01 07:46:01.293168992 +0000 UTC m=+24.638855608 image pull  
May 01 07:46:01 ip-10-0-156-191 machine-config-daemon[1255]: d8fb1b06991285f988dee36f2bf89b1e20ebd8f2ee593000c097298742a099e6
May 01 07:46:01 ip-10-0-156-191 machine-config-daemon[1255]: I0501 07:46:01.306952    1255 rpm-ostree.go:356] Running captured: podman inspect --type=image registry.svc.ci.openshift.org/ci-ln-x2x0vm2/stable@sha256:ee887a93b1b274ad486e304>
May 01 07:46:01 ip-10-0-156-191 machine-config-daemon[1255]: I0501 07:46:01.516940    1255 rpm-ostree.go:356] Running captured: podman create --net=none --name ostree-container-pivot registry.svc.ci.openshift.org/ci-ln-x2x0vm2/stable@sha>
May 01 07:46:01 ip-10-0-156-191 podman[1527]: 2020-05-01 07:46:01.750859421 +0000 UTC m=+0.210134825 container create 7136728a5e0dc558e2b1cc0488cb70ab93b2c3ac5fd396dc9e48e78534d6ad57 (image=registry.svc.ci.openshift.org/ci-ln-x2x0vm2/sta>
May 01 07:46:01 ip-10-0-156-191 machine-config-daemon[1255]: I0501 07:46:01.768963    1255 rpm-ostree.go:356] Running captured: podman mount 7136728a5e0dc558e2b1cc0488cb70ab93b2c3ac5fd396dc9e48e78534d6ad57
May 01 07:46:01 ip-10-0-156-191 machine-config-daemon[1255]: I0501 07:46:01.833957    1255 rpm-ostree.go:234] Pivoting to: 44.81.202004281956-0 (e09f8755c9c7a5b95cd884610132af062b264e7c62dd66c8924f1c79acd7394b)
May 01 07:46:50 ip-10-0-156-191 podman[1600]: 2020-05-01 07:46:50.952999305 +0000 UTC m=+0.057878883 container remove 7136728a5e0dc558e2b1cc0488cb70ab93b2c3ac5fd396dc9e48e78534d6ad57 (image=registry.svc.ci.openshift.org/ci-ln-x2x0vm2/sta>
May 01 07:46:51 ip-10-0-156-191 podman[1612]: 2020-05-01 07:46:51.496406572 +0000 UTC m=+0.504185406 image remove d8fb1b06991285f988dee36f2bf89b1e20ebd8f2ee593000c097298742a099e6 registry.svc.ci.openshift.org/ci-ln-x2x0vm2/stable@sha256:>
May 01 07:46:51 ip-10-0-156-191 machine-config-daemon[1255]: I0501 07:46:51.519226    1255 pivot.go:247] Rebooting due to /run/pivot/reboot-needed
May 01 07:46:51 ip-10-0-156-191 systemd[1]: Stopped Machine Config Daemon Firstboot (4.2 bootimage).
May 01 07:46:51 ip-10-0-156-191 systemd[1]: machine-config-daemon-firstboot-v42.service: Consumed 37.975s CPU time
-- Reboot --
May 01 07:47:53 localhost systemd[1]: machine-config-daemon-firstboot-v42.service: Bound to unit ignition-firstboot-complete.service, but unit isn't active.
May 01 07:47:53 localhost systemd[1]: Dependency failed for Machine Config Daemon Firstboot (4.2 bootimage).
May 01 07:47:53 localhost systemd[1]: machine-config-daemon-firstboot-v42.service: Job machine-config-daemon-firstboot-v42.service/start failed with result 'dependency'.
  1. crio.service still seems to be failing during firstboot
sh-4.4# journalctl -u crio.service
-- Logs begin at Fri 2020-05-01 07:44:55 UTC, end at Fri 2020-05-01 08:24:04 UTC. --
May 01 07:45:35 ip-10-0-156-191 systemd[1]: Starting Open Container Initiative Daemon...
May 01 07:45:35 ip-10-0-156-191 crio[1444]: time="2020-05-01 07:45:35.294078226Z" level=fatal msg="config validation: invalid runtime_path for runtime 'runc': "stat : no such file or directory""
May 01 07:45:35 ip-10-0-156-191 systemd[1]: crio.service: Main process exited, code=exited, status=1/FAILURE
May 01 07:45:35 ip-10-0-156-191 systemd[1]: crio.service: Failed with result 'exit-code'.
May 01 07:45:35 ip-10-0-156-191 systemd[1]: Failed to start Open Container Initiative Daemon.
May 01 07:45:35 ip-10-0-156-191 systemd[1]: crio.service: Consumed 91ms CPU time
-- Reboot --
May 01 07:47:55 ip-10-0-156-191 systemd[1]: Starting Open Container Initiative Daemon...
May 01 07:47:55 ip-10-0-156-191 crio[1251]: time="2020-05-01 07:47:55.467526814Z" level=info msg="using conmon executable \"/usr/libexec/crio/conmon\""
May 01 07:47:55 ip-10-0-156-191 crio[1251]: time="2020-05-01 07:47:55.470391978Z" level=info msg="Update default CNI network name to "
May 01 07:47:55 ip-10-0-156-191 crio[1251]: time="2020-05-01 07:47:55.537900371Z" level=info msg="no seccomp profile specified, using the internal default"
May 01 07:47:55 ip-10-0-156-191 systemd[1]: Started Open Container Initiative Daemon.
May 01 07:47:56 ip-10-0-156-191 crio[1251]: time="2020-05-01 07:47:56.605135278Z" level=warning msg="imageStatus: can't find k8s.gcr.io/pause:3.1" id=7eb784bb-9f64-45c7-8b6a-94b7f75a878f
May 01 07:48:09 ip-10-0-156-191 crio[1251]: time="2020-05-01 07:48:09.710930543Z" level=info msg="attempting to run pod sandbox with infra container: openshift-cluster-node-tuning-operator/tuned-vdds7/POD" id=9573cb42-ef4c-4138-bdbb-9841>
May 01 07:48:09 ip-10-0-156-191 crio[1251]: time="2020-05-01 07:48:09.733625238Z" level=info msg="attempting to run pod sandbox with infra container: openshift-monitoring/node-exporter-jlkv8/POD" id=2bd0d01a-cd35-4007-a967-e62355b0470e
May 01 07:48:09 ip-10-0-156-191 crio[1251]: time="2020-05-01 07:48:09.784773416Z" level=info msg="attempting to run pod sandbox with infra container: openshift-multus/multus-xrvtm/POD" id=6d8e4f68-768a-4e42-95c4-d5befac97d79

@runcom
Copy link
Member

runcom commented May 1, 2020

crio.service still seems to be failing during firstboot

Is it actually working fine after reboot? From the logs it says started so I’m assuming after reboot it works, can we basically check it the node is healthy?

@sinnykumari
Copy link
Contributor

crio.service still seems to be failing during firstboot

Is it actually working fine after reboot? From the logs it says started so I’m assuming after reboot it works, can we basically check it the node is healthy?

yes, crio is running fine after reboot. My concern/question was related to failure during firstboot if that impacts anything.

crio status looks good after reboot.

sh-4.4# systemctl status crio.service
● crio.service - Open Container Initiative Daemon
   Loaded: loaded (/usr/lib/systemd/system/crio.service; disabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/crio.service.d
           └─10-default-env.conf
   Active: active (running) since Fri 2020-05-01 07:47:55 UTC; 35min ago
     Docs: https://github.com/cri-o/cri-o
 Main PID: 1251 (crio)
    Tasks: 18
   Memory: 2.1G
      CPU: 3min 28.866s
   CGroup: /system.slice/crio.service
           └─1251 /usr/bin/crio --enable-metrics=true --metrics-port=9537

May 01 08:23:30 ip-10-0-156-191 crio[1251]: time="2020-05-01 08:23:30.811730102Z" level=info msg="exec'd [/bin/bash -c #!/bin/bash\n/usr/share/openvswitch/scripts/ovs-ctl status > /dev/null &&\n/usr/bin/ovs-appctl -T 5 ofproto/list

Node seems healthy to me as they are all in ready state. MCO reported all 5 worker nodes in same pool which matches current rendered config for worker.

$ oc get node
NAME                           STATUS   ROLES    AGE    VERSION
ip-10-0-133-3.ec2.internal     Ready    master   102m   v1.17.1
ip-10-0-135-104.ec2.internal   Ready    worker   96m    v1.17.1
ip-10-0-136-196.ec2.internal   Ready    worker   96m    v1.17.1
ip-10-0-143-18.ec2.internal    Ready    master   102m   v1.17.1
ip-10-0-144-94.ec2.internal    Ready    master   102m   v1.17.1
ip-10-0-147-230.ec2.internal   Ready    worker   23m    v1.17.1
ip-10-0-156-191.ec2.internal   Ready    worker   25m    v1.17.1
ip-10-0-158-7.ec2.internal     Ready    worker   96m    v1.17.1
$ oc get mc
NAME                                                        GENERATEDBYCONTROLLER                      IGNITIONVERSION   AGE
...
rendered-master-2420c5eb28b3cda1898276f4b49d89f4            fac5a23280e6467ec95bb0237fae3ce387d04f0b   2.2.0             114m
rendered-master-36c0a4c50e0a59559bb8c23f604278c3            2e4a79075b751796250e1745df42480989556869   2.2.0             60m
rendered-worker-65b5bc648ff87c0ca04099c9b92268fe            2e4a79075b751796250e1745df42480989556869   2.2.0             60m
rendered-worker-777e5f98f837db367d0dfa587d8cfc88            fac5a23280e6467ec95bb0237fae3ce387d04f0b   2.2.0             114m
$ oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-36c0a4c50e0a59559bb8c23f604278c3   True      False      False      3              3                   3                     0                      114m
worker   rendered-worker-65b5bc648ff87c0ca04099c9b92268fe   True      False      False      5              5                   5                     0                      114m

Did I miss something to check for node to be considered as healthy?

@runcom
Copy link
Member

runcom commented May 1, 2020

Alrighty, so I think crio is started no matter what before pivot and it correctly fails. After reboot it’s all fine. I think it’s ok, the only thing we could do is prevent crio from starting as we do for kubelet but I don’t see it as a big deal cc @cgwalters

@cgwalters
Copy link
Member

Regarding crio yeah, that came up on the call and also noted it here #1706 (comment) My thinking here is that this is a bug in all branches - one we should fix, I just was trying to do the most minimal/obvious fix. Since this gets us nodes joining the cluster, let's do other things as a followup?

@sinnykumari
Copy link
Contributor

Yup, if doesn't impact cluster, this fix should be sufficient to move further!

@runcom
Copy link
Member

runcom commented May 1, 2020

Not sure what goes on so the e2e-aws tho

@runcom
Copy link
Member

runcom commented May 1, 2020

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 1, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ashcrow, runcom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@runcom
Copy link
Member

runcom commented May 1, 2020

/retest

@ashcrow
Copy link
Member Author

ashcrow commented May 1, 2020

/override ci/prow/e2e-aws

Per discussion with @sdodson and team we are overriding as this job is flaking on network.

This supersedes the need for #1700

@openshift-ci-robot
Copy link
Contributor

@ashcrow: Overrode contexts on behalf of ashcrow: ci/prow/e2e-aws

Details

In response to this:

/override ci/prow/e2e-aws

Per discussion with @sdodson and team we are overriding as this job is flaking on network.

This supersedes the need for #1700

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link
Contributor

@ashcrow: All pull requests linked via external trackers have merged: openshift/machine-config-operator#1707. Bugzilla bug 1830102 has been moved to the MODIFIED state.

Details

In response to this:

[release-4.4] Bug 1830102: templates: Add a special machine-config-daemon-firstboot-v42.service

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants