Skip to content

Conversation

@yuqi-zhang
Copy link
Contributor

@yuqi-zhang yuqi-zhang commented Nov 26, 2020

Adds to/supercedes #2254

This is the overall PR for the reboot epic work, which aims to allow "rebootless updates" when certain changes happen.

Currently supported:
No action (drain, update files, uncordon)

  • ssh keys
  • pull secret

Reload crio

  • registries.conf changes

Reboot

  • all other scenarios

Also add corresponding unit and e2e tests

Today MCO defaults to rebooting node in order to apply successfully
changes to the node. This is the safest way to apply a change on the
node as we don't have to worry about things like which change is safe
to skip reboot.

In certain environments, rebooting is expensive and due to complex
hardware setup sometimes can cause problems and node won't boot up.

This will help to avoid rebooting nodes for certain cases where
MCO thinks it is safe to do so.

See - openshift/enhancements#159
@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 26, 2020
Add functionality to calculate the set of diffs between the existing
and new rendered config when an update happens, and take action
based on the diff. Currently supported actions are:

reboot
 - default behaviour
reload crio
 - when registries.conf is changed
none
 - when pull secret or ssh key is changed

Also rename rebootAction to postConfigChangeActionNone, although
that name is up for debate.
dn.logSystem("Starting update from %s to %s: %+v", oldConfigName, newConfigName, diff)

// TODO: consider how we should honor "force" flag here. Maybe if force, always reboot?
// TODO: consider if we should not cordon if no action needs to be taken
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are doing multiple things when we call updateFiles() irrespective of what rebootAction we perform like deleting stale file, writing units from newIgnConfig, etc. There are chances of getting errors in these steps. Cordon and drain will ensure that we don't end up with degraded node that has running workload.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm ok, I think my consideration was that drain on some workloads (e.g. CI) take quite awhile so it would be a nice value add. And if we were doing an update that does not require a reboot, it was unlikely to break the state of the actual node (at worst, it would e.g. fail a file write, but would not commit any changes to disk and the node itself should still be working even as we're degrading the pool).
For safety purposes though perhaps we should do this drain method for now, and consider improving in the future?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, even I get tempted to skip drain which will work fine during happy path. This is probably an extra precaution to avoid coroner cases bugs that bite us later on.
+1 on revisiting drain optimization later on (probably in 4.8?)

deleted accidently while splitting checkStateOnFirstRun()
@sinnykumari
Copy link
Contributor

Tested all scenarios we are handling in this PR including node scaleup, working perfectly fine.

@sinnykumari sinnykumari requested a review from runcom November 26, 2020 17:05
@sinnykumari
Copy link
Contributor

/retest

@yuqi-zhang
Copy link
Contributor Author

Fixed above error case and added some initial unit tests. Will expand.

Manual testing looks good so far.

@yuqi-zhang yuqi-zhang force-pushed the selective-reboot branch 2 times, most recently from aa60564 to 8f4a38e Compare November 27, 2020 21:17
Add unit tests for different changes in the MachineConfig, to
test if the calculated post config change action is expected.
@sinnykumari
Copy link
Contributor

Added e2e test

@sinnykumari sinnykumari force-pushed the selective-reboot branch 2 times, most recently from 8886dfe to aa995ba Compare November 30, 2020 19:44
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What change would result in no action?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SSH keys + pull secret. I'll add some comments here.

"No action" is perhaps not the right way to frame it, we still cordon, drain + write files/units/etc. We just don't reload any unit and don't reboot.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe postConfigChangeActionSkipReboot? tho i think the none also works

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think e.g. reloading crio also "skips reboot", so I defaulted to "none" to mean "no extra action needed"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, im fine with none 👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the massive benefits of systemd over baseline Unix is that by using cgroups, it applies a rigorous structure to the prior total Wild West of Unix processes you could identify by grepping their name or maybe racy PID files.

In particular, this invocation will send SIGHUP to e.g. a process running in a regular user pod if it happens to be named "crio".

What you want instead is systemctl reload crio - systemd is actively tracking the PID that it launched and can send the signal in a race-free way to exactly the correct process.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah for some reason I thought crio didn't have it defined. I can confirm that it indeed should be doing the same thing:

# systemctl cat crio.service
...
ExecReload=/bin/kill -s HUP $MAINPID
...

And running a reload:

Nov 30 23:23:17 ip-10-0-131-15 systemd[1]: Reloading Open Container Initiative Daemon.
Nov 30 23:23:17 ip-10-0-131-15 crio[1507]: time="2020-11-30 23:23:17.304012464Z" level=info msg="Reloading configuration"
Nov 30 23:23:17 ip-10-0-131-15 systemd[1]: Reloaded Open Container Initiative Daemon.
Nov 30 23:23:17 ip-10-0-131-15 crio[1507]: time="2020-11-30 23:23:17.309633738Z" level=info msg="Updating config from file /etc/crio/crio.conf.d/0-default"
Nov 30 23:23:17 ip-10-0-131-15 crio[1507]: time="2020-11-30 23:23:17.309845337Z" level=info msg="Updating config from path /etc/crio/crio.conf.d"
Nov 30 23:23:17 ip-10-0-131-15 crio[1507]: time="2020-11-30 23:23:17.310093463Z" level=info msg="Applied new registry configuration: &{Registries:] UnqualifiedSearchRegistries:[registry.access.redhat.com docker.io]}"
Nov 30 23:23:17 ip-10-0-131-15 crio[1507]: time="2020-11-30 23:23:17.310131904Z" level=info msg="No seccomp profile specified, using the internal efault"
Nov 30 23:23:17 ip-10-0-131-15 crio[1507]: time="2020-11-30 23:23:17.310144510Z" level=info msg="Set config seccomp_profile to \"\""

Seems like its doing what we want. I'll defer to @sinnykumari since she is more knowledgeable on whether that is correct.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, didn't know crio reload already does the same thing, definitely nicer way to go.
Useful learning and something I will keep in mind :)

@yuqi-zhang
Copy link
Contributor Author

/retest

uptimeOld, err := strconv.ParseFloat(oldTime, 64) seems to have failed in the run but works locally, trying again just in case

@sinnykumari
Copy link
Contributor

/retest

uptimeOld, err := strconv.ParseFloat(oldTime, 64) seems to have failed in the run but works locally, trying again just in case

yeah, I don't know why it is failing in ci and works fine when running make test-e2e on a local cluster. This is making debugging difficult. From one of the ci run where I added debug message, it printed:

 mcd_test.go:193: DEBUG: Uptime file content: I1130 18:29:05.151521    8238 request.go:645] Throttling request took 1.019506334s, request: GET:https://api.ci-op-fpxq6893-1354f.origin-ci-int-gce.dev.openshift.com:6443/apis/storage.k8s.io/v1beta1?timeout=32s

@sinnykumari
Copy link
Contributor

/retest

@sinnykumari
Copy link
Contributor

/retest

…esent

Also, added event for crio service failure and updated event
reasoning from Reboot to SkipReboot
@sinnykumari
Copy link
Contributor

flake due to VpcLimitExceeded
/retest

@sinnykumari
Copy link
Contributor

Need final round of review and approval get it merged.
/assign @cgwalters @runcom

Copy link
Member

@cgwalters cgwalters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave this a medium-level review and I think it looks good, but there are a lot of subtleties here and it touches code paths that we don't necessarily have CI coverage.

On the flip side, I think this is pretty safe to land because it will only kick in in very specialized circumstances. We're very unlikely to break any default installs or CI.

That said, I do have one request (I wouldn't call this a blocker but a very nice to have): Let's add a new machineconfiguration.openshift.io/bootedConfig annotation that is set the first time we do a "live" update.

If you look at the recent rpm-ostree livefs work which is quite analogous to this, if you type rpm-ostree status you can clearly see that a "live" update was applied, as distinct from the booted commit.

Now currently nothing in rpm-ostree tries to capture the full "live history", and I don't think we need to do that here either. The goal is that it should be immediately obvious looking at a node if it's in a "live applied" state. That will help us debug problems in the future.

To clarify...anyone else should feel free to add a LGTM as is, but hopefully you agree with the above and we can try to add a separate bootedConfig as a followup soon.

@cgwalters
Copy link
Member

cgwalters commented Dec 2, 2020

If you haven't tried rpm-ostree ex livefs, here's a demo in current Fedora CoreOS. There's a strong analogy between:

  • "booted deployment" == "currentConfig"
  • "staged deployment" == "desiredConfig"

Then when a livefs is applied, there's a new "livefs commit" (which is the same as staged, but note that staged could change without live changing later, so they need to be separately tracked).

walters@toolbox /v/s/w/b/fcos> cosa run
...
[root@cosa-devsh ~]# rpm-ostree install strace
...
Will download: 1 package (1.2?MB)
...
Added:
  strace-5.9-1.fc33.x86_64
Run "systemctl reboot" to start a reboot
[root@cosa-devsh ~]# rpm-ostree status
State: idle
Deployments:
  ostree://fedora:fedora/x86_64/coreos/testing-devel
                   Version: 33.20201119.dev.0 (2020-11-19T22:04:26Z)
                BaseCommit: 6d5b62bd02745cffbc4f3b470196eed8207fffc674c6602c428f632c226e5f2e
              GPGSignature: (unsigned)
                      Diff: 1 added
           LayeredPackages: strace

* ostree://fedora:fedora/x86_64/coreos/testing-devel
                   Version: 33.20201119.dev.0 (2020-11-19T22:04:26Z)
                    Commit: 6d5b62bd02745cffbc4f3b470196eed8207fffc674c6602c428f632c226e5f2e
              GPGSignature: (unsigned)
[root@cosa-devsh ~]# rpm-ostree ex livefs
[root@cosa-devsh ~]# rpm-ostree status
State: idle
Deployments:
  ostree://fedora:fedora/x86_64/coreos/testing-devel
                   Version: 33.20201119.dev.0 (2020-11-19T22:04:26Z)
                BaseCommit: 6d5b62bd02745cffbc4f3b470196eed8207fffc674c6602c428f632c226e5f2e
                    Commit: 0c398101141bfbf6c3661dd0d237622b865e2b85966007960e05a71cc7fcf1df
              GPGSignature: (unsigned)
                      Diff: 1 added
           LayeredPackages: strace

* ostree://fedora:fedora/x86_64/coreos/testing-devel
                   Version: 33.20201119.dev.0 (2020-11-19T22:04:26Z)
              BootedCommit: 6d5b62bd02745cffbc4f3b470196eed8207fffc674c6602c428f632c226e5f2e
                LiveCommit: 0c398101141bfbf6c3661dd0d237622b865e2b85966007960e05a71cc7fcf1df
                  LiveDiff: 1 added
              GPGSignature: (unsigned)
                  Unlocked: transient
[root@cosa-devsh ~]# 

@kikisdeliveryservice
Copy link
Contributor

I think @cgwalters suggestion could be a good followup for debugability (is that a word?)

@runcom any final thoughts?

// move to desired state without additional validation. We will reboot the node in
// this case regardless of what MachineConfig diff is.
if _, err := os.Stat(constants.MachineConfigDaemonForceFile); err == nil {
if err := os.Remove(constants.MachineConfigDaemonForceFile); err != nil {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this does change the behaviour of how the forcefile works slightly, but I guess its probably for the better, since this way it stays on the system until an update happens. Off the top of my head I don't think this should cause a problem. So +1

@yuqi-zhang
Copy link
Contributor Author

Ok I think this is good to go! Let's aim to improve documentation and debugability in follow up PRs. Thanks everyone for the reviews!

/retest

@kikisdeliveryservice
Copy link
Contributor

let's do this.

/lgtm

@cgwalters
Copy link
Member

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Dec 2, 2020
@kikisdeliveryservice
Copy link
Contributor

/skip

@kikisdeliveryservice
Copy link
Contributor

/test e2e-aws

@kikisdeliveryservice
Copy link
Contributor

/test e2e-aws-serial

1 similar comment
@kikisdeliveryservice
Copy link
Contributor

/test e2e-aws-serial

@kikisdeliveryservice
Copy link
Contributor

/test e2e-aws

@openshift-merge-robot
Copy link
Contributor

openshift-merge-robot commented Dec 3, 2020

@yuqi-zhang: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/okd-e2e-aws e25a613 link /test okd-e2e-aws
ci/prow/e2e-aws-workers-rhel7 e25a613 link /test e2e-aws-workers-rhel7

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@kikisdeliveryservice
Copy link
Contributor

/test e2e-aws

@kikisdeliveryservice
Copy link
Contributor

everything passed! but now... tide is gonna... retest 😐

@runcom
Copy link
Member

runcom commented Dec 3, 2020

To clarify...anyone else should feel free to add a LGTM as is, but hopefully you agree with the above and we can try to add a separate bootedConfig as a followup soon.

That is indeed as @kikisdeliveryservice commented a great thing to have for debuggability - we'll still must-gather in the cases we're investigating our usual bugs but it'll definitely help direct the investigation the right way. We can definitely follow up!

/lgtm

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, kikisdeliveryservice, runcom, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [cgwalters,kikisdeliveryservice,runcom,yuqi-zhang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 42f98e6 into openshift:master Dec 3, 2020
@sinnykumari
Copy link
Contributor

hopefully you agree with the above and we can try to add a separate bootedConfig as a followup soon.

Nice and useful suggestion, definitely we will discuss and work on adding it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. team-mco

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants