daemon: Run once mode #126

ashcrow · 2018-10-15T15:10:04Z

Non RHCOS nodes will need to apply an MC once and exit.

Requires: #139

/cc @aaronlevy @dustymabe @sdodson

Still todo:

Rebase on daemon: use informers to check for updates #130
Update remote http(s) onceFrom to request resources from the cluster
Update local ignition file onceFrom to be able to run without making requests for state from the cluster
Ensure kubernetes client doesn't attempt connection when there is no cluster

ashcrow · 2018-10-15T15:10:56Z

/cc @sdemos @jlebon @kikisdeliveryservice for visibility as active fellow MCD devs.

ashcrow · 2018-10-16T15:05:10Z

/cc @crawford

@aaronlevy @sdodson

If we can pin down the format of the remote service I think we can actually write a card and execute on this. If at all possible I'd like the remote file to be the same as the ignition config ... OR the same as the CRD. The fewer things we have to parse to execute the same/similar operations the better IMHO.

aaronlevy · 2018-10-16T17:14:20Z

I would lean toward the format being an ignition config (not CRD). My reasoning for this:

We will need to generate raw/complete ignition configs for bootstrap/master/worker to support the eventual "bring your own RHCOS" environments. For example, "I already run a DC and have my own PXE infra - I just need the ignition profiles to use.
The bootstrap ignition config (largest/most complex) is already generated only as raw ignition and served from S3 - so we'd be wrapping it in a CRD just to immediately unwrap it anyway.

If we don't already generate ignition configs for master/worker (and instead just their CRD forms) - we likely should add them as an output format of the installer asset generation phase (cc @abhinavdahiya )

ashcrow · 2018-10-16T17:16:12Z

@aaronlevy the MCD currently consumes MCs (machine config) which wrap an ignition config with more metadata used for defining what version the host should be.

Ref: https://github.com/openshift/machine-config-operator/blob/master/docs/MachineConfiguration.md#machineconfig-definition

I'm alright with ignition being the file we parse if it's a runFrom ... unless there is a use case for doing runFrom on an immutable system (which I can't think of a non corner case one off the top of my head).

aaronlevy · 2018-10-16T17:25:20Z

That's a fair point - and would tip the scales the other direction :)

Thinking a bit more - MCD doesn't need to really do anything special for master or worker profiles - we should technically have an api endpoint for all of those (either real api, or bootstrap api). So I think I was jumping incorrectly to assumption that it would need a non-api run-once mode for those profiles. Locally, it would just need to execute bootstrap config.

So would this seem reasonable to everyone:

installer generates bootstrap machineConfig CR (in addition to raw ignition config)
MCD can execute a run-once mode either from api endpoint (machineConfig CR) or locally (machineConfig CR)

ashcrow · 2018-10-16T17:54:15Z

So would this seem reasonable to everyone:

installer generates bootstrap machineConfig CR (in addition to raw ignition config)

👍

MCD can execute a run-once mode either from api endpoint (machineConfig CR) or locally (machineConfig CR)

That would sound reasonable to me. To reiterate to make sure I'm understanding properly, the api endpoint (full URI with scheme) would be passed to --run-from or a full file system path (/something/like/this/$FILENAME) would be passed to --run-from. In either case the same parsing mechanics used when watching CRD endpoints would be utilized, the only difference is it would parse and execute once, and then exit.

ashcrow · 2018-10-17T16:45:08Z

If my reiteration above is correct I think we have enough to groom a card and do this work.

/cc @aaronlevy @sdodson @crawford

aaronlevy · 2018-10-17T20:00:16Z

That sounds good to me. FWIW - naming of flags and such I have no strong feelings about

ashcrow · 2018-10-17T20:37:39Z

OK cool. I've created a card based on this and we'll start filling functionality in soon.

ashcrow · 2018-10-18T18:54:36Z

@sdemos When you have a few PTAL at what I have so far before I start wiring up a single process mode.

sdemos · 2018-10-18T19:19:09Z

pkg/daemon/daemon.go

does this mean it has to be an absolute path? is there a way we can support relative paths?

This does mean absolute. We could support relative ... I can add that in my next update.

either way is fine with me, but if it's just absolute we should make sure to be explicit about it, it might be a confusing error if you specify a relative file path and get told that you need to provide a file.

re: Errors, I'm wondering if we need to validate the URI before proceeding or if we think that the errors from the calls read files/pull configs will be enough?

sdemos · 2018-10-18T19:21:33Z

pkg/daemon/daemon.go

obviously you haven't started on this work yet, but it might be worth it to refactor the rest of the daemon a little so the main loop uses the same function internally as whatever is going to get called here so we can feel confident it's the same behavior.

It will take some serious refactoring I'm afraid. The process method utilizes a lot of external calls to get content, update status, etc.. all of which are not all available in a run once scenario. But agreed. My first attempt will be to try to decouple some of the functions/methods process calls.

yeah, makes sense. plus there is the work in #130 which is going to change that logic up even further. maybe we can just work on unifying it moving forward.

@sdemos as in encapsulate the logic for runOnce on it's own as to make merging easy and follow up later to unify them?

yeah, since it would take some significant work to fully unify them today

sdemos · 2018-10-18T19:22:09Z

cmd/machine-config-daemon/start.go

docs only mention URI but it can be a straight path too (e.g. not file://)

Good catch. I'll clarify that this can be a path to a file or a URL. URI does make it sound like http://, ftp://, file://, etc.. would be supported when really it's http[s]:// and /....

kikisdeliveryservice · 2018-10-18T19:42:25Z

pkg/daemon/daemon.go

Maybe we should add these ioutils to fsclient.go?:)

kikisdeliveryservice · 2018-10-18T19:47:31Z

cmd/machine-config-daemon/start.go

minor typo: "its"

ashcrow · 2018-10-19T13:36:53Z

(not starting a retest yet as this requires another PR which hasn't merged yet due to yesterdays bot outage. It's being re-reviewed now)

ashcrow · 2018-10-19T15:10:59Z

/retest

ashcrow · 2018-10-19T15:18:22Z

Will rebase with #134 later today.

kikisdeliveryservice · 2018-11-07T23:03:47Z

Seems like another flake? :(

abhinavdahiya · 2018-11-08T00:17:04Z

Seems like another flake? :(

We are not seeing this specific error anywhere else yet

NAME                           STATUS                     ROLES     AGE       VERSION
ip-10-0-12-35.ec2.internal     Ready,SchedulingDisabled   master    1h        v1.11.0+d4cacc0
ip-10-0-137-214.ec2.internal   Ready,SchedulingDisabled   worker    1h        v1.11.0+d4cacc0
ip-10-0-145-104.ec2.internal   Ready,SchedulingDisabled   worker    1h        v1.11.0+d4cacc0
ip-10-0-175-235.ec2.internal   Ready,SchedulingDisabled   worker    1h        v1.11.0+d4cacc0
ip-10-0-23-254.ec2.internal    Ready,SchedulingDisabled   master    1h        v1.11.0+d4cacc0
ip-10-0-44-102.ec2.internal    Ready,SchedulingDisabled   master    1h        v1.11.0+d4cacc0
Waiting for router to be created ...

That makes it seem like it is not a flake.

kikisdeliveryservice · 2018-11-08T01:25:17Z

Ah interesting. Thanks @abhinavdahiya

ashcrow · 2018-11-08T15:16:26Z

Agreed, I don't think this is a flake. It does look like a flake we hit previously BUT @yuqi-zhang and I have found cases where nodes are set degraded when they shouldn't be. We're working on this now.

Signed-off-by: Steve Milner <smilner@redhat.com>

- prepUpdateFromCluster and executeUpdateFromCluster* pulled out of handleNodeUpdate for reuse - triggerUpdateWithMachineConfig added for triggering with a provided desired config - triggerUpdate forwards to triggerUpdateWithMachineConfig(nil) - executeUpdateFromClusterWithMachineConfig added for updateing with a provided desired config - executeUpdateFromCluster forwards to executeUpdateFromClusterWithMachineConfig(nil) Signed-off-by: Steve Milner <smilner@redhat.com>

Signed-off-by: Steve Milner <smilner@redhat.com>

- New: Base instance that works without the cluster. Used in NewClusterDrivenDaemon. - NewClusterDrivenDaemon: Builds on top of New. Works with cluster resources. Signed-off-by: Steve Milner <smilner@redhat.com>

Signed-off-by: Steve Milner <smilner@redhat.com>

@yuqi-zhang

Split out the informers creation/start into StartInformer. Idea from @yuqi-zhang. Signed-off-by: Steve Milner <smilner@redhat.com>

When we are in runOnce mode AND the previous MachineConfig does not have a Kind we can assume that there was no previous config to check against. Signed-off-by: Steve Milner <smilner@redhat.com>

Signed-off-by: Steve Milner <smilner@redhat.com>

Remove StartInformer function, as the creation and start must follow the creation - chroot - check state - start workflow. Modify ClientBuilder creation to use old workflow as well. Signed-off-by: Yu Qi Zhang <jerzhang@redhat.com>

ashcrow · 2018-11-08T19:01:51Z

/hold cancel

e2e now passes
onceFrom a local file verified to work

kikisdeliveryservice · 2018-11-08T20:04:08Z

pkg/daemon/daemon.go

 func (dn *Daemon) Run(stop <-chan struct{}) error {
+	// Catch quickly if we've been asked to run once.
+	if dn.onceFrom != "" {
+		glog.V(2).Info("Running once per request")


Is this logging clear enough in the flow of the logs? Would "daemon running once per request" be clearer? Thoughts?

I like your wording better. How about I rework some of the log strings in a follow up?

Sounds great!

kikisdeliveryservice · 2018-11-08T20:20:21Z

Also, thanks for adding comments in this PR @ashcrow , it makes it very easy to read through! 👍

yuqi-zhang · 2018-11-08T22:03:51Z

Some manual testing passes for me both for cluster operation and runOnce with qemu. Will LGTM after logs are cleaned up.

Thanks for the work @ashcrow !

yuqi-zhang · 2018-11-08T22:05:21Z

/lgtm

openshift-ci-robot · 2018-11-08T22:05:28Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ashcrow, sdodson, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~cmd/machine-config-daemon/OWNERS~~ [ashcrow]
~~pkg/daemon/OWNERS~~ [ashcrow]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ashcrow · 2018-11-08T22:05:37Z

Thanks @yuqi-zhang and @kikisdeliveryservice! I'll do the log clean up in another PR post merge.

…copy manifests: specify system-cluster-critical priority and update pull p…

openshift-ci-robot requested a review from aaronlevy October 15, 2018 15:10

openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 15, 2018

openshift-ci-robot requested review from dustymabe and sdodson October 15, 2018 15:10

openshift-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Oct 15, 2018

ashcrow force-pushed the run-once branch 2 times, most recently from e03c02e to 4281ce6 Compare October 16, 2018 14:58

openshift-ci-robot requested a review from crawford October 16, 2018 15:05

ashcrow force-pushed the run-once branch from 4281ce6 to fc652c4 Compare October 18, 2018 18:53

sdemos reviewed Oct 18, 2018

View reviewed changes

kikisdeliveryservice reviewed Oct 18, 2018

View reviewed changes

cmd/machine-config-daemon/start.go Outdated

Copy link

Contributor

kikisdeliveryservice Oct 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor typo: "its"

ashcrow reacted with thumbs up emoji

ashcrow force-pushed the run-once branch from fc652c4 to 8048700 Compare October 18, 2018 20:32

ashcrow force-pushed the run-once branch from 8048700 to e7a7f92 Compare October 19, 2018 18:33

openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Oct 19, 2018

ashcrow and others added 12 commits November 8, 2018 12:58

daemon: Run once from a local or remote file

950af28

Signed-off-by: Steve Milner <smilner@redhat.com>

daemon: More methods for getting MCs

88c4a8b

Signed-off-by: Steve Milner <smilner@redhat.com>

daemon: Add runOnce for once-from usage

aa6d056

Signed-off-by: Steve Milner <smilner@redhat.com>

daemon: Add ReadAll to FsClient

e390ceb

Signed-off-by: Steve Milner <smilner@redhat.com>

daemon: validPath -> ValidPath

fb67e0e

Signed-off-by: Steve Milner <smilner@redhat.com>

daemon: Split Daemon constructors

007cb7c

- New: Base instance that works without the cluster. Used in NewClusterDrivenDaemon. - NewClusterDrivenDaemon: Builds on top of New. Works with cluster resources. Signed-off-by: Steve Milner <smilner@redhat.com>

mcd/start.go: Restructure for cluster/non-cluster use

c4b9349

Signed-off-by: Steve Milner <smilner@redhat.com>

daemon: Start informers after CheckStateOnBoot

7291b61

Split out the informers creation/start into StartInformer. Idea from @yuqi-zhang. Signed-off-by: Steve Milner <smilner@redhat.com>

daemon/update: Allow reconcile skip

1005730

When we are in runOnce mode AND the previous MachineConfig does not have a Kind we can assume that there was no previous config to check against. Signed-off-by: Steve Milner <smilner@redhat.com>

daemon/update: Don't drain the node without cluster

2db59b5

Signed-off-by: Steve Milner <smilner@redhat.com>

daemon: revert to old workflow for informer

d1e7165

Remove StartInformer function, as the creation and start must follow the creation - chroot - check state - start workflow. Modify ClientBuilder creation to use old workflow as well. Signed-off-by: Yu Qi Zhang <jerzhang@redhat.com>

ashcrow force-pushed the run-once branch from f4304b7 to d1e7165 Compare November 8, 2018 17:59

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 8, 2018

kikisdeliveryservice reviewed Nov 8, 2018

View reviewed changes

openshift-ci-robot assigned yuqi-zhang Nov 8, 2018

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 8, 2018

openshift-merge-robot merged commit 393cc44 into openshift:master Nov 8, 2018

ashcrow mentioned this pull request Nov 9, 2018

Misc clean up #162

Merged

osherdp pushed a commit to osherdp/machine-config-operator that referenced this pull request Apr 13, 2021

Merge pull request openshift#126 from gabemontero/local-upstream-123-…

f7ae76d

…copy manifests: specify system-cluster-critical priority and update pull p…

daemon: Run once mode #126

daemon: Run once mode #126

Uh oh!

Conversation

ashcrow commented Oct 15, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ashcrow commented Oct 15, 2018

Uh oh!

ashcrow commented Oct 16, 2018

Uh oh!

aaronlevy commented Oct 16, 2018

Uh oh!

ashcrow commented Oct 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aaronlevy commented Oct 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ashcrow commented Oct 16, 2018

Uh oh!

ashcrow commented Oct 17, 2018

Uh oh!

aaronlevy commented Oct 17, 2018

Uh oh!

ashcrow commented Oct 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ashcrow commented Oct 18, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ashcrow commented Oct 19, 2018

Uh oh!

ashcrow commented Oct 19, 2018

Uh oh!

ashcrow commented Oct 19, 2018

Uh oh!

kikisdeliveryservice commented Nov 7, 2018

Uh oh!

abhinavdahiya commented Nov 8, 2018

Uh oh!

kikisdeliveryservice commented Nov 8, 2018

Uh oh!

ashcrow commented Nov 8, 2018

Uh oh!

ashcrow commented Nov 8, 2018

Uh oh!

kikisdeliveryservice Nov 8, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ashcrow commented Oct 15, 2018 •

edited

Loading

ashcrow commented Oct 16, 2018 •

edited

Loading

aaronlevy commented Oct 16, 2018 •

edited

Loading

ashcrow commented Oct 17, 2018 •

edited

Loading

kikisdeliveryservice Nov 8, 2018 •

edited

Loading