Skip to content

Conversation

@ashcrow
Copy link
Member

@ashcrow ashcrow commented Oct 15, 2018

Non RHCOS nodes will need to apply an MC once and exit.

Requires: #139

/cc @aaronlevy @dustymabe @sdodson

Still todo:

  • Rebase on daemon: use informers to check for updates #130
  • Update remote http(s) onceFrom to request resources from the cluster
  • Update local ignition file onceFrom to be able to run without making requests for state from the cluster
  • Ensure kubernetes client doesn't attempt connection when there is no cluster

@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 15, 2018
@openshift-ci-robot openshift-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Oct 15, 2018
@ashcrow
Copy link
Member Author

ashcrow commented Oct 15, 2018

/cc @sdemos @jlebon @kikisdeliveryservice for visibility as active fellow MCD devs.

@ashcrow ashcrow force-pushed the run-once branch 2 times, most recently from e03c02e to 4281ce6 Compare October 16, 2018 14:58
@ashcrow
Copy link
Member Author

ashcrow commented Oct 16, 2018

/cc @crawford

@aaronlevy @sdodson

If we can pin down the format of the remote service I think we can actually write a card and execute on this. If at all possible I'd like the remote file to be the same as the ignition config ... OR the same as the CRD. The fewer things we have to parse to execute the same/similar operations the better IMHO.

@aaronlevy
Copy link

I would lean toward the format being an ignition config (not CRD). My reasoning for this:

  • We will need to generate raw/complete ignition configs for bootstrap/master/worker to support the eventual "bring your own RHCOS" environments. For example, "I already run a DC and have my own PXE infra - I just need the ignition profiles to use.

  • The bootstrap ignition config (largest/most complex) is already generated only as raw ignition and served from S3 - so we'd be wrapping it in a CRD just to immediately unwrap it anyway.

If we don't already generate ignition configs for master/worker (and instead just their CRD forms) - we likely should add them as an output format of the installer asset generation phase (cc @abhinavdahiya )

@ashcrow
Copy link
Member Author

ashcrow commented Oct 16, 2018

@aaronlevy the MCD currently consumes MCs (machine config) which wrap an ignition config with more metadata used for defining what version the host should be.

Ref: https://github.com/openshift/machine-config-operator/blob/master/docs/MachineConfiguration.md#machineconfig-definition

I'm alright with ignition being the file we parse if it's a runFrom ... unless there is a use case for doing runFrom on an immutable system (which I can't think of a non corner case one off the top of my head).

@aaronlevy
Copy link

aaronlevy commented Oct 16, 2018

That's a fair point - and would tip the scales the other direction :)

Thinking a bit more - MCD doesn't need to really do anything special for master or worker profiles - we should technically have an api endpoint for all of those (either real api, or bootstrap api). So I think I was jumping incorrectly to assumption that it would need a non-api run-once mode for those profiles. Locally, it would just need to execute bootstrap config.

So would this seem reasonable to everyone:

  • installer generates bootstrap machineConfig CR (in addition to raw ignition config)
  • MCD can execute a run-once mode either from api endpoint (machineConfig CR) or locally (machineConfig CR)

@ashcrow
Copy link
Member Author

ashcrow commented Oct 16, 2018

So would this seem reasonable to everyone:

  • installer generates bootstrap machineConfig CR (in addition to raw ignition config)

👍

  • MCD can execute a run-once mode either from api endpoint (machineConfig CR) or locally (machineConfig CR)

That would sound reasonable to me. To reiterate to make sure I'm understanding properly, the api endpoint (full URI with scheme) would be passed to --run-from or a full file system path (/something/like/this/$FILENAME) would be passed to --run-from. In either case the same parsing mechanics used when watching CRD endpoints would be utilized, the only difference is it would parse and execute once, and then exit.

@ashcrow
Copy link
Member Author

ashcrow commented Oct 17, 2018

If my reiteration above is correct I think we have enough to groom a card and do this work.

/cc @aaronlevy @sdodson @crawford

@aaronlevy
Copy link

That sounds good to me. FWIW - naming of flags and such I have no strong feelings about

@ashcrow
Copy link
Member Author

ashcrow commented Oct 17, 2018

OK cool. I've created a card based on this and we'll start filling functionality in soon.

@ashcrow
Copy link
Member Author

ashcrow commented Oct 18, 2018

@sdemos When you have a few PTAL at what I have so far before I start wiring up a single process mode.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this mean it has to be an absolute path? is there a way we can support relative paths?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does mean absolute. We could support relative ... I can add that in my next update.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

either way is fine with me, but if it's just absolute we should make sure to be explicit about it, it might be a confusing error if you specify a relative file path and get told that you need to provide a file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: Errors, I'm wondering if we need to validate the URI before proceeding or if we think that the errors from the calls read files/pull configs will be enough?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

obviously you haven't started on this work yet, but it might be worth it to refactor the rest of the daemon a little so the main loop uses the same function internally as whatever is going to get called here so we can feel confident it's the same behavior.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will take some serious refactoring I'm afraid. The process method utilizes a lot of external calls to get content, update status, etc.. all of which are not all available in a run once scenario. But agreed. My first attempt will be to try to decouple some of the functions/methods process calls.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, makes sense. plus there is the work in #130 which is going to change that logic up even further. maybe we can just work on unifying it moving forward.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sdemos as in encapsulate the logic for runOnce on it's own as to make merging easy and follow up later to unify them?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, since it would take some significant work to fully unify them today

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docs only mention URI but it can be a straight path too (e.g. not file://)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I'll clarify that this can be a path to a file or a URL. URI does make it sound like http://, ftp://, file://, etc.. would be supported when really it's http[s]:// and /....

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should add these ioutils to fsclient.go?:)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor typo: "its"

@ashcrow
Copy link
Member Author

ashcrow commented Oct 19, 2018

(not starting a retest yet as this requires another PR which hasn't merged yet due to yesterdays bot outage. It's being re-reviewed now)

@ashcrow
Copy link
Member Author

ashcrow commented Oct 19, 2018

/retest

@ashcrow
Copy link
Member Author

ashcrow commented Oct 19, 2018

Will rebase with #134 later today.

@openshift-ci-robot openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Oct 19, 2018
@kikisdeliveryservice
Copy link
Contributor

Seems like another flake? :(

@abhinavdahiya
Copy link
Contributor

Seems like another flake? :(

We are not seeing this specific error anywhere else yet

NAME                           STATUS                     ROLES     AGE       VERSION
ip-10-0-12-35.ec2.internal     Ready,SchedulingDisabled   master    1h        v1.11.0+d4cacc0
ip-10-0-137-214.ec2.internal   Ready,SchedulingDisabled   worker    1h        v1.11.0+d4cacc0
ip-10-0-145-104.ec2.internal   Ready,SchedulingDisabled   worker    1h        v1.11.0+d4cacc0
ip-10-0-175-235.ec2.internal   Ready,SchedulingDisabled   worker    1h        v1.11.0+d4cacc0
ip-10-0-23-254.ec2.internal    Ready,SchedulingDisabled   master    1h        v1.11.0+d4cacc0
ip-10-0-44-102.ec2.internal    Ready,SchedulingDisabled   master    1h        v1.11.0+d4cacc0
Waiting for router to be created ...

That makes it seem like it is not a flake.

@kikisdeliveryservice
Copy link
Contributor

Ah interesting. Thanks @abhinavdahiya

@ashcrow
Copy link
Member Author

ashcrow commented Nov 8, 2018

Agreed, I don't think this is a flake. It does look like a flake we hit previously BUT @yuqi-zhang and I have found cases where nodes are set degraded when they shouldn't be. We're working on this now.

ashcrow and others added 12 commits November 8, 2018 12:58
Signed-off-by: Steve Milner <smilner@redhat.com>
Signed-off-by: Steve Milner <smilner@redhat.com>
Signed-off-by: Steve Milner <smilner@redhat.com>
Signed-off-by: Steve Milner <smilner@redhat.com>
- prepUpdateFromCluster and executeUpdateFromCluster* pulled out of handleNodeUpdate for reuse
- triggerUpdateWithMachineConfig added for triggering with a provided desired config
- triggerUpdate forwards to triggerUpdateWithMachineConfig(nil)
- executeUpdateFromClusterWithMachineConfig added for updateing with a provided desired config
- executeUpdateFromCluster forwards to executeUpdateFromClusterWithMachineConfig(nil)

Signed-off-by: Steve Milner <smilner@redhat.com>
Signed-off-by: Steve Milner <smilner@redhat.com>
- New: Base instance that works without the cluster. Used in NewClusterDrivenDaemon.
- NewClusterDrivenDaemon: Builds on top of New. Works with cluster resources.

Signed-off-by: Steve Milner <smilner@redhat.com>
Signed-off-by: Steve Milner <smilner@redhat.com>
Split out the informers creation/start into StartInformer.
Idea from @yuqi-zhang.

Signed-off-by: Steve Milner <smilner@redhat.com>
When we are in runOnce mode AND the previous MachineConfig does
not have a Kind we can assume that there was no previous config
to check against.

Signed-off-by: Steve Milner <smilner@redhat.com>
Signed-off-by: Steve Milner <smilner@redhat.com>
Remove StartInformer function, as the creation and start must
follow the creation - chroot - check state - start workflow.
Modify ClientBuilder creation to use old workflow as well.

Signed-off-by: Yu Qi Zhang <jerzhang@redhat.com>
@ashcrow
Copy link
Member Author

ashcrow commented Nov 8, 2018

/hold cancel

  • e2e now passes
  • onceFrom a local file verified to work

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 8, 2018
func (dn *Daemon) Run(stop <-chan struct{}) error {
// Catch quickly if we've been asked to run once.
if dn.onceFrom != "" {
glog.V(2).Info("Running once per request")
Copy link
Contributor

@kikisdeliveryservice kikisdeliveryservice Nov 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this logging clear enough in the flow of the logs? Would "daemon running once per request" be clearer? Thoughts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like your wording better. How about I rework some of the log strings in a follow up?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds great!

@kikisdeliveryservice
Copy link
Contributor

Also, thanks for adding comments in this PR @ashcrow , it makes it very easy to read through! 👍

@yuqi-zhang
Copy link
Contributor

Some manual testing passes for me both for cluster operation and runOnce with qemu. Will LGTM after logs are cleaned up.

Thanks for the work @ashcrow !

@yuqi-zhang
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 8, 2018
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ashcrow, sdodson, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ashcrow
Copy link
Member Author

ashcrow commented Nov 8, 2018

Thanks @yuqi-zhang and @kikisdeliveryservice! I'll do the log clean up in another PR post merge.

@openshift-merge-robot openshift-merge-robot merged commit 393cc44 into openshift:master Nov 8, 2018
@ashcrow ashcrow mentioned this pull request Nov 9, 2018
osherdp pushed a commit to osherdp/machine-config-operator that referenced this pull request Apr 13, 2021
…copy

manifests: specify system-cluster-critical priority and update pull p…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants