Skip to content

Conversation

@runcom
Copy link
Member

@runcom runcom commented Oct 22, 2019

Mainly to avoid ppl to ship something which could override the MCO files.

The situation can still happen, in which case, the bad MC must be deleted and in order to rollback the maxUnavailable for the pool must be increased by at least 1

Signed-off-by: Antonio Murdaca [email protected]

@runcom runcom added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 22, 2019
@openshift-ci-robot openshift-ci-robot added the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Oct 22, 2019
@openshift-ci-robot
Copy link
Contributor

@runcom: This pull request references Bugzilla bug 1764001, which is invalid:

  • expected the bug to target the "4.3.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

Bug 1764001: pkg/daemon: validate on-disk when in desired config

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Oct 22, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: runcom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 22, 2019
@runcom runcom changed the title Bug 1764001: pkg/daemon: validate on-disk when in desired config Bug 1764116: pkg/daemon: validate on-disk when in desired config Oct 22, 2019
@openshift-ci-robot
Copy link
Contributor

@runcom: This pull request references Bugzilla bug 1764116, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Bug 1764116: pkg/daemon: validate on-disk when in desired config

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Oct 22, 2019
@runcom runcom force-pushed the validate-indesired branch from c2260d0 to bf106a8 Compare October 22, 2019 09:45
@runcom
Copy link
Member Author

runcom commented Oct 22, 2019

@cgwalters wondering why this wasn't the case till the beginning :/

@runcom
Copy link
Member Author

runcom commented Oct 22, 2019

@cgwalters wondering why this wasn't the case till the beginning :/

the actual issue here is that the rendered machineconfigs can contain duplicate entries for e.g. a service or unit. If that's the case, the validate routine can validate and fail only the first entry but what we have written on disk is the second one - should we change the validate routine to always check the last entry if there's a duplicate? This PR just makes sure we can rollback but maybe the fix to validation is needed as well. This is related to the common templates fix that went in #1202

@ajeddeloh since a rendered mc can contain duplicates entries, should we validate only against the last one? right now, we're validating and bailing at the first entry, but the MCD applies the last when writing to disk.

The reproducer is as simple as creating this MC:

$ cat testmc.yaml
# This example MachineConfig replaces /etc/chrony.conf
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: 50-examplecorp-chrony
spec:
  config:
    ignition:
      version: 2.2.0
    systemd:
      units:
      - name: "crio.service"
        dropins:
        - name: "10-default-env.conf"
          contents: "#foo"

The issue arises since we're already shipping 10-deafault-env.conf for the crio.service (and it's empty by default). So, the MCD applies the above, but when it reboots, it validates against the empty, first entry file but the ondisk one contains #foo now.

EDIT

This is trickier than what I've initilly thought - the situation here is:

  • we ship a dropin file directly as a file
  • we create a dropin systemd unit

When we validate on disk, the first file is checked (empty in our case) but it's written by the dropin systemd

@runcom
Copy link
Member Author

runcom commented Oct 22, 2019

(This PR isn't working as intended also, so keep holding)

@runcom
Copy link
Member Author

runcom commented Oct 22, 2019

maybe we should completely avoid having users override what we ship?

@cgwalters
Copy link
Member

@ajeddeloh since a rendered mc can contain duplicates entries, should we validate only against the last one? right now, we're validating and bailing at the first entry, but the MCD applies the last when writing to disk.

I believe it's a hard error to have duplicate files in Ignition spec 3. We should also indeed disallow this in the MCO I'd say.

But the clear shorter term fix is to change our validation to check the last one for consistency with what we actually write.

@runcom
Copy link
Member Author

runcom commented Oct 22, 2019

I believe it's a hard error to have duplicate files in Ignition spec 3. We should also indeed disallow this in the MCO I'd say.

uhm, so then the MCC has to learn to generate rendered MCs by always using the last entry in alphabetical order? 🤔 or just error out maybe

@cgwalters
Copy link
Member

I think we should error out.

@ajeddeloh
Copy link

@ajeddeloh since a rendered mc can contain duplicates entries, should we validate only against the last one? right now, we're validating and bailing at the first entry, but the MCD applies the last when writing to disk.

I believe it's a hard error to have duplicate files in Ignition spec 3. We should also indeed disallow this in the MCO I'd say.

Correct that duplicates are disallowed in spec 3.0.0+. Spec 2.x is more complicated. The last one listed "wins" but only if overwrite is true. Also they're created in the order: directories, then files, then links. This means a file can never "win" against a link. This is also fixed with spec 3.

@runcom
Copy link
Member Author

runcom commented Oct 22, 2019

The last one listed "wins" but only if overwrite is true

here the scenario is "we write a file at a dropin location" then "a dropin writes on that location again" ouch - does spec v3 disallow this?

@kikisdeliveryservice
Copy link
Contributor

I think we should error out.

Erroring out seems to be the cleanest choice, imo - it's predictable and clear to a user what they need to do going forward.

@ajeddeloh
Copy link

here the scenario is "we write a file at a dropin location" then "a dropin writes on that location again" ouch - does spec v3 disallow this?

Can you clarify? Do you mean a systemd drop in or an appended config?

@runcom
Copy link
Member Author

runcom commented Oct 23, 2019

Can you clarify? Do you mean a systemd drop in or an appended config?

yep, a systemd dropin. This is also not fully related to Ignition but rather, to how the MCD interprets the ignition config. So, in this bug case, we first write a normal file (at the dropin location) and then we overwirte it with a systemd dropin (from igniton pov). So, effectively, the both end up in the same directory causing this whole confusion and bug.

@ajeddeloh
Copy link

ajeddeloh commented Oct 23, 2019

Hmm yeah I think we ought to check for that in Ignition as well actually. Failing seems like the right thing to do since the user is asking for the impossible.

Filed: coreos/ignition#881

@runcom
Copy link
Member Author

runcom commented Oct 25, 2019

But the clear shorter term fix is to change our validation to check the last one for consistency with what we actually write.

I don't believe this is the right thing to do now as a short term hack - the reason here writing a file at a location and writing a dropin config which later writes there. Allowing the validation to pass means that any configuration already shipped (for things like crio and kubelet) can be firstly overridden and secondly skipped from validation.
I believe this a broader issue also, how does someone ship a crio dropin? should we allow that since crio is controlled by the CRC crd? hence it makes sense to just error out when someone provides a configuration that would overwrite anything specified before.

What do you all think?

@cgwalters
Copy link
Member

I believe this a broader issue also, how does someone ship a crio dropin?

I think the problem here is that the user chose the same name for the dropin 10-default-env.conf as we did. Perhaps it'd be clearer if we named our drop-ins like 10-mco-default-env.conf and hopefully users would not include -mco in their names.

@runcom
Copy link
Member Author

runcom commented Oct 25, 2019

and hopefully users would not include -mco in their names.

so if they do instead, I think we need to avoid rendering and communicate that, how does that sound?

@runcom
Copy link
Member Author

runcom commented Oct 25, 2019

I'm updating this PR to move to -mco- for our dropins meanwhile - still doesn't solve the issue tho :(

@runcom
Copy link
Member Author

runcom commented Oct 25, 2019

About this PR and Bug tho, the issue was mainly the inability to roll back and I think that's the case because the pools are hitting their maxUnavailable so bumping that to 2 will reconcile the cluster, I'm verifying that, we can then get the rename in and postpone any later discussion about validation when spec 3 will be in MCO maybe (?)

Mainly to avoid ppl to ship something which could override the MCO files.

Signed-off-by: Antonio Murdaca <[email protected]>
@runcom runcom force-pushed the validate-indesired branch from 7ef9636 to 0afa0d4 Compare October 25, 2019 15:20
@openshift-ci-robot openshift-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 25, 2019
@runcom runcom changed the title Bug 1764116: pkg/daemon: validate on-disk when in desired config Bug 1764116: templates: rename our dropins to include the mco string Oct 25, 2019
@openshift-ci-robot
Copy link
Contributor

@runcom: This pull request references Bugzilla bug 1764116, which is valid.

Details

In response to this:

Bug 1764116: templates: rename our dropins to include the mco string

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Signed-off-by: Antonio Murdaca <[email protected]>
@openshift-ci-robot openshift-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Oct 25, 2019
@kikisdeliveryservice
Copy link
Contributor

/skip

@kikisdeliveryservice
Copy link
Contributor

/retest

@kikisdeliveryservice
Copy link
Contributor

/skip

@ashcrow
Copy link
Member

ashcrow commented Dec 9, 2019

/retest

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Dec 9, 2019

@runcom: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/prow/e2e-aws-scaleup-rhel7 fef2281 link /test e2e-aws-scaleup-rhel7
ci/prow/e2e-vsphere fef2281 link /test e2e-vsphere

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@runcom
Copy link
Member Author

runcom commented May 6, 2020

Closed by #1715

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants