Proactively detect config drift #2795

cheesesashimi · 2021-10-07T14:58:22Z

This enables the MCD to check if on-disk configuration has drifted from the currently applied MachineConfig. This work is being tracked in https://issues.redhat.com/browse/MCO-69

- What I did
I added a goroutine to the MCD to run a Config Drift Monitor. Under the hood, the Config Drift Monitor uses fsnotify to wire up handlers for all of the directories referenced in a given MachineConfig. If a write event is detected for any file (or files relating to a Systemd unit) defined in a MachineConfig, the Config Drift Monitor will run the validateOnDiskState function. If the on-disk configuration has drifted from what is specified in the MachineConfig, the node will be marked Degraded. Additionally, I added a validateOnDiskState check to the syncNode routine so that if the config has drifted prior to an update, it will prevent the cluster from getting into an inconsistent state.

Care was taken to ensure that the Config Drift Monitor only runs whenever no updates are occurring. When the MCD is booting, updates may still be pending. To avoid spurious config drift errors (since technically, the config will "drift" from one config to another), the Config Drift Monitor will only run when the MCD has finished booting. Additionally, prior to applying an update, the Config Drift Monitor is shut down. Again, this is done because the config will "drift" from one config to another while the update is being applied. If no reboot is required, the Config Drift Monitor will be started once the update is complete.

- How to verify it

Within a cluster, create a new MachineConfig, e.g.:

---
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: etc-file
spec:
  config:
    ignition:
      version: 3.2.0
    storage:
      files:
      - contents:
          source: data:,hello%20world%0A
        mode: 420
        path: /etc/etc-file

Verify that the Config Drift Monitor has started by reviewing the MCD logs:

I1203 15:28:31.366342    1883 daemon.go:647] Starting MachineConfigDaemon
I1203 15:28:31.366426    1883 daemon.go:654] Enabling Kubelet Healthz Monitor
I1203 15:28:38.461070    1883 daemon.go:1343] Updating Node ip-10-0-132-121.ec2.internal
I1203 15:28:38.461091    1883 daemon.go:1343] Updating Node ip-10-0-132-86.ec2.internal
I1203 15:28:38.461111    1883 daemon.go:1343] Updating Node ip-10-0-150-37.ec2.internal
I1203 15:28:38.461116    1883 daemon.go:1343] Updating Node ip-10-0-151-158.ec2.internal
I1203 15:28:38.461121    1883 daemon.go:1343] Updating Node ip-10-0-164-151.ec2.internal
I1203 15:28:38.461125    1883 daemon.go:1343] Updating Node ip-10-0-170-92.ec2.internal
I1203 15:28:39.462097    1883 daemon.go:402] Started syncing node "ip-10-0-132-121.ec2.internal" (2021-12-03 15:28:39.462086333 +0000 UTC m=+9.130821879)
I1203 15:28:39.462153    1883 daemon.go:404] Finished syncing node "ip-10-0-132-121.ec2.internal" (63.876µs)
I1203 15:28:39.462162    1883 daemon.go:402] Started syncing node "ip-10-0-132-86.ec2.internal" (2021-12-03 15:28:39.462160974 +0000 UTC m=+9.130896512)
I1203 15:28:39.462215    1883 daemon.go:394] Node ip-10-0-132-86.ec2.internal is not labeled node-role.kubernetes.io/master
I1203 15:28:39.468002    1883 daemon.go:910] Current config: rendered-infra-e40714a84270cd41458ddf52dd8c465a
I1203 15:28:39.468060    1883 daemon.go:911] Desired config: rendered-infra-fcc7352d9598d0d4441fdd00aac05739
I1203 15:28:39.474917    1883 update.go:1994] Disk currentConfig rendered-infra-fcc7352d9598d0d4441fdd00aac05739 overrides node's currentConfig annotation rendered-infra-e40714a84270cd41458ddf52dd8c465a
I1203 15:28:39.477776    1883 daemon.go:1193] Validating against pending config rendered-infra-fcc7352d9598d0d4441fdd00aac05739
I1203 15:28:39.495045    1883 daemon.go:1211] Validated on-disk state
I1203 15:28:39.495129    1883 event.go:282] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"ip-10-0-132-86.ec2.internal", UID:"b9afb880-9b75-45ef-b454-a89eb37b6d6f", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NodeDone' Setting node ip-10-0-132-86.ec2.internal, currentConfig rendered-infra-fcc7352d9598d0d4441fdd00aac05739 to Done
I1203 15:28:39.508451    1883 daemon.go:1343] Updating Node ip-10-0-132-86.ec2.internal
I1203 15:28:39.509827    1883 daemon.go:1262] Completing pending config rendered-infra-fcc7352d9598d0d4441fdd00aac05739
I1203 15:28:39.509839    1883 drain.go:44] Initiating uncordon on node (currently schedulable: false)
I1203 15:28:39.523698    1883 drain.go:62] RunCordonOrUncordon() succeeded but node is still not in uncordon state, retrying
I1203 15:28:39.524490    1883 daemon.go:1343] Updating Node ip-10-0-132-86.ec2.internal
I1203 15:28:39.558790    1883 daemon.go:1343] Updating Node ip-10-0-132-86.ec2.internal
I1203 15:28:49.523812    1883 drain.go:44] Initiating uncordon on node (currently schedulable: true)
I1203 15:28:49.523838    1883 drain.go:66] uncordon succeeded on node (currently schedulable: true)
I1203 15:28:49.523846    1883 update.go:1994] Update completed for config rendered-infra-fcc7352d9598d0d4441fdd00aac05739 and node has been successfully uncordoned
I1203 15:28:49.526114    1883 daemon.go:1278] In desired config rendered-infra-fcc7352d9598d0d4441fdd00aac05739
I1203 15:28:49.536226    1883 config_drift_monitor.go:240] Config Drift Monitor started

Once the new MachineConfig has been applied to all pool nodes, modify the contents on-disk of this file:

$ oc debug node/<node-name> -- printf "not-the-data" > /host/etc/etc-file

Check the MCD logs for the node this targeted. You should see something that resembles:

E1203 15:40:28.247846    1883 on_disk_validation.go:221] content mismatch for file "/etc/etc-file" (-want +got):
  bytes.Join({
- 	"hello world\n",
+ 	"not-the-data",
  }, "")
E1203 15:40:28.247952    1883 daemon.go:733] content mismatch for file "/etc/etc-file"
E1203 15:40:28.247984    1883 writer.go:135] Marking Degraded due to: content mismatch for file "/etc/etc-file"
I1203 15:40:28.248939    1883 event.go:282] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"ip-10-0-132-86.ec2.internal", UID:"b9afb880-9b75-45ef-b454-a89eb37b6d6f", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'ConfigDriftDetected' content mismatch for file "/etc/etc-file"
I1203 15:40:28.272864    1883 daemon.go:1343] Updating Node ip-10-0-132-86.ec2.internal
I1203 15:40:29.273087    1883 daemon.go:402] Started syncing node "ip-10-0-132-86.ec2.internal" (2021-12-03 15:40:29.27307442 +0000 UTC m=+718.941809966)
E1203 15:40:29.286249    1883 on_disk_validation.go:221] content mismatch for file "/etc/etc-file" (-want +got):
  bytes.Join({
- 	"hello world\n",
+ 	"not-the-data",
  }, "")
E1203 15:40:29.286289    1883 daemon.go:506] Preflight config drift check failed: content mismatch for file "/etc/etc-file"
I1203 15:40:29.286296    1883 daemon.go:404] Finished syncing node "ip-10-0-132-86.ec2.internal" (13.220356ms)

To recover, either change the file contents back or oc debug node/<node-name> -- touch /host/run/machine-config-daemon-force. The latter will force the MCD to apply the update and will cause a reboot which may or may not be desirable.

- Description for the changelog
Proactively detect on-disk config drift

openshift-ci · 2021-10-07T14:58:23Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

cheesesashimi · 2021-10-12T16:13:42Z

/test ?

openshift-ci · 2021-10-12T16:13:43Z

@cheesesashimi: The following commands are available to trigger required jobs:

/test cluster-bootimages
/test e2e-agnostic-upgrade
/test e2e-aws
/test e2e-gcp-op
/test images
/test unit
/test verify

The following commands are available to trigger optional jobs:

/test e2e-aws-disruptive
/test e2e-aws-proxy
/test e2e-aws-serial
/test e2e-aws-single-node
/test e2e-aws-techpreview-featuregate
/test e2e-aws-upgrade
/test e2e-aws-upgrade-single-node
/test e2e-aws-workers-rhel7
/test e2e-aws-workers-rhel8
/test e2e-azure
/test e2e-azure-upgrade
/test e2e-gcp-op-single-node
/test e2e-gcp-single-node
/test e2e-gcp-upgrade
/test e2e-metal-assisted
/test e2e-metal-ipi
/test e2e-metal-ipi-ovn-dualstack
/test e2e-metal-ipi-ovn-ipv6
/test e2e-openstack
/test e2e-openstack-parallel
/test e2e-ovirt
/test e2e-ovirt-upgrade
/test e2e-ovn-step-registry
/test e2e-vsphere
/test e2e-vsphere-upgrade
/test e2e-vsphere-upi
/test okd-e2e-aws
/test okd-e2e-gcp-op
/test okd-e2e-upgrade
/test okd-e2e-vsphere
/test okd-images

Use /test all to run the following jobs that were automatically triggered:

pull-ci-openshift-machine-config-operator-master-e2e-agnostic-upgrade
pull-ci-openshift-machine-config-operator-master-e2e-aws
pull-ci-openshift-machine-config-operator-master-e2e-aws-disruptive
pull-ci-openshift-machine-config-operator-master-e2e-aws-serial
pull-ci-openshift-machine-config-operator-master-e2e-aws-single-node
pull-ci-openshift-machine-config-operator-master-e2e-aws-techpreview-featuregate
pull-ci-openshift-machine-config-operator-master-e2e-aws-upgrade-single-node
pull-ci-openshift-machine-config-operator-master-e2e-aws-workers-rhel7
pull-ci-openshift-machine-config-operator-master-e2e-aws-workers-rhel8
pull-ci-openshift-machine-config-operator-master-e2e-gcp-op
pull-ci-openshift-machine-config-operator-master-e2e-gcp-op-single-node
pull-ci-openshift-machine-config-operator-master-e2e-metal-ipi
pull-ci-openshift-machine-config-operator-master-e2e-ovn-step-registry
pull-ci-openshift-machine-config-operator-master-e2e-vsphere-upgrade
pull-ci-openshift-machine-config-operator-master-images
pull-ci-openshift-machine-config-operator-master-okd-e2e-aws
pull-ci-openshift-machine-config-operator-master-okd-images
pull-ci-openshift-machine-config-operator-master-unit
pull-ci-openshift-machine-config-operator-master-verify

Details

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

cheesesashimi · 2021-10-12T16:14:30Z

/test e2e-aws

cheesesashimi · 2021-10-12T20:00:40Z

/test e2e-gcp-op

cheesesashimi · 2021-10-14T20:56:46Z

There are still some TODOs:

General code cleanup and deduplication in pkg/daemon/daemon.go.
Better error handling for pkg/daemon/daemon.go. Specifically, how we wrap validateOnDiskState errors and detect them. Also, figure out a better error message for the cases when this does occur.
Wire up the poll period in pkg/daemon/daemon.go so that it can be artificially lowered for testing purposes.
Emit events using the dn.recorder object. TBD: Name(s) of events.

cheesesashimi · 2021-10-14T20:57:34Z

/test e2e-gcp-op
/test e2e-gcp-op-single-node

cheesesashimi · 2021-10-14T21:43:19Z

/test e2e-gcp-op
/test e2e-gcp-op-single-node

cheesesashimi · 2021-10-15T18:55:39Z

/test e2e-gcp-op
/test e2e-gcp-op-single-node

cheesesashimi · 2021-10-18T16:27:32Z

/test e2e-gcp-op
/test e2e-gcp-op-single-node

yuqi-zhang

Overall I think the direction looks good, although I have a few concerns, namely on what we do in the non-booting sync loops and how we handle "no-change forcefile updates". More details in comments below, and please do let me know if I am misunderstanding any of this, I am not sure if I have the details correct

pkg/daemon/daemon.go

sinnykumari · 2021-10-20T15:35:54Z

Overall logic looks fine, although I feel that there is potential to reuse existing code. Not replying to inline as there are already lot of inline comments and it can get confusing to understand. Trying to summarize here my chain of thought to keep things a bit simple and reusing already existing code wherever possible (helps to minimize new regression) :

Create a function e.g. pefromValidateOnDiskState() . This function has required logic and calls validateOnDiskState() whose logic is mainly in https://github.com/openshift/machine-config-operator/blob/master/pkg/daemon/daemon.go#L1071-L1104 .
- In case content mismatch found, send an event and alert based on what we want
Create a goroutine that is tied with MCD pod life
- It calls pefromValidateOnDiskState() only iff (dn.booting is false) && (periodic interval at which we want this to run is true)

This should be ideally sufficient unless I missed any corner case. We don't need mutex because existing checkStateOnFirstRun() -> validateOnDiskState() is called only when dn.booting is set to true and that happens only when MCD pod is restarted or created. Once validateOnDiskState() finishes, it sets dn.booting to false

pkg/daemon/daemon.go

cheesesashimi · 2021-10-21T20:33:18Z

Here's where I am:

I reverted validateOnDiskState back to it's original scope of just validating the on-disk state with the addition of the diskValidatorLock. It will not emit any events on success or failure. That responsibility will fall on the caller.
I'm decoupling the forcefile detection logic from validateOnDiskState to make what's happening more explicit. That means no more maybeValidateOnDiskState.
I will keep the forceFileExists function because I think it's a bit cleaner and easier to read though I'm open to arguments against.
To be more explicit, I'm creating a function called checkForConfigDrift which runs validateOnDiskState against the current on-disk config and will only emit an event when it finds an error. However, it will log whenever it runs.
I've also added a runPeriodicConfigDriftCheck which executes the aforementioned function but emits events upon error with the specific context of being run periodically as opposed to on-demand.

To respond to @sinnykumari's comments:

Using dn.booting would only prevent the periodic check from running when the MCD is still booting. It won't prevent the periodic check from running while an update is in progress. If the MCD is performing an update and the periodic check runs, it could cause unexpected behavior.
I'm not sure if the additional logic found in https://github.com/openshift/machine-config-operator/blob/master/pkg/daemon/daemon.go#L1071-L1104 is applicable for the context of checking for drift relative to the current on-disk configuration in both the periodic case as well as prior to applying an update.

pkg/daemon/daemon.go

test/helpers/mcd_config_drift.go

pkg/daemon/daemon.go

kikisdeliveryservice · 2021-10-21T22:58:13Z

Just as a note, when this is closer to ready, please rebase and get this down to ~3-4 commits.

Can you pull this out of draft (leaving WIP/hold), feels like we can have ci run at least to get it running your tests. :)

cheesesashimi · 2021-11-02T22:14:01Z

Update:

I adopted pretty much all of the changes @kikisdeliveryservice requested (thanks BTW!)
I also pulled this PR out of draft so all the tests can run.
I realized that I introduced an import cycle, so I fixed that. However, I'm not completely happy with my fix, so I'm completely open to feedback and suggestions.

TODO:

Get the import cycle fix into a state I and others are happy with.
Address lint / test failures.
Squash commits.

openshift-bot · 2021-12-11T03:17:53Z

/retest-required