-
Notifications
You must be signed in to change notification settings - Fork 462
Proactively detect config drift #2795
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proactively detect config drift #2795
Conversation
|
Skipping CI for Draft Pull Request. |
|
/test ? |
|
@cheesesashimi: The following commands are available to trigger required jobs:
The following commands are available to trigger optional jobs:
Use
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/test e2e-aws |
|
/test e2e-gcp-op |
|
There are still some TODOs:
|
|
/test e2e-gcp-op |
cf98999 to
ddb275c
Compare
|
/test e2e-gcp-op |
2 similar comments
|
/test e2e-gcp-op |
|
/test e2e-gcp-op |
yuqi-zhang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I think the direction looks good, although I have a few concerns, namely on what we do in the non-booting sync loops and how we handle "no-change forcefile updates". More details in comments below, and please do let me know if I am misunderstanding any of this, I am not sure if I have the details correct
|
Overall logic looks fine, although I feel that there is potential to reuse existing code. Not replying to inline as there are already lot of inline comments and it can get confusing to understand. Trying to summarize here my chain of thought to keep things a bit simple and reusing already existing code wherever possible (helps to minimize new regression) :
This should be ideally sufficient unless I missed any corner case. We don't need mutex because existing checkStateOnFirstRun() -> validateOnDiskState() is called only when dn.booting is set to true and that happens only when MCD pod is restarted or created. Once validateOnDiskState() finishes, it sets dn.booting to false |
|
Here's where I am:
To respond to @sinnykumari's comments:
|
|
Just as a note, when this is closer to ready, please rebase and get this down to ~3-4 commits. Can you pull this out of draft (leaving WIP/hold), feels like we can have ci run at least to get it running your tests. :) |
|
Update:
TODO:
|
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
20 similar comments
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
@cheesesashimi: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
The Config Drift Monitor (openshift#2795) was previously unaware of compressed files. What would happen is the MCD would unzip a compressed file payload and write that to disk. However, the Config Drift Monitor was unaware that the file was compressed, so it was comparing the compressed contents of the MachineConfig against the uncompressed contents that were written to disk. Because of that, the Config Drift Monitor would erroneously degrade the node / MCP. Fixes: #2032565
This enables the MCD to check if on-disk configuration has drifted from the currently applied MachineConfig. This work is being tracked in https://issues.redhat.com/browse/MCO-69
- What I did
I added a goroutine to the MCD to run a Config Drift Monitor. Under the hood, the Config Drift Monitor uses fsnotify to wire up handlers for all of the directories referenced in a given MachineConfig. If a write event is detected for any file (or files relating to a Systemd unit) defined in a MachineConfig, the Config Drift Monitor will run the
validateOnDiskStatefunction. If the on-disk configuration has drifted from what is specified in the MachineConfig, the node will be marked Degraded. Additionally, I added avalidateOnDiskStatecheck to thesyncNoderoutine so that if the config has drifted prior to an update, it will prevent the cluster from getting into an inconsistent state.Care was taken to ensure that the Config Drift Monitor only runs whenever no updates are occurring. When the MCD is booting, updates may still be pending. To avoid spurious config drift errors (since technically, the config will "drift" from one config to another), the Config Drift Monitor will only run when the MCD has finished booting. Additionally, prior to applying an update, the Config Drift Monitor is shut down. Again, this is done because the config will "drift" from one config to another while the update is being applied. If no reboot is required, the Config Drift Monitor will be started once the update is complete.
- How to verify it
$ oc debug node/<node-name> -- printf "not-the-data" > /host/etc/etc-fileTo recover, either change the file contents back or
oc debug node/<node-name> -- touch /host/run/machine-config-daemon-force. The latter will force the MCD to apply the update and will cause a reboot which may or may not be desirable.- Description for the changelog
Proactively detect on-disk config drift