MCO-1009: MCO-1008: MCO-905: Implement NodeDisruptionPolicy API #4267

djoshy · 2024-03-18T17:35:44Z

This PR does the following:

The operator now checks the MachineConfiguration global knob for any user provided node disruption policies, merges them with cluster defaults and displays the results in the Status of the MachineConfiguration object.
The daemon looks at the status of the MachineConfiguration object during node updates. If the daemon is unable to find the status of MachineConfiguration for more than 2 minutes, it will default to the legacy update path.
All of this is gated on NodeDisruption feature gate

How to test:

Bring up a cluster in TechPreview. If testing on a cluster that was originally having the default FeatureSet and then transitioned to TechPreview, please read this MCO-1009: MCO-1008: MCO-905: Implement NodeDisruptionPolicy API #4267 (comment)
Check the status of the MachineConfiguration object named "cluster"; this should list the cluster default policies.

$ oc get MachineConfiguration/cluster -o yaml
apiVersion: operator.openshift.io/v1
kind: MachineConfiguration
metadata:
  creationTimestamp: "2024-04-16T15:02:37Z"
  generation: 4
  name: cluster
  resourceVersion: "261205"
  uid: 2c67b155-1898-452f-adbd-ed376afc0ea2
spec:
  logLevel: Normal
  managementState: Managed
  operatorLogLevel: Normal
status:
  nodeDisruptionPolicyStatus:
    clusterPolicies:
      files:
      - actions:
        - type: None
        path: /etc/mco/internal-registry-pull-secret.json
      - actions:
        - type: None
        path: /var/lib/kubelet/config.json
      - actions:
        - reload:
            serviceName: crio.service
          type: Reload
        path: /etc/machine-config-daemon/no-reboot/containers-gpg.pub
      - actions:
        - reload:
            serviceName: crio.service
          type: Reload
        path: /etc/containers/policy.json
      - actions:
        - type: Special
        path: /etc/containers/registries.conf
      sshkey:
        actions:
        - type: None
  readyReplicas: 0

Apply a MachineConfiguration named "cluster" with a valid NodeDisruptionPolicy. More information about the policy and possible actions can be found here. Here is a sample policy:

apiVersion: operator.openshift.io/v1
kind: MachineConfiguration
metadata:
  name: cluster
  namespace: openshift-machine-config-operator
spec:
  nodeDisruptionPolicy:
    files:
      - path: "/etc/my-file"
        actions:
          - type: None
    units:
      - name: "my.service"
        actions:
          - type: Restart
            restart:
              serviceName: crio.service
      - name: "test.service"
        actions:
          - type: DaemonReload
          - type: Drain
          - type: Reload
            reload:
              serviceName: crio.service
          - type: Restart
            restart:
              serviceName: my.service
    sshkey:
      actions:
      - type: Reboot

Check the status of the MachineConfiguration object; this should now list a merged policy, including the ones specifies by you and the defaults. If there are any conflicts, the user specified ones will override the cluster defaults.


status:
  nodeDisruptionPolicyStatus:
    clusterPolicies:
      files:
      - actions:
        - type: None
        path: /etc/mco/internal-registry-pull-secret.json
      - actions:
        - type: None
        path: /var/lib/kubelet/config.json
      - actions:
        - reload:
            serviceName: crio.service
          type: Reload
        path: /etc/machine-config-daemon/no-reboot/containers-gpg.pub
      - actions:
        - reload:
            serviceName: crio.service
          type: Reload
        path: /etc/containers/policy.json
      - actions:
        - type: Special
        path: /etc/containers/registries.conf
      - actions:
        - type: None
        path: /etc/my-file
      sshkey:
        actions:
        - type: Reboot
      units:
      - actions:
        - restart:
            serviceName: crio.service
          type: Restart
        name: my.service
      - actions:
        - type: DaemonReload
        - type: Drain
        - reload:
            serviceName: crio.service
          type: Reload
        - restart:
            serviceName: my.service
          type: Restart
        name: test.service

In this case, the SSHkey section has been overriden.

Now, apply a new MachineConfig with a change that would be in effect for the currently defined node disruption policy. In this case, I applied a config which had changes to the test.service unit. Observe the daemon logs on the targeted node, and you should see the daemon step through the designated actions.

I0417 18:20:47.614397    2551 update.go:2578] Starting update from rendered-infra-5d1e3cebfddcee59ac7bde56152e0919 to rendered-infra-a17773854499b91cb5cc86087b54cfe8: &{osUpdate:false kargs:false fips:false passwd:false files:false units:true kernelType:false extensions:false}
I0417 18:20:47.624217    2551 update.go:717] Calculating node disruption actions
I0417 18:20:47.624251    2551 update.go:642] NodeDisruptionPolicy found for diff unit test.service!
I0417 18:20:47.624261    2551 update.go:1023] Calculated node disruption actions:
I0417 18:20:47.624271    2551 update.go:1030] DaemonReload
I0417 18:20:47.624284    2551 update.go:1030] Drain
I0417 18:20:47.624298    2551 update.go:1026] Reload - crio.service
I0417 18:20:47.624313    2551 update.go:1028] Restart - my.service
I0417 18:20:47.624320    2551 drain.go:121] Checking drain required for node disruption actions
I0417 18:20:47.624328    2551 update.go:1060] Drain calculated for node disruption: true
...
...
...
I0417 18:21:30.264732    2551 update.go:2578] Executing performPostConfigChangeNodeDisruptionAction(drain already complete/skipped) for config rendered-infra-a17773854499b91cb5cc86087b54cfe8
I0417 18:21:30.266752    2551 update.go:2578] Executing postconfig action: DaemonReload for config rendered-infra-a17773854499b91cb5cc86087b54cfe8
I0417 18:21:30.268793    2551 update.go:2563] Running: systemctl daemon-reload
I0417 18:21:30.645513    2551 update.go:2578] daemon-reload service reloaded successfully!
I0417 18:21:30.647826    2551 update.go:2578] Executing postconfig action: Drain for config rendered-infra-a17773854499b91cb5cc86087b54cfe8
I0417 18:21:30.649623    2551 update.go:2578] Executing postconfig action: Reload for config rendered-infra-a17773854499b91cb5cc86087b54cfe8
I0417 18:21:30.651303    2551 update.go:2563] Running: systemctl reload crio.service
I0417 18:21:30.688445    2551 update.go:2578] crio.service service reloaded successfully!
I0417 18:21:30.690534    2551 update.go:2578] Executing postconfig action: Restart for config rendered-infra-a17773854499b91cb5cc86087b54cfe8
I0417 18:21:30.692379    2551 update.go:2563] Running: systemctl restart my.service
I0417 18:21:30.711817    2551 update.go:2578] my.service service restarted successfully!
I0417 18:21:30.720163    2551 daemon.go:1574] Previous boot ostree-finalize-staged.service appears successful
I0417 18:21:30.720187    2551 daemon.go:1697] Current config: rendered-infra-5d1e3cebfddcee59ac7bde56152e0919
I0417 18:21:30.720192    2551 daemon.go:1698] Desired config: rendered-infra-a17773854499b91cb5cc86087b54cfe8
I0417 18:21:30.720202    2551 daemon.go:1706] state: Done

Some things to note

If any of the changes result in a reboot action, all other policies will be ignored.
There is no dedup of the final actions list. We don't expect users to do MC changes that causes multiple policies to be in effect for a single update very often, so this didn't seem like a pressing need.
There is a special exclusion for files defined in directory /etc/containers/registries.d/ to account for some recent changes from OCPNODE-1632: Support ClusterImagePolicy CRD #4160 (comment). In the future, the MCO plans to add directories and wildcard support. With that in place, this exception can be removed and made part of the standard policy.

openshift-ci · 2024-03-18T17:35:50Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-ci-robot · 2024-03-25T17:35:06Z

@djoshy: This pull request references MCO-1009 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2024-03-25T17:37:07Z

@djoshy: This pull request references MCO-1009 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

Details

In response to this:

This PR checks the MachineConfiguration global knob for any user provided node disruption policies, merges them with cluster defaults and displays the results in the Status of the MachineConfiguration object.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2024-04-10T16:24:43Z

@djoshy: This pull request references MCO-1009 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

Details

In response to this:

This PR does the following:

The operator now checks the MachineConfiguration global knob for any user provided node disruption policies, merges them with cluster defaults and displays the results in the Status of the MachineConfiguration object.

The daemon looks at the status of the MachineConfiguration object during node updates.

All of this is gated on NodeDisruption feature gate

Note: For the daemon, I've added a special exclusion for this case for #4160 (comment)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

djoshy · 2024-04-10T18:40:52Z

/test unit
/test verify

openshift-ci-robot · 2024-04-22T15:45:59Z

@djoshy: This pull request references MCO-1009 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

Details

In response to this:

This PR does the following:

The operator now checks the MachineConfiguration global knob for any user provided node disruption policies, merges them with cluster defaults and displays the results in the Status of the MachineConfiguration object.
The daemon looks at the status of the MachineConfiguration object during node updates.
All of this is gated on NodeDisruption feature gate

How to test:

Bring up a cluster in TechPreview.
Check the status of the MachineConfiguration object named "cluster"; this should list the cluster default policies.

$ oc get MachineConfiguration/cluster -o yaml
apiVersion: operator.openshift.io/v1
kind: MachineConfiguration
metadata:
 creationTimestamp: "2024-04-16T15:02:37Z"
 generation: 4
 name: cluster
 resourceVersion: "261205"
 uid: 2c67b155-1898-452f-adbd-ed376afc0ea2
spec:
 logLevel: Normal
 managementState: Managed
 operatorLogLevel: Normal
status:
 nodeDisruptionPolicyStatus:
   clusterPolicies:
     files:
     - actions:
       - type: None
       path: /etc/mco/internal-registry-pull-secret.json
     - actions:
       - type: None
       path: /var/lib/kubelet/config.json
     - actions:
       - reload:
           serviceName: crio.service
         type: Reload
       path: /etc/machine-config-daemon/no-reboot/containers-gpg.pub
     - actions:
       - reload:
           serviceName: crio.service
         type: Reload
       path: /etc/containers/policy.json
     - actions:
       - type: Special
       path: /etc/containers/registries.conf
     sshkey:
       actions:
       - type: None
 readyReplicas: 0

Apply a MachineConfiguration named "cluster" with a valid NodeDisruptionPolicy. More information about the policy and possible actions can be found here. Here is a sample policy:

apiVersion: operator.openshift.io/v1
kind: MachineConfiguration
metadata:
 name: cluster
 namespace: openshift-machine-config-operator
spec:
 nodeDisruptionPolicy:
   files:
     - path: "/etc/my-file"
       actions:
         - type: None
   units:
     - name: "my.service"
       actions:
         - type: Restart
           restart:
             serviceName: crio.service
     - name: "test.service"
       actions:
         - type: DaemonReload
         - type: Drain
         - type: Reload
           reload:
             serviceName: crio.service
         - type: Restart
           restart:
             serviceName: my.service
   sshkey:
     actions:
     - type: Reboot

Check the status of the MachineConfiguration object; this should now list a merged policy, including the ones specifies by you and the defaults. If there are any conflicts, the user specified ones will override the cluster defaults.


status:
 nodeDisruptionPolicyStatus:
   clusterPolicies:
     files:
     - actions:
       - type: None
       path: /etc/mco/internal-registry-pull-secret.json
     - actions:
       - type: None
       path: /var/lib/kubelet/config.json
     - actions:
       - reload:
           serviceName: crio.service
         type: Reload
       path: /etc/machine-config-daemon/no-reboot/containers-gpg.pub
     - actions:
       - reload:
           serviceName: crio.service
         type: Reload
       path: /etc/containers/policy.json
     - actions:
       - type: Special
       path: /etc/containers/registries.conf
     - actions:
       - type: None
       path: /etc/my-file
     sshkey:
       actions:
       - type: Reboot
     units:
     - actions:
       - restart:
           serviceName: crio.service
         type: Restart
       name: my.service
     - actions:
       - type: DaemonReload
       - type: Drain
       - reload:
           serviceName: crio.service
         type: Reload
       - restart:
           serviceName: my.service
         type: Restart
       name: test.service

In this case, the SSHkey section has been overriden.

Now, apply a new MachineConfig with a change that would be in effect for the currently defined node disruption policy. In this case, I applied a config which had changes to the test.service unit. Observe the daemon logs on the targeted node, and you should see the daemon step through the designated actions.

I0417 18:20:47.614397    2551 update.go:2578] Starting update from rendered-infra-5d1e3cebfddcee59ac7bde56152e0919 to rendered-infra-a17773854499b91cb5cc86087b54cfe8: &{osUpdate:false kargs:false fips:false passwd:false files:false units:true kernelType:false extensions:false}
I0417 18:20:47.624217    2551 update.go:717] Calculating node disruption actions
I0417 18:20:47.624251    2551 update.go:642] NodeDisruptionPolicy found for diff unit test.service!
I0417 18:20:47.624261    2551 update.go:1023] Calculated node disruption actions:
I0417 18:20:47.624271    2551 update.go:1030] DaemonReload
I0417 18:20:47.624284    2551 update.go:1030] Drain
I0417 18:20:47.624298    2551 update.go:1026] Reload - crio.service
I0417 18:20:47.624313    2551 update.go:1028] Restart - my.service
I0417 18:20:47.624320    2551 drain.go:121] Checking drain required for node disruption actions
I0417 18:20:47.624328    2551 update.go:1060] Drain calculated for node disruption: true
...
...
...
I0417 18:21:30.264732    2551 update.go:2578] Executing performPostConfigChangeNodeDisruptionAction(drain already complete/skipped) for config rendered-infra-a17773854499b91cb5cc86087b54cfe8
I0417 18:21:30.266752    2551 update.go:2578] Executing postconfig action: DaemonReload for config rendered-infra-a17773854499b91cb5cc86087b54cfe8
I0417 18:21:30.268793    2551 update.go:2563] Running: systemctl daemon-reload
I0417 18:21:30.645513    2551 update.go:2578] daemon-reload service reloaded successfully!
I0417 18:21:30.647826    2551 update.go:2578] Executing postconfig action: Drain for config rendered-infra-a17773854499b91cb5cc86087b54cfe8
I0417 18:21:30.649623    2551 update.go:2578] Executing postconfig action: Reload for config rendered-infra-a17773854499b91cb5cc86087b54cfe8
I0417 18:21:30.651303    2551 update.go:2563] Running: systemctl reload crio.service
I0417 18:21:30.688445    2551 update.go:2578] crio.service service reloaded successfully!
I0417 18:21:30.690534    2551 update.go:2578] Executing postconfig action: Restart for config rendered-infra-a17773854499b91cb5cc86087b54cfe8
I0417 18:21:30.692379    2551 update.go:2563] Running: systemctl restart my.service
I0417 18:21:30.711817    2551 update.go:2578] my.service service restarted successfully!
I0417 18:21:30.720163    2551 daemon.go:1574] Previous boot ostree-finalize-staged.service appears successful
I0417 18:21:30.720187    2551 daemon.go:1697] Current config: rendered-infra-5d1e3cebfddcee59ac7bde56152e0919
I0417 18:21:30.720192    2551 daemon.go:1698] Desired config: rendered-infra-a17773854499b91cb5cc86087b54cfe8
I0417 18:21:30.720202    2551 daemon.go:1706] state: Done

Some things to note

If any of the changes result in a reboot action, all other policies will be ignored.
There is no dedup of the final actions list. We don't expect users to do MC changes that causes multiple policies to be in effect for a single update very often, so this didn't seem like a pressing need.
There is a special exclusion for files defined in directory /etc/containers/registries.d/ to account for some recent changes from OCPNODE-1632: Support ClusterImagePolicy CRD #4160 (comment). In the future, the MCO plans to add directories and wildcard support. With that in place, this exception can be removed and made part of the standard policy.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

This enables the Operator to read in user defined node disruption polices, merge it with cluster defaults and then update the status on the MachineConfiguration object. This is only done when FeatureGateNodeDisruptionPolicy is enabled.

This change makes the daemon read in NodeDisruptionPolicyStatus during an update and queue up the appropriate actions that the policy requests. No deduping is being done for the policies currently. This will not take place during firstboot as the featureGateAccessor does not exist then.

djoshy · 2024-04-23T00:10:47Z

Made some additional fixes related to the transition to TechPreview featureset. In my tests, I noticed that operator would not get far enough to actually render the NodeDisruptionPolicyStatus as it was "stuck" in a mid feature gate state while the entire cluster was transitioning. Meanwhile the nodes would get assigned a new config to switch to; but with the 10 minute timeout I placed earlier, it takes them a long time to transition to the new config as it has to first hit the timeout and then fall back to execute the legacy update path. To clean up this UX, I did the following:

moved up syncMachineConfiguration in the operator's sync functions so NodeDisruptionPolicyStatus is rendered earlier in the operator lifecycle
dropped the timeout back to 2 minutes, after which the daemon will execute the old update path, not respecting node disruption policies.

With the above fixes, the only variable to account for is the operator's lease acquire time, after which the status will be populated, with which the daemons can orchestrate updates. The two minute timeout is largely arbitrary. During the "feature gate" transition update, some nodes may have to fallback to the legacy path in case the operator is not fast enough to populate the NodeDisruptionPolicyStatus. Once the cluster has gone through the "feature gate" transition update, all future MC updates should respect the node disruption policy of the cluster.

Note: This is only applicable for cases where the feature gate was applied after the cluster was installed with a "Default" featureset. If the cluster featureset was TechPreview at install time, this issue would not happen, it is the transition that gunked up the process.

openshift-ci-robot · 2024-04-23T00:15:55Z

@djoshy: This pull request references MCO-1009 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

Details

In response to this:

This PR does the following:

The operator now checks the MachineConfiguration global knob for any user provided node disruption policies, merges them with cluster defaults and displays the results in the Status of the MachineConfiguration object.
The daemon looks at the status of the MachineConfiguration object during node updates. If the daemon is unable to find the status of MachineConfiguration for more than 2 minutes, it will default to the legacy update path.
All of this is gated on NodeDisruption feature gate

How to test:

Bring up a cluster in TechPreview. If testing on a cluster that was originally having the default FeatureSet and then transitioned to TechPreview, please read this MCO-1009: MCO-1008: MCO-905: Implement NodeDisruptionPolicy API #4267 (comment)
Check the status of the MachineConfiguration object named "cluster"; this should list the cluster default policies.

$ oc get MachineConfiguration/cluster -o yaml
apiVersion: operator.openshift.io/v1
kind: MachineConfiguration
metadata:
 creationTimestamp: "2024-04-16T15:02:37Z"
 generation: 4
 name: cluster
 resourceVersion: "261205"
 uid: 2c67b155-1898-452f-adbd-ed376afc0ea2
spec:
 logLevel: Normal
 managementState: Managed
 operatorLogLevel: Normal
status:
 nodeDisruptionPolicyStatus:
   clusterPolicies:
     files:
     - actions:
       - type: None
       path: /etc/mco/internal-registry-pull-secret.json
     - actions:
       - type: None
       path: /var/lib/kubelet/config.json
     - actions:
       - reload:
           serviceName: crio.service
         type: Reload
       path: /etc/machine-config-daemon/no-reboot/containers-gpg.pub
     - actions:
       - reload:
           serviceName: crio.service
         type: Reload
       path: /etc/containers/policy.json
     - actions:
       - type: Special
       path: /etc/containers/registries.conf
     sshkey:
       actions:
       - type: None
 readyReplicas: 0

Apply a MachineConfiguration named "cluster" with a valid NodeDisruptionPolicy. More information about the policy and possible actions can be found here. Here is a sample policy:

apiVersion: operator.openshift.io/v1
kind: MachineConfiguration
metadata:
 name: cluster
 namespace: openshift-machine-config-operator
spec:
 nodeDisruptionPolicy:
   files:
     - path: "/etc/my-file"
       actions:
         - type: None
   units:
     - name: "my.service"
       actions:
         - type: Restart
           restart:
             serviceName: crio.service
     - name: "test.service"
       actions:
         - type: DaemonReload
         - type: Drain
         - type: Reload
           reload:
             serviceName: crio.service
         - type: Restart
           restart:
             serviceName: my.service
   sshkey:
     actions:
     - type: Reboot

Check the status of the MachineConfiguration object; this should now list a merged policy, including the ones specifies by you and the defaults. If there are any conflicts, the user specified ones will override the cluster defaults.


status:
 nodeDisruptionPolicyStatus:
   clusterPolicies:
     files:
     - actions:
       - type: None
       path: /etc/mco/internal-registry-pull-secret.json
     - actions:
       - type: None
       path: /var/lib/kubelet/config.json
     - actions:
       - reload:
           serviceName: crio.service
         type: Reload
       path: /etc/machine-config-daemon/no-reboot/containers-gpg.pub
     - actions:
       - reload:
           serviceName: crio.service
         type: Reload
       path: /etc/containers/policy.json
     - actions:
       - type: Special
       path: /etc/containers/registries.conf
     - actions:
       - type: None
       path: /etc/my-file
     sshkey:
       actions:
       - type: Reboot
     units:
     - actions:
       - restart:
           serviceName: crio.service
         type: Restart
       name: my.service
     - actions:
       - type: DaemonReload
       - type: Drain
       - reload:
           serviceName: crio.service
         type: Reload
       - restart:
           serviceName: my.service
         type: Restart
       name: test.service

In this case, the SSHkey section has been overriden.

Now, apply a new MachineConfig with a change that would be in effect for the currently defined node disruption policy. In this case, I applied a config which had changes to the test.service unit. Observe the daemon logs on the targeted node, and you should see the daemon step through the designated actions.

I0417 18:20:47.614397    2551 update.go:2578] Starting update from rendered-infra-5d1e3cebfddcee59ac7bde56152e0919 to rendered-infra-a17773854499b91cb5cc86087b54cfe8: &{osUpdate:false kargs:false fips:false passwd:false files:false units:true kernelType:false extensions:false}
I0417 18:20:47.624217    2551 update.go:717] Calculating node disruption actions
I0417 18:20:47.624251    2551 update.go:642] NodeDisruptionPolicy found for diff unit test.service!
I0417 18:20:47.624261    2551 update.go:1023] Calculated node disruption actions:
I0417 18:20:47.624271    2551 update.go:1030] DaemonReload
I0417 18:20:47.624284    2551 update.go:1030] Drain
I0417 18:20:47.624298    2551 update.go:1026] Reload - crio.service
I0417 18:20:47.624313    2551 update.go:1028] Restart - my.service
I0417 18:20:47.624320    2551 drain.go:121] Checking drain required for node disruption actions
I0417 18:20:47.624328    2551 update.go:1060] Drain calculated for node disruption: true
...
...
...
I0417 18:21:30.264732    2551 update.go:2578] Executing performPostConfigChangeNodeDisruptionAction(drain already complete/skipped) for config rendered-infra-a17773854499b91cb5cc86087b54cfe8
I0417 18:21:30.266752    2551 update.go:2578] Executing postconfig action: DaemonReload for config rendered-infra-a17773854499b91cb5cc86087b54cfe8
I0417 18:21:30.268793    2551 update.go:2563] Running: systemctl daemon-reload
I0417 18:21:30.645513    2551 update.go:2578] daemon-reload service reloaded successfully!
I0417 18:21:30.647826    2551 update.go:2578] Executing postconfig action: Drain for config rendered-infra-a17773854499b91cb5cc86087b54cfe8
I0417 18:21:30.649623    2551 update.go:2578] Executing postconfig action: Reload for config rendered-infra-a17773854499b91cb5cc86087b54cfe8
I0417 18:21:30.651303    2551 update.go:2563] Running: systemctl reload crio.service
I0417 18:21:30.688445    2551 update.go:2578] crio.service service reloaded successfully!
I0417 18:21:30.690534    2551 update.go:2578] Executing postconfig action: Restart for config rendered-infra-a17773854499b91cb5cc86087b54cfe8
I0417 18:21:30.692379    2551 update.go:2563] Running: systemctl restart my.service
I0417 18:21:30.711817    2551 update.go:2578] my.service service restarted successfully!
I0417 18:21:30.720163    2551 daemon.go:1574] Previous boot ostree-finalize-staged.service appears successful
I0417 18:21:30.720187    2551 daemon.go:1697] Current config: rendered-infra-5d1e3cebfddcee59ac7bde56152e0919
I0417 18:21:30.720192    2551 daemon.go:1698] Desired config: rendered-infra-a17773854499b91cb5cc86087b54cfe8
I0417 18:21:30.720202    2551 daemon.go:1706] state: Done

Some things to note

If any of the changes result in a reboot action, all other policies will be ignored.
There is no dedup of the final actions list. We don't expect users to do MC changes that causes multiple policies to be in effect for a single update very often, so this didn't seem like a pressing need.
There is a special exclusion for files defined in directory /etc/containers/registries.d/ to account for some recent changes from OCPNODE-1632: Support ClusterImagePolicy CRD #4160 (comment). In the future, the MCO plans to add directories and wildcard support. With that in place, this exception can be removed and made part of the standard policy.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

sinnykumari · 2024-04-23T09:17:08Z

pkg/daemon/update.go

+		for _, policyFile := range clusterPolicies.Files {
+			klog.V(4).Infof("comparing policy path %s to diff path %s", policyFile.Path, diffPath)
+			if policyFile.Path == diffPath {
+				klog.Infof("NodeDisruptionPolicy found for diff file %s!", diffPath)


nit: Perhaps we should exclude ! in the end since it can be confusing that it is part of diffPath.

ack, will fix alongside anything QE might find on next pass. Thanks!

sinnykumari · 2024-04-23T11:23:53Z

pkg/daemon/update.go

+// If at any point an error occurs, we reboot the node so that node has correct configuration.
+func (dn *Daemon) performPostConfigChangeNodeDisruptionAction(postConfigChangeActions []opv1.NodeDisruptionPolicyStatusAction, configName string) error {
+
+	logSystem("Executing performPostConfigChangeNodeDisruptionAction(drain already complete/skipped) for config %s", configName)


Perhaps we can just say "Performing post config change action".
We can skip drain already complete/skipped since this is already captured in https://github.com/openshift/machine-config-operator/blob/master/pkg/daemon/update.go#L793 if we drain.

ack, yeah can clean this up. The reason I kept that marker is this loop will also roll through the Drain action, but I can put in a no-op condition message for that (:

sinnykumari

Few minor nits which are optional.
Other than that this looks great. Nice work David!
/lgtm

openshift-ci · 2024-04-23T12:06:56Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: djoshy, sinnykumari, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [djoshy,sinnykumari,yuqi-zhang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

rioliu-rh · 2024-04-25T06:23:43Z

pre-merge testing result:
verified following functionalities

files actions
- none
- reboot
- restart
- reload
- drain
- daemon reload
- multi-actions
units actions
- none
- reboot
- multi-actions
sshkey actions
- none
- reboot
- multi-actions

Issues:

OCPBUGS-32511 NodeDisruptionPolicyStatus was not ready context deadline exceeded
OCPBUGS-32739 MachineConfigurations is only effective with name
OCPBUGS-32783 NodeDisruptionPolicy action reload cannot take effect

Test cases are linked to epic MCO-507
No merge blocker. issues can be fixed new PRs.

/label qe-approved
/unhold

openshift-ci-robot · 2024-04-25T06:23:49Z

@djoshy: This pull request references MCO-1009 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

Details

In response to this:

This PR does the following:

The operator now checks the MachineConfiguration global knob for any user provided node disruption policies, merges them with cluster defaults and displays the results in the Status of the MachineConfiguration object.
The daemon looks at the status of the MachineConfiguration object during node updates. If the daemon is unable to find the status of MachineConfiguration for more than 2 minutes, it will default to the legacy update path.
All of this is gated on NodeDisruption feature gate

How to test:

Bring up a cluster in TechPreview. If testing on a cluster that was originally having the default FeatureSet and then transitioned to TechPreview, please read this MCO-1009: MCO-1008: MCO-905: Implement NodeDisruptionPolicy API #4267 (comment)
Check the status of the MachineConfiguration object named "cluster"; this should list the cluster default policies.

$ oc get MachineConfiguration/cluster -o yaml
apiVersion: operator.openshift.io/v1
kind: MachineConfiguration
metadata:
 creationTimestamp: "2024-04-16T15:02:37Z"
 generation: 4
 name: cluster
 resourceVersion: "261205"
 uid: 2c67b155-1898-452f-adbd-ed376afc0ea2
spec:
 logLevel: Normal
 managementState: Managed
 operatorLogLevel: Normal
status:
 nodeDisruptionPolicyStatus:
   clusterPolicies:
     files:
     - actions:
       - type: None
       path: /etc/mco/internal-registry-pull-secret.json
     - actions:
       - type: None
       path: /var/lib/kubelet/config.json
     - actions:
       - reload:
           serviceName: crio.service
         type: Reload
       path: /etc/machine-config-daemon/no-reboot/containers-gpg.pub
     - actions:
       - reload:
           serviceName: crio.service
         type: Reload
       path: /etc/containers/policy.json
     - actions:
       - type: Special
       path: /etc/containers/registries.conf
     sshkey:
       actions:
       - type: None
 readyReplicas: 0

Apply a MachineConfiguration named "cluster" with a valid NodeDisruptionPolicy. More information about the policy and possible actions can be found here. Here is a sample policy:

apiVersion: operator.openshift.io/v1
kind: MachineConfiguration
metadata:
 name: cluster
 namespace: openshift-machine-config-operator
spec:
 nodeDisruptionPolicy:
   files:
     - path: "/etc/my-file"
       actions:
         - type: None
   units:
     - name: "my.service"
       actions:
         - type: Restart
           restart:
             serviceName: crio.service
     - name: "test.service"
       actions:
         - type: DaemonReload
         - type: Drain
         - type: Reload
           reload:
             serviceName: crio.service
         - type: Restart
           restart:
             serviceName: my.service
   sshkey:
     actions:
     - type: Reboot

Check the status of the MachineConfiguration object; this should now list a merged policy, including the ones specifies by you and the defaults. If there are any conflicts, the user specified ones will override the cluster defaults.


status:
 nodeDisruptionPolicyStatus:
   clusterPolicies:
     files:
     - actions:
       - type: None
       path: /etc/mco/internal-registry-pull-secret.json
     - actions:
       - type: None
       path: /var/lib/kubelet/config.json
     - actions:
       - reload:
           serviceName: crio.service
         type: Reload
       path: /etc/machine-config-daemon/no-reboot/containers-gpg.pub
     - actions:
       - reload:
           serviceName: crio.service
         type: Reload
       path: /etc/containers/policy.json
     - actions:
       - type: Special
       path: /etc/containers/registries.conf
     - actions:
       - type: None
       path: /etc/my-file
     sshkey:
       actions:
       - type: Reboot
     units:
     - actions:
       - restart:
           serviceName: crio.service
         type: Restart
       name: my.service
     - actions:
       - type: DaemonReload
       - type: Drain
       - reload:
           serviceName: crio.service
         type: Reload
       - restart:
           serviceName: my.service
         type: Restart
       name: test.service

In this case, the SSHkey section has been overriden.

Now, apply a new MachineConfig with a change that would be in effect for the currently defined node disruption policy. In this case, I applied a config which had changes to the test.service unit. Observe the daemon logs on the targeted node, and you should see the daemon step through the designated actions.

I0417 18:20:47.614397    2551 update.go:2578] Starting update from rendered-infra-5d1e3cebfddcee59ac7bde56152e0919 to rendered-infra-a17773854499b91cb5cc86087b54cfe8: &{osUpdate:false kargs:false fips:false passwd:false files:false units:true kernelType:false extensions:false}
I0417 18:20:47.624217    2551 update.go:717] Calculating node disruption actions
I0417 18:20:47.624251    2551 update.go:642] NodeDisruptionPolicy found for diff unit test.service!
I0417 18:20:47.624261    2551 update.go:1023] Calculated node disruption actions:
I0417 18:20:47.624271    2551 update.go:1030] DaemonReload
I0417 18:20:47.624284    2551 update.go:1030] Drain
I0417 18:20:47.624298    2551 update.go:1026] Reload - crio.service
I0417 18:20:47.624313    2551 update.go:1028] Restart - my.service
I0417 18:20:47.624320    2551 drain.go:121] Checking drain required for node disruption actions
I0417 18:20:47.624328    2551 update.go:1060] Drain calculated for node disruption: true
...
...
...
I0417 18:21:30.264732    2551 update.go:2578] Executing performPostConfigChangeNodeDisruptionAction(drain already complete/skipped) for config rendered-infra-a17773854499b91cb5cc86087b54cfe8
I0417 18:21:30.266752    2551 update.go:2578] Executing postconfig action: DaemonReload for config rendered-infra-a17773854499b91cb5cc86087b54cfe8
I0417 18:21:30.268793    2551 update.go:2563] Running: systemctl daemon-reload
I0417 18:21:30.645513    2551 update.go:2578] daemon-reload service reloaded successfully!
I0417 18:21:30.647826    2551 update.go:2578] Executing postconfig action: Drain for config rendered-infra-a17773854499b91cb5cc86087b54cfe8
I0417 18:21:30.649623    2551 update.go:2578] Executing postconfig action: Reload for config rendered-infra-a17773854499b91cb5cc86087b54cfe8
I0417 18:21:30.651303    2551 update.go:2563] Running: systemctl reload crio.service
I0417 18:21:30.688445    2551 update.go:2578] crio.service service reloaded successfully!
I0417 18:21:30.690534    2551 update.go:2578] Executing postconfig action: Restart for config rendered-infra-a17773854499b91cb5cc86087b54cfe8
I0417 18:21:30.692379    2551 update.go:2563] Running: systemctl restart my.service
I0417 18:21:30.711817    2551 update.go:2578] my.service service restarted successfully!
I0417 18:21:30.720163    2551 daemon.go:1574] Previous boot ostree-finalize-staged.service appears successful
I0417 18:21:30.720187    2551 daemon.go:1697] Current config: rendered-infra-5d1e3cebfddcee59ac7bde56152e0919
I0417 18:21:30.720192    2551 daemon.go:1698] Desired config: rendered-infra-a17773854499b91cb5cc86087b54cfe8
I0417 18:21:30.720202    2551 daemon.go:1706] state: Done

Some things to note

If any of the changes result in a reboot action, all other policies will be ignored.
There is no dedup of the final actions list. We don't expect users to do MC changes that causes multiple policies to be in effect for a single update very often, so this didn't seem like a pressing need.
There is a special exclusion for files defined in directory /etc/containers/registries.d/ to account for some recent changes from OCPNODE-1632: Support ClusterImagePolicy CRD #4160 (comment). In the future, the MCO plans to add directories and wildcard support. With that in place, this exception can be removed and made part of the standard policy.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2024-04-25T08:41:37Z

/retest-required

Remaining retests: 0 against base HEAD d148f30 and 2 for PR HEAD 103bdc1 in total

djoshy · 2024-04-25T10:20:34Z

/retest-required

openshift-ci · 2024-04-25T11:54:50Z

@djoshy: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/okd-scos-e2e-aws-ovn	`103bdc1`	link	false	`/test okd-scos-e2e-aws-ovn`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

djoshy · 2024-04-25T11:55:47Z

/override ci/prow/e2e-hypershift

Hypershift failures are unrelated and being fixed. Since the PR has passed it in the past, I'm overriding it.

sinnykumari · 2024-04-25T11:55:57Z

Looks like openshift/hypershift#3938 doesn't help much with HyperShift failure. Looking at recent failure on this PR doesn't look like MCO specific. Also, we had a successful run on this PR with current changes in https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_machine-config-operator/4267/pull-ci-openshift-machine-config-operator-master-e2e-hypershift/1782558388421398528 . This PR should be safe to go in. overriding e2e-hypershift test to get this feature work merged.
/override ci/prow/e2e-hypershift

openshift-ci · 2024-04-25T11:57:05Z

@sinnykumari: Overrode contexts on behalf of sinnykumari: ci/prow/e2e-hypershift

Details

In response to this:

Looks like openshift/hypershift#3938 doesn't help much with HyperShift failure. Looking at recent failure on this PR doesn't look like MCO specific. Also, we had a successful run on this PR with current changes in https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_machine-config-operator/4267/pull-ci-openshift-machine-config-operator-master-e2e-hypershift/1782558388421398528 . This PR should be safe to go in. overriding e2e-hypershift test to get this feature work merged.
/override ci/prow/e2e-hypershift

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2024-04-25T11:57:09Z

@djoshy: Overrode contexts on behalf of djoshy: ci/prow/e2e-hypershift

Details

In response to this:

/override ci/prow/e2e-hypershift

Hypershift failures are unrelated and being fixed. Since the PR has passed it in the past, I'm overriding it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-bot · 2024-04-25T16:41:54Z

[ART PR BUILD NOTIFIER]

This PR has been included in build ose-machine-config-operator-container-v4.16.0-202404251110.p0.g9b99954.assembly.stream.el9 for distgit ose-machine-config-operator.
All builds following this will include this PR.

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 18, 2024

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 18, 2024

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 18, 2024

djoshy force-pushed the parse-node-disrupt branch 2 times, most recently from 7afa70a to 1af0015 Compare March 19, 2024 20:37

djoshy changed the title ~~DNM: testing new CRDs~~ MCO-1009: Validate user-provided configuration and merge with cluster-defaults Mar 25, 2024

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 25, 2024

djoshy force-pushed the parse-node-disrupt branch 4 times, most recently from 5d9737d to 5fa5f95 Compare March 29, 2024 14:23

openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 29, 2024

djoshy force-pushed the parse-node-disrupt branch 4 times, most recently from 25cf96d to ee568f2 Compare March 29, 2024 17:43

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 29, 2024

djoshy force-pushed the parse-node-disrupt branch from ee568f2 to 7f3d08b Compare April 10, 2024 16:19

djoshy changed the title ~~MCO-1009: Validate user-provided configuration and merge with cluster-defaults~~ MCO-1009: MCO-1008: MCO-905: Implement NodeDisruptionPolicy Apr 10, 2024

djoshy changed the title ~~MCO-1009: MCO-1008: MCO-905: Implement NodeDisruptionPolicy~~ MCO-1009: MCO-1008: MCO-905: Implement NodeDisruptionPolicy API Apr 10, 2024

djoshy force-pushed the parse-node-disrupt branch 2 times, most recently from e415e94 to 4ec0517 Compare April 10, 2024 18:40

openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 10, 2024

djoshy marked this pull request as ready for review April 10, 2024 19:13

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 10, 2024

djoshy force-pushed the parse-node-disrupt branch from 8bbfba2 to f3d90b0 Compare April 22, 2024 09:52

djoshy mentioned this pull request Apr 22, 2024

MCO-507: admin defined node disruption policy enhancement openshift/enhancements#1525

Merged

djoshy added 2 commits April 22, 2024 17:25

operator: read & merge node disruption policies

6c67149

This enables the Operator to read in user defined node disruption polices, merge it with cluster defaults and then update the status on the MachineConfiguration object. This is only done when FeatureGateNodeDisruptionPolicy is enabled.

djoshy force-pushed the parse-node-disrupt branch from f3d90b0 to 103bdc1 Compare April 22, 2024 23:53

sinnykumari reviewed Apr 23, 2024

View reviewed changes

openshift-ci bot assigned sinnykumari Apr 23, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 23, 2024

openshift-ci bot added qe-approved Signifies that QE has signed off on this PR and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Apr 25, 2024

openshift-merge-bot bot merged commit 9b99954 into openshift:master Apr 25, 2024

djoshy deleted the parse-node-disrupt branch April 29, 2024 15:57

djoshy mentioned this pull request May 15, 2024

MCO-1152: MCO-1146: Add e2e tests for NodeDisruptionPolicy #4365

Merged

djoshy mentioned this pull request Jul 30, 2024

MCO-1065: MCO-1171: API bump for ManagedBootImages and NodeDisruptionPolicy GA #4496

Merged

MCO-1009: MCO-1008: MCO-905: Implement NodeDisruptionPolicy API #4267

MCO-1009: MCO-1008: MCO-905: Implement NodeDisruptionPolicy API #4267

Uh oh!

Conversation

djoshy commented Mar 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

This PR does the following:

How to test:

Uh oh!

openshift-ci bot commented Mar 18, 2024

Uh oh!

openshift-ci-robot commented Mar 25, 2024 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Mar 25, 2024 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Apr 10, 2024 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

djoshy commented Apr 10, 2024

Uh oh!

openshift-ci-robot commented Apr 22, 2024 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

This PR does the following:

How to test:

Uh oh!

djoshy commented Apr 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Apr 23, 2024 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

This PR does the following:

How to test:

Uh oh!

sinnykumari Apr 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

djoshy Apr 23, 2024

Choose a reason for hiding this comment

Uh oh!

sinnykumari Apr 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

djoshy Apr 23, 2024

Choose a reason for hiding this comment

Uh oh!

sinnykumari left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Apr 23, 2024

Uh oh!

rioliu-rh commented Apr 25, 2024

Uh oh!

openshift-ci-robot commented Apr 25, 2024 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

This PR does the following:

How to test:

Uh oh!

openshift-ci-robot commented Apr 25, 2024

Uh oh!

djoshy commented Apr 25, 2024

Uh oh!

openshift-ci bot commented Apr 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

djoshy commented Apr 25, 2024

Uh oh!

sinnykumari commented Apr 25, 2024

Uh oh!

openshift-ci bot commented Apr 25, 2024

Uh oh!

openshift-ci bot commented Apr 25, 2024

Uh oh!

openshift-bot commented Apr 25, 2024

Uh oh!

Reviewers

Assignees

djoshy commented Mar 18, 2024 •

edited

Loading

openshift-ci-robot commented Mar 25, 2024 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Mar 25, 2024 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Apr 10, 2024 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Apr 22, 2024 •

edited by openshift-ci bot

Loading

djoshy commented Apr 23, 2024 •

edited

Loading

openshift-ci-robot commented Apr 23, 2024 •

edited by openshift-ci bot

Loading

sinnykumari Apr 23, 2024 •

edited

Loading

sinnykumari Apr 23, 2024 •

edited

Loading

openshift-ci-robot commented Apr 25, 2024 •

edited by openshift-ci bot

Loading

openshift-ci bot commented Apr 25, 2024 •

edited

Loading