Skip to content

Conversation

@djoshy
Copy link
Contributor

@djoshy djoshy commented Mar 18, 2024

This PR does the following:

  • The operator now checks the MachineConfiguration global knob for any user provided node disruption policies, merges them with cluster defaults and displays the results in the Status of the MachineConfiguration object.
  • The daemon looks at the status of the MachineConfiguration object during node updates. If the daemon is unable to find the status of MachineConfiguration for more than 2 minutes, it will default to the legacy update path.
  • All of this is gated on NodeDisruption feature gate

How to test:

$ oc get MachineConfiguration/cluster -o yaml
apiVersion: operator.openshift.io/v1
kind: MachineConfiguration
metadata:
  creationTimestamp: "2024-04-16T15:02:37Z"
  generation: 4
  name: cluster
  resourceVersion: "261205"
  uid: 2c67b155-1898-452f-adbd-ed376afc0ea2
spec:
  logLevel: Normal
  managementState: Managed
  operatorLogLevel: Normal
status:
  nodeDisruptionPolicyStatus:
    clusterPolicies:
      files:
      - actions:
        - type: None
        path: /etc/mco/internal-registry-pull-secret.json
      - actions:
        - type: None
        path: /var/lib/kubelet/config.json
      - actions:
        - reload:
            serviceName: crio.service
          type: Reload
        path: /etc/machine-config-daemon/no-reboot/containers-gpg.pub
      - actions:
        - reload:
            serviceName: crio.service
          type: Reload
        path: /etc/containers/policy.json
      - actions:
        - type: Special
        path: /etc/containers/registries.conf
      sshkey:
        actions:
        - type: None
  readyReplicas: 0

  • Apply a MachineConfiguration named "cluster" with a valid NodeDisruptionPolicy. More information about the policy and possible actions can be found here. Here is a sample policy:
apiVersion: operator.openshift.io/v1
kind: MachineConfiguration
metadata:
  name: cluster
  namespace: openshift-machine-config-operator
spec:
  nodeDisruptionPolicy:
    files:
      - path: "/etc/my-file"
        actions:
          - type: None
    units:
      - name: "my.service"
        actions:
          - type: Restart
            restart:
              serviceName: crio.service
      - name: "test.service"
        actions:
          - type: DaemonReload
          - type: Drain
          - type: Reload
            reload:
              serviceName: crio.service
          - type: Restart
            restart:
              serviceName: my.service
    sshkey:
      actions:
      - type: Reboot
  • Check the status of the MachineConfiguration object; this should now list a merged policy, including the ones specifies by you and the defaults. If there are any conflicts, the user specified ones will override the cluster defaults.

status:
  nodeDisruptionPolicyStatus:
    clusterPolicies:
      files:
      - actions:
        - type: None
        path: /etc/mco/internal-registry-pull-secret.json
      - actions:
        - type: None
        path: /var/lib/kubelet/config.json
      - actions:
        - reload:
            serviceName: crio.service
          type: Reload
        path: /etc/machine-config-daemon/no-reboot/containers-gpg.pub
      - actions:
        - reload:
            serviceName: crio.service
          type: Reload
        path: /etc/containers/policy.json
      - actions:
        - type: Special
        path: /etc/containers/registries.conf
      - actions:
        - type: None
        path: /etc/my-file
      sshkey:
        actions:
        - type: Reboot
      units:
      - actions:
        - restart:
            serviceName: crio.service
          type: Restart
        name: my.service
      - actions:
        - type: DaemonReload
        - type: Drain
        - reload:
            serviceName: crio.service
          type: Reload
        - restart:
            serviceName: my.service
          type: Restart
        name: test.service

In this case, the SSHkey section has been overriden.

  • Now, apply a new MachineConfig with a change that would be in effect for the currently defined node disruption policy. In this case, I applied a config which had changes to the test.service unit. Observe the daemon logs on the targeted node, and you should see the daemon step through the designated actions.
I0417 18:20:47.614397    2551 update.go:2578] Starting update from rendered-infra-5d1e3cebfddcee59ac7bde56152e0919 to rendered-infra-a17773854499b91cb5cc86087b54cfe8: &{osUpdate:false kargs:false fips:false passwd:false files:false units:true kernelType:false extensions:false}
I0417 18:20:47.624217    2551 update.go:717] Calculating node disruption actions
I0417 18:20:47.624251    2551 update.go:642] NodeDisruptionPolicy found for diff unit test.service!
I0417 18:20:47.624261    2551 update.go:1023] Calculated node disruption actions:
I0417 18:20:47.624271    2551 update.go:1030] DaemonReload
I0417 18:20:47.624284    2551 update.go:1030] Drain
I0417 18:20:47.624298    2551 update.go:1026] Reload - crio.service
I0417 18:20:47.624313    2551 update.go:1028] Restart - my.service
I0417 18:20:47.624320    2551 drain.go:121] Checking drain required for node disruption actions
I0417 18:20:47.624328    2551 update.go:1060] Drain calculated for node disruption: true
...
...
...
I0417 18:21:30.264732    2551 update.go:2578] Executing performPostConfigChangeNodeDisruptionAction(drain already complete/skipped) for config rendered-infra-a17773854499b91cb5cc86087b54cfe8
I0417 18:21:30.266752    2551 update.go:2578] Executing postconfig action: DaemonReload for config rendered-infra-a17773854499b91cb5cc86087b54cfe8
I0417 18:21:30.268793    2551 update.go:2563] Running: systemctl daemon-reload
I0417 18:21:30.645513    2551 update.go:2578] daemon-reload service reloaded successfully!
I0417 18:21:30.647826    2551 update.go:2578] Executing postconfig action: Drain for config rendered-infra-a17773854499b91cb5cc86087b54cfe8
I0417 18:21:30.649623    2551 update.go:2578] Executing postconfig action: Reload for config rendered-infra-a17773854499b91cb5cc86087b54cfe8
I0417 18:21:30.651303    2551 update.go:2563] Running: systemctl reload crio.service
I0417 18:21:30.688445    2551 update.go:2578] crio.service service reloaded successfully!
I0417 18:21:30.690534    2551 update.go:2578] Executing postconfig action: Restart for config rendered-infra-a17773854499b91cb5cc86087b54cfe8
I0417 18:21:30.692379    2551 update.go:2563] Running: systemctl restart my.service
I0417 18:21:30.711817    2551 update.go:2578] my.service service restarted successfully!
I0417 18:21:30.720163    2551 daemon.go:1574] Previous boot ostree-finalize-staged.service appears successful
I0417 18:21:30.720187    2551 daemon.go:1697] Current config: rendered-infra-5d1e3cebfddcee59ac7bde56152e0919
I0417 18:21:30.720192    2551 daemon.go:1698] Desired config: rendered-infra-a17773854499b91cb5cc86087b54cfe8
I0417 18:21:30.720202    2551 daemon.go:1706] state: Done

Some things to note

  • If any of the changes result in a reboot action, all other policies will be ignored.
  • There is no dedup of the final actions list. We don't expect users to do MC changes that causes multiple policies to be in effect for a single update very often, so this didn't seem like a pressing need.
  • There is a special exclusion for files defined in directory /etc/containers/registries.d/ to account for some recent changes from OCPNODE-1632: Support ClusterImagePolicy CRD #4160 (comment). In the future, the MCO plans to add directories and wildcard support. With that in place, this exception can be removed and made part of the standard policy.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 18, 2024
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 18, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 18, 2024
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 18, 2024
@djoshy djoshy force-pushed the parse-node-disrupt branch 2 times, most recently from 7afa70a to 1af0015 Compare March 19, 2024 20:37
@djoshy djoshy changed the title DNM: testing new CRDs MCO-1009: Validate user-provided configuration and merge with cluster-defaults Mar 25, 2024
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 25, 2024
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Mar 25, 2024

@djoshy: This pull request references MCO-1009 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Mar 25, 2024

@djoshy: This pull request references MCO-1009 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

Details

In response to this:

This PR checks the MachineConfiguration global knob for any user provided node disruption policies, merges them with cluster defaults and displays the results in the Status of the MachineConfiguration object.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@djoshy djoshy force-pushed the parse-node-disrupt branch 4 times, most recently from 5d9737d to 5fa5f95 Compare March 29, 2024 14:23
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 29, 2024
@djoshy djoshy force-pushed the parse-node-disrupt branch 4 times, most recently from 25cf96d to ee568f2 Compare March 29, 2024 17:43
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 29, 2024
@djoshy djoshy force-pushed the parse-node-disrupt branch from ee568f2 to 7f3d08b Compare April 10, 2024 16:19
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Apr 10, 2024

@djoshy: This pull request references MCO-1009 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

Details

In response to this:

This PR does the following:

  • The operator now checks the MachineConfiguration global knob for any user provided node disruption policies, merges them with cluster defaults and displays the results in the Status of the MachineConfiguration object.
  • The daemon looks at the status of the MachineConfiguration object during node updates.
  • All of this is gated on NodeDisruption feature gate

Note: For the daemon, I've added a special exclusion for this case for #4160 (comment)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@djoshy djoshy changed the title MCO-1009: Validate user-provided configuration and merge with cluster-defaults MCO-1009: MCO-1008: MCO-905: Implement NodeDisruptionPolicy Apr 10, 2024
@djoshy djoshy changed the title MCO-1009: MCO-1008: MCO-905: Implement NodeDisruptionPolicy MCO-1009: MCO-1008: MCO-905: Implement NodeDisruptionPolicy API Apr 10, 2024
@djoshy djoshy force-pushed the parse-node-disrupt branch 2 times, most recently from e415e94 to 4ec0517 Compare April 10, 2024 18:40
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 10, 2024
@djoshy
Copy link
Contributor Author

djoshy commented Apr 10, 2024

/test unit
/test verify

@djoshy djoshy marked this pull request as ready for review April 10, 2024 19:13
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 10, 2024
@djoshy djoshy force-pushed the parse-node-disrupt branch from 8bbfba2 to f3d90b0 Compare April 22, 2024 09:52
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Apr 22, 2024

@djoshy: This pull request references MCO-1009 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

Details

In response to this:

This PR does the following:

  • The operator now checks the MachineConfiguration global knob for any user provided node disruption policies, merges them with cluster defaults and displays the results in the Status of the MachineConfiguration object.
  • The daemon looks at the status of the MachineConfiguration object during node updates.
  • All of this is gated on NodeDisruption feature gate

How to test:

  • Bring up a cluster in TechPreview.
  • Check the status of the MachineConfiguration object named "cluster"; this should list the cluster default policies.
$ oc get MachineConfiguration/cluster -o yaml
apiVersion: operator.openshift.io/v1
kind: MachineConfiguration
metadata:
 creationTimestamp: "2024-04-16T15:02:37Z"
 generation: 4
 name: cluster
 resourceVersion: "261205"
 uid: 2c67b155-1898-452f-adbd-ed376afc0ea2
spec:
 logLevel: Normal
 managementState: Managed
 operatorLogLevel: Normal
status:
 nodeDisruptionPolicyStatus:
   clusterPolicies:
     files:
     - actions:
       - type: None
       path: /etc/mco/internal-registry-pull-secret.json
     - actions:
       - type: None
       path: /var/lib/kubelet/config.json
     - actions:
       - reload:
           serviceName: crio.service
         type: Reload
       path: /etc/machine-config-daemon/no-reboot/containers-gpg.pub
     - actions:
       - reload:
           serviceName: crio.service
         type: Reload
       path: /etc/containers/policy.json
     - actions:
       - type: Special
       path: /etc/containers/registries.conf
     sshkey:
       actions:
       - type: None
 readyReplicas: 0

  • Apply a MachineConfiguration named "cluster" with a valid NodeDisruptionPolicy. More information about the policy and possible actions can be found here. Here is a sample policy:
apiVersion: operator.openshift.io/v1
kind: MachineConfiguration
metadata:
 name: cluster
 namespace: openshift-machine-config-operator
spec:
 nodeDisruptionPolicy:
   files:
     - path: "/etc/my-file"
       actions:
         - type: None
   units:
     - name: "my.service"
       actions:
         - type: Restart
           restart:
             serviceName: crio.service
     - name: "test.service"
       actions:
         - type: DaemonReload
         - type: Drain
         - type: Reload
           reload:
             serviceName: crio.service
         - type: Restart
           restart:
             serviceName: my.service
   sshkey:
     actions:
     - type: Reboot
  • Check the status of the MachineConfiguration object; this should now list a merged policy, including the ones specifies by you and the defaults. If there are any conflicts, the user specified ones will override the cluster defaults.

status:
 nodeDisruptionPolicyStatus:
   clusterPolicies:
     files:
     - actions:
       - type: None
       path: /etc/mco/internal-registry-pull-secret.json
     - actions:
       - type: None
       path: /var/lib/kubelet/config.json
     - actions:
       - reload:
           serviceName: crio.service
         type: Reload
       path: /etc/machine-config-daemon/no-reboot/containers-gpg.pub
     - actions:
       - reload:
           serviceName: crio.service
         type: Reload
       path: /etc/containers/policy.json
     - actions:
       - type: Special
       path: /etc/containers/registries.conf
     - actions:
       - type: None
       path: /etc/my-file
     sshkey:
       actions:
       - type: Reboot
     units:
     - actions:
       - restart:
           serviceName: crio.service
         type: Restart
       name: my.service
     - actions:
       - type: DaemonReload
       - type: Drain
       - reload:
           serviceName: crio.service
         type: Reload
       - restart:
           serviceName: my.service
         type: Restart
       name: test.service

In this case, the SSHkey section has been overriden.

  • Now, apply a new MachineConfig with a change that would be in effect for the currently defined node disruption policy. In this case, I applied a config which had changes to the test.service unit. Observe the daemon logs on the targeted node, and you should see the daemon step through the designated actions.
I0417 18:20:47.614397    2551 update.go:2578] Starting update from rendered-infra-5d1e3cebfddcee59ac7bde56152e0919 to rendered-infra-a17773854499b91cb5cc86087b54cfe8: &{osUpdate:false kargs:false fips:false passwd:false files:false units:true kernelType:false extensions:false}
I0417 18:20:47.624217    2551 update.go:717] Calculating node disruption actions
I0417 18:20:47.624251    2551 update.go:642] NodeDisruptionPolicy found for diff unit test.service!
I0417 18:20:47.624261    2551 update.go:1023] Calculated node disruption actions:
I0417 18:20:47.624271    2551 update.go:1030] DaemonReload
I0417 18:20:47.624284    2551 update.go:1030] Drain
I0417 18:20:47.624298    2551 update.go:1026] Reload - crio.service
I0417 18:20:47.624313    2551 update.go:1028] Restart - my.service
I0417 18:20:47.624320    2551 drain.go:121] Checking drain required for node disruption actions
I0417 18:20:47.624328    2551 update.go:1060] Drain calculated for node disruption: true
...
...
...
I0417 18:21:30.264732    2551 update.go:2578] Executing performPostConfigChangeNodeDisruptionAction(drain already complete/skipped) for config rendered-infra-a17773854499b91cb5cc86087b54cfe8
I0417 18:21:30.266752    2551 update.go:2578] Executing postconfig action: DaemonReload for config rendered-infra-a17773854499b91cb5cc86087b54cfe8
I0417 18:21:30.268793    2551 update.go:2563] Running: systemctl daemon-reload
I0417 18:21:30.645513    2551 update.go:2578] daemon-reload service reloaded successfully!
I0417 18:21:30.647826    2551 update.go:2578] Executing postconfig action: Drain for config rendered-infra-a17773854499b91cb5cc86087b54cfe8
I0417 18:21:30.649623    2551 update.go:2578] Executing postconfig action: Reload for config rendered-infra-a17773854499b91cb5cc86087b54cfe8
I0417 18:21:30.651303    2551 update.go:2563] Running: systemctl reload crio.service
I0417 18:21:30.688445    2551 update.go:2578] crio.service service reloaded successfully!
I0417 18:21:30.690534    2551 update.go:2578] Executing postconfig action: Restart for config rendered-infra-a17773854499b91cb5cc86087b54cfe8
I0417 18:21:30.692379    2551 update.go:2563] Running: systemctl restart my.service
I0417 18:21:30.711817    2551 update.go:2578] my.service service restarted successfully!
I0417 18:21:30.720163    2551 daemon.go:1574] Previous boot ostree-finalize-staged.service appears successful
I0417 18:21:30.720187    2551 daemon.go:1697] Current config: rendered-infra-5d1e3cebfddcee59ac7bde56152e0919
I0417 18:21:30.720192    2551 daemon.go:1698] Desired config: rendered-infra-a17773854499b91cb5cc86087b54cfe8
I0417 18:21:30.720202    2551 daemon.go:1706] state: Done

Some things to note

  • If any of the changes result in a reboot action, all other policies will be ignored.
  • There is no dedup of the final actions list. We don't expect users to do MC changes that causes multiple policies to be in effect for a single update very often, so this didn't seem like a pressing need.
  • There is a special exclusion for files defined in directory /etc/containers/registries.d/ to account for some recent changes from OCPNODE-1632: Support ClusterImagePolicy CRD #4160 (comment). In the future, the MCO plans to add directories and wildcard support. With that in place, this exception can be removed and made part of the standard policy.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

djoshy added 2 commits April 22, 2024 17:25
This enables the Operator to read in user defined node disruption
polices, merge it with cluster defaults and then update the status
on the MachineConfiguration object. This is only done when
FeatureGateNodeDisruptionPolicy is enabled.
This change makes the daemon read in NodeDisruptionPolicyStatus during
an update and queue up the appropriate actions that the policy requests.
No deduping is being done for the policies currently. This will not take
place during firstboot as the featureGateAccessor does not exist then.
@djoshy djoshy force-pushed the parse-node-disrupt branch from f3d90b0 to 103bdc1 Compare April 22, 2024 23:53
@djoshy
Copy link
Contributor Author

djoshy commented Apr 23, 2024

Made some additional fixes related to the transition to TechPreview featureset. In my tests, I noticed that operator would not get far enough to actually render the NodeDisruptionPolicyStatus as it was "stuck" in a mid feature gate state while the entire cluster was transitioning. Meanwhile the nodes would get assigned a new config to switch to; but with the 10 minute timeout I placed earlier, it takes them a long time to transition to the new config as it has to first hit the timeout and then fall back to execute the legacy update path. To clean up this UX, I did the following:

  • moved up syncMachineConfiguration in the operator's sync functions so NodeDisruptionPolicyStatus is rendered earlier in the operator lifecycle
  • dropped the timeout back to 2 minutes, after which the daemon will execute the old update path, not respecting node disruption policies.

With the above fixes, the only variable to account for is the operator's lease acquire time, after which the status will be populated, with which the daemons can orchestrate updates. The two minute timeout is largely arbitrary. During the "feature gate" transition update, some nodes may have to fallback to the legacy path in case the operator is not fast enough to populate the NodeDisruptionPolicyStatus. Once the cluster has gone through the "feature gate" transition update, all future MC updates should respect the node disruption policy of the cluster.

Note: This is only applicable for cases where the feature gate was applied after the cluster was installed with a "Default" featureset. If the cluster featureset was TechPreview at install time, this issue would not happen, it is the transition that gunked up the process.

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Apr 23, 2024

@djoshy: This pull request references MCO-1009 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

Details

In response to this:

This PR does the following:

  • The operator now checks the MachineConfiguration global knob for any user provided node disruption policies, merges them with cluster defaults and displays the results in the Status of the MachineConfiguration object.
  • The daemon looks at the status of the MachineConfiguration object during node updates. If the daemon is unable to find the status of MachineConfiguration for more than 2 minutes, it will default to the legacy update path.
  • All of this is gated on NodeDisruption feature gate

How to test:

$ oc get MachineConfiguration/cluster -o yaml
apiVersion: operator.openshift.io/v1
kind: MachineConfiguration
metadata:
 creationTimestamp: "2024-04-16T15:02:37Z"
 generation: 4
 name: cluster
 resourceVersion: "261205"
 uid: 2c67b155-1898-452f-adbd-ed376afc0ea2
spec:
 logLevel: Normal
 managementState: Managed
 operatorLogLevel: Normal
status:
 nodeDisruptionPolicyStatus:
   clusterPolicies:
     files:
     - actions:
       - type: None
       path: /etc/mco/internal-registry-pull-secret.json
     - actions:
       - type: None
       path: /var/lib/kubelet/config.json
     - actions:
       - reload:
           serviceName: crio.service
         type: Reload
       path: /etc/machine-config-daemon/no-reboot/containers-gpg.pub
     - actions:
       - reload:
           serviceName: crio.service
         type: Reload
       path: /etc/containers/policy.json
     - actions:
       - type: Special
       path: /etc/containers/registries.conf
     sshkey:
       actions:
       - type: None
 readyReplicas: 0

  • Apply a MachineConfiguration named "cluster" with a valid NodeDisruptionPolicy. More information about the policy and possible actions can be found here. Here is a sample policy:
apiVersion: operator.openshift.io/v1
kind: MachineConfiguration
metadata:
 name: cluster
 namespace: openshift-machine-config-operator
spec:
 nodeDisruptionPolicy:
   files:
     - path: "/etc/my-file"
       actions:
         - type: None
   units:
     - name: "my.service"
       actions:
         - type: Restart
           restart:
             serviceName: crio.service
     - name: "test.service"
       actions:
         - type: DaemonReload
         - type: Drain
         - type: Reload
           reload:
             serviceName: crio.service
         - type: Restart
           restart:
             serviceName: my.service
   sshkey:
     actions:
     - type: Reboot
  • Check the status of the MachineConfiguration object; this should now list a merged policy, including the ones specifies by you and the defaults. If there are any conflicts, the user specified ones will override the cluster defaults.

status:
 nodeDisruptionPolicyStatus:
   clusterPolicies:
     files:
     - actions:
       - type: None
       path: /etc/mco/internal-registry-pull-secret.json
     - actions:
       - type: None
       path: /var/lib/kubelet/config.json
     - actions:
       - reload:
           serviceName: crio.service
         type: Reload
       path: /etc/machine-config-daemon/no-reboot/containers-gpg.pub
     - actions:
       - reload:
           serviceName: crio.service
         type: Reload
       path: /etc/containers/policy.json
     - actions:
       - type: Special
       path: /etc/containers/registries.conf
     - actions:
       - type: None
       path: /etc/my-file
     sshkey:
       actions:
       - type: Reboot
     units:
     - actions:
       - restart:
           serviceName: crio.service
         type: Restart
       name: my.service
     - actions:
       - type: DaemonReload
       - type: Drain
       - reload:
           serviceName: crio.service
         type: Reload
       - restart:
           serviceName: my.service
         type: Restart
       name: test.service

In this case, the SSHkey section has been overriden.

  • Now, apply a new MachineConfig with a change that would be in effect for the currently defined node disruption policy. In this case, I applied a config which had changes to the test.service unit. Observe the daemon logs on the targeted node, and you should see the daemon step through the designated actions.
I0417 18:20:47.614397    2551 update.go:2578] Starting update from rendered-infra-5d1e3cebfddcee59ac7bde56152e0919 to rendered-infra-a17773854499b91cb5cc86087b54cfe8: &{osUpdate:false kargs:false fips:false passwd:false files:false units:true kernelType:false extensions:false}
I0417 18:20:47.624217    2551 update.go:717] Calculating node disruption actions
I0417 18:20:47.624251    2551 update.go:642] NodeDisruptionPolicy found for diff unit test.service!
I0417 18:20:47.624261    2551 update.go:1023] Calculated node disruption actions:
I0417 18:20:47.624271    2551 update.go:1030] DaemonReload
I0417 18:20:47.624284    2551 update.go:1030] Drain
I0417 18:20:47.624298    2551 update.go:1026] Reload - crio.service
I0417 18:20:47.624313    2551 update.go:1028] Restart - my.service
I0417 18:20:47.624320    2551 drain.go:121] Checking drain required for node disruption actions
I0417 18:20:47.624328    2551 update.go:1060] Drain calculated for node disruption: true
...
...
...
I0417 18:21:30.264732    2551 update.go:2578] Executing performPostConfigChangeNodeDisruptionAction(drain already complete/skipped) for config rendered-infra-a17773854499b91cb5cc86087b54cfe8
I0417 18:21:30.266752    2551 update.go:2578] Executing postconfig action: DaemonReload for config rendered-infra-a17773854499b91cb5cc86087b54cfe8
I0417 18:21:30.268793    2551 update.go:2563] Running: systemctl daemon-reload
I0417 18:21:30.645513    2551 update.go:2578] daemon-reload service reloaded successfully!
I0417 18:21:30.647826    2551 update.go:2578] Executing postconfig action: Drain for config rendered-infra-a17773854499b91cb5cc86087b54cfe8
I0417 18:21:30.649623    2551 update.go:2578] Executing postconfig action: Reload for config rendered-infra-a17773854499b91cb5cc86087b54cfe8
I0417 18:21:30.651303    2551 update.go:2563] Running: systemctl reload crio.service
I0417 18:21:30.688445    2551 update.go:2578] crio.service service reloaded successfully!
I0417 18:21:30.690534    2551 update.go:2578] Executing postconfig action: Restart for config rendered-infra-a17773854499b91cb5cc86087b54cfe8
I0417 18:21:30.692379    2551 update.go:2563] Running: systemctl restart my.service
I0417 18:21:30.711817    2551 update.go:2578] my.service service restarted successfully!
I0417 18:21:30.720163    2551 daemon.go:1574] Previous boot ostree-finalize-staged.service appears successful
I0417 18:21:30.720187    2551 daemon.go:1697] Current config: rendered-infra-5d1e3cebfddcee59ac7bde56152e0919
I0417 18:21:30.720192    2551 daemon.go:1698] Desired config: rendered-infra-a17773854499b91cb5cc86087b54cfe8
I0417 18:21:30.720202    2551 daemon.go:1706] state: Done

Some things to note

  • If any of the changes result in a reboot action, all other policies will be ignored.
  • There is no dedup of the final actions list. We don't expect users to do MC changes that causes multiple policies to be in effect for a single update very often, so this didn't seem like a pressing need.
  • There is a special exclusion for files defined in directory /etc/containers/registries.d/ to account for some recent changes from OCPNODE-1632: Support ClusterImagePolicy CRD #4160 (comment). In the future, the MCO plans to add directories and wildcard support. With that in place, this exception can be removed and made part of the standard policy.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

for _, policyFile := range clusterPolicies.Files {
klog.V(4).Infof("comparing policy path %s to diff path %s", policyFile.Path, diffPath)
if policyFile.Path == diffPath {
klog.Infof("NodeDisruptionPolicy found for diff file %s!", diffPath)
Copy link
Contributor

@sinnykumari sinnykumari Apr 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Perhaps we should exclude ! in the end since it can be confusing that it is part of diffPath.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack, will fix alongside anything QE might find on next pass. Thanks!

// If at any point an error occurs, we reboot the node so that node has correct configuration.
func (dn *Daemon) performPostConfigChangeNodeDisruptionAction(postConfigChangeActions []opv1.NodeDisruptionPolicyStatusAction, configName string) error {

logSystem("Executing performPostConfigChangeNodeDisruptionAction(drain already complete/skipped) for config %s", configName)
Copy link
Contributor

@sinnykumari sinnykumari Apr 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can just say "Performing post config change action".
We can skip drain already complete/skipped since this is already captured in https://github.com/openshift/machine-config-operator/blob/master/pkg/daemon/update.go#L793 if we drain.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack, yeah can clean this up. The reason I kept that marker is this loop will also roll through the Drain action, but I can put in a no-op condition message for that (:

Copy link
Contributor

@sinnykumari sinnykumari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few minor nits which are optional.
Other than that this looks great. Nice work David!
/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 23, 2024
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 23, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: djoshy, sinnykumari, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [djoshy,sinnykumari,yuqi-zhang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@rioliu-rh
Copy link

pre-merge testing result:
verified following functionalities

  • files actions
    • none
    • reboot
    • restart
    • reload
    • drain
    • daemon reload
    • multi-actions
  • units actions
    • none
    • reboot
    • multi-actions
  • sshkey actions
    • none
    • reboot
    • multi-actions

Issues:

  • OCPBUGS-32511 NodeDisruptionPolicyStatus was not ready context deadline exceeded
  • OCPBUGS-32739 MachineConfigurations is only effective with name
  • OCPBUGS-32783 NodeDisruptionPolicy action reload cannot take effect

Test cases are linked to epic MCO-507
No merge blocker. issues can be fixed new PRs.

/label qe-approved
/unhold

@openshift-ci openshift-ci bot added qe-approved Signifies that QE has signed off on this PR and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Apr 25, 2024
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Apr 25, 2024

@djoshy: This pull request references MCO-1009 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

Details

In response to this:

This PR does the following:

  • The operator now checks the MachineConfiguration global knob for any user provided node disruption policies, merges them with cluster defaults and displays the results in the Status of the MachineConfiguration object.
  • The daemon looks at the status of the MachineConfiguration object during node updates. If the daemon is unable to find the status of MachineConfiguration for more than 2 minutes, it will default to the legacy update path.
  • All of this is gated on NodeDisruption feature gate

How to test:

$ oc get MachineConfiguration/cluster -o yaml
apiVersion: operator.openshift.io/v1
kind: MachineConfiguration
metadata:
 creationTimestamp: "2024-04-16T15:02:37Z"
 generation: 4
 name: cluster
 resourceVersion: "261205"
 uid: 2c67b155-1898-452f-adbd-ed376afc0ea2
spec:
 logLevel: Normal
 managementState: Managed
 operatorLogLevel: Normal
status:
 nodeDisruptionPolicyStatus:
   clusterPolicies:
     files:
     - actions:
       - type: None
       path: /etc/mco/internal-registry-pull-secret.json
     - actions:
       - type: None
       path: /var/lib/kubelet/config.json
     - actions:
       - reload:
           serviceName: crio.service
         type: Reload
       path: /etc/machine-config-daemon/no-reboot/containers-gpg.pub
     - actions:
       - reload:
           serviceName: crio.service
         type: Reload
       path: /etc/containers/policy.json
     - actions:
       - type: Special
       path: /etc/containers/registries.conf
     sshkey:
       actions:
       - type: None
 readyReplicas: 0

  • Apply a MachineConfiguration named "cluster" with a valid NodeDisruptionPolicy. More information about the policy and possible actions can be found here. Here is a sample policy:
apiVersion: operator.openshift.io/v1
kind: MachineConfiguration
metadata:
 name: cluster
 namespace: openshift-machine-config-operator
spec:
 nodeDisruptionPolicy:
   files:
     - path: "/etc/my-file"
       actions:
         - type: None
   units:
     - name: "my.service"
       actions:
         - type: Restart
           restart:
             serviceName: crio.service
     - name: "test.service"
       actions:
         - type: DaemonReload
         - type: Drain
         - type: Reload
           reload:
             serviceName: crio.service
         - type: Restart
           restart:
             serviceName: my.service
   sshkey:
     actions:
     - type: Reboot
  • Check the status of the MachineConfiguration object; this should now list a merged policy, including the ones specifies by you and the defaults. If there are any conflicts, the user specified ones will override the cluster defaults.

status:
 nodeDisruptionPolicyStatus:
   clusterPolicies:
     files:
     - actions:
       - type: None
       path: /etc/mco/internal-registry-pull-secret.json
     - actions:
       - type: None
       path: /var/lib/kubelet/config.json
     - actions:
       - reload:
           serviceName: crio.service
         type: Reload
       path: /etc/machine-config-daemon/no-reboot/containers-gpg.pub
     - actions:
       - reload:
           serviceName: crio.service
         type: Reload
       path: /etc/containers/policy.json
     - actions:
       - type: Special
       path: /etc/containers/registries.conf
     - actions:
       - type: None
       path: /etc/my-file
     sshkey:
       actions:
       - type: Reboot
     units:
     - actions:
       - restart:
           serviceName: crio.service
         type: Restart
       name: my.service
     - actions:
       - type: DaemonReload
       - type: Drain
       - reload:
           serviceName: crio.service
         type: Reload
       - restart:
           serviceName: my.service
         type: Restart
       name: test.service

In this case, the SSHkey section has been overriden.

  • Now, apply a new MachineConfig with a change that would be in effect for the currently defined node disruption policy. In this case, I applied a config which had changes to the test.service unit. Observe the daemon logs on the targeted node, and you should see the daemon step through the designated actions.
I0417 18:20:47.614397    2551 update.go:2578] Starting update from rendered-infra-5d1e3cebfddcee59ac7bde56152e0919 to rendered-infra-a17773854499b91cb5cc86087b54cfe8: &{osUpdate:false kargs:false fips:false passwd:false files:false units:true kernelType:false extensions:false}
I0417 18:20:47.624217    2551 update.go:717] Calculating node disruption actions
I0417 18:20:47.624251    2551 update.go:642] NodeDisruptionPolicy found for diff unit test.service!
I0417 18:20:47.624261    2551 update.go:1023] Calculated node disruption actions:
I0417 18:20:47.624271    2551 update.go:1030] DaemonReload
I0417 18:20:47.624284    2551 update.go:1030] Drain
I0417 18:20:47.624298    2551 update.go:1026] Reload - crio.service
I0417 18:20:47.624313    2551 update.go:1028] Restart - my.service
I0417 18:20:47.624320    2551 drain.go:121] Checking drain required for node disruption actions
I0417 18:20:47.624328    2551 update.go:1060] Drain calculated for node disruption: true
...
...
...
I0417 18:21:30.264732    2551 update.go:2578] Executing performPostConfigChangeNodeDisruptionAction(drain already complete/skipped) for config rendered-infra-a17773854499b91cb5cc86087b54cfe8
I0417 18:21:30.266752    2551 update.go:2578] Executing postconfig action: DaemonReload for config rendered-infra-a17773854499b91cb5cc86087b54cfe8
I0417 18:21:30.268793    2551 update.go:2563] Running: systemctl daemon-reload
I0417 18:21:30.645513    2551 update.go:2578] daemon-reload service reloaded successfully!
I0417 18:21:30.647826    2551 update.go:2578] Executing postconfig action: Drain for config rendered-infra-a17773854499b91cb5cc86087b54cfe8
I0417 18:21:30.649623    2551 update.go:2578] Executing postconfig action: Reload for config rendered-infra-a17773854499b91cb5cc86087b54cfe8
I0417 18:21:30.651303    2551 update.go:2563] Running: systemctl reload crio.service
I0417 18:21:30.688445    2551 update.go:2578] crio.service service reloaded successfully!
I0417 18:21:30.690534    2551 update.go:2578] Executing postconfig action: Restart for config rendered-infra-a17773854499b91cb5cc86087b54cfe8
I0417 18:21:30.692379    2551 update.go:2563] Running: systemctl restart my.service
I0417 18:21:30.711817    2551 update.go:2578] my.service service restarted successfully!
I0417 18:21:30.720163    2551 daemon.go:1574] Previous boot ostree-finalize-staged.service appears successful
I0417 18:21:30.720187    2551 daemon.go:1697] Current config: rendered-infra-5d1e3cebfddcee59ac7bde56152e0919
I0417 18:21:30.720192    2551 daemon.go:1698] Desired config: rendered-infra-a17773854499b91cb5cc86087b54cfe8
I0417 18:21:30.720202    2551 daemon.go:1706] state: Done

Some things to note

  • If any of the changes result in a reboot action, all other policies will be ignored.
  • There is no dedup of the final actions list. We don't expect users to do MC changes that causes multiple policies to be in effect for a single update very often, so this didn't seem like a pressing need.
  • There is a special exclusion for files defined in directory /etc/containers/registries.d/ to account for some recent changes from OCPNODE-1632: Support ClusterImagePolicy CRD #4160 (comment). In the future, the MCO plans to add directories and wildcard support. With that in place, this exception can be removed and made part of the standard policy.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD d148f30 and 2 for PR HEAD 103bdc1 in total

@djoshy
Copy link
Contributor Author

djoshy commented Apr 25, 2024

/retest-required

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 25, 2024

@djoshy: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/okd-scos-e2e-aws-ovn 103bdc1 link false /test okd-scos-e2e-aws-ovn

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@djoshy
Copy link
Contributor Author

djoshy commented Apr 25, 2024

/override ci/prow/e2e-hypershift

Hypershift failures are unrelated and being fixed. Since the PR has passed it in the past, I'm overriding it.

@sinnykumari
Copy link
Contributor

Looks like openshift/hypershift#3938 doesn't help much with HyperShift failure. Looking at recent failure on this PR doesn't look like MCO specific. Also, we had a successful run on this PR with current changes in https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_machine-config-operator/4267/pull-ci-openshift-machine-config-operator-master-e2e-hypershift/1782558388421398528 . This PR should be safe to go in. overriding e2e-hypershift test to get this feature work merged.
/override ci/prow/e2e-hypershift

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 25, 2024

@sinnykumari: Overrode contexts on behalf of sinnykumari: ci/prow/e2e-hypershift

Details

In response to this:

Looks like openshift/hypershift#3938 doesn't help much with HyperShift failure. Looking at recent failure on this PR doesn't look like MCO specific. Also, we had a successful run on this PR with current changes in https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_machine-config-operator/4267/pull-ci-openshift-machine-config-operator-master-e2e-hypershift/1782558388421398528 . This PR should be safe to go in. overriding e2e-hypershift test to get this feature work merged.
/override ci/prow/e2e-hypershift

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 25, 2024

@djoshy: Overrode contexts on behalf of djoshy: ci/prow/e2e-hypershift

Details

In response to this:

/override ci/prow/e2e-hypershift

Hypershift failures are unrelated and being fixed. Since the PR has passed it in the past, I'm overriding it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-merge-bot openshift-merge-bot bot merged commit 9b99954 into openshift:master Apr 25, 2024
@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

This PR has been included in build ose-machine-config-operator-container-v4.16.0-202404251110.p0.g9b99954.assembly.stream.el9 for distgit ose-machine-config-operator.
All builds following this will include this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. qe-approved Signifies that QE has signed off on this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants