Skip to content

Conversation

@sebsoto
Copy link
Contributor

@sebsoto sebsoto commented Apr 21, 2020

This PR includes various commits to clean up the how the operator handles errors both in the creation of the controller, and the creation of nodes.

@aravindhp aravindhp changed the title Operator error handling [wmco] Improve error handling Apr 21, 2020
@sebsoto sebsoto force-pushed the operator_changes branch 2 times, most recently from 8ee7ea7 to 3251715 Compare April 22, 2020 14:45
Copy link
Contributor

@VaishnaviHire VaishnaviHire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @sebsoto . Please address my comments

Copy link
Contributor

@aravindhp aravindhp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this, @sebsoto. Please address my comments.

windowsVMs := make(map[types.WindowsVM]bool)
vmTracker, err := tracker.NewTracker(clientset, windowsVMs)
if err != nil {
return nil, errors.Wrap(err, "tracker instantiation failed")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

failed to instantiate tracker

Copy link
Member

@mansikulkarni96 mansikulkarni96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @sebsoto. PTAL at my comments.

}

// Terminate the instance via its instance id
id := vm.GetCredentials().GetInstanceId()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: rename id to instancedID for specific naming

sebsoto added 2 commits April 22, 2020 17:07
This commit changes the creation of the WMC controller to be more
diligent about handling errors. This commit changes two things
* If the kubernetes clientset cannot be created, that error will be
handled instead of just being logged.
* Error messages around the previous addition have been made more verbose
This commit moves the creation of our node tracker out of the reconcile
function to where we create the reconciler. This is being done to create
a distinction between the setup of the controller and the functionality
of the controller, allowing for cleaner error handling.
@aravindhp
Copy link
Contributor

/retest

Testing openshift/release#8323

@aravindhp
Copy link
Contributor

aravindhp commented Apr 22, 2020

@sebsoto when you get a chance please introduce a failure in one of the e2e tests. I want to see if we will get to see the test output in that case. It looks like with the step registry e2e test output is not show on success.

@sebsoto sebsoto changed the title [wmco] Improve error handling [WIP] [wmco] Improve error handling Apr 23, 2020
@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 23, 2020
@sebsoto sebsoto force-pushed the operator_changes branch 3 times, most recently from e3569ae to 7f0f8ba Compare April 23, 2020 20:24
@sebsoto sebsoto changed the title [WIP] [wmco] Improve error handling [wmco] Improve error handling Apr 23, 2020
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 23, 2020
@sebsoto sebsoto force-pushed the operator_changes branch 2 times, most recently from 8671abb to 260d1e8 Compare April 24, 2020 14:23
@mansikulkarni96
Copy link
Member

LGTM


// deleteWindowsVMs deletes the required number of Windows VMs from the cluster and returns a bool indicating the
// status of deletion. This method will return false if any of the VMs fail to get deleted.
// remove removes the given vm from the list of VMs and terminates the underlying VM
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be pedantic I would change order. i.e. terminate first and then remove from the list.

How about renaming the function to removeWorkerNode() or removeWorker and sync with rest of the functions? addWorker addWorkers etc.

I also think we don't need to call out that it is Windows as it obvious given our operator is WMCO :-) But I am open to suggestions.

// TODO: This method should return a slice of errors that we collected.
// Jira story: https://issues.redhat.com/browse/WINC-266
func (r *ReconcileWindowsMachineConfig) deleteWindowsVMs(count int) bool {
func (r *ReconcileWindowsMachineConfig) removeWindowsWorkerNodes(count int) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per the previous comment this would become removeWorkerNodes

}

// createWindowsVMs creates the required number of windows Windows VM and configure them to make
// addWindowsWorker creates a new Windows VM and configures it, adding it as a node object to the cluster
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addWorkerNode()

if err != nil {
return nil, errors.Wrap(err, "error creating windows VM")
}
log.V(1).Info("created a new Windows VM", "ID", vm.GetCredentials().GetInstanceId())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove this as the next debug message implies the VM was created.

// TODO: This method should return a slice of errors that we collected.
// Jira story: https://issues.redhat.com/browse/WINC-266
func (r *ReconcileWindowsMachineConfig) createWindowsWorkerNodes(count int) bool {
func (r *ReconcileWindowsMachineConfig) addWindowsWorkerNodes(count int) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addWorkerNode

go.mod Outdated
Comment on lines 35 to 36
github.com/openshift/api v3.9.1-0.20190924102528-32369d4db2ad+incompatible
github.com/openshift/client-go v0.0.0-20190923180330-3b6373338c9b
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot be pinned to 3.9. That version is too old.

// tracker is used to track all the Windows nodes created via WMCO
tracker *tracker.Tracker
// nodeConfigurer is what is used to configure the created Windows VMs
nodeConfigurer nodeconfig.Interface
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please call this nodeConfig

)

type Interface interface {
Configure(types.WindowsVM) error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need to pass in types.WindowsVM? Can't you create a mocked version of nodeConfig?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do create a mocked version, but in order to use the mocked version within the reconciler we need to pass in the nodeconfig entity in some way. As it was before the regular nodeconfig was being created within the reconciliation process. That doesn't give us any way to switch to using the mocked one.

By making the nodeconfig part of the reconciler we can switch between using the regular and mocked version. Passing in the types.WindowsVM becomes necessary as we will now be using the same nodeconfig for multiple VMs, instead of a new one for every VM. We need to pass in the VM to select which VM will be configured.

k8sclientset: clientset,
tracker: vmTracker,
windowsVMs: windowsVMs,
nodeConfigurer: nodeconfig.NewNodeConfig(clientset),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not able to understand why you need to do this. I don't see the utility of having generic node configuration entities

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I explained this in the above message, let me know if this is a separate issue.

Copy link
Contributor

@aravindhp aravindhp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, @sebsoto. Just a few changes in the error messages.

Currently VMs that fail to become a node are left orphaned. This commit
changes that, so that if a VM fails to become a node via bootstrapping
it is deleted properly. This is being done so that we can properly track
the status of the VMs we create.
Copy link
Contributor

@aravindhp aravindhp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

Thanks for adding this, @sebsoto

@openshift-ci-robot openshift-ci-robot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Apr 28, 2020
@aravindhp
Copy link
Contributor

/hold

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 28, 2020
@aravindhp
Copy link
Contributor

/lgtm cancel

@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Apr 28, 2020
Copy link
Contributor

@ravisantoshgudimetla ravisantoshgudimetla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Thanks for working on this PR @sebsoto

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 28, 2020
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aravindhp, ravisantoshgudimetla, sebsoto

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [aravindhp,ravisantoshgudimetla]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sebsoto
Copy link
Contributor Author

sebsoto commented Apr 28, 2020

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 28, 2020
@openshift-merge-robot openshift-merge-robot merged commit 52a7d27 into openshift:master Apr 28, 2020
wgahnagl pushed a commit to wgahnagl/windows-machine-config-operator that referenced this pull request Mar 6, 2023
[logging] Install Fluentd on Windows instance
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants