Reload deployment state #610

avalluri · 2020-04-21T23:45:31Z

These group of commits implements reconciling deployment changes in cases of both known(cached) deployment and unknown deployment.

pohly · 2020-04-22T08:38:36Z

pkg/pmem-csi-operator/controller/deployment/controller_driver.go

@@ -88,7 +91,18 @@ func (d *PmemCSIDriver) Reconcile(r *ReconcileDeployment) (bool, error) {
 		// Deployment successfull, so no more reconcile needed for this deployment
 		return false, nil
 	case api.DeploymentPhaseRunning:
-		requeue, err := d.reconcileDeploymentChanges(r, oldDeployment, changes)
+		if !foundInCache {


Why is this "not found in cache" logic only applied only to "phase == running"? Consider the following sequence of events:

new deployment object

sub-objects created

setting status fails

operator gets stopped

deployment object gets updated

operator gets restarted

I made changes to handle this case. Now we handle all phases similar and pass through the same reconcile logic so that we "refresh" the sub-objects regardless of its phase.

pohly · 2020-04-22T08:44:52Z

pkg/pmem-csi-operator/controller/deployment/controller_driver.go

+
+	for _, obj := range objects {
+		// Services needs special treatment as they have some immutable field(s)
+		// So, we cannot refresh the existing one with new service object.


Which fields trigger this problem? Do we need to modify them?

Which fields trigger this problem? Do we need to modify them?

Service.ClusterIP, which is auto-generated and immutable from updates.

I see. Okay, so let's keep this code and worry about a more generic solution later.

pohly · 2020-04-22T08:45:32Z

pkg/pmem-csi-operator/controller/deployment/controller_driver.go

+			}
+			existingService.Spec.Selector = s.Spec.Selector
+			klog.Infof("updating service '%s' service ports and selector", s.GetName())
+			if err := r.Update(existingService); err != nil {


I'm worried that "create from scratch" and "update existing" now do not lead to the same end state.

@pohly I doubt I fully understand your concern, what would you expect me to do here.

I'm worried that this update path needs to be maintained manually and may end up missing changing something. But don't worry about it now.

pohly · 2020-04-22T08:50:49Z

test/e2e/operator/deployment_api.go

+
+			By("Wait till driver objects get updated")
+			// FIXME(avalluri): find better way of ensuring the sub-objects got updated
+			// We can't use validateDriverDeployment() inside a Eventually() as it


This is partly why using assertions in sub-functions is an anti-pattern: it prevents reuse of the function.

The other reason is that assertions are attributed to the sub-function instead of the test itself unless extra care is taken to skip stack frames.

You can use validateDriverDeployment unchanged here by adding a defer which handles the failure:

Eventually(func() (success bool) { defer func() { if r := recover(); r != nil { success = false } }() validateDriverDeployment() return true })

This is a stop-gap measure until we have a better validateDriverDeployment.

The problem however will be that if Eventually times out, we don't have any record of why it timed out, i.e. what the result of the last validateDriverDeployment() was.

Thanks, @pohly Nice workaround for the issue. Did the suggested change.

pohly · 2020-04-22T08:56:54Z

test/e2e/operator/deployment_api.go

@@ -731,6 +826,8 @@ func validateDriverDeployment(f *framework.Framework, d *deploy.Deployment, expe
 	Expect(crb.Subjects[0].Name).Should(BeEquivalentTo(saName), "cluster role binding should have a valid service account")
 	_, err = f.ClientSet.RbacV1().Roles(d.Namespace).Get(context.Background(), rb.RoleRef.Name, metav1.GetOptions{})
 	Expect(err).ShouldNot(HaveOccurred(), "roles should have been defined")
+
+	return true


This isn't used anywhere, right?

Yes, right. It was leftover from the trial & error phase. I will revert.

pohly · 2020-04-22T09:02:49Z

pkg/apis/pmemcsi/v1alpha1/deployment_types.go

@@ -81,6 +81,8 @@ type DeploymentStatus struct {

 	// Phase indicates the state of the deployment
 	Phase DeploymentPhase `json:"phase,omitempty"`
+	// LastUpdated time of the deployment status
+	LastUpdated metav1.Time `json:"lastUpdated,omitempty"`


Long term we should reconsider what we put into our status and align with kubernetes/enhancements#1624.

Instead of relying on time, observedGeneration (mentioned in that KEP discussion without further explanation) may be better because it is immune to time shifts.

Okay for now.

pohly · 2020-04-24T06:59:33Z

@avalluri please rebase. I've not done an in-depth review again, but it looks like my concerns have been addressed, so let's merge it.

Added new status field 'lastUpdated' that holds the time stamp when the last the deployment's got updated. In other words when was the last time the deployment got reconciled.

When reconciling a running/failed deployment after operator restart we might not have any previous revision of that deployment in our cache so that we could have a diff. One thing we could do in this case is force refresh of all sub-objects of that deployment with new revision. But we might miss any incompatible changes to that deployment on absence of the operator. So, for finding those changes we could retrieve those values from pre-deployed driver objects for that deployment. We choose "refresh" strategy to support the operator upgrades, which might have changes to sub-objects that are not visible in deployment API. One example for this is command-line arguments of sidecar containers.

This is useful for configuring the operator to use a specific image as default driver image. One usecase is to run the operator as stand alone binary as quick test in development.

client.Get() fails if set namespace of a cluster scoped object. Hence setting namespace only case of namespace scoped objects.

While updating an existing object we copy/pick metadata from read object to our new copy so that we could make UpdateOrCreate() type agnostic. This will not reset the status fields of the existing object as client.Update() will not reset the status. This also changes the semantics of Create() call such that it will not check the existence of the object being created. So that now it returns an error if the object is already exits. This revealed and issue in unit tests of reusing existing CSIDriver that was created for a deleted deployment. Took the opportunity to make the Create()/Update() more verbose.

This adds missing certificates check in reconcile loop, so that it updates if any such change with new secrets.

avalluri · 2020-04-24T07:46:07Z

@avalluri please rebase. I've not done an in-depth review again, but it looks like my concerns have been addressed, so let's merge it.

Done, Shall we wait for final test results?

pohly · 2020-04-24T09:50:04Z

Done, Shall we wait for final test results?

Github still shows conflicts. Did you pull before rebasing?

pohly · 2020-04-24T09:56:06Z

Github still shows conflicts. Did you pull before rebasing?

Never mind. Now it no longer does.

avalluri force-pushed the reload-deployment-state branch from f164cb7 to 2c5cd30 Compare April 21, 2020 23:52

pohly suggested changes Apr 22, 2020

View reviewed changes

pohly mentioned this pull request Apr 22, 2020

Features/Improvements for Operator: post v0.7.0 release #572

Closed

avalluri force-pushed the reload-deployment-state branch 3 times, most recently from c8f862d to b6fad9d Compare April 23, 2020 14:01

avalluri added 6 commits April 24, 2020 10:21

operator: Added last updated timestamp for the deployment status

ca3402a

Added new status field 'lastUpdated' that holds the time stamp when the last the deployment's got updated. In other words when was the last time the deployment got reconciled.

operator: command-line option for choosing default driver image

6c16814

This is useful for configuring the operator to use a specific image as default driver image. One usecase is to run the operator as stand alone binary as quick test in development.

operator: do not set namespace for cluster scoped objects

88dccef

client.Get() fails if set namespace of a cluster scoped object. Hence setting namespace only case of namespace scoped objects.

operator: capture updated certificates on reconcile

d2bc2f1

This adds missing certificates check in reconcile loop, so that it updates if any such change with new secrets.

avalluri force-pushed the reload-deployment-state branch from b6fad9d to d2bc2f1 Compare April 24, 2020 07:40

pohly merged commit 4ea70c5 into intel:operator Apr 24, 2020

pohly approved these changes Apr 24, 2020

View reviewed changes

pohly mentioned this pull request Apr 27, 2020

operator: nil pointer access in reconcileDeploymentChanges #617

Closed

avalluri deleted the reload-deployment-state branch October 20, 2020 20:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reload deployment state #610

Reload deployment state #610

avalluri commented Apr 21, 2020

pohly Apr 22, 2020

avalluri Apr 23, 2020

pohly Apr 22, 2020

avalluri Apr 22, 2020

pohly Apr 22, 2020

pohly Apr 22, 2020

avalluri Apr 22, 2020

pohly Apr 22, 2020

pohly Apr 22, 2020

pohly Apr 22, 2020

avalluri Apr 22, 2020

pohly Apr 22, 2020

avalluri Apr 22, 2020

pohly Apr 22, 2020

pohly commented Apr 24, 2020

avalluri commented Apr 24, 2020

pohly commented Apr 24, 2020

pohly commented Apr 24, 2020

Reload deployment state #610

Reload deployment state #610

Conversation

avalluri commented Apr 21, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pohly commented Apr 24, 2020

avalluri commented Apr 24, 2020

pohly commented Apr 24, 2020

pohly commented Apr 24, 2020