Skip to content
This repository has been archived by the owner on Oct 22, 2024. It is now read-only.

Reload deployment state #610

Merged
merged 6 commits into from
Apr 24, 2020
Merged

Conversation

avalluri
Copy link
Contributor

These group of commits implements reconciling deployment changes in cases of both known(cached) deployment and unknown deployment.

@avalluri avalluri force-pushed the reload-deployment-state branch from f164cb7 to 2c5cd30 Compare April 21, 2020 23:52
@@ -88,7 +91,18 @@ func (d *PmemCSIDriver) Reconcile(r *ReconcileDeployment) (bool, error) {
// Deployment successfull, so no more reconcile needed for this deployment
return false, nil
case api.DeploymentPhaseRunning:
requeue, err := d.reconcileDeploymentChanges(r, oldDeployment, changes)
if !foundInCache {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this "not found in cache" logic only applied only to "phase == running"? Consider the following sequence of events:

  • new deployment object
  • sub-objects created
  • setting status fails
  • operator gets stopped
  • deployment object gets updated
  • operator gets restarted

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made changes to handle this case. Now we handle all phases similar and pass through the same reconcile logic so that we "refresh" the sub-objects regardless of its phase.


for _, obj := range objects {
// Services needs special treatment as they have some immutable field(s)
// So, we cannot refresh the existing one with new service object.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which fields trigger this problem? Do we need to modify them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which fields trigger this problem? Do we need to modify them?

Service.ClusterIP, which is auto-generated and immutable from updates.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Okay, so let's keep this code and worry about a more generic solution later.

}
existingService.Spec.Selector = s.Spec.Selector
klog.Infof("updating service '%s' service ports and selector", s.GetName())
if err := r.Update(existingService); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried that "create from scratch" and "update existing" now do not lead to the same end state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pohly I doubt I fully understand your concern, what would you expect me to do here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried that this update path needs to be maintained manually and may end up missing changing something. But don't worry about it now.


By("Wait till driver objects get updated")
// FIXME(avalluri): find better way of ensuring the sub-objects got updated
// We can't use validateDriverDeployment() inside a Eventually() as it
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is partly why using assertions in sub-functions is an anti-pattern: it prevents reuse of the function.

The other reason is that assertions are attributed to the sub-function instead of the test itself unless extra care is taken to skip stack frames.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use validateDriverDeployment unchanged here by adding a defer which handles the failure:

Eventually(func() (success bool) {
    defer func() {
        if r := recover(); r != nil {
            success = false
        }
    }()

    validateDriverDeployment()
    return true
})

This is a stop-gap measure until we have a better validateDriverDeployment.

The problem however will be that if Eventually times out, we don't have any record of why it timed out, i.e. what the result of the last validateDriverDeployment() was.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @pohly Nice workaround for the issue. Did the suggested change.

@@ -731,6 +826,8 @@ func validateDriverDeployment(f *framework.Framework, d *deploy.Deployment, expe
Expect(crb.Subjects[0].Name).Should(BeEquivalentTo(saName), "cluster role binding should have a valid service account")
_, err = f.ClientSet.RbacV1().Roles(d.Namespace).Get(context.Background(), rb.RoleRef.Name, metav1.GetOptions{})
Expect(err).ShouldNot(HaveOccurred(), "roles should have been defined")

return true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't used anywhere, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, right. It was leftover from the trial & error phase. I will revert.

@@ -81,6 +81,8 @@ type DeploymentStatus struct {

// Phase indicates the state of the deployment
Phase DeploymentPhase `json:"phase,omitempty"`
// LastUpdated time of the deployment status
LastUpdated metav1.Time `json:"lastUpdated,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Long term we should reconsider what we put into our status and align with kubernetes/enhancements#1624.

Instead of relying on time, observedGeneration (mentioned in that KEP discussion without further explanation) may be better because it is immune to time shifts.

Okay for now.

@avalluri avalluri force-pushed the reload-deployment-state branch 3 times, most recently from c8f862d to b6fad9d Compare April 23, 2020 14:01
@pohly
Copy link
Contributor

pohly commented Apr 24, 2020

@avalluri please rebase. I've not done an in-depth review again, but it looks like my concerns have been addressed, so let's merge it.

Added new status field 'lastUpdated' that holds the time stamp when the
last the deployment's got updated. In other words when was the last time
the deployment got reconciled.
When reconciling a running/failed deployment after operator restart we might
not have any previous revision of that deployment in our cache so that
we could have a diff. One thing we could do in this case is force
refresh of all sub-objects of that deployment with new revision.

But we might miss any incompatible changes to that deployment on absence
of the operator. So, for finding those changes we could retrieve those
values from pre-deployed driver objects for that deployment.

We choose "refresh" strategy to support the operator upgrades, which
might have changes to sub-objects that are not visible in deployment
API. One example for this is command-line arguments of sidecar
containers.
This is useful for configuring the operator to use a specific image as default
driver image. One usecase is to run the operator as stand alone binary as quick
test in development.
client.Get() fails if set namespace of a cluster scoped object. Hence
setting namespace only case of namespace scoped objects.
While updating an existing object we copy/pick metadata from read object
to our new copy so that we could make UpdateOrCreate() type agnostic.
This will not reset the status fields of the existing object as
client.Update() will not reset the status.

This also changes the semantics of Create() call such that it will not
check the existence of the object being created. So that now it returns
an error if the object is already exits. This revealed and issue in unit
tests of reusing existing CSIDriver that was created for a deleted
deployment.

Took the opportunity to make the Create()/Update() more verbose.
This adds missing certificates check in reconcile loop, so that it
updates if any such change with new secrets.
@avalluri avalluri force-pushed the reload-deployment-state branch from b6fad9d to d2bc2f1 Compare April 24, 2020 07:40
@avalluri
Copy link
Contributor Author

@avalluri please rebase. I've not done an in-depth review again, but it looks like my concerns have been addressed, so let's merge it.

Done, Shall we wait for final test results?

@pohly
Copy link
Contributor

pohly commented Apr 24, 2020

Done, Shall we wait for final test results?

Github still shows conflicts. Did you pull before rebasing?

@pohly
Copy link
Contributor

pohly commented Apr 24, 2020

Github still shows conflicts. Did you pull before rebasing?

Never mind. Now it no longer does.

@pohly pohly merged commit 4ea70c5 into intel:operator Apr 24, 2020
@avalluri avalluri deleted the reload-deployment-state branch October 20, 2020 20:34
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants