-
Notifications
You must be signed in to change notification settings - Fork 55
Conversation
f164cb7
to
2c5cd30
Compare
@@ -88,7 +91,18 @@ func (d *PmemCSIDriver) Reconcile(r *ReconcileDeployment) (bool, error) { | |||
// Deployment successfull, so no more reconcile needed for this deployment | |||
return false, nil | |||
case api.DeploymentPhaseRunning: | |||
requeue, err := d.reconcileDeploymentChanges(r, oldDeployment, changes) | |||
if !foundInCache { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this "not found in cache" logic only applied only to "phase == running"? Consider the following sequence of events:
- new deployment object
- sub-objects created
- setting status fails
- operator gets stopped
- deployment object gets updated
- operator gets restarted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made changes to handle this case. Now we handle all phases similar and pass through the same reconcile logic so that we "refresh" the sub-objects regardless of its phase.
|
||
for _, obj := range objects { | ||
// Services needs special treatment as they have some immutable field(s) | ||
// So, we cannot refresh the existing one with new service object. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which fields trigger this problem? Do we need to modify them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which fields trigger this problem? Do we need to modify them?
Service.ClusterIP
, which is auto-generated and immutable from updates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Okay, so let's keep this code and worry about a more generic solution later.
} | ||
existingService.Spec.Selector = s.Spec.Selector | ||
klog.Infof("updating service '%s' service ports and selector", s.GetName()) | ||
if err := r.Update(existingService); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm worried that "create from scratch" and "update existing" now do not lead to the same end state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pohly I doubt I fully understand your concern, what would you expect me to do here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm worried that this update path needs to be maintained manually and may end up missing changing something. But don't worry about it now.
test/e2e/operator/deployment_api.go
Outdated
|
||
By("Wait till driver objects get updated") | ||
// FIXME(avalluri): find better way of ensuring the sub-objects got updated | ||
// We can't use validateDriverDeployment() inside a Eventually() as it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is partly why using assertions in sub-functions is an anti-pattern: it prevents reuse of the function.
The other reason is that assertions are attributed to the sub-function instead of the test itself unless extra care is taken to skip stack frames.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use validateDriverDeployment
unchanged here by adding a defer
which handles the failure:
Eventually(func() (success bool) {
defer func() {
if r := recover(); r != nil {
success = false
}
}()
validateDriverDeployment()
return true
})
This is a stop-gap measure until we have a better validateDriverDeployment
.
The problem however will be that if Eventually
times out, we don't have any record of why it timed out, i.e. what the result of the last validateDriverDeployment()
was.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @pohly Nice workaround for the issue. Did the suggested change.
test/e2e/operator/deployment_api.go
Outdated
@@ -731,6 +826,8 @@ func validateDriverDeployment(f *framework.Framework, d *deploy.Deployment, expe | |||
Expect(crb.Subjects[0].Name).Should(BeEquivalentTo(saName), "cluster role binding should have a valid service account") | |||
_, err = f.ClientSet.RbacV1().Roles(d.Namespace).Get(context.Background(), rb.RoleRef.Name, metav1.GetOptions{}) | |||
Expect(err).ShouldNot(HaveOccurred(), "roles should have been defined") | |||
|
|||
return true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't used anywhere, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, right. It was leftover from the trial & error phase. I will revert.
@@ -81,6 +81,8 @@ type DeploymentStatus struct { | |||
|
|||
// Phase indicates the state of the deployment | |||
Phase DeploymentPhase `json:"phase,omitempty"` | |||
// LastUpdated time of the deployment status | |||
LastUpdated metav1.Time `json:"lastUpdated,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Long term we should reconsider what we put into our status and align with kubernetes/enhancements#1624.
Instead of relying on time, observedGeneration
(mentioned in that KEP discussion without further explanation) may be better because it is immune to time shifts.
Okay for now.
c8f862d
to
b6fad9d
Compare
@avalluri please rebase. I've not done an in-depth review again, but it looks like my concerns have been addressed, so let's merge it. |
Added new status field 'lastUpdated' that holds the time stamp when the last the deployment's got updated. In other words when was the last time the deployment got reconciled.
When reconciling a running/failed deployment after operator restart we might not have any previous revision of that deployment in our cache so that we could have a diff. One thing we could do in this case is force refresh of all sub-objects of that deployment with new revision. But we might miss any incompatible changes to that deployment on absence of the operator. So, for finding those changes we could retrieve those values from pre-deployed driver objects for that deployment. We choose "refresh" strategy to support the operator upgrades, which might have changes to sub-objects that are not visible in deployment API. One example for this is command-line arguments of sidecar containers.
This is useful for configuring the operator to use a specific image as default driver image. One usecase is to run the operator as stand alone binary as quick test in development.
client.Get() fails if set namespace of a cluster scoped object. Hence setting namespace only case of namespace scoped objects.
While updating an existing object we copy/pick metadata from read object to our new copy so that we could make UpdateOrCreate() type agnostic. This will not reset the status fields of the existing object as client.Update() will not reset the status. This also changes the semantics of Create() call such that it will not check the existence of the object being created. So that now it returns an error if the object is already exits. This revealed and issue in unit tests of reusing existing CSIDriver that was created for a deleted deployment. Took the opportunity to make the Create()/Update() more verbose.
This adds missing certificates check in reconcile loop, so that it updates if any such change with new secrets.
b6fad9d
to
d2bc2f1
Compare
Done, Shall we wait for final test results? |
Github still shows conflicts. Did you pull before rebasing? |
Never mind. Now it no longer does. |
These group of commits implements reconciling deployment changes in cases of both known(cached) deployment and unknown deployment.