-
Notifications
You must be signed in to change notification settings - Fork 382
add alpha asynchronous binding operation support #1512
add alpha asynchronous binding operation support #1512
Conversation
bf70c8e
to
ed8ba30
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks like the biz logic between this and async instance should be similar and sure would be nice to just define the logic in one place and just customize the resources being operated on. Don't really have a good suggestion off the top of my head however :)
test/integration/controller_test.go
Outdated
@@ -1333,7 +1565,7 @@ func newTestController(t *testing.T, config fakeosb.FakeClientConfiguration) ( | |||
informerFactory := scinformers.NewSharedInformerFactory(catalogClient, 10*time.Second) | |||
serviceCatalogSharedInformers := informerFactory.Servicecatalog().V1beta1() | |||
|
|||
fakeRecorder := record.NewFakeRecorder(10) | |||
fakeRecorder := record.NewFakeRecorder(20) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you put a comment here on why this was necessary so if people in the future get stuck they might know what to do?
Also, why not something larger like 100?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. And I changed it to 50, which I expect should never be hit by our tests.
pkg/controller/controller_binding.go
Outdated
@@ -385,6 +401,12 @@ func (c *controller) reconcileServiceBinding(binding *v1beta1.ServiceBinding) er | |||
BindResource: &osb.BindResource{AppGUID: &appGUID}, | |||
} | |||
|
|||
if serviceClass.Spec.BindingRetrievable && | |||
utilfeature.DefaultFeatureGate.Enabled(scfeatures.AsyncBindingOperations) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we log the case where the user has not enabled the feature? Or, how would they know to turn this flag on? Or does it matter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should at least add some docs explaining the flag (don't see them in this PR). More logs couldn't hurt, but IMO the docs are paramount.
pkg/controller/controller_binding.go
Outdated
return err | ||
} | ||
|
||
c.recorder.Eventf(instance, corev1.EventTypeNormal, asyncBindingReason, asyncBindingMessage) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be for binding not instance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
pkg/controller/controller_binding.go
Outdated
return err | ||
} | ||
|
||
c.recorder.Eventf(instance, corev1.EventTypeNormal, asyncUnbindingReason, asyncUnbindingMessage) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, binding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch. Fixed.
@vaikas-google I made some design sketches in #1209 , and the response I got was to duplicate the logic for now. I'm all for refactoring to make common across both resources at a later date! |
66dd327
to
5f3a2d7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks reasonable @kibbles-n-bytes
only question here is don't we still need to prefix the field names with Alpha
?
pkg/controller/controller_binding.go
Outdated
@@ -385,6 +401,12 @@ func (c *controller) reconcileServiceBinding(binding *v1beta1.ServiceBinding) er | |||
BindResource: &osb.BindResource{AppGUID: &appGUID}, | |||
} | |||
|
|||
if serviceClass.Spec.BindingRetrievable && | |||
utilfeature.DefaultFeatureGate.Enabled(scfeatures.AsyncBindingOperations) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should at least add some docs explaining the flag (don't see them in this PR). More logs couldn't hurt, but IMO the docs are paramount.
@arschles I don't think that kube uses the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall it looks like a good start. My main concern is with our logic for handling errors after getting a "succeeded" response for the async operation.
pkg/controller/controller_test.go
Outdated
func getTestServiceBindingAsyncBinding(operation string) *v1beta1.ServiceBinding { | ||
binding := getTestServiceBinding() | ||
if operation != "" { | ||
binding.Status.LastOperation = &operation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not used as it's overwritten by the new ServiceBindingStatus
a few lines later.
pkg/controller/controller.go
Outdated
@@ -177,7 +175,8 @@ func (c *controller) Run(workers int, stopCh <-chan struct{}) { | |||
createWorker(c.servicePlanQueue, "ClusterServicePlan", maxRetries, true, c.reconcileClusterServicePlanKey, stopCh, &waitGroup) | |||
createWorker(c.instanceQueue, "ServiceInstance", maxRetries, true, c.reconcileServiceInstanceKey, stopCh, &waitGroup) | |||
createWorker(c.bindingQueue, "ServiceBinding", maxRetries, true, c.reconcileServiceBindingKey, stopCh, &waitGroup) | |||
createWorker(c.pollingQueue, "Poller", maxRetries, false, c.requeueServiceInstanceForPoll, stopCh, &waitGroup) | |||
createWorker(c.instancePollingQueue, "InstancePoller", maxRetries, false, c.requeueServiceInstanceForPoll, stopCh, &waitGroup) | |||
createWorker(c.bindingPollingQueue, "BindingPoller", maxRetries, false, c.requeueServiceBindingForPoll, stopCh, &waitGroup) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should only create the worker if the feature is enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, true. I had added the feature gate after the implementation, so some of these slipped through.
pkg/controller/controller.go
Outdated
Tags: svc.Tags, | ||
Description: svc.Description, | ||
Requires: svc.Requires, | ||
BindingRetrievable: svc.BindingRetrievable, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since BindingRetrievable
is an alpha addition to the API, it must not be set when the feature is disabled, according to the kube guidelines at https://github.com/kubernetes/community/blob/master/contributors/devel/api_changes.md#adding-unstable-features-to-stable-versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
pkg/controller/controller_binding.go
Outdated
// deleting or mitigating an orphan; this is more readable than | ||
// checking the timestamps in various places. | ||
mitigatingOrphan := binding.Status.OrphanMitigationInProgress | ||
creating := binding.Status.CurrentOperation == v1beta1.ServiceBindingOperationBind && !mitigatingOrphan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't strictly need the creating
boolean yet since there is no updating state. So creating
is just !deleting
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true. I had added it originally in attempt to keep a bit more parity with the async instances. Though I ended up only using it once in this function, so I'm happy to remove it.
pkg/controller/controller_binding.go
Outdated
c.recorder.Event(binding, corev1.EventTypeWarning, errorPollingLastOperationReason, s) | ||
|
||
if err := c.checkPollingServiceBindingForReconciliationRetryTimeout(binding); err != nil { | ||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot return nil here in all cases. If there was an error encountered while doing the work in checkPollingServiceBindingForReconciliationRetryTimeout
, then we need to return that error. For example, if there was an error updating the binding status, we need to return that error here.
The same goes for the other four places where checkPollingServiceBindingForReconciliationRetryTimeout
is called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I mixed up the error flow for polling in my head. Fixed.
} | ||
|
||
getBindingResponse, err := brokerClient.GetBinding(getBindingRequest) | ||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is the right logic here. When we fail to get the binding after the async operation succeeds, then we cannot continue to poll for the async operation. The proposed OSB API spec states
A response with "state": "succeeded" or "state": "failed" MUST cause the platform to cease polling.
I think we need to either retry getting the binding and enter orphan mitigation if we fail to get the binding after a sufficient period of time. We probably need another state that the binding can enter into after the async create was successful and before the get of the binding is successful. I would suggest that we even wait until the next reconciliation before we attempt to get the new binding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oof... I didn't notice the "MUST" there. I had intended to long-term move the GET outside of the polling since it is completely independent of the last_operation
endpoint, but had hoped it could be a follow-up. Orphan mitigation would probably be the easiest short-term solution.
As an aside about polling in general, it seems problematic to explicitly say the platform MUST stop polling upon receiving a succeeded/failed state. Does that mean we should expect brokers to nuke their last_operation
endpoint as soon as one GET that returns "succeeded" is returned? If so, then isn't our current instance polling also a bit broken? If something goes wrong while trying to store the state of the succeeded
/failed
response, we try again from the top, attempting another poll.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The alternative is that the broker would need to retain every operation_key
indefinitely, though, right? Is there a happy medium where the brokers agree to retain the operation_key
for a certain period of time or until the platform explicitly acknowledges that it no longer needs the operation_key
?
Yes, I agree that our current polling is a bit broken. It is not clear (at least to me) from the OSB spec how the broker is supposed to respond when it gets a last operation. Either the broker will respond with an error code and the controller will continue polling until the reconciliation retry duration expires. Or the broker will respond with "succeeded" and the controller will proceed along merrily as though it is the first time that it received "succeeded". In the first case, there are quite a few edge cases that make that an undesirable state for the resource to be in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me as well like the spec is unclear. Rereading, the wording seems to imply to me that a platform should only expect one "succeeded"
/"failed"
, but as you said any error code other than 400 will cause the platform to keep polling, and the description of 400 doesn't capture this case. I'll bring it up to the OSB working group to see if we can get clarification.
For the time being, I made this situation trigger orphan mitigation.
pkg/controller/controller_binding.go
Outdated
return c.continuePollingServiceBinding(binding) | ||
} | ||
|
||
if err := c.injectServiceBinding(binding, getBindingResponse.Credentials); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same concern here. We cannot continue polling if there was a failure to inject the service binding.
@kibbles-n-bytes @staebler this is different than schemas and originating identity, though, isn't it? the field names in the resources may change name or disappear altogether. was that the case with schemas/orig. ident? |
edaafa3
to
80bcea2
Compare
1395a1f
to
e502e2a
Compare
Setting to "in-progress" while I address PR comments. |
ef51ede
to
7276c20
Compare
@arschles That was the case with schemas; for example, I think you're thinking of the resources from the OSB client, which were prefixed with Alpha inconsistently; the |
@staebler I made the changes based off your feedback, please take a look! |
7276c20
to
f45cdeb
Compare
pkg/controller/controller_binding.go
Outdated
@@ -1439,21 +1450,17 @@ func (c *controller) pollServiceBinding(binding *v1beta1.ServiceBinding) error { | |||
|
|||
setServiceBindingCondition( | |||
binding, | |||
v1beta1.ServiceBindingConditionReady, | |||
v1beta1.ConditionFalse, | |||
v1beta1.ServiceBindingConditionFailed, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will make the behavior inconsistent between sync and async binding. In the sync case, if there is an error injecting the binding, the reconciliation is retried until the retry duration is exceeded. In the async case, if there is an error injecting the binding, the reconciler immediately sets the state of the binding to Failed. One way or another, the two scenarios need to be consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the synchronous behavior is the ideal one. I would like to rework the logic to be able to retry the fetching and injection after we know the operation succeeded, but would like to do that in a follow-up. I think it's okay to have the slight inconsistency for the time being.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I can accept that.
Adds alpha support for the proposed asynchronous binding flow coming to the OSB spec (PR #334).