Skip to content
This repository has been archived by the owner on May 6, 2022. It is now read-only.

add alpha asynchronous binding operation support #1512

Merged
merged 16 commits into from
Nov 7, 2017

Conversation

kibbles-n-bytes
Copy link
Contributor

@kibbles-n-bytes kibbles-n-bytes commented Nov 1, 2017

Adds alpha support for the proposed asynchronous binding flow coming to the OSB spec (PR #334).

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Nov 1, 2017
Copy link
Contributor

@vaikas vaikas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks like the biz logic between this and async instance should be similar and sure would be nice to just define the logic in one place and just customize the resources being operated on. Don't really have a good suggestion off the top of my head however :)

@@ -1333,7 +1565,7 @@ func newTestController(t *testing.T, config fakeosb.FakeClientConfiguration) (
informerFactory := scinformers.NewSharedInformerFactory(catalogClient, 10*time.Second)
serviceCatalogSharedInformers := informerFactory.Servicecatalog().V1beta1()

fakeRecorder := record.NewFakeRecorder(10)
fakeRecorder := record.NewFakeRecorder(20)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you put a comment here on why this was necessary so if people in the future get stuck they might know what to do?
Also, why not something larger like 100?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. And I changed it to 50, which I expect should never be hit by our tests.

@@ -385,6 +401,12 @@ func (c *controller) reconcileServiceBinding(binding *v1beta1.ServiceBinding) er
BindResource: &osb.BindResource{AppGUID: &appGUID},
}

if serviceClass.Spec.BindingRetrievable &&
utilfeature.DefaultFeatureGate.Enabled(scfeatures.AsyncBindingOperations) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we log the case where the user has not enabled the feature? Or, how would they know to turn this flag on? Or does it matter?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should at least add some docs explaining the flag (don't see them in this PR). More logs couldn't hurt, but IMO the docs are paramount.

return err
}

c.recorder.Eventf(instance, corev1.EventTypeNormal, asyncBindingReason, asyncBindingMessage)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be for binding not instance?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

return err
}

c.recorder.Eventf(instance, corev1.EventTypeNormal, asyncUnbindingReason, asyncUnbindingMessage)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, binding?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch. Fixed.

@kibbles-n-bytes
Copy link
Contributor Author

@vaikas-google I made some design sketches in #1209 , and the response I got was to duplicate the logic for now. I'm all for refactoring to make common across both resources at a later date!

Copy link
Contributor

@arschles arschles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks reasonable @kibbles-n-bytes

only question here is don't we still need to prefix the field names with Alpha?

@@ -385,6 +401,12 @@ func (c *controller) reconcileServiceBinding(binding *v1beta1.ServiceBinding) er
BindResource: &osb.BindResource{AppGUID: &appGUID},
}

if serviceClass.Spec.BindingRetrievable &&
utilfeature.DefaultFeatureGate.Enabled(scfeatures.AsyncBindingOperations) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should at least add some docs explaining the flag (don't see them in this PR). More logs couldn't hurt, but IMO the docs are paramount.

@staebler
Copy link
Contributor

staebler commented Nov 3, 2017

@arschles I don't think that kube uses the Alpha and Beta prefix. See api_changes.md. The originating identity headers fields are feature gated and alpha and do not use the Alpha label either.

Copy link
Contributor

@staebler staebler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall it looks like a good start. My main concern is with our logic for handling errors after getting a "succeeded" response for the async operation.

func getTestServiceBindingAsyncBinding(operation string) *v1beta1.ServiceBinding {
binding := getTestServiceBinding()
if operation != "" {
binding.Status.LastOperation = &operation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not used as it's overwritten by the new ServiceBindingStatus a few lines later.

@@ -177,7 +175,8 @@ func (c *controller) Run(workers int, stopCh <-chan struct{}) {
createWorker(c.servicePlanQueue, "ClusterServicePlan", maxRetries, true, c.reconcileClusterServicePlanKey, stopCh, &waitGroup)
createWorker(c.instanceQueue, "ServiceInstance", maxRetries, true, c.reconcileServiceInstanceKey, stopCh, &waitGroup)
createWorker(c.bindingQueue, "ServiceBinding", maxRetries, true, c.reconcileServiceBindingKey, stopCh, &waitGroup)
createWorker(c.pollingQueue, "Poller", maxRetries, false, c.requeueServiceInstanceForPoll, stopCh, &waitGroup)
createWorker(c.instancePollingQueue, "InstancePoller", maxRetries, false, c.requeueServiceInstanceForPoll, stopCh, &waitGroup)
createWorker(c.bindingPollingQueue, "BindingPoller", maxRetries, false, c.requeueServiceBindingForPoll, stopCh, &waitGroup)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should only create the worker if the feature is enabled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, true. I had added the feature gate after the implementation, so some of these slipped through.

Tags: svc.Tags,
Description: svc.Description,
Requires: svc.Requires,
BindingRetrievable: svc.BindingRetrievable,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since BindingRetrievable is an alpha addition to the API, it must not be set when the feature is disabled, according to the kube guidelines at https://github.com/kubernetes/community/blob/master/contributors/devel/api_changes.md#adding-unstable-features-to-stable-versions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// deleting or mitigating an orphan; this is more readable than
// checking the timestamps in various places.
mitigatingOrphan := binding.Status.OrphanMitigationInProgress
creating := binding.Status.CurrentOperation == v1beta1.ServiceBindingOperationBind && !mitigatingOrphan
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't strictly need the creating boolean yet since there is no updating state. So creating is just !deleting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true. I had added it originally in attempt to keep a bit more parity with the async instances. Though I ended up only using it once in this function, so I'm happy to remove it.

c.recorder.Event(binding, corev1.EventTypeWarning, errorPollingLastOperationReason, s)

if err := c.checkPollingServiceBindingForReconciliationRetryTimeout(binding); err != nil {
return nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot return nil here in all cases. If there was an error encountered while doing the work in checkPollingServiceBindingForReconciliationRetryTimeout, then we need to return that error. For example, if there was an error updating the binding status, we need to return that error here.

The same goes for the other four places where checkPollingServiceBindingForReconciliationRetryTimeout is called.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I mixed up the error flow for polling in my head. Fixed.

}

getBindingResponse, err := brokerClient.GetBinding(getBindingRequest)
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the right logic here. When we fail to get the binding after the async operation succeeds, then we cannot continue to poll for the async operation. The proposed OSB API spec states

A response with "state": "succeeded" or "state": "failed" MUST cause the platform to cease polling.

I think we need to either retry getting the binding and enter orphan mitigation if we fail to get the binding after a sufficient period of time. We probably need another state that the binding can enter into after the async create was successful and before the get of the binding is successful. I would suggest that we even wait until the next reconciliation before we attempt to get the new binding.

Copy link
Contributor Author

@kibbles-n-bytes kibbles-n-bytes Nov 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oof... I didn't notice the "MUST" there. I had intended to long-term move the GET outside of the polling since it is completely independent of the last_operation endpoint, but had hoped it could be a follow-up. Orphan mitigation would probably be the easiest short-term solution.

As an aside about polling in general, it seems problematic to explicitly say the platform MUST stop polling upon receiving a succeeded/failed state. Does that mean we should expect brokers to nuke their last_operation endpoint as soon as one GET that returns "succeeded" is returned? If so, then isn't our current instance polling also a bit broken? If something goes wrong while trying to store the state of the succeeded/failed response, we try again from the top, attempting another poll.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alternative is that the broker would need to retain every operation_key indefinitely, though, right? Is there a happy medium where the brokers agree to retain the operation_key for a certain period of time or until the platform explicitly acknowledges that it no longer needs the operation_key?

Yes, I agree that our current polling is a bit broken. It is not clear (at least to me) from the OSB spec how the broker is supposed to respond when it gets a last operation. Either the broker will respond with an error code and the controller will continue polling until the reconciliation retry duration expires. Or the broker will respond with "succeeded" and the controller will proceed along merrily as though it is the first time that it received "succeeded". In the first case, there are quite a few edge cases that make that an undesirable state for the resource to be in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me as well like the spec is unclear. Rereading, the wording seems to imply to me that a platform should only expect one "succeeded"/"failed", but as you said any error code other than 400 will cause the platform to keep polling, and the description of 400 doesn't capture this case. I'll bring it up to the OSB working group to see if we can get clarification.

For the time being, I made this situation trigger orphan mitigation.

return c.continuePollingServiceBinding(binding)
}

if err := c.injectServiceBinding(binding, getBindingResponse.Credentials); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same concern here. We cannot continue polling if there was a failure to inject the service binding.

@kibbles-n-bytes
Copy link
Contributor Author

kibbles-n-bytes commented Nov 3, 2017

@arschles Following what @staebler said, for schemas we didn't prefix with Alpha, so I also didn't prefix here.

(Originating identities should actually be migrated out of an alpha feature now that they're in 2.13 though 😅 )

@arschles
Copy link
Contributor

arschles commented Nov 3, 2017

@kibbles-n-bytes @staebler this is different than schemas and originating identity, though, isn't it? the field names in the resources may change name or disappear altogether. was that the case with schemas/orig. ident?

@kibbles-n-bytes kibbles-n-bytes force-pushed the async_bind branch 6 times, most recently from edaafa3 to 80bcea2 Compare November 3, 2017 17:19
@kibbles-n-bytes kibbles-n-bytes force-pushed the async_bind branch 3 times, most recently from 1395a1f to e502e2a Compare November 3, 2017 17:42
@kibbles-n-bytes
Copy link
Contributor Author

Setting to "in-progress" while I address PR comments.

@kibbles-n-bytes kibbles-n-bytes force-pushed the async_bind branch 4 times, most recently from ef51ede to 7276c20 Compare November 3, 2017 19:07
@kibbles-n-bytes
Copy link
Contributor Author

@arschles That was the case with schemas; for example, ServiceInstanceCreateParameterSchema didn't have Alpha as part of its resource name even before 2.13 support (link).

I think you're thinking of the resources from the OSB client, which were prefixed with Alpha inconsistently; the osb.Plan object had the field AlphaParameterSchemas of type *AlphaParameterSchema, but the request objects had the field OriginatingIdentity of type AlphaOriginatingIdentity. But @pmorie suggested when I made the changes to the OSB client to not prefix with Alpha.

@kibbles-n-bytes
Copy link
Contributor Author

@staebler I made the changes based off your feedback, please take a look!

@@ -1439,21 +1450,17 @@ func (c *controller) pollServiceBinding(binding *v1beta1.ServiceBinding) error {

setServiceBindingCondition(
binding,
v1beta1.ServiceBindingConditionReady,
v1beta1.ConditionFalse,
v1beta1.ServiceBindingConditionFailed,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will make the behavior inconsistent between sync and async binding. In the sync case, if there is an error injecting the binding, the reconciliation is retried until the retry duration is exceeded. In the async case, if there is an error injecting the binding, the reconciler immediately sets the state of the binding to Failed. One way or another, the two scenarios need to be consistent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the synchronous behavior is the ideal one. I would like to rework the logic to be able to retry the fetching and injection after we know the operation succeeded, but would like to do that in a follow-up. I think it's okay to have the slight inconsistency for the time being.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I can accept that.

@vaikas vaikas added the LGTM1 label Nov 6, 2017
@staebler staebler added the LGTM2 label Nov 7, 2017
@staebler staebler merged commit 4309a0e into kubernetes-retired:master Nov 7, 2017
@kibbles-n-bytes kibbles-n-bytes deleted the async_bind branch November 9, 2017 23:01
@kibbles-n-bytes kibbles-n-bytes added this to the 0.1.3 milestone Nov 16, 2017
@carolynvs carolynvs mentioned this pull request Apr 18, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. LGTM1 LGTM2 size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants