add alpha asynchronous binding operation support #1512

kibbles-n-bytes · 2017-11-01T23:07:45Z

Adds alpha support for the proposed asynchronous binding flow coming to the OSB spec (PR #334).

vaikas

Overall looks like the biz logic between this and async instance should be similar and sure would be nice to just define the logic in one place and just customize the resources being operated on. Don't really have a good suggestion off the top of my head however :)

vaikas · 2017-11-02T17:05:19Z

test/integration/controller_test.go

@@ -1333,7 +1565,7 @@ func newTestController(t *testing.T, config fakeosb.FakeClientConfiguration) (
 	informerFactory := scinformers.NewSharedInformerFactory(catalogClient, 10*time.Second)
 	serviceCatalogSharedInformers := informerFactory.Servicecatalog().V1beta1()

-	fakeRecorder := record.NewFakeRecorder(10)
+	fakeRecorder := record.NewFakeRecorder(20)


can you put a comment here on why this was necessary so if people in the future get stuck they might know what to do?
Also, why not something larger like 100?

Done. And I changed it to 50, which I expect should never be hit by our tests.

vaikas · 2017-11-02T17:08:20Z

pkg/controller/controller_binding.go

@@ -385,6 +401,12 @@ func (c *controller) reconcileServiceBinding(binding *v1beta1.ServiceBinding) er
 			BindResource: &osb.BindResource{AppGUID: &appGUID},
 		}

+		if serviceClass.Spec.BindingRetrievable &&
+			utilfeature.DefaultFeatureGate.Enabled(scfeatures.AsyncBindingOperations) {


Should we log the case where the user has not enabled the feature? Or, how would they know to turn this flag on? Or does it matter?

We should at least add some docs explaining the flag (don't see them in this PR). More logs couldn't hurt, but IMO the docs are paramount.

vaikas · 2017-11-02T17:09:56Z

pkg/controller/controller_binding.go

+				return err
+			}
+
+			c.recorder.Eventf(instance, corev1.EventTypeNormal, asyncBindingReason, asyncBindingMessage)


Should this be for binding not instance?

vaikas · 2017-11-02T17:10:51Z

pkg/controller/controller_binding.go

+				return err
+			}
+
+			c.recorder.Eventf(instance, corev1.EventTypeNormal, asyncUnbindingReason, asyncUnbindingMessage)


same here, binding?

Nice catch. Fixed.

kibbles-n-bytes · 2017-11-02T17:20:08Z

@vaikas-google I made some design sketches in #1209 , and the response I got was to duplicate the logic for now. I'm all for refactoring to make common across both resources at a later date!

arschles

looks reasonable @kibbles-n-bytes

only question here is don't we still need to prefix the field names with Alpha?

arschles · 2017-11-02T23:56:37Z

pkg/controller/controller_binding.go

@@ -385,6 +401,12 @@ func (c *controller) reconcileServiceBinding(binding *v1beta1.ServiceBinding) er
 			BindResource: &osb.BindResource{AppGUID: &appGUID},
 		}

+		if serviceClass.Spec.BindingRetrievable &&
+			utilfeature.DefaultFeatureGate.Enabled(scfeatures.AsyncBindingOperations) {


We should at least add some docs explaining the flag (don't see them in this PR). More logs couldn't hurt, but IMO the docs are paramount.

staebler · 2017-11-03T12:12:01Z

@arschles I don't think that kube uses the Alpha and Beta prefix. See api_changes.md. The originating identity headers fields are feature gated and alpha and do not use the Alpha label either.

staebler

Overall it looks like a good start. My main concern is with our logic for handling errors after getting a "succeeded" response for the async operation.

staebler · 2017-11-03T14:14:27Z

pkg/controller/controller_test.go

+func getTestServiceBindingAsyncBinding(operation string) *v1beta1.ServiceBinding {
+	binding := getTestServiceBinding()
+	if operation != "" {
+		binding.Status.LastOperation = &operation


This is not used as it's overwritten by the new ServiceBindingStatus a few lines later.

staebler · 2017-11-03T14:28:48Z

pkg/controller/controller.go

@@ -177,7 +175,8 @@ func (c *controller) Run(workers int, stopCh <-chan struct{}) {
 		createWorker(c.servicePlanQueue, "ClusterServicePlan", maxRetries, true, c.reconcileClusterServicePlanKey, stopCh, &waitGroup)
 		createWorker(c.instanceQueue, "ServiceInstance", maxRetries, true, c.reconcileServiceInstanceKey, stopCh, &waitGroup)
 		createWorker(c.bindingQueue, "ServiceBinding", maxRetries, true, c.reconcileServiceBindingKey, stopCh, &waitGroup)
-		createWorker(c.pollingQueue, "Poller", maxRetries, false, c.requeueServiceInstanceForPoll, stopCh, &waitGroup)
+		createWorker(c.instancePollingQueue, "InstancePoller", maxRetries, false, c.requeueServiceInstanceForPoll, stopCh, &waitGroup)
+		createWorker(c.bindingPollingQueue, "BindingPoller", maxRetries, false, c.requeueServiceBindingForPoll, stopCh, &waitGroup)


We should only create the worker if the feature is enabled.

Ah, true. I had added the feature gate after the implementation, so some of these slipped through.

staebler · 2017-11-03T14:30:48Z

pkg/controller/controller.go

-				Tags:          svc.Tags,
-				Description:   svc.Description,
-				Requires:      svc.Requires,
+				BindingRetrievable: svc.BindingRetrievable,


Since BindingRetrievable is an alpha addition to the API, it must not be set when the feature is disabled, according to the kube guidelines at https://github.com/kubernetes/community/blob/master/contributors/devel/api_changes.md#adding-unstable-features-to-stable-versions.

staebler · 2017-11-03T14:36:27Z

pkg/controller/controller_binding.go

+	// deleting or mitigating an orphan; this is more readable than
+	// checking the timestamps in various places.
+	mitigatingOrphan := binding.Status.OrphanMitigationInProgress
+	creating := binding.Status.CurrentOperation == v1beta1.ServiceBindingOperationBind && !mitigatingOrphan


We don't strictly need the creating boolean yet since there is no updating state. So creating is just !deleting.

That's true. I had added it originally in attempt to keep a bit more parity with the async instances. Though I ended up only using it once in this function, so I'm happy to remove it.

staebler · 2017-11-03T14:49:36Z

pkg/controller/controller_binding.go

+		c.recorder.Event(binding, corev1.EventTypeWarning, errorPollingLastOperationReason, s)
+
+		if err := c.checkPollingServiceBindingForReconciliationRetryTimeout(binding); err != nil {
+			return nil


We cannot return nil here in all cases. If there was an error encountered while doing the work in checkPollingServiceBindingForReconciliationRetryTimeout, then we need to return that error. For example, if there was an error updating the binding status, we need to return that error here.

The same goes for the other four places where checkPollingServiceBindingForReconciliationRetryTimeout is called.

Ah, I mixed up the error flow for polling in my head. Fixed.

staebler · 2017-11-03T14:57:29Z

pkg/controller/controller_binding.go

+		}
+
+		getBindingResponse, err := brokerClient.GetBinding(getBindingRequest)
+		if err != nil {


I don't think this is the right logic here. When we fail to get the binding after the async operation succeeds, then we cannot continue to poll for the async operation. The proposed OSB API spec states

A response with "state": "succeeded" or "state": "failed" MUST cause the platform to cease polling.

I think we need to either retry getting the binding and enter orphan mitigation if we fail to get the binding after a sufficient period of time. We probably need another state that the binding can enter into after the async create was successful and before the get of the binding is successful. I would suggest that we even wait until the next reconciliation before we attempt to get the new binding.

Oof... I didn't notice the "MUST" there. I had intended to long-term move the GET outside of the polling since it is completely independent of the last_operation endpoint, but had hoped it could be a follow-up. Orphan mitigation would probably be the easiest short-term solution.

As an aside about polling in general, it seems problematic to explicitly say the platform MUST stop polling upon receiving a succeeded/failed state. Does that mean we should expect brokers to nuke their last_operation endpoint as soon as one GET that returns "succeeded" is returned? If so, then isn't our current instance polling also a bit broken? If something goes wrong while trying to store the state of the succeeded/failed response, we try again from the top, attempting another poll.

The alternative is that the broker would need to retain every operation_key indefinitely, though, right? Is there a happy medium where the brokers agree to retain the operation_key for a certain period of time or until the platform explicitly acknowledges that it no longer needs the operation_key?

Yes, I agree that our current polling is a bit broken. It is not clear (at least to me) from the OSB spec how the broker is supposed to respond when it gets a last operation. Either the broker will respond with an error code and the controller will continue polling until the reconciliation retry duration expires. Or the broker will respond with "succeeded" and the controller will proceed along merrily as though it is the first time that it received "succeeded". In the first case, there are quite a few edge cases that make that an undesirable state for the resource to be in.

It seems to me as well like the spec is unclear. Rereading, the wording seems to imply to me that a platform should only expect one "succeeded"/"failed", but as you said any error code other than 400 will cause the platform to keep polling, and the description of 400 doesn't capture this case. I'll bring it up to the OSB working group to see if we can get clarification.

For the time being, I made this situation trigger orphan mitigation.

staebler · 2017-11-03T14:59:11Z

pkg/controller/controller_binding.go

+			return c.continuePollingServiceBinding(binding)
+		}
+
+		if err := c.injectServiceBinding(binding, getBindingResponse.Credentials); err != nil {


Same concern here. We cannot continue polling if there was a failure to inject the service binding.

kibbles-n-bytes · 2017-11-03T16:33:57Z

@arschles Following what @staebler said, for schemas we didn't prefix with Alpha, so I also didn't prefix here.

(Originating identities should actually be migrated out of an alpha feature now that they're in 2.13 though 😅 )

arschles · 2017-11-03T16:37:11Z

@kibbles-n-bytes @staebler this is different than schemas and originating identity, though, isn't it? the field names in the resources may change name or disappear altogether. was that the case with schemas/orig. ident?

kibbles-n-bytes · 2017-11-03T17:44:51Z

Setting to "in-progress" while I address PR comments.

kibbles-n-bytes · 2017-11-03T20:07:47Z

@arschles That was the case with schemas; for example, ServiceInstanceCreateParameterSchema didn't have Alpha as part of its resource name even before 2.13 support (link).

I think you're thinking of the resources from the OSB client, which were prefixed with Alpha inconsistently; the osb.Plan object had the field AlphaParameterSchemas of type *AlphaParameterSchema, but the request objects had the field OriginatingIdentity of type AlphaOriginatingIdentity. But @pmorie suggested when I made the changes to the OSB client to not prefix with Alpha.

kibbles-n-bytes · 2017-11-03T20:08:25Z

@staebler I made the changes based off your feedback, please take a look!

staebler · 2017-11-06T14:33:52Z

pkg/controller/controller_binding.go

@@ -1439,21 +1450,17 @@ func (c *controller) pollServiceBinding(binding *v1beta1.ServiceBinding) error {

 			setServiceBindingCondition(
 				binding,
-				v1beta1.ServiceBindingConditionReady,
-				v1beta1.ConditionFalse,
+				v1beta1.ServiceBindingConditionFailed,


This will make the behavior inconsistent between sync and async binding. In the sync case, if there is an error injecting the binding, the reconciliation is retried until the retry duration is exceeded. In the async case, if there is an error injecting the binding, the reconciler immediately sets the state of the binding to Failed. One way or another, the two scenarios need to be consistent.

I think the synchronous behavior is the ideal one. I would like to rework the logic to be able to retry the fetching and injection after we know the operation succeeded, but would like to do that in a follow-up. I think it's okay to have the slight inconsistency for the time being.

OK. I can accept that.

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Nov 1, 2017

kibbles-n-bytes force-pushed the async_bind branch from bf70c8e to ed8ba30 Compare November 1, 2017 23:13

vaikas reviewed Nov 2, 2017

View reviewed changes

kibbles-n-bytes force-pushed the async_bind branch from 66dd327 to 5f3a2d7 Compare November 2, 2017 21:19

arschles reviewed Nov 2, 2017

View reviewed changes

staebler reviewed Nov 3, 2017

View reviewed changes

kibbles-n-bytes force-pushed the async_bind branch 6 times, most recently from edaafa3 to 80bcea2 Compare November 3, 2017 17:19

Michael Kibbe added 13 commits November 3, 2017 10:22

add types changes

536d5f6

add generated files

78283a6

add binding polling queue

4a82456

add async bind conditions

c9a9358

add polling handling to current reconciliation functions

ffa9369

add polling functions

d840645

fix existing tests

e0161a2

add tests for async binding operations

aea4cb1

add validation

6b60c74

add integration test support

287d865

add feature gate

4657989

rebase

214c5b6

fixup int test

b93bc8c

Michael Kibbe added 2 commits November 3, 2017 10:22

address comments

3f28d50

rebase

c8fcf51

kibbles-n-bytes force-pushed the async_bind branch 3 times, most recently from 1395a1f to e502e2a Compare November 3, 2017 17:42

kibbles-n-bytes added the in-progress label Nov 3, 2017

kibbles-n-bytes force-pushed the async_bind branch 4 times, most recently from ef51ede to 7276c20 Compare November 3, 2017 19:07

kibbles-n-bytes removed the in-progress label Nov 3, 2017

address comments

f45cdeb

kibbles-n-bytes force-pushed the async_bind branch from 7276c20 to f45cdeb Compare November 3, 2017 22:12

staebler reviewed Nov 6, 2017

View reviewed changes

vaikas added the LGTM1 label Nov 6, 2017

staebler added the LGTM2 label Nov 7, 2017

staebler merged commit 4309a0e into kubernetes-retired:master Nov 7, 2017

staebler mentioned this pull request Nov 8, 2017

Adding UnbindStatus to ServiceBindings #1526

Closed

kibbles-n-bytes deleted the async_bind branch November 9, 2017 23:01

staebler mentioned this pull request Nov 10, 2017

TestPollServiceBinding is not preserving the original binding for assertions #1549

Closed

kibbles-n-bytes added this to the 0.1.3 milestone Nov 16, 2017

carolynvs mentioned this pull request Apr 18, 2018

Async Binding Design #1209

Closed

add alpha asynchronous binding operation support #1512

add alpha asynchronous binding operation support #1512

Conversation

kibbles-n-bytes commented Nov 1, 2017 • edited Loading

vaikas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kibbles-n-bytes commented Nov 2, 2017

arschles left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

staebler commented Nov 3, 2017

staebler left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kibbles-n-bytes Nov 3, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kibbles-n-bytes commented Nov 3, 2017 • edited Loading

arschles commented Nov 3, 2017

kibbles-n-bytes commented Nov 3, 2017

kibbles-n-bytes commented Nov 3, 2017

kibbles-n-bytes commented Nov 3, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kibbles-n-bytes commented Nov 1, 2017 •

edited

Loading

kibbles-n-bytes Nov 3, 2017 •

edited

Loading

kibbles-n-bytes commented Nov 3, 2017 •

edited

Loading