Skip to content

Conversation

@staebler
Copy link
Contributor

@staebler staebler commented Aug 30, 2019

Use expectations to ensure that clusterDeployment controller does not create multiple overlapping clusterprovisions.

There is currently a race condition where the controller will create a second clusterprovision if the first clusterprovision has not been added to the local cache by the time the clusterdeployment is reconciled again after creating the first clusterprovision. To prevent this, these changes use expectations to block the controller from creating the second clusterprovision unitl the controller sees that the first clusterprovision has been created.

When the clusterDeployment controller creates a clusterprovision, the controller will increase the number of creation expectations for the clusterdeployment. When the controller sees an add event for the clusterprovision, the controller will reduce the number of creation expectations for the clusterdeployment. During a reconcile loop, the controller will not look at clusterprovisions while there are outstanding creation expectations for the clusterdeployment being reconciled.

@openshift-ci-robot openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 30, 2019
@dgoodwin
Copy link
Contributor

Working through trying to understand this but initial thought, could you update the git commit msg on the second commit with some heavier details on what this is doing, how it works and why it's needed in this specific situation.

r.expectations.ExpectCreations(types.NamespacedName{Namespace: cd.Namespace, Name: cd.Name}.String(), 1)
if err := r.Create(context.TODO(), provision); err != nil {
cdLog.WithError(err).Error("could not create provision")
r.expectations.CreationObserved(types.NamespacedName{Namespace: cd.Namespace, Name: cd.Name}.String())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something that's confusing me a little bit here, it looks like the intent here is to expect the creation of the cluster provision, but the expectation itself is using the cluster deployment's name. Should this be the provision's name? Or is that an issue with the random component of their names.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expectations are set up to expect a creation on behalf of the clusterdeployment. The key is always the resource being reconciled, and not the resource being created.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotcha thanks.

@staebler
Copy link
Contributor Author

staebler commented Aug 30, 2019

Working through trying to understand this but initial thought, could you update the git commit msg on the second commit with some heavier details on what this is doing, how it works and why it's needed in this specific situation.

Do you mean the first commit? Or do you mean why I need to vendor in the package that contains the controller expectations code?

@staebler staebler changed the title fix clusterprovision creation race [WIP] fix clusterprovision creation race Aug 30, 2019
@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 30, 2019
@staebler
Copy link
Contributor Author

I need to copy the controller expectations code from k8s.io/kubernetes/pkg/controller into our codebase. That package is bringing in too much other stuff.

@dgoodwin
Copy link
Contributor

+1 copying sounds good.

I meant the commit with your code :) Whichever it was.

@openshift-ci-robot openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Aug 30, 2019
@staebler
Copy link
Contributor Author

+1 copying sounds good.

I meant the commit with your code :) Whichever it was.

Sure. I will improve on the commit message.

@openshift-ci-robot openshift-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 30, 2019
@staebler staebler changed the title [WIP] fix clusterprovision creation race fix clusterprovision creation race Aug 30, 2019
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 30, 2019
@dgoodwin
Copy link
Contributor

This is great, thank you.

/lgtm

@openshift-ci-robot openshift-ci-robot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Aug 30, 2019
create multiple overlapping clusterprovisions.

There is currently a race condition where the controller will create a
second clusterprovision if the first clusterprovision has not been added
to the local cache by the time the clusterdeployment is reconciled again
after creating the first clusterprovision. To prevent this, these changes
use expectations to block the controller from creating the second clusterprovision
unitl the controller sees that the first clusterprovision has been created.

When the clusterDeployment controller creates a clusterprovision, the controller
will increase the number of creation expectations for the clusterdeployment.
When the controller sees an add event for the clusterprovision, the controller
will reduce the number of creation expectations for the clusterdeployment.
During a reconcile loop, the controller will not look at clusterprovisions
while there are outstanding creation expectations for the clusterdeployment
being reconciled.
@openshift-ci-robot openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed lgtm Indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Aug 30, 2019
@dgoodwin
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 30, 2019
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dgoodwin, staebler

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 339584a into openshift:master Aug 30, 2019
wking added a commit to wking/hive that referenced this pull request Sep 10, 2019
... and security groups.  Pulling in openshift/installer@55630a304
(pkg/destroy/aws: Delete security groups by VPC,
2019-08-13, openshift/installer#2214).

Generated with:

  $ sed -i 's/ae2baf820f22b9bf1ed40932e7b702d790087574/6fdffbb9c21d90ba97fc79dff3ccbeac22fb2a0e/' Gopkg.toml
  $ dep ensure

using:

  $ dep version
  dep:
   version     : v0.5.1
   build date  : 2019-03-20
   git hash    : faa61893
   go version  : go1.10.3
   go compiler : gc
   platform    : linux/amd64
   features    : ImportDuringSolve=false

The k8s.io/apimachinery/pkg/util/clock addition in Gopkg.lock is
probably catching up with c759bfa (Use expectations to ensure that
clusterDeployment controller does not create multiple overlapping
clusterprovisions, 2019-08-23, openshift#518).
@staebler staebler deleted the expectations branch December 7, 2019 15:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants