Skip to content

Conversation

@2uasimojo
Copy link
Member

@2uasimojo 2uasimojo commented Jul 23, 2021

To avoid timing issues and properly count assigned ClusterDeployments and ClusterClaims, this commit moves the assignment of ClusterClaims (updating their reference to their assigned CD) from the clusterclaim_controller into the clusterpool_controller, right next to the assignment of ClusterDeployments (updating their reference to their assigned ClusterClaim).

It also adds code to double-check that assignments are in sync (claims assigned to the clusters that ref them, and vice versa) before proceeding with pool math and further assignments. This should help eliminate assignment conflicts, and "heal" partial updates (e.g. assigning the claim succeeds, but assigning the CD fails).

HIVE-1599

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 23, 2021
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 23, 2021
@2uasimojo
Copy link
Member Author

First pass at the code side. TODO:

  • Clean up ordering of methods in collections.go
  • Fix UT
  • Add moar UT
  • Live test

@2uasimojo 2uasimojo force-pushed the consolidate-claiming branch 3 times, most recently from 9f27bcc to 8c0ecb5 Compare July 27, 2021 21:18
@codecov
Copy link

codecov bot commented Jul 27, 2021

Codecov Report

Merging #1474 (d8180cb) into master (3de9e02) will increase coverage by 1.02%.
The diff coverage is 69.59%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1474      +/-   ##
==========================================
+ Coverage   41.23%   42.26%   +1.02%     
==========================================
  Files         334      335       +1     
  Lines       30265    31012     +747     
==========================================
+ Hits        12481    13108     +627     
- Misses      16722    16800      +78     
- Partials     1062     1104      +42     
Impacted Files Coverage Δ
...g/controller/clusterpool/clusterpool_controller.go 56.56% <58.82%> (-2.93%) ⬇️
pkg/controller/clusterpool/collections.go 70.76% <70.76%> (ø)
...controller/clusterclaim/clusterclaim_controller.go 60.65% <100.00%> (+0.06%) ⬆️
pkg/controller/utils/errorscrub.go 100.00% <0.00%> (ø)
pkg/controller/utils/utils.go 79.47% <0.00%> (+0.32%) ⬆️
.../clusterdeployment/clusterdeployment_controller.go 69.18% <0.00%> (+9.26%) ⬆️

@2uasimojo 2uasimojo changed the title WIP: Consolidate claim assignment into clusterpool_controller Consolidate claim assignment into clusterpool_controller Jul 27, 2021
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 27, 2021
@2uasimojo 2uasimojo force-pushed the consolidate-claiming branch from 8c0ecb5 to f1f7e59 Compare July 27, 2021 21:57
Copy link
Member Author

@2uasimojo 2uasimojo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some notes for reviewers inline

@2uasimojo
Copy link
Member Author

/assign @abhinavdahiya
/cc @dgoodwin @jnpacker

Copy link
Contributor

@abhinavdahiya abhinavdahiya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did a first pass.

  • the collection is feeling a little weird, because side-effect functions leave the object not actionable again as they become stale
  • we are treating both claims and cds as source of truth when for safety we should only treat the cds as that. (claims are user editable)
  • View functions update the collection (sorting) when makes them incompatible with concurrent calls.

// ByName returns the named ClusterDeployment from the cdLookup, or nil if no CD by that name exists.
ByName(string) *hivev1.ClusterDeployment
// Installing returns the list of ClusterDeployments in the process of being installed. These are
// not available for claim assignment.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are not available for claim assignment.

doesn't seem like this function should say what is assignable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's totally its job. The implication is that you couldn't, for example, do:

for cd in cds.Installing() {
    cd.Assign()
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, installing shouldn't say what you can do with the cd that are in this state.. that the user does with these is completely up to them. and in your example Assign() functions defines what type of cd can be assigned. Assign function should say that it should pass only assignable cd.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These structs/methods are 100% designed to conform to and enforce the architecture of the clusterpool controller, and that architecture dictates that we don't assign CDs that are Installing. If future-me is reading, say, assignClustersToClaims and says, "Whoah, we're only using Assignable() CDs -- was that a mistake??" I will go look at the comments for Installing and at least see that it was done on purpose.

I've added some words to the docstrings for cdCollection.Assign and assignClustersToClaims. But I'd like to keep these words here.

Comment on lines 46 to 50
// be assigned first to minimize claim response time.
// - Running CDs, oldest first
// - Resuming CDs, in order of least recently resumed (soonest likely to become Running)
// - Hibernating CDs, oldest first
// - Anything else, oldest first. (These should probably be skipped. TODO?)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't seem like a good place for interface to define this specific behavior. the implementation should be the one in control, otherwise it's too strict.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The interfaces are gone. I'd like to keep this in the docstring of the impl itself.


// SyncCDAssignments makes sure each claim which purports to be assigned has the correct CD
// assigned to it, updating the CD and/or claim on the server as necessary.
func (claims *claimCollection) SyncCDAssignments(c client.Client, cds cdLookup, logger log.FieldLogger) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does this not return an error?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's best-effort. It logs problems it encounters, but we don't want those problems to short-circuit the Reconcile. (I could have bubbled the errors up and ignored them in the caller, but didn't see the need.)


// SyncClaimAssignments makes sure each ClusterDeployment which purports to be assigned has the
// correct claim assigned to it, updating the CD and/or claim on the server as necessary.
func (cds *cdCollection) SyncClaimAssignments(c client.Client, claims claimLookup, logger log.FieldLogger) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does this not return an error?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's best-effort. It logs problems it encounters, but we don't want those problems to short-circuit the Reconcile. (I could have bubbled the errors up and ignored them in the caller, but didn't see the need.)

@2uasimojo
Copy link
Member Author

  • the collection is feeling a little weird, because side-effect functions leave the object not actionable again as they become stale

I don't think they become stale. I responded to one inline comment about this; were there other places this was of concern?

  • we are treating both claims and cds as source of truth when for safety we should only treat the cds as that. (claims are user editable)

I think this was addressed in an inline comment, yes?

  • View functions update the collection (sorting) when makes them incompatible with concurrent calls.

Moved sorting out of getters per this comment.

To avoid timing issues and properly count assigned ClusterDeployments
and ClusterClaims, this commit moves the assignment of ClusterClaims
(updating their reference to their assigned CD) from the
clusterclaim_controller into the clusterpool_controller, right next to
the assignment of ClusterDeployments (updating their reference to their
assigned ClusterClaim).

It also adds code to double-check that assignments are in sync (claims
assigned to the clusters that ref them, and vice versa) before
proceeding with pool math and further assignments. This should help
eliminate assignment conflicts, and "heal" partial updates (e.g.
assigning the claim succeeds, but assigning the CD fails).

HIVE-1599
@2uasimojo 2uasimojo force-pushed the consolidate-claiming branch from 2852c47 to d8180cb Compare August 5, 2021 22:56
@2uasimojo
Copy link
Member Author

Note: This is now stacked on #1489. Doesn't matter to me if we merge both commits here or separately.

@abhinavdahiya
Copy link
Contributor

Note: This is now stacked on #1489. Doesn't matter to me if we merge both commits here or separately.

might want to rebase as 1489 merged?

Copy link
Contributor

@abhinavdahiya abhinavdahiya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 10, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 10, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: 2uasimojo, abhinavdahiya

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [2uasimojo,abhinavdahiya]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-bot
Copy link

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci openshift-ci bot merged commit ed4d2b3 into openshift:master Aug 10, 2021
@2uasimojo 2uasimojo deleted the consolidate-claiming branch August 11, 2021 14:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants