Skip to content

Improve communications between controller and provisioner#761

Merged
metal3-io-bot merged 15 commits intometal3-io:masterfrom
zaneb:semantic-return
Jan 15, 2021
Merged

Improve communications between controller and provisioner#761
metal3-io-bot merged 15 commits intometal3-io:masterfrom
zaneb:semantic-return

Conversation

@zaneb
Copy link
Copy Markdown
Member

@zaneb zaneb commented Jan 12, 2021

Use semantic functions (retryAfterDelay(), operationContinuing(), operationComplete(), operationFailed(), transientError()) to generate return values from ironic provisioner. This is a first step toward changing the provisioner.Result type to include richer information (e.g. to be able to distinguish between a request to retry after a delay and the operation simply continuing).

Stop writing to the BareMetalHost Status subresource in the provisioner, and require all changes to be communicated back to the controller via return values from the Provisioner interface.

Only update the BareMetalHost Status subresource when something has changed, instead of on ~every reconcile.

@zaneb zaneb requested a review from andfasano January 12, 2021 20:53
@metal3-io-bot metal3-io-bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jan 12, 2021
@zaneb
Copy link
Copy Markdown
Member Author

zaneb commented Jan 12, 2021

/test-integration

@zaneb zaneb marked this pull request as ready for review January 13, 2021 18:18
@metal3-io-bot metal3-io-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 13, 2021
@zaneb
Copy link
Copy Markdown
Member Author

zaneb commented Jan 13, 2021

/test-integration

@zaneb
Copy link
Copy Markdown
Member Author

zaneb commented Jan 13, 2021

/cc @dhellmann

zaneb added 3 commits January 13, 2021 17:27
Add a set of functions that lend semantic meaning to the various Result
values that can be returned - some of which result in identical Result
structures being returned to the caller. This is the first step toward
improving the provisioner API to allow the controller more insight into
how it should respond to events in the provisioner.

The six possible events are:

* A conflict that should be retried with a delay
* A change to the Host Status that requires it to be written back to the
  k8s API
* Waiting for an ongoing process
* Successful completion of the current operation
* A failure of the current operation
* A transient error (e.g. network connection failure)
Denote instances where we are returning due to a 409 Conflict in Ironic
by calling the retryAfterDelay() function to get a Result.

Note that in some cases this introduces a delay where previously there
was none. Delaying after a 409 response is probably always the right
thing to do.
@zaneb
Copy link
Copy Markdown
Member Author

zaneb commented Jan 13, 2021

/test-integration

Copy link
Copy Markdown
Member

@andfasano andfasano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart a couple of minor points/questions, changes look good to me.

I think this approach is much more neat and readable - as the semantic functions express better the intention in the code, and definitely helps in promoting a better separation of the host update logic outside of the Provisioner.

Comment thread main.go
provisionerFactory := func(host metal3iov1alpha1.BareMetalHost, bmcCreds bmc.Credentials, publish provisioner.EventPublisher) (provisioner.Provisioner, error) {
isUnmanaged := host.Spec.ExternallyProvisioned && !host.HasBMCDetails()

hostCopy := host.DeepCopy()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the requirement for this copy?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's enforcing that no provisioner ever gets to make changes to the Host, since it only gets a copy and the controller never sees the copy.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds fine

Comment thread controllers/metal3.io/baremetalhost_controller.go Outdated
Comment thread controllers/metal3.io/baremetalhost_controller.go Outdated
Comment thread pkg/provisioner/ironic/ironic.go Outdated
@metal3-io-bot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andfasano, zaneb

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

zaneb added 10 commits January 14, 2021 17:03
Denote instances where we are returning due to data in the Host's Status
structure being updated by calling the hostUpdated() function to get a
Result.
The provisioner interface currently requires provisioners to write to the
host's Status in order to set the ID. This is undesirable. Return the
provisioner's ID for the host from ValidateManagementAccess() and write it
to the Status in the controller.
The provisioner interface currently requires provisioners to write to the
host's Status in order to set the power status. This is undesirable. Return
the power status (if known) for the host from UpdateHardwareState(), rather
than a Result structure from which only the Dirty flag was used. Write the
updated power state to the Status in the controller.
The fixtureProvisioner object gets recreated for each Reconcile() call, so
it cannot be used to store state that is used to make different things
happen on each reconciliation. Some of this state was instead being stored
in the host Status, which could potentially confound some of the tests:
only the controller should modify the status, and it is the thing we are
trying to test.

Instead, create a Fixture structure that contains persistent state for a
particular Host and the provisioner Factory to create a provisioner that
has access to that state. Stop writing to the Host's Status from the
fixture provisioner.
There is no longer any reason for provisioners to modify the BareMetalHost
object itself; they should communicate any changes explicitly through
return values from the Provisioner API.

To prevent this starting to happen again in future, pass only a copy of the
BareMetalHost in the provisioner factory, so that no changes made by the
provisioner will ever make their way back to the controller nor be written
to the k8s API.
This function is now called on basically every reconcile, so don't log
information about things that haven't changed.
zaneb added 2 commits January 14, 2021 17:03
When the provisioner could write to the Host's Status subresource and
expect it to be persisted, we had to always treat a Dirty flag returned
from the provisioner as indicating that the Status was dirty and required a
write. (Oddly this meant the Dirty flag was treated with almost the inverse
of its intended meaning: when it was returned true the Status was rarely
actually dirty, while when it was false the action was complete which
always requires a write.)

Now that provisioners are no longer allowed to write to the Status, we only
need to request a write when the controller has updated the Status. Add a
new actionUpdate to cover the couple of cases that are not already covered
by actionComplete, and those where ClearError() returns dirty. The
actionUpdate type comprises actionContinue so that type checks for the
latter handle it correctly.

Make actionContinue return false from the Dirty() method, meaning that it
will not cause a write to the k8s API. The actionContinueNoWrite type is
removed, since it would now be identical to actionContinue.
All of the action<StateName>() methods of the reconciler are the handlers
used by the host state machine to carry out the action in the current
provisioning state, except for actionRegistering(). actionRegistering() is
called in every state (by the state machine's ensureRegistered() method)
prior to calling the action method for the current state.

Rename actionRegistering() to registerHost() to eliminate the implication
that it is an action specific to the Registering state.

Simplify the return handling by returning nil when validating management
access succeeded and there were no changes required to the Status, since
this is what we want to return from ensureRegistered() (indicating no need
to return early before executing the state's action method). This will
occur both in general operation when there is no re-registration required,
and also potentially at the conclusion of a re-registration where there was
no credentials change and errors had already been cleared (by a previous
invocation that returned actionUpdate).

At the conclusion of a registration where changes need to be written to the
Status, e.g. by clearing a previous error state or updating the known good
credentials, actionComplete is returned. This is later translated to
actionUpdate by ensureRegistered().

This enables that scenario to be distinguished from that of a change
needing to be written while registration is still in progress, which
results in actionUpdate.

This means there is no longer a need in ensureRegistered() to guess when
the registerHost() method might have resulted in a change to the Status.
@zaneb
Copy link
Copy Markdown
Member Author

zaneb commented Jan 14, 2021

/test-integration

@andfasano
Copy link
Copy Markdown
Member

/lgtm

@metal3-io-bot metal3-io-bot added the lgtm Indicates that a PR is ready to be merged. label Jan 15, 2021
@metal3-io-bot metal3-io-bot merged commit ea587bb into metal3-io:master Jan 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants