Add APIs for machine phases #531

hardikdr · 2018-10-10T14:23:30Z

What this PR does / why we need it: This PR adds the necessary machines-phases and states in the machine-api stack. Machine phases and states are a way of representing the lifecycle of the machines. PR is based on the following proposal doc: proposal

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

Release note:

Added APIs for machine phases

ggaurav10 · 2018-10-10T14:29:12Z

pkg/apis/cluster/common/consts.go

+	// MachineRunning should be set when machine has joined the cluster and running successfully.
+	MachineRunning MachinePhase = "Running"
+
+	// MachineRunning should be set when machine is being deleted.


Replace MachineRunning with MachineTerminating

done, thanks for pointing out.

scruplelesswizard · 2018-10-11T08:31:53Z

/ok-to-test

scruplelesswizard

Just a nit from me

scruplelesswizard · 2018-10-11T08:40:10Z

pkg/apis/cluster/common/consts.go

+)
+
+// MachineState is the current status of the last performed operation.
+type MachineState string


nit: MachineOperationState would be a better name as it represents the state of the operation, not of the machine

Yes, that makes sense, thanks.
Updating the occurrences wherever needed.

alvaroaleman · 2018-10-11T09:36:32Z

Background for the failing tests: The timeout for the test controlplane is too low, so it often fails to start: 2018/10/11 09:27:09 timeout waiting for process kube-apiserver to start

This already happened multiple times on different PRs. For the time being the workaround is to hit /retest until they pass.

I've created a PR in kubebuilder to get this configurable so we can increase the timeout: kubernetes-sigs/controller-runtime#169 It will however take some time until we can use that, because even when that PR is merged we have to wait for the next release.

hardikdr · 2018-10-11T10:02:02Z

/retest

hardikdr · 2018-10-11T10:08:22Z

thanks, @alvaroaleman for the information, I will then retry after some time.

scruplelesswizard · 2018-10-11T21:43:41Z

/retest

alvaroaleman · 2018-10-11T21:57:15Z

@hardikdr Actually the failures here a real, check out the logs

hardikdr · 2018-10-12T14:27:33Z

@alvaroaleman Yes, I just corrected it and test-check seems to be passing. Thanks for the help.

davidewatson · 2018-10-14T17:31:31Z

/lgtm

alvaroaleman · 2018-10-14T19:15:02Z

@hardikdr Can you quickly explain what this allows us to do what we can currently not do?

Because:

The machine-controller itself should only take the .Spec into account, not the .Status
Other controllers like the machineSet controller basically care only about one condition: Is the associated Node existing and/or ready? They can check that via .Status.NodeRef already: https://github.com/kubernetes-sigs/cluster-api/blob/master/pkg/controller/machineset/status.go#L55
For humans, there are events

Is there anything I missed in the above list? Is there anything I got wrong in the above list?

hardikdr · 2018-10-17T06:14:41Z

Yes sure.
The idea is to basically define the life-cycle of the machines in a descriptive manner.
I see use cases mainly from 2 broad categories: Expressiveness and Consumption.

1. Expressiveness:

Seems it will be nice to have a framework to express different lifecycle-events of machines as accurately as possible.
Taking a subtle example:
Machine gets into the failed-phase. There could be multiple reasons behind it, for instance:
- Kubelet did not contact APISever from last 10 minutes.
- Or Disk-pressure is high from last ~30minutes due to which apps are suffering.
- Or simply machine-creation itself has failed due to error either from cloud-provider or say mis-configuration of software-stack on machine.

It seems it will be great to express all three scenarios in a structured fashion.

For first 2 cases, we could set MStatus.LastOperation to Health-check and MStatus.LastOparation.State/Description to clearly mention it failed due to kubelet didn't respond in first case and diskPressure was high in second case.
The third case is different in a way that creation itself failed, and LastOperation could be then set to Create and LO.State/Description could be set :
- to saying cloud-provide could not create the machine,
- or cloud-provider created the machine but kubelet could not register itself.
Essentially proposed framework will offer equipped means of expressing the current status-quo of the machine. On top LastOperation field will also provide the latest opreration history of machine-status.

2. Consumption:

Machine could go through many different phases during life-span and machineSetController or User might want to take different actions based on the exact phase.
For instance,
MachineSet controller might not want to delete the machine with Unknown phase, but rather give some more time to machine to connect back within timeout period.
MachineSet controller might want to delete the machine immediately to replace it by newer one if its in the Failed phase - assuming machine would had been given enough time already before to join back.
MachineSet controller might want to ignore machines in Standby phase in case of baremetals, and more.
Essentially, set of phases should be well-defined for machineSet/other-higher controller to take the right-actions.

To answer the question: Can you quickly explain what this allows us to do what we can currently not do?

I would say, in current model we do not have means of expressing the phases in detailed and structured way - as described in use case 1 above.
And also we do not have well-defined set of inputs for machineSet/other-controllers to take the right actions based on the current status of the machine. - use case 2.

Though machine-shared-controller and machine-external-controller would still require a contract to update the machineStatus, but that's a separate thread. Turned out to be a long one, but I hope it makes sense :) Feedback is most welcome.

alvaroaleman · 2018-10-17T09:37:37Z

Thanks for the detailed answer, @hardikdr !

It seems it will be great to express all three scenarios in a structured fashion.

We can derive all this information from .Status.NodeRef: failed machine creation from the fact that the .Status.NodeRef does not exist or does not point to a valid node, the two other examples you mentioned are available as conditions on the Node object.

MachineSet controller might not want to delete the machine with Unknown phase, but rather give some more time to machine to connect back within timeout period.

I don't think it should be the concern of the MachineSet controller to re-create cloud provider instances when the node gets lost or never connects, that is something the Machine controller does/should do. And again, all this info here is available from the Machine object or the referenced Node object.

I would say, in current model we do not have means of expressing the phases in detailed and structured way - as described in use case 1 above

Can you explain why this is needed? For users we can use events, for other controllers do you have an example use-case where the information available via the .Status.NodeRef is not sufficient? Because to me it feels like we are duplicating the Node object and its features here (conditions, heartbeats)

And also we do not have well-defined set of inputs for machineSet/other-controllers to take the right actions based on the current status of the machine

Basically the same question as above, do you have a sample of a controller who needs this? The machineSet controller only needs to know if a node exists and if its healthy, this works perfectly fine with the existing .Status.NodeRef

hardikdr · 2018-10-22T19:55:51Z

thanks for the comments @alvaroaleman .

We can derive all this information from .Status.NodeRef: failed machine creation from the fact that the .Status.NodeRef does not exist or does not point to a valid node,

Yes, NodeRef is essential and will provide much info, but I am not sure if there is a way to understand if creation/deletion or other operation has failed or succeeded on a particular machine- via NodeRef's existence. Deriving certain possibilities only based on the availability of NodeRef or pointer to wrong machines- seems less desirable.

I don't think it should be the concern of the MachineSet controller to re-create cloud provider instances when the node gets lost or never connects, that is something the Machine controller does/should do.

I would rather expect machine-controller should mostly be resposible for creation/deletion of the machines and reporting right health-status via MachineStatus field. MachineSet controller should then look for Failed machines and try to replace them. This is more from the perspective that MachienSet controller should only make sure to have # of healthy machines and take necessary steps - and not participate in race-condition with MachineController while recreation of machines.

Can you explain why this is needed? For users we can use events, for other controllers do you have an example use-case where the information available via the .Status.NodeRef is not sufficient?

Fundamentally, NodeRef is bound by the possible values available in the NodeObject. We would defininitely want to expand our usecases to include new phases such as Draining/Standby[for barementals] and more.

davidewatson · 2018-10-23T13:51:55Z

MachineOperationType and MachineOperationState can be useful for GUIs built on top of the ClusterAPI. In order to determine if an upgrade is compelte, we currently compare the Version.Kubelet and Version.ControlPlane with the version reported by the Node. This is somewhat limited since kubelet versions are not the only reason for an upgrade.

Others have suggested using Events instead. Can events be lost however, if a controller reboots for example?

From a user perspective I am not sure there is a more reliable way to answer these questions. Is the machine still upgrading, is it being deleted, etc? This is valuable for user feedback within a GUI.

roberthbailey · 2018-10-31T17:18:52Z

As discussed during the meeting today, we will merge this in ~24 hours unless there are objections before then.

/approve
/hold

scruplelesswizard · 2018-11-22T10:13:58Z

/retest

pkg/apis/cluster/v1alpha1/machine_types.go

hardikdr · 2018-11-22T12:00:50Z

@chaosaffe thanks for the suggestions. All of them looks good to me. I also made the necessary changes.

scruplelesswizard · 2018-11-22T12:20:01Z

/lgtm

alvaroaleman · 2018-11-22T12:43:56Z

pkg/apis/cluster/v1alpha1/machine_types.go

+	// specific machine. It should also convey the state of the latest-operation for example if 
+	// it is still on-going, failed or completed successfully.
+	// +optional
+	LastOperation LastOperation `json:"lastOperation,omitempty"`


Please make this a pointer as its an optional struct

yes sure, done.

alvaroaleman · 2018-11-22T12:59:08Z

/lgtm

hardikdr · 2018-11-22T13:05:43Z

/retest

roberthbailey · 2018-11-25T05:18:12Z

It looks like we've pretty much reached consensus on this PR. Let's make sure there are no further comments or objections at the next meeting and then merge it if all looks good.

sidharthsurana

minor naming suggestion, otherwise lgtm

sidharthsurana · 2018-11-26T19:27:09Z

config/crds/cluster_v1alpha1_machine.yaml

+              properties:
+                description:
+                  type: string
+                lastUpdateTime:


Nit: Can we change the name lastUpdateTime to lastUpdated just to be consistent with the similar fields in other places and objects.

done, thanks.

Co-Authored-By: hardikdr <[email protected]>

sidharthsurana · 2018-11-28T18:51:17Z

/LGTM

k8s-ci-robot · 2018-11-28T18:51:25Z

@sidharthsurana: changing LGTM is restricted to assignees, and only kubernetes-sigs/cluster-api repo collaborators may be assigned issues.

In response to this:

/LGTM

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

scruplelesswizard · 2018-11-28T19:30:56Z

/lgtm

…

On Wed, Nov 28, 2018, 19:51 k8s-ci-robot, ***@***.***> wrote: @sidharthsurana <https://github.com/sidharthsurana>: changing LGTM is restricted to assignees, and only kubernetes-sigs/cluster-api repo collaborators may be assigned issues. In response to this <#531 (comment)> : /LGTM Instructions for interacting with me using PR comments are available here <https://git.k8s.io/community/contributors/guide/pull-requests.md>. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra <https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:> repository. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#531 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/APFttC5E71J0j4kN-21iFMb5B1jsAj8wks5uztswgaJpZM4XVdFZ> .

roberthbailey · 2018-11-28T19:45:38Z

/approve

roberthbailey · 2018-11-28T19:45:49Z

/hold cancel

k8s-ci-robot · 2018-11-28T19:45:55Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hardikdr, roberthbailey

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [roberthbailey]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

derekwaynecarr · 2018-12-01T18:32:33Z

I apologize for the late question, but is there a write-up that summarizes a response from Daniel's feedback here:
https://docs.google.com/document/d/12TsBPn1lfMk50_yydzbXNZ9PT-8-88o4tSNi7eqPCVg/edit?disco=AAAACWAG9QI

similar to him, I worry about the use of a single phase based on experience with pods.

roberthbailey · 2019-01-24T00:14:37Z

@hardikdr -- can you answer @derekwaynecarr's question? I know that you had a chance to connect with @lavalamp at KubeCon and chat about phases / conditions / etc.

…doc_update note about capv master

k8s-ci-robot requested review from medinatiger and roberthbailey October 10, 2018 14:23

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 10, 2018

ggaurav10 reviewed Oct 10, 2018

View reviewed changes

hardikdr force-pushed the machine-phase-api branch from e3a39cf to f4c713c Compare October 10, 2018 15:14

k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 11, 2018

scruplelesswizard suggested changes Oct 11, 2018

View reviewed changes

hardikdr force-pushed the machine-phase-api branch 2 times, most recently from 5f36ff3 to 7426220 Compare October 11, 2018 09:24

hardikdr force-pushed the machine-phase-api branch from 7426220 to d58060c Compare October 12, 2018 14:22

k8s-ci-robot assigned davidewatson Oct 14, 2018

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 14, 2018

k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Oct 31, 2018

scruplelesswizard suggested changes Nov 22, 2018

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 22, 2018

alvaroaleman reviewed Nov 22, 2018

View reviewed changes

hardikdr force-pushed the machine-phase-api branch from 45812f1 to 1e19979 Compare November 22, 2018 12:55

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 22, 2018

k8s-ci-robot assigned alvaroaleman Nov 22, 2018

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 22, 2018

sidharthsurana reviewed Nov 26, 2018

View reviewed changes

Apply suggestions from code review

a6c4f92

Co-Authored-By: hardikdr <[email protected]>

hardikdr force-pushed the machine-phase-api branch from 1e19979 to a6c4f92 Compare November 27, 2018 08:28

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 27, 2018

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 28, 2018

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 28, 2018

k8s-ci-robot merged commit 5a27d98 into kubernetes-sigs:master Nov 28, 2018

davidewatson mentioned this pull request Jun 10, 2019

Machine States & Preboot Bootstrapping #997

Merged

jayunit100 pushed a commit to jayunit100/cluster-api that referenced this pull request Jan 31, 2020

Merge pull request kubernetes-sigs#531 from ykakarap/getting_started_…

8bf015d

…doc_update note about capv master

Add APIs for machine phases #531

Add APIs for machine phases #531

Conversation

hardikdr commented Oct 10, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scruplelesswizard commented Oct 11, 2018

scruplelesswizard left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alvaroaleman commented Oct 11, 2018

hardikdr commented Oct 11, 2018

hardikdr commented Oct 11, 2018

scruplelesswizard commented Oct 11, 2018

alvaroaleman commented Oct 11, 2018

hardikdr commented Oct 12, 2018

davidewatson commented Oct 14, 2018

alvaroaleman commented Oct 14, 2018

hardikdr commented Oct 17, 2018

alvaroaleman commented Oct 17, 2018

hardikdr commented Oct 22, 2018

davidewatson commented Oct 23, 2018

roberthbailey commented Oct 31, 2018

scruplelesswizard commented Nov 22, 2018

hardikdr commented Nov 22, 2018

scruplelesswizard commented Nov 22, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alvaroaleman commented Nov 22, 2018

hardikdr commented Nov 22, 2018

roberthbailey commented Nov 25, 2018

sidharthsurana left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sidharthsurana commented Nov 28, 2018

k8s-ci-robot commented Nov 28, 2018

scruplelesswizard commented Nov 28, 2018 via email

roberthbailey commented Nov 28, 2018

roberthbailey commented Nov 28, 2018

k8s-ci-robot commented Nov 28, 2018

derekwaynecarr commented Dec 1, 2018

roberthbailey commented Jan 24, 2019