Skip to content

📖 Update KCP proposal with scale in#3857

Merged
k8s-ci-robot merged 1 commit into
kubernetes-sigs:masterfrom
Nordix:kcp-scale-in-proposal
Jan 7, 2021
Merged

📖 Update KCP proposal with scale in#3857
k8s-ci-robot merged 1 commit into
kubernetes-sigs:masterfrom
Nordix:kcp-scale-in-proposal

Conversation

@jan-est
Copy link
Copy Markdown
Contributor

@jan-est jan-est commented Oct 23, 2020

Currently, KCP scale out during the upgrade, and no other method is available. However, this is a very restrictive implementation in environments where the amount of resources is limited and scale out might not be possible. The main motivation for this KCP proposal update is to be able to add scale in feature to KCP implementation when this PR is approved.

This proposal is discussed in google doc and in issue 3512

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Oct 23, 2020
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @jan-est. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 23, 2020
@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Oct 23, 2020
@jan-est
Copy link
Copy Markdown
Contributor Author

jan-est commented Oct 23, 2020

/assign @fabriziopandini

@vincepri
Copy link
Copy Markdown
Member

vincepri commented Nov 2, 2020

/ok-to-test
/milestone v0.4.0

@k8s-ci-robot k8s-ci-robot added this to the v0.4.0 milestone Nov 2, 2020
@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Nov 2, 2020
@vincepri
Copy link
Copy Markdown
Member

vincepri commented Nov 2, 2020

/kind proposal

@k8s-ci-robot k8s-ci-robot added the kind/proposal Issues or PRs related to proposals. label Nov 2, 2020
Comment thread docs/proposals/20191017-kubeadm-based-control-plane.md Outdated
Comment thread docs/proposals/20191017-kubeadm-based-control-plane.md Outdated
Comment thread docs/proposals/20191017-kubeadm-based-control-plane.md Outdated
Comment thread docs/proposals/20191017-kubeadm-based-control-plane.md Outdated
Comment thread docs/proposals/20191017-kubeadm-based-control-plane.md Outdated
Comment thread docs/proposals/20191017-kubeadm-based-control-plane.md Outdated
Comment thread docs/proposals/20191017-kubeadm-based-control-plane.md Outdated
Comment thread docs/proposals/20191017-kubeadm-based-control-plane.md Outdated
- Rollout operations rely on scale up and scale down which are be blocked based on Etcd and control plane health checks.

- The algorithm is the following:
- Verify control plane replica count is >= 3
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we think to make it visible to the user the fact that the rollout can be performed due to this reason?

Copy link
Copy Markdown
Contributor Author

@jan-est jan-est Nov 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a little rethinking with the logic here. I think we don't need to be worried about the replica counts at all. Since current implementation does not even support the deployment of KCP with even replica count and returns on error:

The KubeadmControlPlane "controlplane-name" is invalid: spec.replicas: Forbidden: cannot be an even number when using managed etcd

When using stacked etcd KCP does not upgrade if replica count is for example 2. So, we can rely on the fact that replica count should be uneven when user is trying to upgrade. So I would say we only make sure that we scale up when replica count is 1 and KubeadmControlPlane.Spec.UpgradeStrategy is set. Any thoughts?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should have an explicit rule checking that current replicas >=3.

The main reason for having this explicit rule is that even do it is not allowed to specify an even number desired replica count, then in the current KCP implementation the rollout logic takes precedence on creating the missing machines, and as a result, it could happen that:

  • Set desired replica count = 3
  • Create first machine (desired replicas = 3, current replicas =1)
  • Change machine spec
  • KCP start to rollout existing machines, stopping to provision the missing ones
  • if UpgradeStrategy == scale-in, without the explicit rule above, the only existing replica will be deleted 😞

So the rule is ok for me, might be worth to specify current machines; it is only important to inform the user when this rule is preventing rollout to happen

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is ok for me. Do we have any preferences how we should inform the user?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also prevent users to do initial deployment of the KCP with:

  • UpgradeStrategy == scale-in
  • replica count < 3

For this we could add something as follows into kubeadm_control_plane_webhook.go :

	if in.Spec.UpgradeStrategy.Type == "ScaleIn" && *in.Spec.Replicas < 3 {
		allErrs = append(
			allErrs,
			field.Forbidden(
				field.NewPath("spec", "replicas"),
				"cannot set scaleIn rollout stragegy with less than three initial replicas",
			),
		)
	} 

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any preferences how we should inform the user?

We are converging on using conditions for all the user-facing messages, so this could a warning on the MachinesSpecUpToDate condition

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should definitely opt for webhook validation in this case, which is immediate and informs the user before reconciliation starts

Comment thread docs/proposals/20191017-kubeadm-based-control-plane.md Outdated
@jan-est jan-est force-pushed the kcp-scale-in-proposal branch from e7f4bae to 75c41d1 Compare November 9, 2020 11:25
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 9, 2020
@jan-est jan-est force-pushed the kcp-scale-in-proposal branch from 75c41d1 to d95ddf4 Compare November 9, 2020 12:00
Comment thread docs/proposals/20191017-kubeadm-based-control-plane.md Outdated
Comment thread docs/proposals/20191017-kubeadm-based-control-plane.md Outdated
Comment thread docs/proposals/20191017-kubeadm-based-control-plane.md Outdated
Comment thread docs/proposals/20191017-kubeadm-based-control-plane.md Outdated
Comment thread docs/proposals/20191017-kubeadm-based-control-plane.md
Comment thread docs/proposals/20191017-kubeadm-based-control-plane.md Outdated
Copy link
Copy Markdown
Member

@fabriziopandini fabriziopandini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another pass, I like how it is shaping out

@jan-est jan-est force-pushed the kcp-scale-in-proposal branch 3 times, most recently from 8551a4a to 53665cc Compare November 11, 2020 08:58
Copy link
Copy Markdown
Member

@fabriziopandini fabriziopandini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

last nits, otherwise lgtm for me

Comment thread docs/proposals/20191017-kubeadm-based-control-plane.md Outdated
Comment thread docs/proposals/20191017-kubeadm-based-control-plane.md Outdated
Comment thread docs/proposals/20191017-kubeadm-based-control-plane.md Outdated
Comment thread docs/proposals/20191017-kubeadm-based-control-plane.md Outdated
Comment thread docs/proposals/20191017-kubeadm-based-control-plane.md Outdated
@jan-est jan-est force-pushed the kcp-scale-in-proposal branch 2 times, most recently from 3fdb787 to c5ef384 Compare November 13, 2020 11:18
Copy link
Copy Markdown
Member

@fabriziopandini fabriziopandini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall lgtm, few small nits not blocking

Comment thread docs/proposals/20191017-kubeadm-based-control-plane.md Outdated
Comment thread docs/proposals/20191017-kubeadm-based-control-plane.md Outdated
Comment thread docs/proposals/20191017-kubeadm-based-control-plane.md Outdated
Comment thread docs/proposals/20191017-kubeadm-based-control-plane.md Outdated
@jan-est jan-est force-pushed the kcp-scale-in-proposal branch from 1412fd2 to 0016ffe Compare December 8, 2020 10:40
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 8, 2020
@jan-est jan-est force-pushed the kcp-scale-in-proposal branch from 0016ffe to 48737df Compare December 8, 2020 13:12
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 8, 2020
@jan-est jan-est force-pushed the kcp-scale-in-proposal branch from 48737df to a480513 Compare December 8, 2020 13:25
@vincepri
Copy link
Copy Markdown
Member

vincepri commented Jan 4, 2021

/lgtm

/assign @detiber @CecileRobertMichon
for final approval

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 4, 2021
Copy link
Copy Markdown
Contributor

@detiber detiber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few items that I think should be addressed as followup, but otherwise this lgtm.

/approve

Comment thread docs/proposals/20191017-kubeadm-based-control-plane.md Outdated
Comment thread docs/proposals/20191017-kubeadm-based-control-plane.md Outdated
Comment thread docs/proposals/20191017-kubeadm-based-control-plane.md Outdated
Comment thread docs/proposals/20191017-kubeadm-based-control-plane.md
Comment thread docs/proposals/20191017-kubeadm-based-control-plane.md Outdated
- The kubeadmConfigSpec used by each machine at creation time is stored in annotations at machine level.
- If the annotation is not present (machine is either old or adopted), we won't roll out on any possible changes made in KCP's ClusterConfiguration given that we don't have enough information to make a decision.
Users should use KCP.Spec.UpgradeAfter field to force a rollout in this case.
Setting `MaxUnavailable` to 1 and `MaxSurge` to 0 is recommended for resource constrained environment like bare-metal, OpenStack or vSphere resource pools, etc. The algorithm is the following:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to block progress on this, but as a followup can we refine this to avoid saying that it is recommended to use these settings, but rather that one could use these settings if they are operating in a resource constrained environment and do not have sufficient capacity to use a surge based rollout?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: detiber

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 4, 2021
Comment thread docs/proposals/20191017-kubeadm-based-control-plane.md Outdated

- The controller should tolerate the manual or automatic removal of a replica during the upgrade process. A replica that fails during the upgrade may block the completion of the upgrade. Removal or other remedial action may be necessary to allow the upgrade to complete.
- KubeadmControlPlane verifies that control plane replica count is >= 3
- If replica count is less than 3 KubeadmControlPlane fallback to default way of rolling out and try to scale up.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean the user's settings would be silently ignored? What if there is no available quota to scale up when the control plane count is 1, MaxUnavailable is 1, and MaxSurge is 0? Wouldn't it be better to avoid ever getting into this situation in the first place by validating the spec and rejecting this combination of values?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CecileRobertMichon I agree that using spec validation would make more sense than fallback to default rollout here.

- The kubeadmConfigSpec used by each machine at creation time is stored in annotations at machine level.
- If the annotation is not present (machine is either old or adopted), we won't roll out on any possible changes made in KCP's ClusterConfiguration given that we don't have enough information to make a decision.
Users should use KCP.Spec.UpgradeAfter field to force a rollout in this case.
Setting `MaxUnavailable` to 1 and `MaxSurge` to 0 is recommended for resource constrained environment like bare-metal, OpenStack or vSphere resource pools, etc. The algorithm is the following:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Currently KubeadmControlPlane supports only one rollout strategy type the `RollingUpdateStrategyType`. Rolling upgrade strategy's behavior can be modified by using `MaxUnavailable` and `MaxSurge` fields. Both field values can be an absolute number 0 or 1 with following rules:

- If `MaxUnavailable` is set to 0 `MaxSurge` needs to be 1 (default values)
- If `MaxUnavailable` is set to 1 `MaxSurge` needs to be 0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make sure this validation is enforced when the proposal is implemented

@jan-est jan-est force-pushed the kcp-scale-in-proposal branch from a480513 to 65663a1 Compare January 7, 2021 07:15
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 7, 2021
@jan-est jan-est force-pushed the kcp-scale-in-proposal branch 3 times, most recently from b299d38 to 05dfd93 Compare January 7, 2021 07:36
@vincepri
Copy link
Copy Markdown
Member

vincepri commented Jan 7, 2021

/lgtm

Over to @CecileRobertMichon to unhold

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 7, 2021
@CecileRobertMichon
Copy link
Copy Markdown
Contributor

/lgtm
/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 7, 2021
@k8s-ci-robot k8s-ci-robot merged commit 0c7d0a2 into kubernetes-sigs:master Jan 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/proposal Issues or PRs related to proposals. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants