📖 Update KCP proposal with scale in#3857
Conversation
|
Hi @jan-est. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/assign @fabriziopandini |
|
/ok-to-test |
|
/kind proposal |
| - Rollout operations rely on scale up and scale down which are be blocked based on Etcd and control plane health checks. | ||
|
|
||
| - The algorithm is the following: | ||
| - Verify control plane replica count is >= 3 |
There was a problem hiding this comment.
How do we think to make it visible to the user the fact that the rollout can be performed due to this reason?
There was a problem hiding this comment.
I did a little rethinking with the logic here. I think we don't need to be worried about the replica counts at all. Since current implementation does not even support the deployment of KCP with even replica count and returns on error:
The KubeadmControlPlane "controlplane-name" is invalid: spec.replicas: Forbidden: cannot be an even number when using managed etcd
When using stacked etcd KCP does not upgrade if replica count is for example 2. So, we can rely on the fact that replica count should be uneven when user is trying to upgrade. So I would say we only make sure that we scale up when replica count is 1 and KubeadmControlPlane.Spec.UpgradeStrategy is set. Any thoughts?
There was a problem hiding this comment.
IMO we should have an explicit rule checking that current replicas >=3.
The main reason for having this explicit rule is that even do it is not allowed to specify an even number desired replica count, then in the current KCP implementation the rollout logic takes precedence on creating the missing machines, and as a result, it could happen that:
- Set desired replica count = 3
- Create first machine (desired replicas = 3, current replicas =1)
- Change machine spec
- KCP start to rollout existing machines, stopping to provision the missing ones
- if UpgradeStrategy == scale-in, without the explicit rule above, the only existing replica will be deleted 😞
So the rule is ok for me, might be worth to specify current machines; it is only important to inform the user when this rule is preventing rollout to happen
There was a problem hiding this comment.
That is ok for me. Do we have any preferences how we should inform the user?
There was a problem hiding this comment.
Should we also prevent users to do initial deployment of the KCP with:
- UpgradeStrategy == scale-in
- replica count < 3
For this we could add something as follows into kubeadm_control_plane_webhook.go :
if in.Spec.UpgradeStrategy.Type == "ScaleIn" && *in.Spec.Replicas < 3 {
allErrs = append(
allErrs,
field.Forbidden(
field.NewPath("spec", "replicas"),
"cannot set scaleIn rollout stragegy with less than three initial replicas",
),
)
}
There was a problem hiding this comment.
Do we have any preferences how we should inform the user?
We are converging on using conditions for all the user-facing messages, so this could a warning on the MachinesSpecUpToDate condition
There was a problem hiding this comment.
We should definitely opt for webhook validation in this case, which is immediate and informs the user before reconciliation starts
e7f4bae to
75c41d1
Compare
75c41d1 to
d95ddf4
Compare
fabriziopandini
left a comment
There was a problem hiding this comment.
Another pass, I like how it is shaping out
8551a4a to
53665cc
Compare
fabriziopandini
left a comment
There was a problem hiding this comment.
last nits, otherwise lgtm for me
3fdb787 to
c5ef384
Compare
fabriziopandini
left a comment
There was a problem hiding this comment.
overall lgtm, few small nits not blocking
c5ef384 to
e4ab181
Compare
1412fd2 to
0016ffe
Compare
0016ffe to
48737df
Compare
48737df to
a480513
Compare
|
/lgtm /assign @detiber @CecileRobertMichon |
detiber
left a comment
There was a problem hiding this comment.
A few items that I think should be addressed as followup, but otherwise this lgtm.
/approve
| - The kubeadmConfigSpec used by each machine at creation time is stored in annotations at machine level. | ||
| - If the annotation is not present (machine is either old or adopted), we won't roll out on any possible changes made in KCP's ClusterConfiguration given that we don't have enough information to make a decision. | ||
| Users should use KCP.Spec.UpgradeAfter field to force a rollout in this case. | ||
| Setting `MaxUnavailable` to 1 and `MaxSurge` to 0 is recommended for resource constrained environment like bare-metal, OpenStack or vSphere resource pools, etc. The algorithm is the following: |
There was a problem hiding this comment.
I don't want to block progress on this, but as a followup can we refine this to avoid saying that it is recommended to use these settings, but rather that one could use these settings if they are operating in a resource constrained environment and do not have sufficient capacity to use a surge based rollout?
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: detiber The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
|
||
| - The controller should tolerate the manual or automatic removal of a replica during the upgrade process. A replica that fails during the upgrade may block the completion of the upgrade. Removal or other remedial action may be necessary to allow the upgrade to complete. | ||
| - KubeadmControlPlane verifies that control plane replica count is >= 3 | ||
| - If replica count is less than 3 KubeadmControlPlane fallback to default way of rolling out and try to scale up. |
There was a problem hiding this comment.
Does this mean the user's settings would be silently ignored? What if there is no available quota to scale up when the control plane count is 1, MaxUnavailable is 1, and MaxSurge is 0? Wouldn't it be better to avoid ever getting into this situation in the first place by validating the spec and rejecting this combination of values?
There was a problem hiding this comment.
@CecileRobertMichon I agree that using spec validation would make more sense than fallback to default rollout here.
| - The kubeadmConfigSpec used by each machine at creation time is stored in annotations at machine level. | ||
| - If the annotation is not present (machine is either old or adopted), we won't roll out on any possible changes made in KCP's ClusterConfiguration given that we don't have enough information to make a decision. | ||
| Users should use KCP.Spec.UpgradeAfter field to force a rollout in this case. | ||
| Setting `MaxUnavailable` to 1 and `MaxSurge` to 0 is recommended for resource constrained environment like bare-metal, OpenStack or vSphere resource pools, etc. The algorithm is the following: |
| Currently KubeadmControlPlane supports only one rollout strategy type the `RollingUpdateStrategyType`. Rolling upgrade strategy's behavior can be modified by using `MaxUnavailable` and `MaxSurge` fields. Both field values can be an absolute number 0 or 1 with following rules: | ||
|
|
||
| - If `MaxUnavailable` is set to 0 `MaxSurge` needs to be 1 (default values) | ||
| - If `MaxUnavailable` is set to 1 `MaxSurge` needs to be 0 |
There was a problem hiding this comment.
Let's make sure this validation is enforced when the proposal is implemented
a480513 to
65663a1
Compare
b299d38 to
05dfd93
Compare
05dfd93 to
2563145
Compare
|
/lgtm Over to @CecileRobertMichon to unhold |
|
/lgtm |
Currently, KCP scale out during the upgrade, and no other method is available. However, this is a very restrictive implementation in environments where the amount of resources is limited and scale out might not be possible. The main motivation for this KCP proposal update is to be able to add scale in feature to KCP implementation when this PR is approved.
This proposal is discussed in google doc and in issue 3512