Skip to content

fix: wait for overprovisioned nodes to be removed from MachinePool#1367

Closed
chewong wants to merge 1 commit into
kubernetes-sigs:masterfrom
chewong:overprovisioned-nodes
Closed

fix: wait for overprovisioned nodes to be removed from MachinePool#1367
chewong wants to merge 1 commit into
kubernetes-sigs:masterfrom
chewong:overprovisioned-nodes

Conversation

@chewong
Copy link
Copy Markdown
Member

@chewong chewong commented May 5, 2021

Signed-off-by: Ernest Wong chuwon@microsoft.com

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):

Fixes #1361

Special notes for your reviewer:

Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

TODOs:

  • squashed commits
  • includes documentation
  • adds unit tests

Release note:

NONE

Signed-off-by: Ernest Wong <chuwon@microsoft.com>
@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 5, 2021
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign vincepri after the PR has been reviewed.
You can assign the PR to them by writing /assign @vincepri in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added area/provider/azure Issues or PRs related to azure provider sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. labels May 5, 2021
@k8s-ci-robot k8s-ci-robot requested review from devigned and shysank May 5, 2021 17:20
@k8s-ci-robot k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label May 5, 2021
Comment thread scripts/ci-entrypoint.sh
kubectl wait --for=condition=Ready node --all --timeout=5m
if [[ "${EXP_MACHINE_POOL:-}" == "true" ]]; then
echo "Waiting for nodes overprovisioned by the MachinePool to be removed"
while kubectl get nodes | grep 'NotReady' > /dev/null; do
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how come kubectl wait --for=condition=Ready node --all doesn't cover this?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/capz-azure-file-machinepool-1-19/1388742406299455488, I am suspecting that the overprovisioned node was spawned after kubectl wait --for=condition=Ready node --all is finished running

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't the same thing happen here if that's the case? It would have no NotReady nodes (== all nodes are Ready as checked above) and it would proceed to the next step, making it possible for a NotReady node to appear after

Copy link
Copy Markdown
Contributor

@devigned devigned May 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps, we should add a condition to Azure machine pool to indicate when it reaches a stable state. In #1332, we introduce a ScaleSetDesiredReplicasCondition which is true when the desired number of ready nodes is reached. We could add another condition which would describe a stable state with only ready machines which match the desired replica count.

@chewong
Copy link
Copy Markdown
Member Author

chewong commented May 18, 2021

Closing this PR for now since I don't have bandwidth to work on it

@chewong chewong closed this May 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/provider/azure Issues or PRs related to azure provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. release-note-none Denotes a PR that doesn't merit a release note. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MachinePool has more worker nodes than the desired value

4 participants