Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wait for control plane node to be ready after joining the cluster #598

Closed
wants to merge 1 commit into from

Conversation

aojea
Copy link
Contributor

@aojea aojea commented Jun 7, 2019

It can happen that the control plane node is not completely ready after joining the cluster,
If one worker node tries to join against a control plane node that's not ready, it fails to join thus the cluster creation fails.
This is a workaround to wait until the control node is ready after it joins the cluster before joining new nodes.

Fixes: #588

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jun 7, 2019
@aojea aojea changed the title Wait for ready after join a control plane node Wait for control plane node to be ready after joining the cluster Jun 7, 2019
@aojea aojea mentioned this pull request Jun 7, 2019
@@ -211,6 +212,16 @@ func runKubeadmJoinControlPlane(
return errors.Wrap(err, "failed to join a control plane node with kubeadm")
}

// Wait for the node to be Ready
// TODO: remove once https://github.com/kubernetes-sigs/kind/issues/588 is fixed
// kubeadm join should guarantee that the cluster is ready
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fabriziopandini @neolit123 I don't know if this is true 😅 , should kubeadm join, guarantee that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aojea see my comments on the issue #588 (comment)

In a nutshell Kubeadm is not responsible, but it is the kubelet. Additional, it seems also that the API server does not detected properly when the etcd instance is ready

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kubeadm join should guarantee that the cluster is ready

well kubeadm can look at pod and endpoint state, but the cluster as a whole - a bit tricky.
kinder currently waits for these pods + the node ready status, as @fabriziopandini mentioned:
https://github.com/kubernetes/kubeadm/blob/62556834c87e34004ac84c17b2f2c68b5c4f3b22/kinder/pkg/actions/waiter.go#L32-L44

given HA join consistently does not fail using kind 0.2.0 as seen here:
https://k8s-testgrid.appspot.com/sig-cluster-lifecycle-kubeadm#kubeadm-kind-master
i'm trying to get to the bottom of the problem instead - i.e. finding a change in kind that helped the problem surface.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kind is much faster with the latest versions, the fact that adding delays solve the problem or at least reduce them makes me think that´s tightly related to that

@aojea
Copy link
Contributor Author

aojea commented Jun 7, 2019

/assign @neolit123
/assign @fabriziopandini
/assign @BenTheElder

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: aojea
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: bentheelder

If they are not already assigned, you can assign the PR to them by writing /assign @bentheelder in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@BenTheElder
Copy link
Member

Code looks fine FWIW but I'm not convinced we've fully root-caused the issue we're attempting to address yet, and I'm not a fan of adding arbitrary busy waits.

@aojea
Copy link
Contributor Author

aojea commented Jun 9, 2019

Definitely not the right approach

@aojea aojea closed this Jun 9, 2019
@aojea aojea deleted the controlplane branch August 9, 2020 08:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

create HA cluster is flaky
6 participants