Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Machine stuck in WaitingForNodeRef state when the IPv4 public network is disabled #632

Closed
Heyoxe opened this issue Mar 2, 2023 · 5 comments

Comments

@Heyoxe
Copy link

Heyoxe commented Mar 2, 2023

/kind bug

What steps did you take and what happened:

  • publicNetwork.enableIPv4 is set to false in the HCloudMachineTemplate of the control plane
  • hcloudNetwork.enabled is set to true in the HetznerCluster
    When creating a cluster the control planes are stuck in the WaitingForNodeRef on the NodeHealthy step. Switching back the value of enableIPv4 to true works as expected.

What did you expect to happen:
The machine should not get stuck in this state.

Anything else you would like to add:

  • The cluster was provisioned by following the quick start instructions with the hcloud flavor and applying the steps above.
  • When executing kubectl describe hcloudmachine, one of the address is:
    Address:  <nil>
    Type:     ExternalIP
    
  • I initialy discovered the issue while using the Talos bootstrap provider. At some point during the bootstraping, the TalosControlPlane attempts to communicate to the private ip of the machine instead of the public ipv6; I assume that the issue here is similar

Environment:

  • cluster-api-provider-hetzner version: v1.0.0-beta.14
  • Kubernetes version: 1.23.4 (kind)
  • OS: debian-11
@batistein
Copy link
Contributor

Hi @Heyoxe thank you very much for the issue report!
Could you please provide us the logs of CACPK? This sounds like an issue of the control-plane provider which is not able to determine if the node is ready. Only for the record I assume you had a public IPv6 Address enabled?!

In general creating a cluster without any public IP is right now not supported. As CACPK is doing some checks on the control-plane node.

@Heyoxe
Copy link
Author

Heyoxe commented Mar 7, 2023

Hello,
After looking at it a bit more in-depth the main issue was that github.com does not appear to support IPv6. Using a DNS64/NAT64 led to another issue where the kubelet was trying to reach the IPv4 address of the LB. Using a NAT would probably solve those issues but I haven't tried.
IPv6 was enabled on all machines (lb, control-plane, worker) and the management cluster was a correctly configured for dual-stack.

(For reference, what I was trying to ultimately do is create a cluster where the only public IP is the LB's but I don't think it is possible to do that currently)

@rbjorklin
Copy link

rbjorklin commented Apr 10, 2023

I think I'm seeing a variation of this when trying to spin up a cluster with the worker machine deployment only having private networking.

❯ k get hcloudmachines
NAME                                CLUSTER         IMAGE                                                   PLACEMENT GROUP   TYPE    STATE     READY   INSTANCEID          MACHINE
rbjorklin-com-control-plane-8m64w   rbjorklin-com   packer-fedora-37-1.26.3-containerd-2023-04-08T165557Z   control-plane     cpx11   running   true    hcloud://31026421   rbjorklin-com-control-plane-2hngz
rbjorklin-com-control-plane-pt8lz   rbjorklin-com   packer-fedora-37-1.26.3-containerd-2023-04-08T165557Z   control-plane     cpx11   running   true    hcloud://31025771   rbjorklin-com-control-plane-9l682
rbjorklin-com-control-plane-rsq5n   rbjorklin-com   packer-fedora-37-1.26.3-containerd-2023-04-08T165557Z   control-plane     cpx11   running   true    hcloud://31025328   rbjorklin-com-control-plane-8f2t8
rbjorklin-com-md-0-n6vdq            rbjorklin-com   packer-fedora-37-1.26.3-containerd-2023-04-08T165557Z   md-0              cpx21   running   true    hcloud://31027652   rbjorklin-com-md-0-5959b7f5b8xd96g4-vhdbz

❯ k get machines
NAME                                        CLUSTER         NODENAME                            PROVIDERID          PHASE         AGE    VERSION
rbjorklin-com-control-plane-2hngz           rbjorklin-com   rbjorklin-com-control-plane-8m64w   hcloud://31026421   Running       88m    v1.26.3
rbjorklin-com-control-plane-8f2t8           rbjorklin-com   rbjorklin-com-control-plane-rsq5n   hcloud://31025328   Running       124m   v1.26.3
rbjorklin-com-control-plane-9l682           rbjorklin-com   rbjorklin-com-control-plane-pt8lz   hcloud://31025771   Running       104m   v1.26.3
rbjorklin-com-md-0-5959b7f5b8xd96g4-vhdbz   rbjorklin-com                                       hcloud://31027652   Provisioned   19m    v1.26.3

❯ k get hcloudmachines rbjorklin-com-md-0-n6vdq -o yaml | yq .status
addresses:
  - address: <nil>
    type: ExternalIP
  - address: 10.0.0.4
    type: InternalIP
conditions:
  - lastTransitionTime: "2023-04-10T01:58:53Z"
    status: "True"
    type: InstanceBootstrapReady
  - lastTransitionTime: "2023-04-10T02:00:38Z"
    status: "True"
    type: InstanceReady
instanceState: running
ready: true

❯ k get machine rbjorklin-com-md-0-5959b7f5b8xd96g4-vhdbz -o yaml | yq .status
addresses:
  - address: <nil>
    type: ExternalIP
  - address: 10.0.0.4
    type: InternalIP
bootstrapReady: true
conditions:
  - lastTransitionTime: "2023-04-10T02:00:38Z"
    status: "True"
    type: Ready
  - lastTransitionTime: "2023-04-10T01:58:51Z"
    status: "True"
    type: BootstrapReady
  - lastTransitionTime: "2023-04-10T02:00:38Z"
    status: "True"
    type: InfrastructureReady
  - lastTransitionTime: "2023-04-10T02:00:38Z"
    reason: NodeProvisioning
    severity: Warning
    status: "False"
    type: NodeHealthy
infrastructureReady: true
lastUpdated: "2023-04-10T02:00:38Z"
observedGeneration: 3
phase: Provisioned

EDIT: Some further digging shows that my issue is not the same. My workers are failing to join due to a missing kubelet config.

Apr 10 04:00:19 rbjorklin-com-md-0-rb5nh kubelet[1072]: E0410 04:00:19.806507    1072 run.go:74] "command failed" err="failed to load kubelet config file, error: failed to load Kubelet config file /var/lib/kubelet/config.yaml, error failed to read kubelet config file \"/var/lib/kubelet/config.yaml\", error: open /var/lib/kubelet/config.yaml: no such file or directory, path: /var/lib/kubelet/config.yaml"

EDIT2: Enabling public addresses made the workers join the cluster correctly. Is there a way to fix this or is using private networking not supported?

EDIT3: Not sure if it was setting bind-address: 0.0.0.0 on the API server, spinning up a nat gateway or maybe a combination of the two that fixed it but I've got it working now.

@BlakeB415
Copy link

BlakeB415 commented Aug 21, 2023

Hello,
This issue should be reopened as this prevents IPv6-only clusters.

I believe it should only use the LB IPv4 when enableIPv4 is set to true. If enableIPv4 is explicitly false and enableIPv6 is true, it should use the IPv6 address of the LB.

I think the relevant code is located here

defaultHost := hetznerCluster.Status.ControlPlaneLoadBalancer.IPv4

Thank you.

@BlakeB415
Copy link

@batistein Is there a way we can get this fixed? Also, I noticed that the make script for install-dev-prerequisites no longer exists though it's still in the development documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants