Machine stuck in `WaitingForNodeRef` state when the IPv4 public network is disabled #632

Heyoxe · 2023-03-02T20:43:36Z

/kind bug

What steps did you take and what happened:

publicNetwork.enableIPv4 is set to false in the HCloudMachineTemplate of the control plane
hcloudNetwork.enabled is set to true in the HetznerCluster
When creating a cluster the control planes are stuck in the WaitingForNodeRef on the NodeHealthy step. Switching back the value of enableIPv4 to true works as expected.

What did you expect to happen:
The machine should not get stuck in this state.

Anything else you would like to add:

The cluster was provisioned by following the quick start instructions with the hcloud flavor and applying the steps above.
When executing kubectl describe hcloudmachine, one of the address is:
```
Address:  <nil>
Type:     ExternalIP
```
I initialy discovered the issue while using the Talos bootstrap provider. At some point during the bootstraping, the TalosControlPlane attempts to communicate to the private ip of the machine instead of the public ipv6; I assume that the issue here is similar

Environment:

cluster-api-provider-hetzner version: v1.0.0-beta.14
Kubernetes version: 1.23.4 (kind)
OS: debian-11

The text was updated successfully, but these errors were encountered:

batistein · 2023-03-06T07:38:59Z

Hi @Heyoxe thank you very much for the issue report!
Could you please provide us the logs of CACPK? This sounds like an issue of the control-plane provider which is not able to determine if the node is ready. Only for the record I assume you had a public IPv6 Address enabled?!

In general creating a cluster without any public IP is right now not supported. As CACPK is doing some checks on the control-plane node.

Heyoxe · 2023-03-07T20:44:48Z

Hello,
After looking at it a bit more in-depth the main issue was that github.com does not appear to support IPv6. Using a DNS64/NAT64 led to another issue where the kubelet was trying to reach the IPv4 address of the LB. Using a NAT would probably solve those issues but I haven't tried.
IPv6 was enabled on all machines (lb, control-plane, worker) and the management cluster was a correctly configured for dual-stack.

(For reference, what I was trying to ultimately do is create a cluster where the only public IP is the LB's but I don't think it is possible to do that currently)

rbjorklin · 2023-04-10T02:24:57Z

I think I'm seeing a variation of this when trying to spin up a cluster with the worker machine deployment only having private networking.

❯ k get hcloudmachines
NAME                                CLUSTER         IMAGE                                                   PLACEMENT GROUP   TYPE    STATE     READY   INSTANCEID          MACHINE
rbjorklin-com-control-plane-8m64w   rbjorklin-com   packer-fedora-37-1.26.3-containerd-2023-04-08T165557Z   control-plane     cpx11   running   true    hcloud://31026421   rbjorklin-com-control-plane-2hngz
rbjorklin-com-control-plane-pt8lz   rbjorklin-com   packer-fedora-37-1.26.3-containerd-2023-04-08T165557Z   control-plane     cpx11   running   true    hcloud://31025771   rbjorklin-com-control-plane-9l682
rbjorklin-com-control-plane-rsq5n   rbjorklin-com   packer-fedora-37-1.26.3-containerd-2023-04-08T165557Z   control-plane     cpx11   running   true    hcloud://31025328   rbjorklin-com-control-plane-8f2t8
rbjorklin-com-md-0-n6vdq            rbjorklin-com   packer-fedora-37-1.26.3-containerd-2023-04-08T165557Z   md-0              cpx21   running   true    hcloud://31027652   rbjorklin-com-md-0-5959b7f5b8xd96g4-vhdbz

❯ k get machines
NAME                                        CLUSTER         NODENAME                            PROVIDERID          PHASE         AGE    VERSION
rbjorklin-com-control-plane-2hngz           rbjorklin-com   rbjorklin-com-control-plane-8m64w   hcloud://31026421   Running       88m    v1.26.3
rbjorklin-com-control-plane-8f2t8           rbjorklin-com   rbjorklin-com-control-plane-rsq5n   hcloud://31025328   Running       124m   v1.26.3
rbjorklin-com-control-plane-9l682           rbjorklin-com   rbjorklin-com-control-plane-pt8lz   hcloud://31025771   Running       104m   v1.26.3
rbjorklin-com-md-0-5959b7f5b8xd96g4-vhdbz   rbjorklin-com                                       hcloud://31027652   Provisioned   19m    v1.26.3

❯ k get hcloudmachines rbjorklin-com-md-0-n6vdq -o yaml | yq .status
addresses:
  - address: <nil>
    type: ExternalIP
  - address: 10.0.0.4
    type: InternalIP
conditions:
  - lastTransitionTime: "2023-04-10T01:58:53Z"
    status: "True"
    type: InstanceBootstrapReady
  - lastTransitionTime: "2023-04-10T02:00:38Z"
    status: "True"
    type: InstanceReady
instanceState: running
ready: true

❯ k get machine rbjorklin-com-md-0-5959b7f5b8xd96g4-vhdbz -o yaml | yq .status
addresses:
  - address: <nil>
    type: ExternalIP
  - address: 10.0.0.4
    type: InternalIP
bootstrapReady: true
conditions:
  - lastTransitionTime: "2023-04-10T02:00:38Z"
    status: "True"
    type: Ready
  - lastTransitionTime: "2023-04-10T01:58:51Z"
    status: "True"
    type: BootstrapReady
  - lastTransitionTime: "2023-04-10T02:00:38Z"
    status: "True"
    type: InfrastructureReady
  - lastTransitionTime: "2023-04-10T02:00:38Z"
    reason: NodeProvisioning
    severity: Warning
    status: "False"
    type: NodeHealthy
infrastructureReady: true
lastUpdated: "2023-04-10T02:00:38Z"
observedGeneration: 3
phase: Provisioned

EDIT: Some further digging shows that my issue is not the same. My workers are failing to join due to a missing kubelet config.

Apr 10 04:00:19 rbjorklin-com-md-0-rb5nh kubelet[1072]: E0410 04:00:19.806507    1072 run.go:74] "command failed" err="failed to load kubelet config file, error: failed to load Kubelet config file /var/lib/kubelet/config.yaml, error failed to read kubelet config file \"/var/lib/kubelet/config.yaml\", error: open /var/lib/kubelet/config.yaml: no such file or directory, path: /var/lib/kubelet/config.yaml"

EDIT2: Enabling public addresses made the workers join the cluster correctly. Is there a way to fix this or is using private networking not supported?

EDIT3: Not sure if it was setting bind-address: 0.0.0.0 on the API server, spinning up a nat gateway or maybe a combination of the two that fixed it but I've got it working now.

BlakeB415 · 2023-08-21T19:25:57Z

Hello,
This issue should be reopened as this prevents IPv6-only clusters.

I believe it should only use the LB IPv4 when enableIPv4 is set to true. If enableIPv4 is explicitly false and enableIPv6 is true, it should use the IPv6 address of the LB.

I think the relevant code is located here

cluster-api-provider-hetzner/controllers/hetznercluster_controller.go

Line 200 in f3e822a

defaultHost := hetznerCluster.Status.ControlPlaneLoadBalancer.IPv4

Thank you.

BlakeB415 · 2023-08-30T21:24:36Z

@batistein Is there a way we can get this fixed? Also, I noticed that the make script for install-dev-prerequisites no longer exists though it's still in the development documentation.

batistein closed this as completed Apr 5, 2023

JochemTSR mentioned this issue Mar 19, 2024

Use LoadBalancer IPv6 address #1218

Open

JochemTSR mentioned this issue Jun 20, 2024

✨ Load Balancer IPv6 as apiserver endpoint #1227

Draft

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Machine stuck in `WaitingForNodeRef` state when the IPv4 public network is disabled #632

Machine stuck in `WaitingForNodeRef` state when the IPv4 public network is disabled #632

Heyoxe commented Mar 2, 2023

batistein commented Mar 6, 2023

Heyoxe commented Mar 7, 2023

rbjorklin commented Apr 10, 2023 •

edited

Loading

BlakeB415 commented Aug 21, 2023 •

edited

Loading

BlakeB415 commented Aug 30, 2023

Machine stuck in WaitingForNodeRef state when the IPv4 public network is disabled #632

Machine stuck in WaitingForNodeRef state when the IPv4 public network is disabled #632

Comments

Heyoxe commented Mar 2, 2023

batistein commented Mar 6, 2023

Heyoxe commented Mar 7, 2023

rbjorklin commented Apr 10, 2023 • edited Loading

BlakeB415 commented Aug 21, 2023 • edited Loading

BlakeB415 commented Aug 30, 2023

Machine stuck in `WaitingForNodeRef` state when the IPv4 public network is disabled #632

Machine stuck in `WaitingForNodeRef` state when the IPv4 public network is disabled #632

rbjorklin commented Apr 10, 2023 •

edited

Loading

BlakeB415 commented Aug 21, 2023 •

edited

Loading