EKS node do not get IPv6 address from wicked-dhcp6 #3143

sdomme · 2023-05-24T16:53:16Z

Image I'm using: Bottlerocket OS 1.14.0 (aws-k8s-1.24)

What I expected to happen:
In an EKS IPv6 Cluster I expect to get an IPv6 addr on the eth0 interface (primary ENI) of the nodes, what also happen the most of the time. For example:

bash-5.1# wicked show all
lo              device-unconfigured
      link:     #1, state up
      type:     loopback
      addr:     ipv4 127.0.0.1/8
      addr:     ipv6 ::1/128

eth0            up
      link:     #2, state up, mtu 9001
      type:     ethernet, hwaddr 0a:ff:ec:87:a6:e8
      config:   wicked:xml:/etc/wicked/ifconfig/eth0.xml
      leases:   ipv4 dhcp granted
      leases:   ipv6 dhcp granted
      addr:     ipv4 10.33.254.4/24 [dhcp]
      addr:     ipv6 2a05:d014:dad:dc05::9cc/64 [dhcp]
      route:    ipv4 default via 10.33.254.1 [dhcp]
      route:    ipv6 default via fe80::850:32ff:fe76:84 metric 1024 proto ra

What actually happened:

It happen on multiple EKS Clusters (nodes are provisioned by Karpenter dynamically) that the nodes don't get an IPv6 addr on eth0 even if the ENI has this IP addr configured (from the AWS console). Tools (in our case cilium) using IPv6 to talk to the kubernetes API run into timeout and break the whole network stack.

What we could find so far:

IPv6 request is stuck. It says granted on a working node.

ash-5.1# wicked show eth0
eth0            up
      link:     #2, state up, mtu 9001
      type:     ethernet, hwaddr 02:ca:73:3a:e3:d0
      config:   wicked:xml:/etc/wicked/ifconfig/eth0.xml
      leases:   ipv4 dhcp granted
      leases:   ipv6 dhcp requesting
      addr:     ipv4 10.32.36.132/24 [dhcp]
      route:    ipv4 default via 10.32.36.1 [dhcp]
      route:    ipv6 default via fe80::b5:76ff:fec1:2b40 metric 1024 proto ra

Trying to get an IP manually leads to an timeout.
Gives up after 10 seconds:

ash-5.1# wicked test dhcp6 eth0
wicked: eth0: Request to acquire DHCPv6 lease with UUID b53e6e64-540d-0d00-007b-020001000000 in mode auto
bash-5.1#

More verbose:

bash-5.1# /usr/libexec/wicked/bin/wickedd-dhcp6 --test --test-mode auto --log-level debug --debug most eth0
::: Reading config file /etc/wicked/common.xml
::: eth0: Found usable link-local IPv6 address: fe80::ca:73ff:fe3a:e3d0
Notice: eth0: Request to acquire DHCPv6 lease with UUID f31a6e64-d8de-0100-4864-010001000000 in mode auto
::: eth0: setting fsm timeout to 10.000 sec
::: eth0: timeout in state INIT (failure)
::: ni_dhcp6_tester_protocol_event(ev=3, dev=eth0[2], config-uuid=f31a6e64-d8de-0100-4864-010001000000)
::: no sockets left to watch
::: eth0: Deleting dhcp6 device with index 2

Interestingly restarting the whole wicked helps to get an IP

bash-5.1# systemctl restart wicked
bash-5.1# wicked show eth0
eth0            up
      link:     #2, state up, mtu 9001
      type:     ethernet, hwaddr 02:ca:73:3a:e3:d0
      config:   wicked:xml:/etc/wicked/ifconfig/eth0.xml
      leases:   ipv4 dhcp granted
      leases:   ipv6 dhcp granted
      addr:     ipv4 10.32.36.132/24 [dhcp]
      addr:     ipv6 2a05:d014:396:cd03::d8c3/64 [dhcp]
      route:    ipv4 default via 10.32.36.1 proto dhcp
      route:    ipv6 default via fe80::b5:76ff:fec1:2b40 metric 1024 proto ra

How to reproduce the problem:
We couldn't find a pattern, why this happen. In a Cluster where we do dynamic scaling and exchange the nodes very often per day it happen more often.

We were recommended to open up the issue from aws/amazon-vpc-cni-k8s#2391

The text was updated successfully, but these errors were encountered:

zmrow · 2023-05-24T18:16:30Z

Thanks for the issue @sdomme ! This is an interesting failure case. Appreciate the debug work you've done.

Do you have an estimation of how often it happens in your clusters? Something like "x failures in x instance launches".

On an instance that fails to get a DHCP6 lease, does wicked, wickedd, or wickedd-nanny have any errors in logs? journalctl -u wicked

I'd also be curious to see the output of the following (debug mode set to all)
wicked --log-level debug --debug all test dhcp6 eth0

Not sure wickedd-dhcp6 will provide different output, but it's worth a shot:
/usr/libexec/wicked/bin/wickedd-dhcp6 --test --test-mode auto --log-level debug --debug all eth0

sdomme · 2023-05-24T18:40:39Z

Hi @zmrow Unfortunately (#2254) there is not much in the logs except of service is starting... etc. At the moment there is no broken node where I have access to. As soon as I have the next one, I can provide the logs and output of the commands you mentioned.

For the estimation it is a bit hard to say. Totally depending on the Cluster. Currently there is one Cluster with 16 karpenter provisioned nodes. The oldest one with an age of 32h. 1 Node is in a broken state. My estimation would be something every 12-16 nodes / 24h there is one with this issue.

zmrow · 2023-05-24T23:12:42Z

Thanks @sdomme . I've launched a bunch of instances and seem seem to have reproduced the same behavior you're seeing.

I'm digging into it and am looking for a fix. I'll provide additional details here when I have them.

zmrow · 2023-05-26T00:02:13Z

Update: I turned on some additional debug logging in wicked and built an image and AMI.

I had an initial theory about the timing of the accept_ra sysctl being set, but that's not really panning out. I added this sysctl to the defaults in my custom image and was still able to get an instance that failed to get an IPv6 address.

I'm digging through the additional logs of this instance and will follow up tomorrow as I'm launching more as we speak.

zmrow · 2023-05-26T21:29:15Z

Update: Continuing the search!

I found that in all cases where the instance failed to get an IPv6 address, wicked logged a message:

May 25 23:40:15 localhost wickedd-dhcp6[978]: eth0: setting initial transmit delay of 0 .. 0.000s

In code, if the transmit delay is set to 0, the function cancels the timer and doesn't continue. A DHCPv6 Solicit message is never sent and everything grinds to a halt.

@markusboehme wrote a few exploratory patches today to adjust the solicitation delay. In my testing of these patches, I was still able to produce 1 node in 2000 launches that failed to get a DHCPv6 address. The interface never seems to transition to "managed" mode, which we see in successful instances:

May 24 21:13:35 localhost wickedd-dhcp6[960]: eth0: updated dhcp6 mode from auto to managed
May 24 21:13:35 localhost wickedd-dhcp6[960]: eth0: fsm start in mode managed

We'll continue to dig into this.

zmrow · 2023-06-01T20:52:49Z

@sdomme good news! I think we found the culprit(s) and have a good bead on a fix. I have a PR out now for one half of the fix, and @markusboehme is working on a PR for some patches to wicked which make up the other half of the fix.

We went pretty far down the rabbit hole on this one.

As previously mentioned, we came across an issue where wicked's state machine would grind to a halt if the solicitation delay was set to 0. @markusboehme came up with a patch to fix that issue, but we were still able to reproduce the issue fairly reliably with a couple nodes per 2000 launches. With some additional digging into wicked code, he discovered yet another patch in upstream wicked code that polls for updates to the IPv6-related flags rather than waiting for a NEWLINK event from the kernel. This also helped the situation, but we still could reproduce the issue.

After digging a bit deeper in wicked/kernel code we discovered my initial suspicion about the accept_ra sysctl timing was also a piece of the puzzle. The kernel sets a flag indicating a router advertisement has been received (IF_RA_RCVD), but only after it completes duplicate address detection and decides whether to send a router solicitation. If the sysctl isn't set by the time duplicate address detection completes, the solicitation doesn't happen and the IF_RA_RCVD flag doesn't get set. wicked uses this flag to decide whether or not to kick off the state machine that handles DHCP6. We are now allowing wicked to manage the "accept RA" setting by adding it to the config of the primary interface.

Using my PR and the aforementioned patches to wicked I've launched 6000 instances into an IPv6 EKS cluster successfully. That makes us feel pretty confident we've nailed this down.

The wicked patches will be in PR soon.

sdomme · 2023-06-05T06:53:55Z

Hello @zmrow It is indeed very good news. Thank you and @markusboehme for this awesome debugging.

ajaykumarmandapati · 2023-06-06T15:58:06Z

Yet again! thanks for your in-depth debugging and resolving this quite fast 🏆, could you also let us know when this would released? @zmrow and @markusboehme

zmrow · 2023-06-07T15:09:00Z

We don't have an exact release date yet, but want to include this in the next release. We'll follow up here when we have more details.

stmcginnis · 2023-06-20T15:24:55Z

Fixes merged to develop and 1.14.x branches.

sdomme added status/needs-triage Pending triage or re-evaluation type/bug Something isn't working labels May 24, 2023

sdomme mentioned this issue May 24, 2023

aws-cni in chaining mode with cilium on ipv6 EKS stuck in pending state aws/amazon-vpc-cni-k8s#2391

Closed

zmrow added status/research This issue is being researched and removed status/needs-triage Pending triage or re-evaluation labels May 24, 2023

zmrow mentioned this issue Jun 1, 2023

netdog: set IPv6 accept-ra for primary interface via config rather than sysctl #3159

Merged

markusboehme mentioned this issue Jun 2, 2023

Wicked: improve reliability of acquiring a DHCPv6 lease #3166

Merged

This was referenced Jun 19, 2023

v1.14.2 Tracking Issue 🐞 #3210

Closed

[1.14.x] netdog: set IPv6 accept-ra for primary interface via config rather than sysctl #3211

Merged

[1.14.x] Wicked: improve reliability of acquiring a DHCPv6 lease #3212

Merged

stmcginnis closed this as completed Jun 20, 2023

pasi-romo-idealo mentioned this issue Jul 14, 2023

Cilium pod fails to start due to CrashLoopBackOff error in config init-container cilium/cilium#24949

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EKS node do not get IPv6 address from wicked-dhcp6 #3143

EKS node do not get IPv6 address from wicked-dhcp6 #3143

sdomme commented May 24, 2023

zmrow commented May 24, 2023 •

edited

Loading

sdomme commented May 24, 2023

zmrow commented May 24, 2023

zmrow commented May 26, 2023

zmrow commented May 26, 2023

zmrow commented Jun 1, 2023

sdomme commented Jun 5, 2023

ajaykumarmandapati commented Jun 6, 2023

zmrow commented Jun 7, 2023

stmcginnis commented Jun 20, 2023

EKS node do not get IPv6 address from wicked-dhcp6 #3143

EKS node do not get IPv6 address from wicked-dhcp6 #3143

Comments

sdomme commented May 24, 2023

zmrow commented May 24, 2023 • edited Loading

sdomme commented May 24, 2023

zmrow commented May 24, 2023

zmrow commented May 26, 2023

zmrow commented May 26, 2023

zmrow commented Jun 1, 2023

sdomme commented Jun 5, 2023

ajaykumarmandapati commented Jun 6, 2023

zmrow commented Jun 7, 2023

stmcginnis commented Jun 20, 2023

zmrow commented May 24, 2023 •

edited

Loading