Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EKS node do not get IPv6 address from wicked-dhcp6 #3143

Closed
Tracked by #3210
sdomme opened this issue May 24, 2023 · 10 comments
Closed
Tracked by #3210

EKS node do not get IPv6 address from wicked-dhcp6 #3143

sdomme opened this issue May 24, 2023 · 10 comments
Labels
status/research This issue is being researched type/bug Something isn't working

Comments

@sdomme
Copy link

sdomme commented May 24, 2023

Image I'm using: Bottlerocket OS 1.14.0 (aws-k8s-1.24)

What I expected to happen:
In an EKS IPv6 Cluster I expect to get an IPv6 addr on the eth0 interface (primary ENI) of the nodes, what also happen the most of the time. For example:

bash-5.1# wicked show all
lo              device-unconfigured
      link:     #1, state up
      type:     loopback
      addr:     ipv4 127.0.0.1/8
      addr:     ipv6 ::1/128

eth0            up
      link:     #2, state up, mtu 9001
      type:     ethernet, hwaddr 0a:ff:ec:87:a6:e8
      config:   wicked:xml:/etc/wicked/ifconfig/eth0.xml
      leases:   ipv4 dhcp granted
      leases:   ipv6 dhcp granted
      addr:     ipv4 10.33.254.4/24 [dhcp]
      addr:     ipv6 2a05:d014:dad:dc05::9cc/64 [dhcp]
      route:    ipv4 default via 10.33.254.1 [dhcp]
      route:    ipv6 default via fe80::850:32ff:fe76:84 metric 1024 proto ra

What actually happened:

It happen on multiple EKS Clusters (nodes are provisioned by Karpenter dynamically) that the nodes don't get an IPv6 addr on eth0 even if the ENI has this IP addr configured (from the AWS console). Tools (in our case cilium) using IPv6 to talk to the kubernetes API run into timeout and break the whole network stack.

What we could find so far:

IPv6 request is stuck. It says granted on a working node.

ash-5.1# wicked show eth0
eth0            up
      link:     #2, state up, mtu 9001
      type:     ethernet, hwaddr 02:ca:73:3a:e3:d0
      config:   wicked:xml:/etc/wicked/ifconfig/eth0.xml
      leases:   ipv4 dhcp granted
      leases:   ipv6 dhcp requesting
      addr:     ipv4 10.32.36.132/24 [dhcp]
      route:    ipv4 default via 10.32.36.1 [dhcp]
      route:    ipv6 default via fe80::b5:76ff:fec1:2b40 metric 1024 proto ra

Trying to get an IP manually leads to an timeout.
Gives up after 10 seconds:

ash-5.1# wicked test dhcp6 eth0
wicked: eth0: Request to acquire DHCPv6 lease with UUID b53e6e64-540d-0d00-007b-020001000000 in mode auto
bash-5.1#

More verbose:

bash-5.1# /usr/libexec/wicked/bin/wickedd-dhcp6 --test --test-mode auto --log-level debug --debug most eth0
::: Reading config file /etc/wicked/common.xml
::: eth0: Found usable link-local IPv6 address: fe80::ca:73ff:fe3a:e3d0
Notice: eth0: Request to acquire DHCPv6 lease with UUID f31a6e64-d8de-0100-4864-010001000000 in mode auto
::: eth0: setting fsm timeout to 10.000 sec
::: eth0: timeout in state INIT (failure)
::: ni_dhcp6_tester_protocol_event(ev=3, dev=eth0[2], config-uuid=f31a6e64-d8de-0100-4864-010001000000)
::: no sockets left to watch
::: eth0: Deleting dhcp6 device with index 2

Interestingly restarting the whole wicked helps to get an IP

bash-5.1# systemctl restart wicked
bash-5.1# wicked show eth0
eth0            up
      link:     #2, state up, mtu 9001
      type:     ethernet, hwaddr 02:ca:73:3a:e3:d0
      config:   wicked:xml:/etc/wicked/ifconfig/eth0.xml
      leases:   ipv4 dhcp granted
      leases:   ipv6 dhcp granted
      addr:     ipv4 10.32.36.132/24 [dhcp]
      addr:     ipv6 2a05:d014:396:cd03::d8c3/64 [dhcp]
      route:    ipv4 default via 10.32.36.1 proto dhcp
      route:    ipv6 default via fe80::b5:76ff:fec1:2b40 metric 1024 proto ra

How to reproduce the problem:
We couldn't find a pattern, why this happen. In a Cluster where we do dynamic scaling and exchange the nodes very often per day it happen more often.

We were recommended to open up the issue from aws/amazon-vpc-cni-k8s#2391

@zmrow
Copy link
Contributor

zmrow commented May 24, 2023

Thanks for the issue @sdomme ! This is an interesting failure case. Appreciate the debug work you've done.

Do you have an estimation of how often it happens in your clusters? Something like "x failures in x instance launches".

On an instance that fails to get a DHCP6 lease, does wicked, wickedd, or wickedd-nanny have any errors in logs? journalctl -u wicked

I'd also be curious to see the output of the following (debug mode set to all)
wicked --log-level debug --debug all test dhcp6 eth0

Not sure wickedd-dhcp6 will provide different output, but it's worth a shot:
/usr/libexec/wicked/bin/wickedd-dhcp6 --test --test-mode auto --log-level debug --debug all eth0

@sdomme
Copy link
Author

sdomme commented May 24, 2023

Hi @zmrow Unfortunately (#2254) there is not much in the logs except of service is starting... etc. At the moment there is no broken node where I have access to. As soon as I have the next one, I can provide the logs and output of the commands you mentioned.

For the estimation it is a bit hard to say. Totally depending on the Cluster. Currently there is one Cluster with 16 karpenter provisioned nodes. The oldest one with an age of 32h. 1 Node is in a broken state. My estimation would be something every 12-16 nodes / 24h there is one with this issue.

@zmrow zmrow added status/research This issue is being researched and removed status/needs-triage Pending triage or re-evaluation labels May 24, 2023
@zmrow
Copy link
Contributor

zmrow commented May 24, 2023

Thanks @sdomme . I've launched a bunch of instances and seem seem to have reproduced the same behavior you're seeing.

I'm digging into it and am looking for a fix. I'll provide additional details here when I have them.

@zmrow
Copy link
Contributor

zmrow commented May 26, 2023

Update: I turned on some additional debug logging in wicked and built an image and AMI.

I had an initial theory about the timing of the accept_ra sysctl being set, but that's not really panning out. I added this sysctl to the defaults in my custom image and was still able to get an instance that failed to get an IPv6 address.

I'm digging through the additional logs of this instance and will follow up tomorrow as I'm launching more as we speak.

@zmrow
Copy link
Contributor

zmrow commented May 26, 2023

Update: Continuing the search!

I found that in all cases where the instance failed to get an IPv6 address, wicked logged a message:

May 25 23:40:15 localhost wickedd-dhcp6[978]: eth0: setting initial transmit delay of 0 .. 0.000s

In code, if the transmit delay is set to 0, the function cancels the timer and doesn't continue. A DHCPv6 Solicit message is never sent and everything grinds to a halt.

@markusboehme wrote a few exploratory patches today to adjust the solicitation delay. In my testing of these patches, I was still able to produce 1 node in 2000 launches that failed to get a DHCPv6 address. The interface never seems to transition to "managed" mode, which we see in successful instances:

May 24 21:13:35 localhost wickedd-dhcp6[960]: eth0: updated dhcp6 mode from auto to managed
May 24 21:13:35 localhost wickedd-dhcp6[960]: eth0: fsm start in mode managed

We'll continue to dig into this.

@zmrow
Copy link
Contributor

zmrow commented Jun 1, 2023

@sdomme good news! I think we found the culprit(s) and have a good bead on a fix. I have a PR out now for one half of the fix, and @markusboehme is working on a PR for some patches to wicked which make up the other half of the fix.

We went pretty far down the rabbit hole on this one.

As previously mentioned, we came across an issue where wicked's state machine would grind to a halt if the solicitation delay was set to 0. @markusboehme came up with a patch to fix that issue, but we were still able to reproduce the issue fairly reliably with a couple nodes per 2000 launches. With some additional digging into wicked code, he discovered yet another patch in upstream wicked code that polls for updates to the IPv6-related flags rather than waiting for a NEWLINK event from the kernel. This also helped the situation, but we still could reproduce the issue.

After digging a bit deeper in wicked/kernel code we discovered my initial suspicion about the accept_ra sysctl timing was also a piece of the puzzle. The kernel sets a flag indicating a router advertisement has been received (IF_RA_RCVD), but only after it completes duplicate address detection and decides whether to send a router solicitation. If the sysctl isn't set by the time duplicate address detection completes, the solicitation doesn't happen and the IF_RA_RCVD flag doesn't get set. wicked uses this flag to decide whether or not to kick off the state machine that handles DHCP6. We are now allowing wicked to manage the "accept RA" setting by adding it to the config of the primary interface.

Using my PR and the aforementioned patches to wicked I've launched 6000 instances into an IPv6 EKS cluster successfully. That makes us feel pretty confident we've nailed this down.

The wicked patches will be in PR soon.

@sdomme
Copy link
Author

sdomme commented Jun 5, 2023

Hello @zmrow It is indeed very good news. Thank you and @markusboehme for this awesome debugging.

@ajaykumarmandapati
Copy link

Yet again! thanks for your in-depth debugging and resolving this quite fast 🏆, could you also let us know when this would released? @zmrow and @markusboehme

@zmrow
Copy link
Contributor

zmrow commented Jun 7, 2023

We don't have an exact release date yet, but want to include this in the next release. We'll follow up here when we have more details.

@stmcginnis
Copy link
Contributor

Fixes merged to develop and 1.14.x branches.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status/research This issue is being researched type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants