IPAM: Adds AWS ENI IPv6 Prefix Delegation Support#32352
IPAM: Adds AWS ENI IPv6 Prefix Delegation Support#32352danehans wants to merge 2 commits intocilium:mainfrom
Conversation
9c8e75c to
984e02a
Compare
|
@danehans The main commit here is quite long, would it be possible to break this down into a couple smaller commits to make the changes easier to review/understand? |
| func (n *Node) GetMaximumAllocatableIP(family ipam.Family) int { | ||
| if family == ipam.IPv6 { | ||
| // not implemented | ||
| return 0 | ||
| } |
There was a problem hiding this comment.
It doesn't seem like Azure distinguishes between ip4/ip6 for this limit, so we can probably just leave this as is and make a note for the follow metrics work.
There was a problem hiding this comment.
If the family parameter is removed, the Azure Node type will not not implement the IPAM NodeOperations interface.
@tommyp1ckles thanks for reviewing the PR. I originally broke the PR into ~4 commits but the |
asauber
left a comment
There was a problem hiding this comment.
Let me know if you have any suggestions on the best way to break the PR up into multiple commits that will pass this job.
As an example of something that could be broken down into its own commit, you have some new logic in ipam.go which determine if a new masquerade option should be set. I'm not entirely sure how self contained this concept is, but it looks like it could be broken down into its own commit, which adds that new logic and produces a version of Cilium that can build and pass tests with the new functionality.
A PR that adds over 3200 lines of new code I would expect to contain at least 6 - 12 commits with detailed descriptions, or the PR will be quite difficult to review.
|
Some example PRs from recent large-size change sets: https://github.com/cilium/cilium/pull/32336/commits In these case I believe the reviewers also held some conversations to explain the changes to the reviewers. |
|
@asauber thanks for the review and guidance. Let me look into refactoring the PR into multiple commits. |
|
To verify the e2e functionality of this PR, perform the following steps: Create an EC2 cluster: Note: Patch VPC CNI: Before installing Cilium, a few dependencies must be met. First, create an IAM policy doc required for the operator to function: Create an IAM policy using the above policy document: Attach the IAM policy to the node instance role created by eksctl. First, extract the node role from the node group: Attach the IAM policy to the node group: Next, the Cilium Get the IPv6 CIDR for the VPC and save it as an environment variable: Build and push images: agent build/tag/push: aws-operator build/tag/push: Due to the following Cilium daemon error, set the VPC IPv6 CIDR range for each node: Annotate each node with Install Cilium using Helm: If you built your own agent and operator images, replace the Wait for Cilium to be ready: I used my CiliumCon IPv6 demo to verify functionality. Note that the kernel version for instance |
|
/test |
|
@tommyp1ckles Once this has been approved and merged in, can you advise how frequently you release new minor versions? We're trying to gauge how far off this feature is from being GA. TIA |
| log.Warn("IPSec encrypted overlay is enabled but IPSec is not. Ignoring option.") | ||
| } | ||
|
|
||
| // AWS does not support dualstack: https://docs.aws.amazon.com/eks/latest/userguide/cni-ipv6.html |
There was a problem hiding this comment.
nit: should this "AWS does not support" or "EKS does not support"?
There was a problem hiding this comment.
(I definitely don't understand the whole configuration space here, so feel free to tell me I'm wrong).
| // GetMinimumAllocatableIPv4 returns the minimum amount of IPv4 addresses that | ||
| // must be allocated to the instance. | ||
| func (n *Node) GetMinimumAllocatableIPv4() int { | ||
| // GetMinimumAllocatableIPv4 returns the minimum amount of addresses that |
There was a problem hiding this comment.
nit: docblock is wrong here.
| // in the allocation pool. It also returns the number of IPs required and | ||
| // available. | ||
| func (n *nodeStore) hasMinimumIPsInPool(localNodeStore *node.LocalNodeStore) (minimumReached bool, required, numAvailable int) { | ||
| func (n *nodeStore) autoDetectIPv6NativeRoutingCIDR(localNodeStore *node.LocalNodeStore) bool { |
There was a problem hiding this comment.
This seems copy-pasty from autoDetectIPv4NativeRoutingCIDR, can those be combined?
gandro
left a comment
There was a problem hiding this comment.
Thanks a lot for tackling this! I know this has been a heavily requested feature.
We (@bimmlerd and I) started reviewing the code (noticing some nits), but stopped half way, as we think overall there is a discussion to be had if this is the right approach to implement IPv6 support on top of PD.
I'd rather have us re-think how the prefix delegation control plane works, see comment below for more.
|
|
||
| // IPv6 is not supported for the Azure IPAM plugin. | ||
| if option.Config.IPAM == ipamOption.IPAMAzure && option.Config.IPv6Enabled() { | ||
| log.Fatalf("%s ENI IPAM plugin does not support IPv6", ipamOption.IPAMAzure) |
There was a problem hiding this comment.
| log.Fatalf("%s ENI IPAM plugin does not support IPv6", ipamOption.IPAMAzure) | |
| log.Fatalf("%s IPAM mode does not support IPv6", ipamOption.IPAMAzure) |
There was a problem hiding this comment.
Yes, although Alibaba Cloud does call their network interfaces "ENI"s, Azure does not call them "ENI"s, so best to leave that out here.
|
|
||
| // IPv6 is not supported for the Alibaba IPAM plugin. | ||
| if option.Config.IPAM == ipamOption.IPAMAlibabaCloud && option.Config.IPv6Enabled() { | ||
| log.Fatalf("%s ENI IPAM plugin does not support IPv6", ipamOption.IPAMAlibabaCloud) |
There was a problem hiding this comment.
| log.Fatalf("%s ENI IPAM plugin does not support IPv6", ipamOption.IPAMAlibabaCloud) | |
| log.Fatalf("%s ENI IPAM mode does not support IPv6", ipamOption.IPAMAlibabaCloud) |
| // by the caller. | ||
| // This will also cause disruption of networking until all endpoints | ||
| // have been regenerated. | ||
| primaryFam := ipam.IPv4 |
There was a problem hiding this comment.
This looks off: If we are restoring a router IP from before a cilium-agent restart, allocateDatapathIPs is called with fromFS being a valid IPv6. Thus the family of the allocated IP would be IPv6. We ought to check the family type of fromK8s/fromFS here
| if d.healthEndpointRouting, err = parseRoutingInfo(result); err != nil { | ||
| log.WithError(err).Warn("Unable to allocate health information for ENI") | ||
| } | ||
| } |
There was a problem hiding this comment.
Given the amount of duplication here, have you considered refactoring out the body of allocateHealthIPs into a family-agnostic function similar to allocateDatapathIPs?
| if family == ipam.IPv6 { | ||
| numPrefixes := ip.PrefixCeil(a.IPv6.AvailableForAllocation, option.ENIPDBlockSizeIPv6) | ||
| if err := n.manager.api.AssignENIIPv6Prefixes(ctx, a.InterfaceID, int32(numPrefixes)); err != nil { | ||
| return err |
There was a problem hiding this comment.
In contrast to the isSubnetAtPrefixCapacity logic below (which returns early if there is no error), this will not return early if there is no error. This means that if AssignENIIPv6Prefixes succeeded, we'll still call n.manager.api.AssignPrivateIpAddresses at the end of the function, which I think we don't want.
| // ENIPDBlockSizeIPv6 is the number of IPs available on an ENI IPv6 prefix. AWS assigns a /80 prefixto an IPv6 ENI | ||
| // which contains 2^(128-80) IP addresses, so the number of IPv6 addresses is scaled down to this value. | ||
| // See https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-prefix-eni.html#ec2-prefix-basics for more details | ||
| const ENIPDBlockSizeIPv6 = 64 |
There was a problem hiding this comment.
The downside of limiting each PD to only 64 IPs means that we're having to allocate multiple PDs per node (assuming ~100-300 pods per node), even though a single PD contains more than enough IPs we'll ever need.
While I understand that building on top of the existing PD infrastructure is the easiest way to add IPv6 support, I fear that we're painting ourselves in a corner here. I'd rather have us re-think how the PD control plane is supposed to work. I think it would simplify things a lot if it worked more like cluster-pool or multi-pool IPAM, where the operator assigns a full CIDR to a node, and then node itself then carves out IPs from that CIDR. This also reduces the amount of CiliumNode updates, as we no longer have to update the resource for every IP allocation/release or enumerate every single pod IP in the CiliumNode CRD.
I understand that this is a larger refactor than the approach presented in this PR, but doing these refactors before we add a feature like IPv6 gives us the chance to fix such shortcomings in the controlplane, as any changes after the feature has been merged needs to be backwards compatible.
There was a problem hiding this comment.
Agreed. Offering more of the IPv6 space to each ENI (at by least an order of magnitude) allows for a new range of use cases. There are applications of which I'm aware that use an IPv6 address per connection.
There was a problem hiding this comment.
AWS Node allocates /80 prefix from the private subnet exclusively for pods. Should the cilium IPAM/ENI not just do the same?
asauber
left a comment
There was a problem hiding this comment.
This is clearly a feature that we need at some point. This is a great first pass at the implementation. There are ways that it could be modified to more fully take advantage of the IPv6 address space, so let's discuss that.
Reviewed for CLI and Config changes, and did one pass looking for logic problems.
I will leave a more detailed IPAM review to others.
|
|
||
| // AWS does not support dualstack: https://docs.aws.amazon.com/eks/latest/userguide/cni-ipv6.html | ||
| if option.Config.IPAM == ipamOption.IPAMENI && option.Config.IsDualStack() { | ||
| log.Fatalf("Both %s and %s options cannot be set since AWS does not support dualstack", |
There was a problem hiding this comment.
If we decide to change the comment above, this error message should also say "EKS" rather than "AWS".
|
|
||
| // IPv6 is not supported for the Azure IPAM plugin. | ||
| if option.Config.IPAM == ipamOption.IPAMAzure && option.Config.IPv6Enabled() { | ||
| log.Fatalf("%s ENI IPAM plugin does not support IPv6", ipamOption.IPAMAzure) |
There was a problem hiding this comment.
Yes, although Alibaba Cloud does call their network interfaces "ENI"s, Azure does not call them "ENI"s, so best to leave that out here.
| if option.Config.EnableIPv6Masquerade && | ||
| option.Config.IPAM == ipamOption.IPAMENI && | ||
| result != nil && | ||
| len(result.CIDRs) > 0 { | ||
| result.CIDRs = coalesceCIDRs(result.CIDRs) | ||
| } |
There was a problem hiding this comment.
Is this case not already handled by the block on line 284?
| InvalidParameterValueStr = "InvalidParameterValue" | ||
|
|
||
| // subnetAvailableIPv6Addrs is the available number of addresses for an EC2 IPv6 subnet. | ||
| subnetAvailableIPv6Addrs = 10000 |
There was a problem hiding this comment.
Where does the number 10000 come from? The /80 (per ENI) mentioned in your PR description would support billions of addresses per subnet.
Since the maximum number of addresses in an IPv4 subnet is 2^24 (10.0.0.0/8), I would expect IPv6 to support at least that many addresses as a maximum, and preferably more.
Per your tests, six m5.large instances could exceed 10,000 addresses.
| if iface.MacAddress != nil { | ||
| eni.MAC = aws.ToString(iface.MacAddress) | ||
| eni.MAC = *iface.MacAddress | ||
| } | ||
|
|
||
| if iface.NetworkInterfaceId != nil { | ||
| eni.ID = aws.ToString(iface.NetworkInterfaceId) | ||
| eni.ID = *iface.NetworkInterfaceId | ||
| } | ||
|
|
||
| if iface.Description != nil { | ||
| eni.Description = aws.ToString(iface.Description) | ||
| eni.Description = *iface.Description | ||
| } |
There was a problem hiding this comment.
Is there a reason here to remove the use of aws.ToString? It coalesces nil values to empty strings.
| if iface.PrivateIpAddress != nil { | ||
| eni.IP = *iface.PrivateIpAddress | ||
| for _, ip := range iface.PrivateIpAddresses { | ||
| if !usePrimary && ip.Primary != nil && aws.ToBool(ip.Primary) { |
There was a problem hiding this comment.
nit: aws.ToBool does the nil check
| return 0, unableToCreateENI, fmt.Errorf("%s: %w", errUnableToCreateENI, err) | ||
| scopedLog.WithField(logfields.Node, n.k8sObj.Name).Warning("Subnet might be out of IPv4 prefixes, Cilium will not allocate prefixes on this node anymore") | ||
| eniID, eni, err = n.manager.api.CreateNetworkInterface(ctx, int32(toAllocate), subnet.ID, desc, securityGroupIDs, false, family) | ||
| if err != nil { |
There was a problem hiding this comment.
Moving this error handling inside the isSubnetAtPrefixCapacity(err) block here removes the unableToCreateENI error handling for other cases where err != nil on line 598.
It should remain outside this block.
| // GetMinimumAllocatableIPv4 returns the minimum amount of IPv4 addresses that | ||
| // must be allocated to the instance. | ||
| func (n *Node) GetMinimumAllocatableIPv4() int { | ||
| // GetMinimumAllocatableIPv4 returns the minimum amount of IPa that must be allocated |
| // ENIPDBlockSizeIPv6 is the number of IPs available on an ENI IPv6 prefix. AWS assigns a /80 prefixto an IPv6 ENI | ||
| // which contains 2^(128-80) IP addresses, so the number of IPv6 addresses is scaled down to this value. | ||
| // See https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-prefix-eni.html#ec2-prefix-basics for more details | ||
| const ENIPDBlockSizeIPv6 = 64 |
There was a problem hiding this comment.
Agreed. Offering more of the IPv6 space to each ENI (at by least an order of magnitude) allows for a new range of use cases. There are applications of which I'm aware that use an IPv6 address per connection.
| // IPv6MaxAboveWatermark is the maximum number of IPv6 addresses to allocate | ||
| // beyond the addresses needed to reach the IPv6PreAllocate watermark. | ||
| // Going above the watermark can help reduce the number of API calls to | ||
| // allocate IPs, e.g. when a new ENI is allocated, as many secondary | ||
| // IPv6 adresses as possible are allocated. Limiting the amount can help reduce | ||
| // waste of IPs. | ||
| // | ||
| // +kubebuilder:validation:Minimum=0 | ||
| IPv6MaxAboveWatermark int `json:"ipv6-max-above-watermark,omitempty"` |
There was a problem hiding this comment.
This concept shouldn't be necessary with the IPv6 address space.
|
This pull request has been automatically marked as stale because it |
|
Heads up that #31145 will need to be also added back into this effort if someone revives this PR. |
|
@joestringer I talked to @danehans and he is not going to be able to move forward with this PR. Do you think a maintainer could take over his work ? |
|
This pull request has been automatically marked as stale because it |
|
This pull request has not seen any activity since it was marked stale. |
|
Hello! Is the work ever going to merged in the future? |
@joestringer is there anything you or one of cilium's maintainer could do ? |
|
Cilium relies on the contributions from volunteers in the community to implement functionality that they find important. I would encourage contributors to chat with one another, collaborate on what they see as the solutions, and make proposals to move this forward step by step. Personally I don't have the bandwidth to tackle this at the moment. |
|
Hi, I'm going to be working on this in a few months, I could do some PR on @danehans repo |
|
What work does the @danehans need to be acceptable? The above review looks to be mostly minor requests? |
|
I believe #32352 (comment) is the biggest concern which I think will need some serious engineering work to get resolved. |
Previously, ENI prefix delegation support was limited to IPv4.
This PR updates existing structs, methods, etc. to support IPv6
prefix delegation. The overall implementation follows the existing
approach taken for IPv4 prefix delegation.
Although AWS assigns a /80 prefix to an ENI, 64 addresses are the
maximum number of allocatable addresses. This restriction can be user
configurable in the future if needed.
Fixes: #19251
Fixes: #18405
Includes commit 3b20d9c from #31145. When #31145 is merged, I will rebase to remove this commit so ignore this commit from your review.