Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Egress Policy v1alpha1 implementation #1924

Closed
5 tasks done
tnqn opened this issue Mar 1, 2021 · 10 comments
Closed
5 tasks done

Egress Policy v1alpha1 implementation #1924

tnqn opened this issue Mar 1, 2021 · 10 comments
Labels
kind/design Categorizes issue or PR as related to design.

Comments

@tnqn
Copy link
Member

tnqn commented Mar 1, 2021

Describe what you are trying to solve
This proposal summarizes the first alpha version of the Egress feature. Please see #667 for the complete proposal.

In v1alpha1, we require users to manually configure SNAT IPs on the Nodes. In an Egress, a particular SNAT IP can be specified for the selected Pods, and antrea-controller will publish the selected Pods of Egresses to the Nodes on which the selected Pods run.

There will be some limiations in the first version: encap mode is the only supported traffic mode. Some features and scenarios, e.g. HA, dual-stack and windows are not supported.

Describe how your solution impacts user flows

  1. User configures secondary IPs that can be used as SNAT IPs to Nodes' network interfaces.
  2. User configures EgressPolicy (a CRD API) which selects specific Pods and the IP they should be translated to when accessing external addresses.

Describe the main design/architecture of your solution

API change

An user-facing API will be introduced. The object schema will be like below:

type EgressPolicy struct {
	metav1.TypeMeta `json:",inline"`
	// Standard metadata of the object.
	metav1.ObjectMeta `json:"metadata,omitempty"`

	// Specification of the desired behavior of EgressPolicy.
	Spec EgressPolicySpec `json:"spec"`
}

// EgressPolicySpec defines the desired state for EgressPolicy.
type EgressPolicySpec struct {
	// AppliedTo selects Pods to which the policy will be applied.
	AppliedTo AppliedTo
	// EgressIP specifies the SNAT IP address for the selected Pods.
	EgressIP string
}

// AppliedTo defines the workloads to which a policy is applied.
type AppliedTo struct {
	// Select Pods matched by this selector. If set with NamespaceSelector,
	// Pods are matched from Namespaces matched by the NamespaceSelector;
	// otherwise, Pods are matched from all Namespaces.
	// +optional
	PodSelector *metav1.LabelSelector `json:"podSelector,omitempty"`
	// Select all Pods from Namespaces matched by this selector, as
	// workloads in To/From fields. If set with PodSelector,
	// Pods are matched from Namespaces matched by the NamespaceSelector.
	// +optional
	NamespaceSelector *metav1.LabelSelector `json:"namespaceSelector,omitempty"`
}

Egress's Pod selection is calculated by antrea-controller and transmitted to antrea-agent via a controlplane API EgressGroup. This is mainly to avoid redundant Pod watching and group calculation when resolving "AppliedTo".
A Egress's corresponding EgressGroup will use the same name for agent to identify, like Service and Endpoint resource.

type EgressGroup struct {
	metav1.TypeMeta
	metav1.ObjectMeta
	// GroupMembers is a list of resources selected by this group.
	GroupMembers []GroupMember
}

type EgressGroupPatch struct {
	metav1.TypeMeta
	metav1.ObjectMeta
	AddedGroupMembers   []GroupMember
	RemovedGroupMembers []GroupMember
}

Control Plane

antrea-controller

antrea-controller watches the Egress resource from Kubernetes API, creates the EgressGroup resources. EgressGroup API in the controlplane API group will provide list, get, and watch interface for agents to consume.

antrea-agent

antrea-agent watches the above EgressGroup API and Egress API, then:
For each Egress, it checks whether the EgressIP is configured on the Node it runs on. If yes, it allocates a locally-unique ID (usage mentioned in the "Data plane" section below) for this IP and configures corresponding openflow rules and iptables rules to enforce SNAT for specific traffic. Otherwise it does nothing.
For each Pod in EgressGroup, it checks whether the associated EgressIP is local or not. If local, it configures specific openflow rules to forward the traffic coming from the Pod to the gateway interface with specific mark set. If remote, it configures specific openflow rules to forward the traffic to the tunnel interface with specific tunnel destination set.

Data Plane

(Copied from #667 (comment))
On the Node, antrea-agent will realize the SNATPolicy with OVS flows and iptables rules. If the SNAT IP is not present on the local Node, the packets to be SNAT'd will be tunneled to the SNAT Node using the SNAT IP to be the tunnel destination IP. On the SNAT Node, the tunnel destination IP will be directly used as the SNAT IP.
On the SNAT Node, an iptables rule will be added to perform the SNAT with the specified SNAT IP, but which SNAT IP to use for a given packet is controlled by the OVS flows. The OVS flows will mark a packet that needs to be SNAT'd with a SNAT IP with the corresponding integer ID, and the corresponding iptables SNAT rule matches the packet MARK.

The OVS flow changes include:

table 31
// SNAT flows for Windows
- priority=210 ip,-new+trk,snatCTMARK,from_uplink macRewriteMark,goto:40 (SNAT return traffic)
+ priority=210 ip,-new+trk,snatCTMARK,from_uplink,nw_dst=localSubnet macRewriteMark,goto:40 (SNAT return traffic - remote packets will be handled by L3Fwd flows, so no need to set the macRewrite MAC)

table 70
// Reuse these Windows SNAT flows to skip packets need not SNAT
+priority=200 ip,from_local,nw_dst=localSubnet goto:80
+priority=200 ip,from_local,nw_dst=nodeIP goto:80
+priority=200 ip,from_local,nw_dst=gatewayCTMark goto:80

// Send packets for external network to the SNAT table
+priority=190 ip,from_local goto:71
+priority=190 ip,macRewriteMark mod_dl_dst:gw0_mac,goto:71 (traffic tunneled from remote Nodes)

+table 71 (snatTable. ttlDecTable is moved to table 72)
// Windows flows: load SNAT IP to a register (probably share the endpointIPReg and endpointIPv6XXReg)
priority=200 ip,+new+trk,in_port=local_pods snatRequiredMark(snat_ip),goto:80 (SNAT for local Pods, matching in_ports)
priority=200 ip,+new+trk,tun_dst=snat_ip snatRequiredMark(tun_dst),goto:80 (SNAT for remote Pods, matching tun_dst)
priority=190 ip,+new+trk snatRequiredMark(node_ip),goto:80 (default SNAT IP)

// Linux: mark the packet with an integer ID allocated for each SNAT IP
priority=200 ip,+new+trk,in_port=local_pods mark(snat_id),goto:80 (SNAT for local Pods)
priority=200 ip,+new+trk,tun_dst=snat_ip mark(snat_id),goto:80 (SNAT for remote Pods)

// common: tunnel packets need to SNAT on a remote Node with the SNAT IP to be the outer destination
priority=200 ip,in_port=local_pods mod_dl_src:gw0_mac,mod_dl_dst:vMAC,snat_ip->NXM_NX_TUN_IPV4_DST,goto:72
priority=0 goto_table:80

+table 72 (ttlDecTable)

table 105
// Windows: perform SNAT with the SNAT IP saved in the register
+priority=200 ip,+new+trk,snatRequiredMark ct(commit,table=110,zone=65520,nat(src=snat_ip),snatCTMark)

iptables rules:
iptables -t nat -A POSTROUTING -m mark --mark snat_id -j SNAT --to-source snat_ip

Work breakdown

Alternative solutions that you considered
NONE

Test plan
Add E2E tests to verify specific Pods are translated to specific IP when accessting an http server deployed "outside" the cluster (it could be a host-network Pod running on a Node that is different from the Egress Node.

Additional context
Any other relevant information.

@tnqn tnqn added the kind/design Categorizes issue or PR as related to design. label Mar 1, 2021
@tnqn tnqn added this to the Antrea v0.14.0 release milestone Mar 1, 2021
@tnqn
Copy link
Member Author

tnqn commented Mar 1, 2021

@jianjuns I created this issue to track the design change and progress of the first version of the feature, copied some content we have discussed in #667 here. Feel free to update it directly if you have any ideas on details or names.

@jianjuns
Copy link
Contributor

jianjuns commented Mar 2, 2021

Thanks for the details of version 1.

A question for controlplane API: could we reuse AppliedToGroup instead of adding a new EgressGroup?

@jianjuns
Copy link
Contributor

jianjuns commented Mar 2, 2021

And for AppliedTo why do not have ClusterGroup and Service reference there? Is it for simplification of 1st version?

@vicky-liu
Copy link

cc @ceclinux to take a look at work breakdown.

@tnqn
Copy link
Member Author

tnqn commented Mar 2, 2021

@jianjuns

A question for controlplane API: could we reuse AppliedToGroup instead of adding a new EgressGroup?

I thought about this but didn't find real benefits to do so so switched to another way that could reduce code redundancy and grouping caculation between all kinds of groups, including clustergroups, appliedtogroups, addressgroups and egressgroups.
Some cons of reusing AppliedToGroup I thought of:

  1. When AppliedToGroups for NetworkPolicy and EgressPolicy are mixed in a single API, the AppliedToGroups for one policy will have to wake up another policy's event handler unnecessarily.
  2. The AppliedToGroup for EgressPolicy may use different strategy to dispatch to agents. Egress nodes may need to know all members of a given group, instead of the ones that are running on it (when we want to support noencap mode or the case that SNAT IP is not reachable from non egress nodes).
  3. Currently the AppliedToGroup is coupled with NetworkPolicyController on both controller and agent side, extracting them for another vertical to reuse is more complex than extracting the grouping logic to a separate module and have different group APIs consuming it. In the latter way, the API is business aware while the grouping process is generic.
    My PoC of the latter approach is close to finish, and I have verified it could improve the performance of networkpolicy controller greatly as well as reducing code redundancy. I may push the PR for review in 1 or 2 days.

And for AppliedTo why do not have ClusterGroup and Service reference there? Is it for simplification of 1st version?

I copied the struct from your PR. Supporting ClusterGroup and Service reference should be ok, I don't think of big effort introduced by them. Feel free to add them to the design if you think they should be in 1st version.

@jianjuns
Copy link
Contributor

jianjuns commented Mar 2, 2021

But from understanding/troubleshooting perspective, it is much better to use a single type, and map a single ClusterGroup to a single AddressGroup or AppliedToGroup.
If we think too much work to refactor NetworkPolicyController, can we still reusing the same AppliedToGroup type, but create another set of AppliedToGroups in this release?

I think better to support ClusterGroup and Service reference too. I can update my PR with my ideas.

@tnqn
Copy link
Member Author

tnqn commented Mar 3, 2021

I think you mean having another API path but use the same struct, like "/v1alpha1/egressgroups" will get the new set of AppliedToGroup. However, clientset code is generated based on the name of the struct or the "resourceName" tag of the struct. I think it won't work if we use same struct in same API group as the paths in the generated clientset will be exactly same.

And what do you think the first and second problems I mentioned above, especially the second? I think the group for egresspolicy has more difference from AppliedToGroup for NetworkPolicy: it needs to include podIP information and dispatched to all egress nodes, which is more like AddressGroup for Egress Node but AppliedToGroup for non Egress Node.
With these differences, do you think it's still worth to consider EgressNode as AppliedToGroup?

@tnqn
Copy link
Member Author

tnqn commented Mar 17, 2021

@jianjuns Given the fact that all agents need to watch all Egresses and there shouldn't be overlapped groups for Egress, I found there is no much value to have a controlplane Egress API as we could just create a EgressGroup with the same name as the Egress resource (just like Service and Endpoint), then use the Egress's name to get its group on agent side to save many code (the controller in antrea-controller can focus on syncing EgressGroup, antrea-agent can leverage Egress Informer), let me know if you have concern on this. This is the code on antrea-controller side: 178405b

@jianjuns
Copy link
Contributor

I am fine to watch Egresses directly from K8s API for now. We can decide what to do later (when we have another solution to discover/assign SNAT IPs).

@tnqn
Copy link
Member Author

tnqn commented Apr 7, 2021

All code changes have been merged, closing

@tnqn tnqn closed this as completed Apr 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/design Categorizes issue or PR as related to design.
Projects
None yet
Development

No branches or pull requests

3 participants