Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace KubeProxy Design Draft (Linux Only) #1931

Closed
hongliangl opened this issue Mar 2, 2021 · 10 comments
Closed

Replace KubeProxy Design Draft (Linux Only) #1931

hongliangl opened this issue Mar 2, 2021 · 10 comments
Assignees
Labels
kind/design Categorizes issue or PR as related to design. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@hongliangl
Copy link
Contributor

hongliangl commented Mar 2, 2021

Replace KubeProxy Design Draft (Linux Only)

original google doc

Why We Should Do This

Since we will provide NodePort Service support in AntreaProxy recently, we will then have the ability to remove KubeProxy totally. This will save a lot of cycles and memories, and make it possible to have more control over Service traffic.

Items to be Resolved

We still need the following abilities to do the replacement.

Direct APIServer Access

The KubeProxy may not exist when we’re installing Antrea in the future, this will make Services not work during the installation. Meanwhile, Antrea needs to connect to the Kubernetes Service to watch resources. To overcome this issue, we need both AntreaAgent and AntreaController to be able to connect to the APIServer Pod/Proxy directly.

img

Node ClusterIP Service Support

The ability to serve ClusterIP Service on the Node. AntreaProxy does not have this ability yet.

imgimg

Current Expected

LoadBalancer Service Support

Service Health Check Support

Bootstrap Order Re-arrange

Since the AntreaProxy will take all charge of Service serving, Services will not be available until the first sync of AntreaProxy. Thus, we need to make sure all sub-components rely on Services to be waiting until AntreaProxy is ready.

Detail Design

Direct APIServer Access

Kubernetes Client in Both AntreaAgent and AntreaController

In the current implementation, we use the in-cluster kubeconfig to set up the connection of the Kubernetes client. In the replacement solution, we need to change the server address from the Service address to the IP address of the Kubernetes API Server. There is a PR #1735 for supporting this in Agent. We need a similar PR for the controller also.

Antrea Client for Watching Internal NetworkPolicies

The Antrea client in AntreaAgent uses the Antrea Service to watch internal NetworkPolicies. We make the Antrea client to be initialized after the first sync of AntreaProxy.

APIServer in Both AntreaAgent and AntreaController

The APIServers in Antrea components need to connect to the Kubernetes APIServer to retrieve the Secret. The Secret retrieve will only take one time and it will not happen once the APIServer is up. Since the retrieve implementation is hidden behind the Kubernetes library, we need to rewrite the function in Antrea and override the address of the Kubernetes Service.

Startup Pressure

Once we override the Kubernetes Service, we lose the LB of it. This may cause large pressure on the specified Kubernetes APIServer since the watcher will then keep using it to watch resources.To overcome this issue, we can make the AntreaProxy have the ability to notify if it has been synced. Once the Service is ready, we can then switch to use the Service to take the advantage of the LB. Meanwhile, we should also have a flag to control whether we need to do the switch, since users may use a custom out-of-cluster LB for serving Kubernetes APIServer.Another way to reduce the pressure is to accept multiple override Endpoints, each Antrea component can then randomly pick one Endpoint from the list to connect.

Host ClusterIP Access Support

We use IPTables and IP rules to make it, and one of the design goals is to make the runtime updates on the host as little as possible.Like the NodePort implementation, we use IPSet to match the cluster dress, protocol, and port.

Name: ANTREA-SVC-CLUSTER-SERVICE
Type: hash:ip,port
Revision: 5
Header: family inet hashsize 1024 maxelem 65536
Size in memory: 600
References: 2
Number of entries: 8
Members:
10.96.0.10,udp:53
10.107.33.12,tcp:8080

In the mangle table of the iptables, we make a custom chain that marks packets that match the IPSet with mark 0xf2. The chain should be referred to in both OUTPUT and PREROUTING chains.

Chain ANTREA-SVC-CLUSTER (2 references)
target     prot opt source               destination
MARK       all  --  0.0.0.0/0            0.0.0.0/0            match-set ANTREA-SVC-CLUSTER-SERVICE dst,dst MARK set 0xf2

In the nat table, we match the packet with 0xf2 mark and masquerade them.

Chain POSTROUTING (policy ACCEPT)
target     prot opt source               destination
ANTREA-SVC-MASQ  all  --  0.0.0.0/0            0.0.0.0/0
 
Chain ANTREA-SVC-MASQ (1 references)
target     prot opt source               destination
MASQUERADE  all  --  0.0.0.0/0            0.0.0.0/0            mark match 0xf2

To make packets with 0xf2 mark, we add an IP rule to make that packets go to the antrea route table.

from all fwmark 0xf2 lookup antrea

In the antrea route table, we only have one onlink default route which forwards all packets to the antrea-gw0.

default via 169.254.169.110 dev antrea-gw0 onlink

We also need to add flow in the pipeline to respond to the virtual IP address' ARP query.

table=20, priority=200,arp,arp_tpa=169.254.169.110,arp_op=1 \
actions=move:NXM_OF_ETH_SRC[]->NXM_OF_ETH_DST[], \
    set_field:aa:bb:cc:dd:ee:ff->eth_src, \
    load:0x2->NXM_OF_ARP_OP[], \
    move:NXM_NX_ARP_SHA[]->NXM_NX_ARP_THA[], \
    set_field:aa:bb:cc:dd:ee:ff->arp_sha, \
    move:NXM_OF_ARP_SPA[]->NXM_OF_ARP_TPA[], \
    set_field:169.254.169.110->arp_spa, \
    IN_PORT 

Bootstrap Order Re-arrange

The bootstrap order of sub-components in AntreaController only rely on Kubernetes APIServer, thus we don’t need to care since we will use direct connection instead of the Service. We should take care of the bootstrap order of AnteaAgent because subcomponents in AntreaAgent rely on the antrea Service to set up the antrea client.The dependency map of sub-components in AntreaAgent looks like the diagram below.

img

According to the dependency map, the bootstrap order could be:

img
Installation

We can still follow the single YAML way to install Antrea, but users need to specify a Kubernetes APIServer Pod address in the config. Since Antrea will use insert operation instead of append option to the IPTables, once Antrea components start, KubeProxy rules will not be hit.

Upgrade

Since AntreaProxy will install IPTables rules before KubeProxy’s, users can just delete the KubeProxy deployment after the installation. There must be short connection breaks since the connection.

Antctl as an installation wizard

We can also install and update commands to Antctl to simplify the installation and upgrade.Instead of modifying a YAML file, only specifying the APIServer Endpoint by using Antctl should be easierAntctl can take responsibility for cleanup works, for e.g., remove KubeProxy rules, remove legacy rules or update API endpoints of Antrea.

KubeProxy Options Compatible

Consider Supporting in future
  • --azure-container-registry-config string
    • Path to the file containing Azure container registry configuration information.
  • --hostname-override string
    • If non-empty, will use this string as identification instead of the actual hostname.
  • --cluster-cidr string
    • The CIDR range of pods in the cluster. When configured, traffic sent to a Service cluster IP from outside this range will be masqueraded and traffic sent from pods to an external LoadBalancer IP will be directed to the respective cluster IP instead
Supported in Alternative Options
  • --healthz-bind-address ip:port Default: 0.0.0.0:10256
    • The IP address with port for the health check server to serve on (set to '0.0.0.0:10256' for all IPv4 interfaces and '[::]:10256' for all IPv6 interfaces). Set empty to disable.
  • --masquerade-all
    • If using the pure iptables proxy, SNAT all traffic sent via Service cluster IPs (this not commonly needed)
Implemented
  • --master string
    • The address of the Kubernetes API server (overrides any value in kubeconfig)
  • --nodeport-addresses stringSlice
    • A string slice of values which specify the addresses to use for NodePorts. Values may be valid IP blocks (e.g. 1.2.3.0/24, 1.2.3.4/32). The default empty string slice ([]) means to use all local addresses.
  • --profiling
    • If true enables profiling via web interface on /debug/pprof handler.

Open Items

  • How to redirect Service traffic to antrea-gw0 in other ways? Not
  • What the source IP address will be when accessing the Service IP:Port from localhost? If the source IP address is antrea-gw0's, the MASQUERADE action for Service traffic in POSTROUTING chain is not needed.
  • Whether Antrea Controller should connect to Kubernetes APIServer directly.
@hongliangl hongliangl added the kind/design Categorizes issue or PR as related to design. label Mar 2, 2021
@antoninbas
Copy link
Contributor

@hongliangl could you clarify what happens for Host ClusterIP access when the endpoints for the Service are not in the Pod Network (they can be host IPs, e.g. for the antrea Service, or even external IPs)? How do we guarantee that the return traffic goes through AntreaProxy in this case (for reverse translation). I think @ruicao93 brought this up at the community meeting yesterday, as he had a similar issue on Windows (#1824).

@tnqn
Copy link
Member

tnqn commented Mar 5, 2021

Some update about open questions from community meetings:

  1. Is the MASQUERADE action necessary? Yes, @hongliangl verified that the source address was selected based on the main route table before packets processed by netfilter where we match ipset and set mark to apply policy routing.
  2. Can antrea-controller rely on AntreaProxy's service proxying instead of adding a configuration to connect kube-apiserver directly? Yes, @weiqiangt verified that antrea-controller can run well after antrea-agent sets up the necessary flows. Currently there's no explicit retry in antrea-controller to wait for services to be available. The ConfigMap.Get call in authentication.ApplyTo in createAPIServerConfig has 30 seconds timeout, I can see tcp retransmission during it:
13:56:22.436193 IP 10.133.1.226.59076 > 10.96.0.1.443: Flags [S], seq 3973458840, win 64240, options [mss 1460,sackOK,TS val 4093858639 ecr 0,nop,wscale 7], length 0
13:56:23.446227 IP 10.133.1.226.59076 > 10.96.0.1.443: Flags [S], seq 3973458840, win 64240, options [mss 1460,sackOK,TS val 4093859649 ecr 0,nop,wscale 7], length 0
13:56:25.462214 IP 10.133.1.226.59076 > 10.96.0.1.443: Flags [S], seq 3973458840, win 64240, options [mss 1460,sackOK,TS val 4093861665 ecr 0,nop,wscale 7], length 0
13:56:29.718224 IP 10.133.1.226.59076 > 10.96.0.1.443: Flags [S], seq 3973458840, win 64240, options [mss 1460,sackOK,TS val 4093865921 ecr 0,nop,wscale 7], length 0
13:56:37.910231 IP 10.133.1.226.59076 > 10.96.0.1.443: Flags [S], seq 3973458840, win 64240, options [mss 1460,sackOK,TS val 4093874113 ecr 0,nop,wscale 7], length 0

And if the flows are not ready in 15 seconds (even though the timeout is 30 seconds, the backoff time is beyong the window), the call will return error and the program will exit and restart:

F0305 14:00:12.654096       1 main.go:59] Error running controller: error creating API server config: unable to load configmap based request-header-client-ca-file: Get "https://10.96.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication": dial tcp 10.96.0.1:443: i/o timeout
goroutine 1 [running]:
k8s.io/klog.stacks(0xc0003ca900, 0xc000db4000, 0x131, 0x1e9)
        /go/pkg/mod/k8s.io/[email protected]/klog.go:875 +0xb9
k8s.io/klog.(*loggingT).output(0x2b923a0, 0xc000000003, 0xc0005c7b20, 0x2ad03af, 0x7, 0x3b, 0x0)
        /go/pkg/mod/k8s.io/[email protected]/klog.go:826 +0x35f
k8s.io/klog.(*loggingT).printf(0x2b923a0, 0x3, 0x1d24f2b, 0x1c, 0xc000961d58, 0x1, 0x1)
        /go/pkg/mod/k8s.io/[email protected]/klog.go:707 +0x153
k8s.io/klog.Fatalf(...)
        /go/pkg/mod/k8s.io/[email protected]/klog.go:1276
main.newControllerCommand.func1(0xc0003ae280, 0xc000153a00, 0x0, 0x8)
        /antrea/cmd/antrea-controller/main.go:59 +0x1e7
github.com/spf13/cobra.(*Command).execute(0xc0003ae280, 0xc00004e0a0, 0x8, 0x8, 0xc0003ae280, 0xc00004e0a0)
        /go/pkg/mod/github.com/spf13/[email protected]/command.go:830 +0x2c2
github.com/spf13/cobra.(*Command).ExecuteC(0xc0003ae280, 0x0, 0x0, 0x0)
        /go/pkg/mod/github.com/spf13/[email protected]/command.go:914 +0x30b
github.com/spf13/cobra.(*Command).Execute(...)
        /go/pkg/mod/github.com/spf13/[email protected]/command.go:864
main.main()
        /antrea/cmd/antrea-controller/main.go:38 +0x52
  1. For the question @ruicao93 @antoninbas asked above, I think not only external endpoints, but also the noencap mode will have the issue that the reply packets will be handled by host directly and reset if we do nothing. I can imagine with current approach we will need to add mark to the connection and use policy routing for the reply packets to be delivered to OVS. Maybe we should take this scenario into consideration when comparing the current approach with TC.

@hongliangl
Copy link
Contributor Author

@hongliangl could you clarify what happens for Host ClusterIP access when the endpoints for the Service are not in the Pod Network (they can be host IPs, e.g. for the antrea Service, or even external IPs)? How do we guarantee that the return traffic goes through AntreaProxy in this case (for reverse translation). I think @ruicao93 brought this up at the community meeting yesterday, as he had a similar issue on Windows (#1824).

The implementation for host IPs is a little complicated, and I'll explain it later in detail.

@jianjuns
Copy link
Contributor

jianjuns commented Mar 5, 2021

Again I feel we should come up with a single approach to redirect matched host traffic to OVS. Later we will that for Node security policies (but it can be different in that most Node traffic should be directed to OVS) too.

@hongliangl
Copy link
Contributor Author

hongliangl commented Mar 8, 2021

For endpoint in remote host network. E.g,

  • SRC: 192.168.77.100:12345 DST: 10.96.0.1:8080. In host network, packets will be marked 0x21 in mangle table PREROUTING chain.
  • SRC: 10.0.1.1:12345 DST: 10.96.0.1:8080. In host network, packets will be masqueraded matching mark 0x21 in POSTROUTING chain on host network after routing to antrea-gw0. 10.0.1.1 is the IP address of antrea-gw0.
  • SRC: 10.0.1.1:12345 DST: 192.168.77.101:8080. In OVS, packets' destination will be modified with DNAT.
  • SRC: 169.254.169.111:12345 DST: 192.168.77.101:8080. In OVS, packets' source will be modified with setting field method in OVS. If we don't modify source, the return traffic will be routed to antrea-gw0 eventually, not ovs pipeline.
  • SRC: 192.168.77.100:23456 DST: 192.168.77.101:8080. In host network, packets' will be masqueraded matching mark 0x21 in POSTROUTING chain on host network. If we don't do this, the return traffic cannot return to current host.

The traffic is sent to target endpoint finally. Now we handle the return traffic.

  • SRC: 192.168.77.101:8080 DST: 192.168.77.100:23456. In host network.
  • SRC: 192.168.77.101:8080 DST: 169.254.169.111:12345. In host network, we add an onlink route 'ip route add 169.254.169.111 via 169.254.169.111 dev antrea-gw0 onlink' on host network.
  • SRC: 192.168.77.101:8080 DST: 10.0.1.1:12345. In OVS, packets' dst will be modified with setting field method in OVS.
  • SRC: 10.96.0.1:8080 DST: 10.0.1.1:12345. In OVS, packets' source will be recovered because of DNAT.
  • SRC: 10.96.0.1:8080 DST: 192.168.77.100:12345. In host network, packets' destination will be recovered because of masquerade.
    @antoninbas

@hongliangl
Copy link
Contributor Author

hongliangl commented Apr 18, 2021

The key to implement NodePort is how to "redirect" the traffic of NodePort into OVS. The NodePort traffice can be from localhost or remote hosts.

Traffic From Remote Hosts

Environment

This is the test environment. backend be seen as an integration of OVS and pods, and node is a VM. Note that, this is different from the real Antrea environment, so there should be some changes when integrating this design with Antrea.

img

Design

img

The destination of the NodePort traffic may be different IP addresses if the Kubernetes has multiple network adaptors.

Here we take eth0 as an example, and assuming that the source IP address(short for src) arriving eth0 is 192.168.2.135, source port(short for sport) is 56789, and source MAC address(short for smac) is 11:22:33:44:55:66.

In general,

  • The incoming traffic of NodePort is redirected from the ingress of ethx(any network adaptor that can be accessed from remote hosts) to the egress of gw0(When integrating with Antrea, this should be antrea-gw0). Note that, TC filter is used to filter NodePort traffic, avoiding redirecting all traffic.
  • The replying traffic of NodePort is redirected from the ingress of gw0 to the egress of ethx. Note that, TC filter is also used here as above.

This is a detailed example:

  • TC filters the traffic based on the destination IP address(short for dst) and the destination port(short for dport) on eth0's ingress.
    • The dst is the IP address of the network adaptor. Here is 192.168.2.211.
    • The dport is the port of NodePort. Here assuming that the port is 30000.
    • Traffic status:
      • smac: 11:22:33:44:55:66, src: 192.168.2.135, sport: 56789
      • dmac: 00:00:92:68:02:11, dst: 192.168.2.211, dport: 30000
  • TC redirects the traffic filtered on eth0's ingress to gw0's egress. Note that:
    • The smac/dmac are not changed, they should be modified before redirecting to gw0's egress. In this test environment, smac/dmac should be modified to 00:00:10:10:10:01/00:00:10:10:10:02, otherwise , the traffic will be dropped at gw1 as the dmac is not gw1's MAC address.
    • When integrating this design with Antrea, OVS pipeline can also take over the work of modifying smac/dmac, not TC.
    • Traffic status:
      • smac: 00:00:10:10:10:01, src: 192.168.2.135, sport: 56789
      • dmac: 00:00:10:10:10:02, dst: 192.168.2.211, dport: 30000
  • The traffic leaves gw0 and reaches gw1. We don't need to care about the following-up process. When integrating this design with Antrea, the following-up process should be handled by OVS pipeline.
  • After a while, the replied traffic reaches gw1, then it leaves gw1 and reaches gw0's ingress.
    • Traffic status:
      • smac: 00:00:10:10:10:02, src: 192.168.2.211, sport: 30000
      • dmac: 00:00:10:10:10:01, dst: 192.168.2.135, dport: 56789
  • The traffic reaches gw0's ingress. TC filters traffic here based on src and sport and redirect the filtered traffic to eth0's egress.
    • The src in TC filter is the IP address of the network adaptor where the traffic comes from. Here is 192.168.2.211 .
    • The sport is the port of NodePort.
    • There may be multiple filters on the ingress of gw0. Traffic is redirected to different network adaptors' egress according to its src.
    • The smac/dmac of replied traffic should be modified before redirecting filtered traffic to target network adaptor's egress.
    • Traffic status:
      • smac: 00:00:92:68:02:11, src: 192.168.2.211, sport: 30000
      • dmac: 11:22:33:44:55:66, dst: 192.168.2.135, dport: 56789
  • The traffic leaves eth0.

Hash Table

From the detailed example of above, we can see that the key of redirecting NodePort traffic is TC filter. By default, there is a default hash table that has only a hash bucket on each network adaptor's ingress and egress. Every hash bucket can be appended items in list. We can also add another hash table to the list, however, there is a restriction that we can only create a hash table with a maximum of 256 buckets. The hashkey should be set when attaching filter to the hash table.

For example,

  • The first command creates a hash table 1 with 256 buckets.
  • The second command creates a filter matching dst 192.168.2.211
    • TC u32 filter doesn't have any syntax sugar for hashkey. hashkey mask 0x000000ff at 20 match ip dst 192.168.2.211 means that the traffic matching that dst is 192.168.2.211 will be distributed to hash buckets in hash table 1, according to the logical and result of packet byte offset at 20 and 0x000000ff, as the index of buckets.
    • Exactly, the byte offset 20 of packet is the start position of TCP/UDP sport and dport,
    • Why I don't use mask 0x0000ffff? That way means matching the full dport 16 bits as hashkey, however, this will get the same result with using mask 0x000000ff(8 bits), as there are 256 buckets and 8 bits are enough.
tc filter add dev eth0 parent ffff:0 protocol ip prio 99 handle 1: u32 divisor 256
tc filter add dev eth0 parent ffff:0 protocol ip prio 99 u32 link 1: hashkey mask 0x000000ff at 20 match ip dst 192.168.2.211

I have two designs of filtering NodePort traffic.

Hash Design 1: Hash Table + List

From the picture we can see that:

  • On root hash table, I add a filter(Filter 0) to it. This filter only matches the traffic dst is 192.168.2.211.

  • Matched traffic will be processed in hash table 1. According to the hashkey(hashkey is the logical and result of byte offset at 20 and mask 0x00000ff), the traffic will be distributed to different buckets on hash table 1. For example, traffic whose dport is 256/512/1024 will be sent to bucket 0, traffic whose dport is 258 will be sent to bucket 2. Here we can see that the hashkey is just the index of buckets. The time complexity is O(1) here.

  • A bucket may has a list of filters. We can append at most 4096 filters to a bucket. In this design, a bucket at most has 256 filters as there are 256 buckets. The traffic whose dports that have the same last 8 bits shares the same bucket.

  • For example, if we want to redirect traffic whose dport is 256/512/1024 to gw0, we need to add filters to bucket 0 of hash table 1 to match related traffic and redirect it to gw0.

  • From the documents of TC, it is not recommended to add filters to buckets manually. We can use sample command to calculate the target bucket. For example, the key word sample can help calculate target bucket index with u32 filter syntax.

    tc filter add dev eht0 parent ffff:0 protocol ip prio 99 u32 ht 1: sample ip dport 256 0x00ff \
     match ip dport 256 0xffff action mirred egress redirect dev gw0
img
Hash Design 2: Nested Hash Table

From the picture we can see that:

  • In this design, I create another 256 sub hash tables from 0x100 to 0x1FF. Every hash table also has 256 buckets. Due to space limitation, I only drew hash table 101 and 103 in the picture.

  • Attach a filter to every buckets in hash table 1. Every filter links to its corresponding hash table. For example, filter in bucket 0 links to hash table 100; filter in bucket 1 links to hash table 101, etc. Here the hashkey of filter is the first 8 bits of dport.

  • For example, when filtering NodePort traffic whose dport is 275, filter should be attached to bucket 1 of hash table 101. The reason is that:

    • The last 8 bits of 257 is 1 in decimal, the traffic will be distributed to bucket 1 of hash table 1.
    • The first 8 bits of 257 is 1 in decimal, the traffic will be distributed to bucket 1 of hash table 101 based on the filter in bucket 1 of hash table 1.
    • The traffic is redirected based on the filter in bucket 1 of hash table 101.
  • The time complexity is O(1).

img

Traffic From Localhost

Environment

img

Design

img

The design for handling NodePort from the localhost and remote host is different. Here I need to take two examples.

  • The NodePort traffic to 127.0.0.1:30000.
  • The NodePort traffic to 192.168.2.211:30000.
For Traffic to 127.0.0.1:30000

In general,

  • The traffic of NodePort is redirected from the egress of lo to the egress of gw0(When integrating with Antrea, this should be antrea-gw0). Note that, TC filter is used to filter NodePort traffic, avoiding redirecting all traffic.

  • The replying traffic of NodePort is redirected from the ingress of gw0 to the ingress of lo. Note that, TC filter is also used here as above.

  • Note that, TC stateless SNAT/DNAT should be used before redirecting, as traffic whose src/dst is 127.0.0.1 will not be routed. When integrating with Antrea, this may also be done by OVS.

This is a detailed example:

  • TC filters the traffic based on dst and dport on lo's egress.
    • dst is 127.0.0.1, and dport is 30000
    • Traffic status:
      • smac: 00:00:00:00:00:00, src: 127.0.0.1, sport: 56789
      • dmac: 00:00:00:00:00:00, dst: 127.0.0.1, dport: 30000
  • TC redirects the traffic filtered on lo's egress to gw0's egress. Note that
    • Modify the src/dst (src 127.0.0.1 -> 169.254.1.1, dst 127.0.0.1 -> 169.254.1.2) with TC stateless SNAT/DNAT.
    • Modify smac/dmac from all-zero to gw0’s/gw1's.
    • Traffic status:
      • smac: 00:00:10:10:10:01, src: 169.254.1.1, sport: 56789
      • dmac: 00:00:10:10:10:02, dst: 169.254.1.2, dport: 30000
  • The traffic leaves gw0 and reaches gw1. We don't need to care about the following-up process. When integrating this design with Antrea, the following-up process should be handled by OVS pipeline.
  • After a while, the replying traffic reaches gw1, then it leaves gw1 and reaches gw0's ingress.
  • Traffic status:
    • smac: 00:00:10:10:10:02, src: 169.254.1.2, sport: 30000
    • dmac: 00:00:10:10:10:01, dst: 169.254.1.1, dport: 56789
  • The replying traffic reaches gw0's ingress. TC filters traffic here based on src and sport and the redirect the filtered traffic to lo's ingress.
    • Restore the src/dst (src 169.254.1.2 -> 127.0.0.1, dst 169.254.1.1 -> 127.0.0.1) with TC stateless SNAT/DNAT.
    • Restore the smac/dmac to all-zero.
    • Traffic status:
      • smac: 00:00:00:00:00:00, src: 127.0.0.1, sport: 30000
      • dmac: 00:00:00:00:00:00, dst: 127.0.0.1, dport: 56789
  • The traffic reaches lo and is passed to localhost.

For Traffic to 192.168.2.211:30000

  • TC filters the traffic based on dst and dport on lo's egress.

    • dst is 192.168.2.211, and dport is 30000
    • Traffic status:
      • smac: 00:00:00:00:00:00, src: 192.168.2.211, sport: 56789
      • dmac: 00:00:00:00:00:00, dst: 192.168.2.211, dport: 30000
  • TC redirects the traffic filtered on lo's egress to gw0's egress.

    • Modify smac/dmac from all-zero to gw0’s/gw1's.
    • Traffic status:
      • smac: 00:00:10:10:10:01, src: 192.168.2.211, sport: 56789
      • dmac: 00:00:10:10:10:02, dst: 192.168.2.211, dport: 30000
  • The traffic leaves gw0 and reaches gw1. We don't need to care about the following-up process. When integrating this design with Antrea, the following-up process should be handled by OVS pipeline.

  • After a while, the replying traffic reaches gw1, then it leaves gw1 and reaches gw0's ingress.

    • Traffic status:
      • smac: 00:00:10:10:10:02, src: 192.168.2.211, sport: 30000
      • dmac: 00:00:10:10:10:01, dst: 192.168.2.211, dport: 56789
  • The replying traffic reaches gw0's ingress.TC filters traffic based on src , dst, and sport and the redirect the filtered traffic to lo's ingress. Note that dst is the key condition. If TC doesn't filter the traffic with dst, TC can't judge where the traffic should redirect back(lo or ethx).

    • The priority of this filter should be higher than the priority of filtering remote NodePort traffic to avoid redirecting NodePort traffic from loopback to eth0. If the dst is localhost IP address(e.g, 192.168.2.211), the traffic should be redirected to lo's ingress.

    • Restore the smac/dmac to all-zero.

      • Traffic status:
        • smac: 00:00:00:00:00:00, src: 192.168.2.211, sport: 30000
        • dmac: 00:00:00:00:00:00, dst: 192.168.2.211, dport: 56789
  • The traffic reaches lo and is passed to localhost.

Evaluation

Environment

The netperf server and client are two VMs.

  • Kernel version: Linux 4.15.0-140-generic
  • OS version: Ubuntu 18.04.3 LTS
  • CPU * 3:
    • model name : Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz
    • cpu MHz : 2194.843
    • cache size : 19712 KB
  • Network: Ethernet controller: VMware VMXNET3 Ethernet Controller (rev 01)

All values are the average of 140 results from 160 results(the highest 10 and lowest 10 results are not included).

TCP-STRERAM

Packet Size \ Design iptables + ipset tc_hash tc_nested_hash
64 656.217 674.558 653.977
256 2304.56 2325.16 2336.37
512 3541.46 3547.27 3533.27
1536 3754 3730.48 3667.02
2560 3713.58 3726.77 3693.61
16384 3704.44 3761.89 3675.48

TCP-RR

Packet Size \ Design iptables + ipset tc_hash tc_nested_hash
1 2539.29 2569.74 2540.65
16 2635.26 2735.46 2692.66
256 2741.16 2842.26 2806.15
1024 2696.46 2765.82 2724.23
2048 1690.4 1765.74 1713.09

TCP-CRR

Packet Size \ Design iptables + ipset tc_hash tc_nested_hash
1 2455.18 2578.86 2679.26
16 2625.26 2837.48 2900.07
256 2753.82 2986.17 2931.69
1024 2630.45 2801.36 2789.5
2048 1632.37 1803.37 1828.81

@hongliangl
Copy link
Contributor Author

Hello, @antoninbas @jianjuns @tnqn, I have post the new design and test result.

@jianjuns
Copy link
Contributor

Hi @hongliangl , I do not understand the following:

The traffic is sent by gw0 and arrives gw1. Here I add DNAT on the current network namespace. All traffic will be redirect to network namespace endpoint via DNAT.

Traffic status:
SRC IP: 192.168.2.135; MAC: 00:00:20:20:20:02; PORT: 56789
DST IP: 20.20.20.1; MAC: 00:00:20:20:20:01; PORT: 30000

The replied traffic arrives gw3 and unDNAT and is sent back via gw1

Traffic status:
SRC IP: 192.168.2.211; MAC: 00:00:10:10:10:02; PORT: 30000
DST IP: 192.168.2.135; MAC: 00:00:10:10:10:01; PORT: 56789

What is gw1 and what is gw3?
"add DNAT on the current network namespace": how you do DNAT? OVS or iptables or TC?

@antoninbas antoninbas removed their assignment Sep 28, 2021
@lzhecheng
Copy link
Contributor

@hongliangl should this issue be closed?

@github-actions
Copy link
Contributor

github-actions bot commented Feb 9, 2022

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment, or this will be closed in 90 days

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/design Categorizes issue or PR as related to design. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

7 participants