Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support DSR mode for LoadBalancerIPs with AntreaProxy #5025

Closed
2 tasks done
tnqn opened this issue May 23, 2023 · 13 comments · Fixed by #5202
Closed
2 tasks done

Support DSR mode for LoadBalancerIPs with AntreaProxy #5025

tnqn opened this issue May 23, 2023 · 13 comments · Fixed by #5202
Labels
area/proxy Issues or PRs related to proxy functions in Antrea kind/design Categorizes issue or PR as related to design.

Comments

@tnqn
Copy link
Member

tnqn commented May 23, 2023

Describe what you are trying to solve

See #4956 for the original issue.

In DSR (Direct server return) mode, the load balancer routes packets to the backends (without changing src/dst IPs in it typically). The backends process the requests and answer directly to the clients, without passing through the load-balancer.

Pros:

  • lower latency
  • total output bandwidth is the sum of each backend bandwidth
  • preserved client IP

Use case: in-cluster LoadBalancers, e.g. Metallb, Antrea ServiceExternalIP

Describe the solution you have in mind

The diagram shows how the traffic flows in DSR mode:
image

The tricky part is how to persist loadbalancing result of the first packet of a connection on ingress Node, considering the following caveats:

  • Connections will be marked as ”invalid” in conntrack on ingress Nodes
  • OVS doesn’t provide “ct_mark” and “ct_label” for invalid connections

Potential Solution

  • Leverage Learn action: however, @antoninbas pointed out that there is a latency a learned flow is installed with, which may cause different selection result. Need to check whether it could cause real problems.
  • Change the OVS behavior to get "ct_mark" and "ct_label" for invalid connections. Need to understand whether it's by design or a bug.
  • Use deterministic algorithm to select backend to ensure the same backend will be selected.

Example Flows with learn actions:

svc_ip=172.18.0.10
svc_port=80
pod_ip=0xaf40102
node_ip=0xac120003

ovs-ofctl add-group br-int "group_id=100,type=select,bucket=bucket_id:0,weight:100,actions=load:0xaf40102->NXM_NX_REG3[],load:0x50->NXM_NX_REG4[0..15],resubmit(,ServiceLB)"
ovs-ofctl add-flow br-int "cookie=0x1, table=ServiceLB,priority=205,tcp,reg4=0x10000/0x70000,nw_dst=172.18.0.10,tp_dst=80 actions=load:0x3->NXM_NX_REG4[16..18],load:0x1->NXM_NX_REG0[9],load:0x5->NXM_NX_REG7[],group:100" # removed reg4[21], ToExternalAddressRegMark
ovs-ofctl add-flow br-int "cookie=0x1, table=ServiceLB, priority=195,tcp,reg4=0x30000/0x70000,nw_dst=172.18.0.10,tp_dst=80 actions=learn(table=SessionAffinity,idle_timeout=60,fin_idle_timeout=10,priority=200,delete_learned,cookie=0x1,eth_type=0x800,nw_proto=6,NXM_OF_TCP_SRC[],NXM_OF_TCP_DST[],NXM_OF_IP_DST[],NXM_OF_IP_SRC[],load:NXM_NX_REG3[]->NXM_NX_REG3[],load:NXM_NX_REG4[0..15]->NXM_NX_REG4[0..15],load:0x2->NXM_NX_REG4[16..18],load:0x1->NXM_NX_REG0[9]),load:0x2->NXM_NX_REG4[16..18],resubmit(,EndpointDNAT)"
ovs-ofctl add-flow br-int "cookie=0x1, table=EndpointDNAT, priority=220,tcp,reg3=0xaf40102,reg4=0x20050/0x7ffff actions=ct(commit,table=AntreaPolicyEgressRule,zone=65520,exec(load:0x1->NXM_NX_CT_MARK[4],move:NXM_NX_REG0[0..3]->NXM_NX_CT_MARK[0..3]))"
ovs-ofctl add-flow br-int "cookie=0x1, table=L3Forwarding, priority=210,ip,reg3=0xaf40102 actions=mod_dl_src:96:de:80:00:db:cf,mod_dl_dst:aa:bb:cc:dd:ee:ff,load:0xac120003->NXM_NX_TUN_IPV4_DST[],load:0x1->NXM_NX_REG0[4..7],resubmit(,L3DecTTL)"
ovs-ofctl add-flow br-int "cookie=0x1, table=ConntrackState, priority=220,ct_state=+inv+trk,ip actions=resubmit(,PreRoutingClassifier)"

Describe how your solution impacts user flows

User can set the LoadBalancer mode to DSR in antrea-agent config. Then LoadBalancerIPs of Services will work in DSR mode.

Test plan

  • Performance evaluation: latency, throughput
  • Test with popular in-cluster LoadBalancers
@tnqn tnqn added the kind/design Categorizes issue or PR as related to design. label May 23, 2023
@tnqn tnqn added this to the Antrea v1.13 release milestone May 23, 2023
@tnqn tnqn added the area/proxy Issues or PRs related to proxy functions in Antrea label May 23, 2023
@tnqn
Copy link
Member Author

tnqn commented May 29, 2023

@antoninbas I tried to reproduce the latency of learned flow but it seemed working fine even the first packet and the second packet had a very small interval:

13:34:30.967023 IP 172.18.0.1.36214 > 172.18.0.10.80: Flags [S], seq 3413610957, win 64240, options [mss 1460,sackOK,TS val 1136776094 ecr 0,nop,wscale 7], length 0
13:34:30.969892 IP 172.18.0.1.36214 > 172.18.0.10.80: Flags [.], ack 1131791762, win 502, options [nop,nop,TS val 1136776097 ecr 2554672012], length 0
13:34:30.969911 IP 172.18.0.1.36214 > 172.18.0.10.80: Flags [P.], seq 1131758189:1131758264, ack 1131791762, win 502, options [nop,nop,TS val 1136776097 ecr 2554672012], length 75: HTTP: GET / HTTP/1.1
13:34:30.971142 IP 172.18.0.1.36214 > 172.18.0.10.80: Flags [.], ack 1131792000, win 501, options [nop,nop,TS val 1136776099 ecr 2554672013], length 0
13:34:30.971352 IP 172.18.0.1.36214 > 172.18.0.10.80: Flags [.], ack 1131792615, win 501, options [nop,nop,TS val 1136776099 ecr 2554672014], length 0
13:34:30.971898 IP 172.18.0.1.36214 > 172.18.0.10.80: Flags [F.], seq 1131758264, ack 1131792615, win 501, options [nop,nop,TS val 1136776099 ecr 2554672014], length 0
13:34:30.972140 IP 172.18.0.1.36214 > 172.18.0.10.80: Flags [.], ack 1131792616, win 501, options [nop,nop,TS val 1136776100 ecr 2554672014], length 0

And the packet counter confirmed the subsequent packets always hit the learned flow:

cookie=0x1, duration=1.703s, table=SessionAffinity, n_packets=6, n_bytes=471, idle_timeout=10, priority=200,tcp,nw_src=172.18.0.1,nw_dst=172.18.0.10,tp_src=36214,tp_dst=80 actions=fin_timeout(idle_timeout=10),load:0xaf40004->NXM_NX_REG3[],load:0x50->NXM_NX_REG4[0..15],load:0x2->NXM_NX_REG4[16..18],load:0x1->NXM_NX_REG0[9]

10,000 connections succeeded 100% when there are two different backends on different nodes even I changed the max-revalidator back to 500.

So I suspect the original issue has been fixed in OVS 2.17.6. Do you have a way to reproduce the original issue?

@antoninbas
Copy link
Contributor

@tnqn My best guess is that the issue doesn't happen for packets belonging to the same connection?
My guess is that the microflow cache is updated for that specific connection by the datapath itself, so there is no issue for subsequent packets.
However, in the case of the SessionAffinity implementation, we are dealing with different connections (source port can be different). In that case, new connections do not hit the learned flow until the revalidator updates the datapath.

One way to confirm this would be to update your learned flow so that it doesn't match on the source port, then trigger several back-to-back connections.

@tnqn
Copy link
Member Author

tnqn commented Jun 5, 2023

@antoninbas Yes, after I removed the source port in the learned flow, the 2nd or the 3rd connection had a great chance to fail, which means the delay still exists. Checked OVS doc about how OVS selects a bucket, I guess it's not due to microflow cache but because the selection is based on 5-tuple. dp_hash is the default selection method.

 dp_hash
                     Use  a  datapath computed hash value.  The hash algorithm
                     varies   across   different   datapath   implementations.
                     dp_hash   uses   the   upper   32   bits  of  the  selec‐
                     tion_method_param as the datapath hash  algorithm  selec‐
                     tor.   The  supported values are 0 (corresponding to hash
                     computation over the IP 5-tuple) and 1 (corresponding  to
                     a  symmetric  hash computation over the IP 5-tuple).  Se‐
                     lecting specific fields with the  fields  option  is  not
                     supported  with  dp_hash).  The lower 32 bits are used as
                     the hash basis.

For SessionAffinity, because the learned flow has only source IP but not source port, the second connection using different source port may get a different bucket when the learned flow is not realized.

For DSR: when the learned flow is per src port, regardless of its existence, all packets of a connection will always get the same bucket; when the learned flow is per src IP, subsequent packets of a connection will get the same bucket as the first packet before the learned flow is realized and may change to another bucket after the learned flow is realized.

So there should be no problem as long as we keep source port in the learned flow.

I also found something more interesting: if I change the selection method from dp_hash to hash, the subsequent packets of a connection will always get the same bucket as the first packet even if the learned flow is per src IP. It seems there is no delay in this mode. And I'm thinking if switching to this mode for SessionAffinity could benefit in the following aspects:

  1. We don't need to change the default other_config:max-revalidator
  2. The Session Affinity will be enforced more strictly even the interval between connections are very small
  3. The performance may be better because we can choose what fields to hash in hash mode while dp_hash always hash the 5-tuple fields. For a service, the protocol, dst ip, dst port are same so we only need to hash src ip and src port.

Besides, I found the reason why running benchmark with new connections in DSR mode had worse performance than normal mode: inserting a learned flow to datapath incurs more latency. After I reduced the number of learned flows by masking the src ports to ensure at most 64 flows will be learned, the performance became much better:

  • ab (Do NOT use HTTP KeepAlive): 1607.16 rps (DSR) vs. 1070.32 rps (normal)
  • ab -k (Use HTTP KeepAlive): 8852.57 rps (DSR) vs. 5711.66 rps (normal)

I'm going to implement DSR with the revised flows if there is no other problem.

@antoninbas
Copy link
Contributor

I also found something more interesting: if I change the selection method from dp_hash to hash, the subsequent packets of a connection will always get the same bucket as the first packet even if the learned flow is per src IP

I am not sure I understand this comment, unless you are talking about multiple connections, with different source ports.

I believe there is a significant different between how hash and dp_hash work. hash has the advantage of letting you choose which fields are used to compute the hash. That means that we could omit the source port from the hash for SessionAffinity (doesn't seem to be the default though).

This is from an old patch for dp_hash implementation1:

The current default OpenFlow select group implementation sends every new L4 flow
to the slow path for the balancing decision and installs a 5-tuple "miniflow"
in the datapath to forward subsequent packets of the connection accordingly.
Clearly this has major scalability issues with many parallel L4 flows and high
connection setup rates.

The dp_hash selection method for the OpenFlow select group was added to OVS
as an alternative. It avoids the scalability issues for the price of an
additional recirculation in the datapath.

And the OVS manual entry for hash specifies:

Use a hash computed over the fields specified with the fields option, see below. If no hash fields are specified, hash defaults to a symmetric hash over the combination of MAC addresses, VLAN tags, Ether type, IP addresses and L4 port numbers. hash uses the selection_method_param as the hash basis.

Note that the hashed fields become exact matched by the datapath flows. For example, if the TCP source port is hashed, the created datapath flows will match the specific TCP source port value present in the packet received. Since each TCP connection generally has a different source port value, a separate datapath flow will be need to be inserted for each TCP connection thus hashed to a select group bucket.

Also from the OVS manual, we have this for dp_hash:

This double-matching incurs a small additional latency cost for each packet, but this latency is orders of magnitude less than the latency of creating new datapath flows for new TCP connections.

Which seems to be a reference to one drawback of using hash.

My takeaways from all this are as follows:

  1. when using hash, we can choose to exclude the source port from the hash, eliminating the time window during which new connections using a different source port are sent to a different backend. However, by default, the hash should include the source port so we should still observe the same issue.
  2. hash seems to have scalability issues historically, which I am not sure we can ignore for our use case?
  3. hash may also incur higher latency during connection establishment?
  4. a lower revalidator delay may still be useful in case of Service endpoint update during the learning window: if we haven't observed issues so far (CPU usage), I think we can keep it as it is. A Service endpoint update means that traffic may be rebalanced.

With regards to the scalability issues of hash, they may not be relevant to us for the following reasons:

  • we plan on omitting fields from the hash, hence reducing the number of datapath flows
  • since we are using learning anyway, we are ok with increasing the number of datapath flows in the first place

The scalability impact is the same for hash and learning IMO. You observed it for learning in your experiment.

So only the increased latency may be of concern?

Footnotes

  1. https://dev.openvswitch.narkive.com/LGmF9C46/ovs-patch-v4-0-3-use-improved-dp-hash-select-group-by-default

@tnqn
Copy link
Member Author

tnqn commented Jun 6, 2023

I also found something more interesting: if I change the selection method from dp_hash to hash, the subsequent packets of a connection will always get the same bucket as the first packet even if the learned flow is per src IP

I am not sure I understand this comment, unless you are talking about multiple connections, with different source ports.

I meant:

  • When using dp_hash and learn flow is based on src IP, the 2nd connection could be discrupted by the learned flow of the 1st connection, because the 1st connection may select bucket A while the 2nd connection may select bucket B. Before the learned flow of the 1st connection is realized in kernel, the first few packets of the 2nd connection will use bucket B; after the learned flow is realized, the subsequent packets of the 2nd connection will use bucket A.
  • When using hash and learn flow is based on src IP, the 2nd connection won't be distrupted. My understanding is that the hash method forces the packet to be sent to OVS userspace, thus will enforce the learned flow without any delay.

I think the scalabity issue and the higher latency refer to the fact that the hash method forces the packet to be processed by OVS userspace and generate a new datapath flow for every new connection. It's true for most cases but probably not true when learn is one of the actions, because the learn action has determined the packet will be sent to OVS userspace anyway. This is the results I got with different flows with ab -n 10000:

Test Requests per second Time per request (ms) Kernel flow number
1. hash, no learn action 1116.74 0.895 17537
2. hash, learn action based on source IP + port 1169.29 0.855 16151
3. hash, learn action based on source IP + masked (6bits) source port 1671.39 0.598 142
4. hash, learn action based on source IP 1723.53 0.580 14
5. dp_hash, no learn action 1630.54 0.613 30
6. dp_hash, learn action based on source IP + source port 1067.87 0.936 17347
7. dp_hash, learn action based on source IP + masked (6bits) source port - - -
8. dp_hash, learn action based on source IP - - -
  • 1 and 5 don't work as they cannot guarantee consistent selection for a connection, but it validates that dp_hash is indeed more efficient when there is no learn action
  • 2 and 6 can work but their performance are not the best and generate one userspace flow and one or two datapath flows for each connection, which are exactly the drawbacks of hash method, however, encountered by dp_hash too when learn action is there
  • 3's performance is close to the best and generates reasonable flows (can be controlled via the bits of source port we include in the flows)
  • 4 has the best performance and generates least datapath flows, however, can cause uneven load balacing as one client IP can only access one backend
  • 7 and 8 don't get valid results due to the delay of learned flows distrupting connections.

It seems to me that 3 is the most appropriate approach which has nearly best performance, generats reasonable flows, loads balance evenly relatively.

@antoninbas
Copy link
Contributor

That's good data. I agree that Option 3 looks like a good approach.

My understanding is that the hash method forces the packet to be sent to OVS userspace, thus will enforce the learned flow without any delay.

It feels to me that there are 2 flows being installed, one caused by the usage of hash and one because of the learn action. Is that correct? Given that you use 6 bits of the source port, one would expect 2^6 = 64 flows, but we instead have 128 + X flows (unless there is some recirculation?). If this is accurate, it feels like the learn action flow would still be subject to revalidator processing?

@tnqn
Copy link
Member Author

tnqn commented Jun 7, 2023

There are indeed 2 flows being installed for each connection, but caused by two different ct_states:
For the first packet of a connection, it's "+new-inv+trk", it will be upcalled due to the usage of hash and the generated datapath flow will have the ct_states like below:

recirc_id(0x4a3c),in_port(2),ct_state(+new-inv+trk),ct_mark(0/0x10),eth(),eth_type(0x0800),ipv4(src=172.18.0.1,dst=172.18.0.10,proto=6,frag=no),tcp(src=32/0x3f,dst=80), packets:0, bytes:0, used:never, actions:ct(commit,zone=65520,mark=0x12/0x1f),recirc(0x4a4b)

For the second packet of a connection, it's "-new+inv+trk", it won't match the above flow and will be upcalled due to the usage of hash and the generated datapath flow will be like:

recirc_id(0x4a3c),in_port(2),ct_state(+inv+trk),eth(),eth_type(0x0800),ipv4(src=172.18.0.1,dst=172.18.0.10,proto=6,frag=no),tcp(src=32/0x3f,dst=80), packets:3, bytes:198, used:1.153s, flags:F., actions:ct(commit,zone=65520,mark=0x12/0x1f),recirc(0x4a4b)

And because the second packet is upcalled, the learned flow of the first packet will apply to it, so it's not subject to revalidator processing, and it triggers installing the learned flow to datapath immediately.
For the other subsequent packets of the connection, they will match the second datapath flow and won't be upcalled.
For all packets of some other connections that have the same masked source ports, their first packets will match the first datapath flow and their subsequent packets will match the second datapath flow when they are not expired, or will be upcalled and match the learned flow.

In theory, we can change some flows to avoid the datapath flows being generated with ct_states, then even the second packet of the first connection won't be upcalled and generate another datapath flow.

@antoninbas
Copy link
Contributor

Thanks @tnqn, things are much clearer now.

Based on your explanation, if we replace dp_hash with hash for the SessionAffinity implementation, I am fine with reverting the max-revalidator configuration to its default.

However, I assume we should stick with dp_hash for the default Service case?

One more question: in your table, for the hash, do you always use src IP + masked port? I am surprised we have so many flows in the first case, "hash, no learn action" (17537).

@tnqn
Copy link
Member Author

tnqn commented Jun 7, 2023

Based on your explanation, if we replace dp_hash with hash for the SessionAffinity implementation, I am fine with reverting the max-revalidator configuration to its default.

However, I assume we should stick with dp_hash for the default Service case?

Yes, dp_hash is more efficient when there is no learn action. I will evaluate the performance impact when making the change.

One more question: in your table, for the hash, do you always use src IP + masked port? I am surprised we have so many flows in the first case, "hash, no learn action" (17537).

All the tests in the table were executed without using masked port in hash parameters. "hash, no learn action" had many datapath flows because the whole source port is used in hash; "hash, learn action based on source IP + masked (6bits) source port" and "hash, learn action based on source IP" had few datapath flows because most subsequent connections hit the learned flows or their datapath cache.

@antoninbas
Copy link
Contributor

antoninbas commented Jun 7, 2023

All the tests in the table were executed without using masked port in hash parameters.

That explains the large number of flows. I am surprised the latency is not a bit higher for 1 given that each new connection needs to be upcalled (btw, what's the unit for Time per request, it ms?).

For our DSR use case (test 3 in the table), do you think it makes any difference to mask the source port in the hash? Maybe in the case where 2 connections (with the same masked source ports) are established almost "simultaneously", with the first packet of the second connection arriving before the upcall for the first packet of the first connection completes? That would be a total edge case though.

@tnqn
Copy link
Member Author

tnqn commented Jun 8, 2023

That explains the large number of flows. I am surprised the latency is not a bit higher for 1 given that each new connection needs to be upcalled (btw, what's the unit for Time per request, it ms?).

The unit test is ms. The latency when each new connection needs to be upcalled (Case 1, 2, 6) is indeed higher than the others (Case 3, 4, 5).

For our DSR use case (test 3 in the table), do you think it makes any difference to mask the source port in the hash? Maybe in the case where 2 connections (with the same masked source ports) are established almost "simultaneously", with the first packet of the second connection arriving before the upcall for the first packet of the first connection completes? That would be a total edge case though.

There is no obvious performance difference according to tests. I thought there could be issues if 2 connections with the same masked source port are established simultaneously when we don't mask the source port with hash, because 1 connection may map to bucket A and generate a leaned flow with the masked source port, then another connection may map to bucket B and generate a learned flow with the same masked source port but different action, overriding the first one. However, the issue didn't happen. I'm not sure if it's OVS having some synchronization mechanism to avoid the problem, but I agree it's perhaps better to have the same masked source port in hash, for the following reasons:

  1. The selection will be consistent regardless which flow a packet matches, the learned flow generated by preceding connections, or the hash flow.
  2. There will be less datapath flows because the datapath flow that matches "+new" will match masked source port, instead of the whole source port.
  3. There will be a bit less upcalls because the first packet of the second connection with the same masked source port will hit the datapath flow, instead of being upcalled, hitting the learned flow, generating another datapath flow (which the 3rd connection could benefit from).

@antoninbas
Copy link
Contributor

There is no obvious performance difference according to tests. I thought there could be issues if 2 connections with the same masked source port are established simultaneously when we don't mask the source port with hash, because 1 connection may map to bucket A and generate a leaned flow with the masked source port, then another connection may map to bucket B and generate a learned flow with the same masked source port but different action, overriding the first one.

That's also what I thought could happen

However, the issue didn't happen.

You mean, when running ab in concurrent mode?

Let's mask the source port in hash too then if there is no obvious downside.

@tnqn
Copy link
Member Author

tnqn commented Jun 9, 2023

However, the issue didn't happen.

You mean, when running ab in concurrent mode?

Yes, I tried ab -c with 10 and 100.

Let's mask the source port in hash too then if there is no obvious downside.

Sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/proxy Issues or PRs related to proxy functions in Antrea kind/design Categorizes issue or PR as related to design.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants