Cannot access ClusterIP service if the endpoint is on another Node when AntreaProxy is disabled #2319

tnqn · 2021-06-28T15:10:17Z

Thanks @hxietkg for finding the issue.

Describe the bug

When AntreaProxy is disabled, a Pod cannot access the ClusterIP of the Service if the selected endpoints is on another Node.

For example, the DNS query against kube-dns service failed because the reply was from unexpected source:

# kubectl exec -it client-87c5f74c4-tx98q -n dev2 -- nslookup www.google.com
;; reply from unexpected source: 192.168.1.162#53, expected 10.96.0.10#53
;; reply from unexpected source: 192.168.1.162#53, expected 10.96.0.10#53
;; reply from unexpected source: 192.168.1.162#53, expected 10.96.0.10#53
;; connection timed out; no servers could be reached

Access a http service got no reply:

# kubectl exec -it client-87c5f74c4-tx98q -n dev2 -- curl 10.111.129.233
curl: (7) Failed to connect to 10.111.129.233 port 80: Connection timed out

The root cause of this issue is that, if the reply traffic of a connection that has been processed by iptables/ipvs rules (of kube-proxy) is received from the tunnel interface, its destination MAC would be rewritten twice because it would have both gatewayCTMark and macRewriteMark set. The latter rewriting would overwrite the former one and would cause the packets to be delivered to the destination Pod directly without doing reversed NAT in the host netns.

table=0, priority=200,in_port="antrea-tun0" actions=load:0->NXM_NX_REG0[0..15],load:0x1->NXM_NX_REG0[19],resubmit(,30)
table=31, priority=200,ct_state=-new+trk,ct_mark=0x20,ip actions=mod_dl_dst:0e:6d:42:66:92:46,resubmit(,40)
table=70, priority=200,ip,reg0=0x80000/0x80000,nw_dst=192.168.0.34 actions=mod_dl_src:0e:6d:42:66:92:46,mod_dl_dst:3a:b4:4c:58:75:05,resubmit(,72)

To Reproduce

Disable AntreaProxy
Access a Service's ClusterIP from a Pod which is running on a Node different from the Nodes that the Service's backend Pods are running on

Expected
The access should succeed.
The failure should be caught by CI tests.

Actual behavior
The access failed.
No existing CI tests can catch it reliably because upstream tests don't run with AntreaProxy disabled and the Antrea specific e2e tests don't have dedicated cross-Node Service access case.

Versions:
Please provide the following information:

Antrea version (Docker image tag). v0.13.0-v1.1.0

The text was updated successfully, but these errors were encountered:

tnqn added the kind/bug Categorizes issue or PR as related to a bug. label Jun 28, 2021

tnqn added this to the Antrea v1.2 release milestone Jun 28, 2021

tnqn added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Jun 28, 2021

tnqn self-assigned this Jun 28, 2021

tnqn mentioned this issue Jun 28, 2021

Fix cross-Node service access when AntreaProxy is disabled #2318

Merged

tnqn closed this as completed in #2318 Jun 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot access ClusterIP service if the endpoint is on another Node when AntreaProxy is disabled #2319

Cannot access ClusterIP service if the endpoint is on another Node when AntreaProxy is disabled #2319

tnqn commented Jun 28, 2021 •

edited

Loading

Cannot access ClusterIP service if the endpoint is on another Node when AntreaProxy is disabled #2319

Cannot access ClusterIP service if the endpoint is on another Node when AntreaProxy is disabled #2319

Comments

tnqn commented Jun 28, 2021 • edited Loading

tnqn commented Jun 28, 2021 •

edited

Loading