Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot access ClusterIP service if the endpoint is on another Node when AntreaProxy is disabled #2319

Closed
tnqn opened this issue Jun 28, 2021 · 0 comments · Fixed by #2318
Closed
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.

Comments

@tnqn
Copy link
Member

tnqn commented Jun 28, 2021

Thanks @hxietkg for finding the issue.

Describe the bug

When AntreaProxy is disabled, a Pod cannot access the ClusterIP of the Service if the selected endpoints is on another Node.

For example, the DNS query against kube-dns service failed because the reply was from unexpected source:

# kubectl exec -it client-87c5f74c4-tx98q -n dev2 -- nslookup www.google.com
;; reply from unexpected source: 192.168.1.162#53, expected 10.96.0.10#53
;; reply from unexpected source: 192.168.1.162#53, expected 10.96.0.10#53
;; reply from unexpected source: 192.168.1.162#53, expected 10.96.0.10#53
;; connection timed out; no servers could be reached

Access a http service got no reply:

# kubectl exec -it client-87c5f74c4-tx98q -n dev2 -- curl 10.111.129.233
curl: (7) Failed to connect to 10.111.129.233 port 80: Connection timed out

The root cause of this issue is that, if the reply traffic of a connection that has been processed by iptables/ipvs rules (of kube-proxy) is received from the tunnel interface, its destination MAC would be rewritten twice because it would have both gatewayCTMark and macRewriteMark set. The latter rewriting would overwrite the former one and would cause the packets to be delivered to the destination Pod directly without doing reversed NAT in the host netns.

table=0, priority=200,in_port="antrea-tun0" actions=load:0->NXM_NX_REG0[0..15],load:0x1->NXM_NX_REG0[19],resubmit(,30)
table=31, priority=200,ct_state=-new+trk,ct_mark=0x20,ip actions=mod_dl_dst:0e:6d:42:66:92:46,resubmit(,40)
table=70, priority=200,ip,reg0=0x80000/0x80000,nw_dst=192.168.0.34 actions=mod_dl_src:0e:6d:42:66:92:46,mod_dl_dst:3a:b4:4c:58:75:05,resubmit(,72)

To Reproduce

  1. Disable AntreaProxy
  2. Access a Service's ClusterIP from a Pod which is running on a Node different from the Nodes that the Service's backend Pods are running on

Expected
The access should succeed.
The failure should be caught by CI tests.

Actual behavior
The access failed.
No existing CI tests can catch it reliably because upstream tests don't run with AntreaProxy disabled and the Antrea specific e2e tests don't have dedicated cross-Node Service access case.

Versions:
Please provide the following information:

  • Antrea version (Docker image tag). v0.13.0-v1.1.0
@tnqn tnqn added the kind/bug Categorizes issue or PR as related to a bug. label Jun 28, 2021
@tnqn tnqn added this to the Antrea v1.2 release milestone Jun 28, 2021
@tnqn tnqn added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Jun 28, 2021
@tnqn tnqn self-assigned this Jun 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant