Flannel and kube-proxy race for postrouting chain #20391

bprashanth · 2016-01-31T20:44:17Z

More recent versions of flannel when started with the --ip-masq flag, force a jump to the FLANNEL chain where there's an 'ACCEPT all traffic from node subnet' rule. i.e something like:

$ iptables -t nat -L flannel
$ iptables -t nat -A POSTROUTING -s $NODE_CIDR -j FLANNEL
$ iptables -t nat -A FLANNEL -d $NODE_CIDR -j ACCEPT

$ sudo service kube-proxy start

From: https://github.com/coreos/flannel/blob/master/network/ipmasq.go#L32

Since ACCEPT is a built in target like DROP, it'll stop processing any of the kube service rules. This manifests as a bug that looks like a misconfigured hairpin mode, i.e, if a pod gets loadbalanced to itself when accessing the service dns name, packets get dropped because of a martian source.

@kubernetes/goog-cluster how do we co-ordinate? is it safe to always have kube-proxy prepend?

The text was updated successfully, but these errors were encountered:

ArtfulCoder · 2016-03-09T17:58:54Z

Users are hitting this issue. (referenced issue #22717)

ArtfulCoder · 2016-03-10T21:07:07Z

Other option could be along these lines: #22717 (comment)

mwhooker · 2016-04-18T23:46:56Z

we're hitting this, too. Does anyone have a workaround? Tried iptables -t nat -I FLANNEL -s 10.2.0.0/16 -d 10.2.0.0/16 -j MASQUERADE without any luck.

I think we could get by setting --proxy-mode=userspace, but according to the invocation that's qualitatively worse than iptables

thockin · 2016-04-19T03:42:29Z

Can flannel maybe ONLY install this rule if the policy for POSTROUTING is DROP ? @steveej what do you think? We need to coordinate a bit...

martynd · 2016-04-22T19:20:57Z

@mwhooker
At a guess, did you try that command based on this comment?

If you are still experiencing the issue, try running . /var/run/flannel/subnet.env && iptables -t nat -I FLANNEL -s $FLANNEL_NETWORK -d $FLANNEL_NETWORK -j MASQUERADE to automatically add the rule using the flannels runtime config.

If that isnt where your env file is stored, you can just use the ip/subnet on the flannel.1 interface as that should match. Just switch out the 10.2.0.0/16's in the original command. Alternatively just edit /var/run/flannel/subnet.env ion the previous command accordingly.

RE the userspace proxy, it would work as a short term bandaid, but it wont scale as well, so id limit its use just to testing.

mwhooker · 2016-04-26T00:07:49Z

@martynd yes, that's the comment. I looked at /var/run/flannel/subnet.env and found FLANNEL_NETWORK=10.2.0.0/16, so I have to assume the result will be the same as when I tried it earlier.

I can try again but it will have to wait until later this week.

thanks for the thought, but still looking for alternate ideas.

martynd · 2016-04-26T00:34:58Z

I've come across a similar issue to this one.
Give this a go, it solved the other issue
. /var/run/flannel/subnet.env && iptables -t nat -A POSTROUTING -s $FLANNEL_SUBNET -o docker0 -j ACCEPT

thockin · 2016-04-30T05:11:27Z

@steveej ping?

lxpollitt · 2016-05-20T19:12:47Z

cc @tomdee

steveej · 2016-05-26T11:32:41Z

@thockin

ping?

I'll stick my head together with @tomdee to figure this out.
Some information upfront will be good, this seems also related to installation/configuration (documentation) of kubernetes. What is the official way of setting up flannel with kubernetes? Does kubernetes rely on flannel's --ip-masq behavior in any way?

bprashanth · 2016-05-26T15:36:40Z

What is the official way of setting up flannel with kubernetes?

There is no "official way", but there is a way that should just work. Run kube-up with NETWORK_PROVIDER=flannel (https://github.com/kubernetes/kubernetes/blob/master/cluster/gce/config-default.sh#L134).

Does kubernetes rely on flannel's --ip-masq behavior in any way?

Someone needs to add that masq rule (flannel-io/flannel#318)

tomdee · 2016-05-31T21:41:16Z

I don't think it's right that flannel does an ACCEPT. A much better option would be a RETURN. As @thockin points out above that could be an issue if the default for the POSTROUTING table was DROP but as far as I understand it that would be an extremely strange thing to do.

I'm going to put up a PR for flannel to change ACCEPT to RETURN and that should resolve this bug.

martynd · 2016-05-31T22:08:31Z

Perhaps it could use ACCEPT/RETURN based on the default policy? Probably a better fit that way.

changleicn · 2016-11-06T12:36:45Z

any updates for this issue？

tomdee · 2016-11-16T01:07:50Z

@changleicn It looks like I fixed this back in May in flannel, so maybe this isn't an issue any more?

Hades32 · 2016-11-27T09:06:03Z

Happening for me in a fresh "kubeadm init" installation.

My workaround is finding the "RETURN" rule in "POSTROUTING" that "ignores" intra-pod traffic ans simply delete it. Works fine until k8s restarts...

Could someone care to explain why this rule is needed in the first place? Is it just for optimization?

# symptom is like this:

[root@pine64 ~]# docker exec -it 3c93 bash
root@bash-2239139833-xavbo:/# dig grafana
;; reply from unexpected source: 10.244.0.47#53, expected 10.96.0.10#53
# find NAT rule
Chain POSTROUTING (policy ACCEPT 4 packets, 255 bytes)
num   pkts bytes target     prot opt in     out     source               destination
1    22421 1487K KUBE-POSTROUTING  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* kubernetes postrouting rules */
2      803 52102 RETURN     all  --  *      *       10.244.0.0/16        10.244.0.0/16
# delete it
iptables -t nat -D POSTROUTING 2

Versions: (system is arm64)

ubeadm/kubernetes-xenial,now 1.5.0-alpha.2-421-a6bea3d79b8bba-00 arm64 [installed]
kubectl/kubernetes-xenial,now 1.4.4-00 arm64 [installed]
kubelet/kubernetes-xenial,now 1.4.4-01 arm64 [installed]
kubernetes-cni/kubernetes-xenial,now 0.3.0.1-07a8a2-00 arm64 [installed]
[root@pine64 ~]# docker ps | grep flann
2b7a51958274        quay.io/coreos/flannel-git:v0.6.1-28-g5dde68d-arm64

thockin · 2016-11-29T00:43:46Z

@tomdee I don't have a flannel install up right now to poke at, but maybe we can hash out a protocol for this between kube and flannel? Is flannel still installing rules into POSTROUTING unilaterally?

tomdee · 2016-12-01T21:09:41Z

@thockin sounds good. I'd need to go and check but yes, I think flannel is still installing POSTROUTING rules unilaterally.

thockin · 2016-12-02T07:33:17Z

Should we plan a slack meeting or hangout or something maybe next week or week after?

…

On Thu, Dec 1, 2016 at 1:10 PM, Tom Denham ***@***.***> wrote: @thockin <https://github.com/thockin> sounds good. I'd need to go and check but yes, I think flannel is still installing POSTROUTING rules unilaterally. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#20391 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFVgVBWmCQZwCeElO5cMceWL5lxAabCuks5rDzeogaJpZM4HQH9o> .

abemusic · 2017-03-13T20:41:48Z

Has there been any progress on this? I think we may also be hitting this, but not sure. All pod to pod interactions across nodes seem to be fine when using their service, but host to service works until the service gives back a pod running on the host. If that makes sense :)

phillydogg28 · 2017-06-14T20:57:41Z

Bump this again. I am not able to use the service endpoints from inside a pod of the service.

PodA is part of ServiceA.
PodB is part of ServiceB.
In PodA, I can curl ServiceB. In PodB I can curl ServiceA.
In PodA, I *cannot* curl ServiceA. In PodB I *cannot* curl ServiceB.

Should I be able to access ServiceA from inside PodA?

jl-dos · 2017-12-11T04:50:06Z

Same issue. Any update?

fejta-bot · 2018-03-11T04:52:30Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

codebreach · 2018-03-23T09:04:59Z

Hey @phillydogg28, @abemusic and @jl-dos

We are facing the same issue and were wondering if any of you had found a workaround or a resolution?

Happening on GKE with no plugins on both 1.9.4-gke.1 and 1.7.12-gke.1. Node/Master OS is ContainerOS

fejta-bot · 2018-04-22T09:56:50Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

guoshimin · 2018-04-22T17:38:21Z

/remove-lifecycle rotten

rikatz · 2018-04-22T19:29:34Z

A non related but related thing here is that after changing to a pure Calico stack (only Calico, without flannel) solved my race condition issue in a big environment.

fejta-bot · 2018-07-21T20:24:51Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2018-08-20T21:11:36Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2018-09-19T21:57:47Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2018-09-19T21:57:53Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

poidag-zz · 2020-08-11T03:52:59Z

A non related but related thing here is that after changing to a pure Calico stack (only Calico, without flannel) solved my race condition issue in a big environment.

Hi @rikatz Do you have any guidelines on what you did for migration from Flannel to Calico?

discostur · 2022-11-08T10:36:43Z

Just installed a k8s v1.24.7 cluster with flannel v0.20.1 and the issue still seems to exist:

;; reply from unexpected source: 10.244.0.45#53, expected 10.96.0.10#53
;; reply from unexpected source: 10.244.0.45#53, expected 10.96.0.10#53
;; reply from unexpected source: 10.244.0.45#53, expected 10.96.0.10#53

DNS only works if the coreDNS pod runs on the same node like the client. Pod to pod communication via services is broken ...

aojea · 2022-11-08T10:39:25Z

but that is a flannel problem, isn't it?

it should be reported on its repo

discostur · 2022-11-08T15:01:13Z

@aojea you are right it is specific to flannel not to k8s - please forget about my post.

aojea · 2022-11-08T20:01:32Z

no worries, sometimes is the other way around

bprashanth added sig/network Categorizes an issue or PR as relevant to SIG Network. team/cluster labels Jan 31, 2016

ArtfulCoder mentioned this issue Mar 9, 2016

Can't use service IP from inside the pod that backs the service #22717

Closed

bprashanth mentioned this issue Mar 30, 2016

A pod can't direct traffic to itself via a service #20475

Closed

bprashanth mentioned this issue Apr 19, 2016

Local docker setup leaves /sys as ro in kubelet container #24350

Closed

virsox mentioned this issue Apr 28, 2016

Can't connect to a service IP from the service's pod #19930

Closed

tomdee mentioned this issue May 31, 2016

Remove race-condition when setting up masquerade rules flannel-io/flannel#442

Merged

aaronlevy mentioned this issue Jul 11, 2016

Pods are not accessible via CusterIP from the nodes in vagrant multi-node coreos/coreos-kubernetes#568

Open

mwhooker mentioned this issue Jul 22, 2016

Docker Daemon Hangs under load moby/moby#13885

Closed

odacremolbap mentioned this issue Sep 9, 2016

Error pushing from dockerbuilder to registry deis/dockerbuilder#79

Closed

Hades32 mentioned this issue Nov 27, 2016

Kubernetes connection error failed to create watch traefik/traefik#883

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 11, 2018

spiffxp removed the team/cluster (deprecated - do not use) label Mar 15, 2018

This was referenced Apr 4, 2018

iptables rule: -A POSTROUTING -s x.x.x./cidr -j FLANNEL missing flannel-io/flannel#971

Closed

iptables rule: -A POSTROUTING -s x.x.x./cidr -j FLANNEL missing #971 #62246

Closed

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 22, 2018

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Apr 22, 2018

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 21, 2018

nicklasfrahm mentioned this issue Aug 19, 2018

Error getting certificate cert-manager/cert-manager#663

Closed

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 20, 2018

k8s-ci-robot closed this as completed Sep 19, 2018

murali-reddy mentioned this issue Jun 15, 2020

whitelist traffic to cluster IP and node ports in INPUT chain to bypass netwrok policy enforcement cloudnativelabs/kube-router#914

Merged

Flannel and kube-proxy race for postrouting chain #20391

Flannel and kube-proxy race for postrouting chain #20391

Comments

bprashanth commented Jan 31, 2016

ArtfulCoder commented Mar 9, 2016

ArtfulCoder commented Mar 10, 2016

mwhooker commented Apr 18, 2016

thockin commented Apr 19, 2016

martynd commented Apr 22, 2016 • edited Loading

mwhooker commented Apr 26, 2016

martynd commented Apr 26, 2016

thockin commented Apr 30, 2016

lxpollitt commented May 20, 2016

steveej commented May 26, 2016

bprashanth commented May 26, 2016

tomdee commented May 31, 2016

martynd commented May 31, 2016

changleicn commented Nov 6, 2016

tomdee commented Nov 16, 2016

Hades32 commented Nov 27, 2016

thockin commented Nov 29, 2016

tomdee commented Dec 1, 2016

thockin commented Dec 2, 2016 via email

abemusic commented Mar 13, 2017

phillydogg28 commented Jun 14, 2017

jl-dos commented Dec 11, 2017

fejta-bot commented Mar 11, 2018

codebreach commented Mar 23, 2018

fejta-bot commented Apr 22, 2018

guoshimin commented Apr 22, 2018

rikatz commented Apr 22, 2018

fejta-bot commented Jul 21, 2018

fejta-bot commented Aug 20, 2018

fejta-bot commented Sep 19, 2018

k8s-ci-robot commented Sep 19, 2018

poidag-zz commented Aug 11, 2020

discostur commented Nov 8, 2022

aojea commented Nov 8, 2022

discostur commented Nov 8, 2022

aojea commented Nov 8, 2022

martynd commented Apr 22, 2016 •

edited

Loading