-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docker 1.12 swarm mode load balancing not consistently working #25325
Comments
I am at the moment a little bit baffled. If you have any tips where I can look to provide more inside information, then please let me know and I can provide this. |
I met the same problem with a 3-node setup. I brought up a service with 5 replicas using following command:
When I My three nodes are located at AWS Tokyo, Vultr Tokyo and DigitalOcean SGP1. |
@mschirrmeister Is the problem still there if you started a service with 3 replicas instead of starting the service with one replica and then scaling up? |
@mrjana Yes, the problem is still there, even if I start the service with the option |
When creating/deleting and querying services I watched today syslog for errors on the host and I saw the following. Not sure how bad that is, or if it is helpful. querying a service with curl
adding a service
deleting a service
|
I wasn't able to reproduce it using
|
Hi all, I too have some issues with load balancer which suddenly stops working. My setup is a simple 1 server (CentOS) with Docker 1.12 installed. After a while, following simple play around actions caused it to stop working:
I tested it via 2 external servers with curl (watch -n1 "curl -s 10.3.x.x |grep -e 'My hostname|My address'") . First symptom was that the load balancer stopped "round robin" the containers, each curl stayed on the same container - this on all curl tests on all servers including curl on server itself. And then the load balancer stopped all together with timeouts resulting in all curl tests. syslog at time it stopped: Some time after this, I scaled the swarm service again to 0, and again to 2. It then worked again. |
Result after node restart:
If I add 3rd container:
Result after adding 3rd node looks like that:
|
@thaJeztah this is one of the issues that is potentially solved via #25603. @mschirrmeister can you please confirm ? |
I have the same issue! I tested with CentOS and Ubuntu nodes, same issue. Usually, issue occurs only on the node that is restarted. If I run "systemctl restart docker" after reboot, apparently resolve the issue for a moment, but issue return after some minutes. |
I updated to 1.12.1-rc1. The package upgrade did also a restart of the service. I then did a full restart of the hosts and re-created the services. Access via curl works at the moment and all 3 backends on all 3 hosts worked. I will monitor the situation a little more to see if it will break again. When it was not working after the upgrade, I connected to the container on the docker host where it was load balancing always to the same backend and did a dns lookup to |
@mschirrmeister I have seen that same symptom on several occasions.... and I diagnosed it in the IPVS table not being populated correctly (why I don't know). To confirm the issue, cat /proc/net/ip_vs in the ingress-sbox namespace (that's the network namespace doing load balancing for requests coming from the outside). I.e. cd /var/run/docker/netns/ ; nsenter --net=5683f2b6e546 cat /proc/net/ip_vs Those last 3 lines are the hex-encoded IP of the containers being load balanced for the service. Also, if you have multiple services defined already... you'll have several of these entries. This one was for service marked with FWM 0x01C2. Find what traffic this was for originally with: cd /var/run/docker/netns/ ; nsenter --net=5683f2b6e546 iptables -t mangle -L -n -v So this service is the one with port 8080 published to the outside world |
@somejfn This is exactly the problem that was fixed in 1.12.1-rc1 i.e incorrect backend information in ipvs. Are you using 1.12.1-rc1? |
@mrjana Was on 1.12.0 with pre-built binaries at https://get.docker.com/builds/Linux/x86_64/docker-latest.tgz. Im on CoreOS (hence no a package manager) so I guess I'd need to build from source to get 1.12.1-rc1 until 1.12.1 is GA ? |
@somejfn Yes, that's right. |
I am definitely running 1.12.1-rc1.
I can confirm I still see the issue. I did today another reboot of all 3 docker hosts. Then started the docker daemon on all 3 hosts and the swarm cluster was back up and running
I created then my service again with 1 replica. Scaled it to 3 and accessed it from my client. Host1
Host2/Host3 look like this.
When I do a service remove, the entry in |
I have the same issue with 1.12.1-rc1. Environment:
Steps to reproduce:
At this point, I can access the service through just one host.
IPVSADM is forwarding correctly to IPs 10.255.0.12 and 10.255.0.32. NODE 2 (not working):
IPVSADM on the node 2 continue forwarding connections to old IP's (0.26, 0.29, 0.30 and etc.). IE, DownScale didn't updated ipvsadm. |
@mschirrmeister @asmialoski I've run this scale up/down tests many times and I haven't seen any issues. Can you please post the daemon logs from nodes where you are having issues? |
@mrjana Please, see logs in attachments. Let me explain steps that I performed: I have two nodes: 1 MASTER and 1 WORKER.
Tks, |
|
My logs are available here. https://gist.github.com/mschirrmeister/e1b86b93b4524066de7a06aee5bb80ef What I did was again:
|
I have similar experiences. When starting with a fresh swarm and a freshly deployed stack (using "docker stack deploy") it works. I don't do scaling but regularly redeploy services (using "docker stack deploy"). After each deploy of the same stack (with updated images) I get more problems accessing the containers. Sometimes I get connection refused but mostly I get "Connection timed out". It might be of interest that I regularly restarts docker on the node where the deploy command is issued. (Swarm related commands starts to give "Error response from daemon: rpc error: code = 4 desc = context deadline exceeded". Restart of docker is the only way I've found to recover from this.) |
@mschirrmeister You seem to be having a basic issue with load balancing. There seems to be something unique in your environment. I would have to take a look at your hosts to see what's different. I know you offered to provide access to your machines. Is that still possible? |
I have the same problem on 6 Raspberry Pi nodes on 1.12.1...
This looks like it's trying to do round-robin but it can't find its way to the other nodes...
Rebooting all nodes didn't help...
To me this looks like no network setup is being done...
Please let me know if I can provide more information. As a question aside, why the difference between |
I would like to add that I created the service with 6 replicas from the offing and the networking never worked. This isn't a case of scaling up or down after the service launch. |
@DarkerMatter You most probably have a different problem. Since you are on RPi and if you are on raspbian distro I would check if you have vxlan module in you kernel by doing |
How is this closed? the OP @mschirrmeister never confirmed it is fixed. |
I will reopen the issue. This issue has become a kitchen sink issue for various different issues. For e.g the issue reported by @asmialoski in this thread is definitely fixed by 99c3968. So I mentioned this issue in the commit logs of my PR which automatically closed this issue. But the original issue as reported by @mschirrmeister is probably not resolved yet. We can keep it open until that is resolved. @asmialoski If you want to you can use docker/docker master build to verify if your issue is resolved now. |
@mrjana Thanks Jana, it's working now. I'll get in touch with Alex Ellis so he can update the Swarm tutorial. |
@mrjana Access is still possible. Wrote you an email |
@mschirrmeister I took a look at the logs from one of your nodes. It looks like you had a gossip failure and we did not recover from it. This problem with recovery is being fixed via moby/libnetwork#1446 and soon will be available in docker/docker. But we still need to understand how this cluster got into a gossip failure. Are these nodes in different regions or availability zones in the cloud? |
@mrjana No, the nodes are not in different regions. All in one Azure region. Also in an availability set, which makes sure they do not run on the same hardware under the hood. If Microsoft has also multiple datacenters under the hood, I do not know. But even if so, this can be treated like local dcs. I could reproduce it, by rebooting all nodes. Same time more or less. Maybe somehow a timing thing? I have not tried to leave them alone for a while and start docker and containes/services a few minutes later. |
@mschirrmeister One of the nodes seems to have gone through multiple leader timeouts and re-election has happened indicating that the underlying network partitioning issue is not isolated to gossip alone. Even raft seems to have been affected. At this point I will consider that there was some problem and the recovery fix we have in moby/libnetwork#1446 will help solve the problem. Since you mentioned that you are seeing consistently, one experiment that you could try is to run only one of the node as manager instead of all 3 of them as manager to see if raft itself is contributing to underlying network congestion issue. |
Note for 1.12.2: we might need to extract the fix from 99c3968. |
This issue is a duplicate of #26563 |
@vieux I guess #26563 would be a duplicate of #25325 if it is the same thing. But reading #26563 is to me a different issue, since it is reported that a node is not responding at all on the port. Where the issue here is, that the port is open on all node, but it is not properly load balanced across all containers. @mrjana I tried it with only 1 leader. I demoted to nodes, rebooted them again, once they came up, I started docker and created a service. Same issue. 2 nodes balanced across 2 containers and 1 node balanced only to 1 container. (Was not the latest version. Still my last 1.13 compiled dev version). |
How is this a duplicate, #26563 seems to describe a different situation |
FYI, running Docker 1.12.2 swarm with 4-6 nodes, this networking issue disappeared when I made sure the |
I had a similar problem and I am running 3 managers and 1 worker for testing (Docker 1.12.2, aws ec2 w/ Ubuntu 14.04).
Now it's working. Here are my two services
|
I am locking this issue for commenting because (as already mentioned in #25325 (comment)), this issue has been collecting various unrelated (or "not-directly related") issues related to swarm networking. If you're still having issues, please open a new issue with details |
Hi,
I have a problem with the docker 1.12 swarm mode load balancing. The setup has 3 hosts, Docker 1.12 on CentOS 7 running in Azure. Nothing really special about the hosts. Plain CentOS 7 setup, Docker 1.12 from the Docker yum repo and btrfs as a data disk for
/var/lib/docker
.If I create 2 services, scale them to 3 and then try to access them from a client the access occasionally does not work.
What it means is if you access the service via the docker host ip address(es) and exposed ports some containers do not respond.
Output of
docker version
:Output of
docker info
:Additional environment details (AWS, VirtualBox, physical, etc.):
Current test environment is running on Microsoft Azure
Steps to reproduce the issue:
Create overlay network
Create services and scale them
docker service ls
docker service ps service1
docker service ps service2
docker service inspect service1
Access service1 from a client against docker host 1
Access service2 from a client against docker host 1
Access service1 from a client against docker host 2
Access service2 from a client against docker host 2
Access service1 from a client against docker host 3
Access service2 from a client against docker host 3
Describe the results you received:
Not all containers respond when accessing the service via the docker host ip addresses and exposed ports.
Describe the results you expected:
All containers from a service should respond no matter via which docker host the service is accessed.
Additional information you deem important (e.g. issue happens only occasionally):
The issue is occasionally. Occasionally that if you delete and re-create the service maybe all containers respond, or containers on a different host do not respond.
It is at least consistent once a service is created. Lets say, containers on host 2 and host 3 do not respond when accessed via docker host 1, then it is always like this for the lifetime of that service.
The text was updated successfully, but these errors were encountered: