Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docker-compose hangs #6678

Closed
xcash opened this issue Apr 29, 2019 · 51 comments
Closed

docker-compose hangs #6678

xcash opened this issue Apr 29, 2019 · 51 comments
Labels

Comments

@xcash
Copy link

xcash commented Apr 29, 2019

Description of the issue

In a new server installation (on a VPS provided by OVH) running Ubuntu 18.x the Docker system works flawlessy and fast but docker-compose sometimes hangs indefinitely for no reason. And for every command, from version to ps, up, logs and so on.
I tried everything regarding DNS and name resolution, even disabling systemd-resolved.
When it works it is fast. When it hangs I can see the /tmp/_Mxyz123123 directory created.
Running with verbose hangs before any output. And then completes flawlessy.

Context information (for bug reports)

Output of docker-compose version

docker-compose version 1.24.0, build 0aa59064
docker-py version: 3.7.2
CPython version: 3.6.8
OpenSSL version: OpenSSL 1.1.0j  20 Nov 2018

Output of docker version

Client:
 Version:           18.09.5
 API version:       1.39
 Go version:        go1.10.8
 Git commit:        e8ff056
 Built:             Thu Apr 11 04:43:57 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          18.09.5
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.8
  Git commit:       e8ff056
  Built:            Thu Apr 11 04:10:53 2019
  OS/Arch:          linux/amd64
  Experimental:     false

Output of docker-compose config
(Make sure to add the relevant -f and other flags)

... not related to a specific docker-compose file

Steps to reproduce the issue

  1. docker-compose

Observed result

Sometimes hangs

Expected result

Run fast.

Stacktrace / full error message

No stacktrace but i managed to strace the behavior.

[...]
openat(AT_FDCWD, "/tmp/_MEIsYFQ2P/lib/python3.6/config-3.6m-x86_64-linux-gnu/Makefile", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3</tmp/_MEIsYFQ2P/lib/python3.6/config-3.6m-x86_64-linux-gnu/Makefile> <0.000038>
fstat(3</tmp/_MEIsYFQ2P/lib/python3.6/config-3.6m-x86_64-linux-gnu/Makefile>, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0 <0.000020>
write(3</tmp/_MEIsYFQ2P/lib/python3.6/config-3.6m-x86_64-linux-gnu/Makefile>, "# Generated automatically from M"..., 69632) = 69632 <0.000130>
fchmod(3</tmp/_MEIsYFQ2P/lib/python3.6/config-3.6m-x86_64-linux-gnu/Makefile>, 0700) = 0 <0.000050>
write(3</tmp/_MEIsYFQ2P/lib/python3.6/config-3.6m-x86_64-linux-gnu/Makefile>, "type.o   -o Modules/xxsubtype$(E"..., 43) = 43 <0.000029>
close(3</tmp/_MEIsYFQ2P/lib/python3.6/config-3.6m-x86_64-linux-gnu/Makefile>) = 0 <0.000021>
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f2088645a10) = 3110 <0.000276>
rt_sigaction(SIGINT, {sa_handler=0x404840, sa_mask=[INT], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7f2087c55f20}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0 <0.000051>
rt_sigaction(SIGKILL, {sa_handler=0x404840, sa_mask=[KILL], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7f2087c55f20}, 0x7ffe82f7d900, 8) = -1 EINVAL (Invalid argument) <0.000031>
rt_sigaction(SIGTERM, {sa_handler=0x404840, sa_mask=[TERM], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7f2087c55f20}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0 <0.000028>
wait4(-1, 

[... here it hangs and then after 114s it continues output ...]

wait4(-1, docker-compose version 1.24.0, build 0aa59064
docker-py version: 3.7.2
CPython version: 3.6.8
OpenSSL version: OpenSSL 1.1.0j  20 Nov 2018
[{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 3110 <114.945565>
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=3110, si_uid=0, si_status=0, si_utime=43, si_stime=6} ---
rt_sigaction(SIGINT, {sa_handler=SIG_DFL, sa_mask=[INT], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7f2087c55f20}, {sa_handler=0x404840, sa_mask=[INT], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7f2087c55f20}, 8) = 0 <0.000094>
rt_sigaction(SIGKILL, {sa_handler=SIG_DFL, sa_mask=[KILL], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7f2087c55f20}, 0x7ffe82f7d900, 8) = -1 EINVAL (Invalid argument) <0.000150>

[...]

Additional information

Linux vps[omit] 4.15.0-48-generic #51-Ubuntu SMP Wed Apr 3 08:28:49 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Ubuntu 18.04.2 LTS

docker-compose installed with curl

I have lot of other docker-compose environments working and this is the first time I see this behavior. Any idea on what to check or where to look?

@jcsirot
Copy link

jcsirot commented Apr 29, 2019

@xcash Thank you for the ticket. We are going to investigate. Do you have any minimal docker-compose.yaml triggering that issue? Or does the compose file contain some specific elements (like volumes or dedicated network)?

@xcash
Copy link
Author

xcash commented Apr 30, 2019

@jcsirot No as it happens even when issuing a simple docker-compose version without a docker-compose.yaml. It seems not related to the docker-compose.yml. It seems it hangs before even checking for it, like, as in the strace above, when checking for docker subsystem.

Btw the compose file mounts local volumes and uses default network. Uses a bitnami/nginx image.

@hschalamonek
Copy link

This happens to me as well, running on a Google Compute Engine instance with the same docker versions as above. Also Ubuntu 18.04. Even docker-compose --help does this, so no docker-compose.yaml is involved.

@bitbuck3t
Copy link

bitbuck3t commented May 2, 2019

Same behavior for me as well. I'm running on a Digital Ocean droplet running Ubuntu 18.04.2 LTS with 4.15.0-48-generic x86_64 GNU/Linux kernel. Installed via curl today.

@will-emmerson
Copy link

Same problem for me, Linode with Ubuntu 18.04.2. Going back to version 1.23.0 fixed it, although I had to recreate the containers.

@xcash
Copy link
Author

xcash commented May 13, 2019

It seems like after the last update of Docker itself the problem is gone... can you guys (@hschalamonek @bitbuck3t @will-emmerson ) check it too?

@Viktova
Copy link

Viktova commented May 20, 2019

@xcash I'm running on Docker version 18.09.6, build 481bc77
and it's still happening.

Ubuntu 18.04.2LTS x64 with everything up to date

as a work around: you must call docker-compose command once and
after that issue your desired command.

@ijc
Copy link

ijc commented May 20, 2019

@xcash was your strace capture (from the original issue) captured with -fff to follow forks/clones?

If I strace locally (running 1.24.0) which doesn't hang (there's a brief pause, ~1s) I do notice:

clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f07acbaea10) = 16880
rt_sigaction(SIGINT, {sa_handler=0x404840, sa_mask=[INT], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7f07acbe88e0}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGKILL, {sa_handler=0x404840, sa_mask=[KILL], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7f07acbe88e0}, 0x7ffdb3f8c1f0, 8) = -1 EINVAL (Invalid argument)
rt_sigaction(SIGTERM, {sa_handler=0x404840, sa_mask=[TERM], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7f07acbe88e0}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 16880
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=16880, si_uid=1000, si_status=0, si_utime=60, si_stime=8} ---
rt_sigaction(SIGINT, {sa_handler=SIG_DFL, sa_mask=[INT], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7f07acbe88e0}, {sa_handler=0x404840, sa_mask=[INT], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7f07acbe88e0}, 8) = 0
rt_sigaction(SIGKILL, {sa_handler=SIG_DFL, sa_mask=[KILL], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7f07acbe88e0}, 0x7ffdb3f8c1f0, 8) = -1 EINVAL (Invalid argument)
rt_sigaction(SIGTERM, {sa_handler=SIG_DFL, sa_mask=[TERM], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7f07acbe88e0}, {sa_handler=0x404840, sa_mask=[TERM], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7f07acbe88e0}, 8) = 0
openat(AT_FDCWD, "/tmp/_MEI9VTJSw", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 3

e.g. at the point you have a hang I have a child which is spawned and then exits.

I think the /tmp/ stuff is the self-extracting stuff from https://github.com/pyinstaller/pyinstaller.

The first exec looks to be some sort of self exec, and it then execs /sbin/ldconfig twice. What I saw was (please excuse my ascii...):

+-- /tmp/dc.16879:execve("./docker-compose-Linux-x86_64", ["./docker-compose-Linux-x86_64", "version"], 0x7fff52d107c0 /* 56 vars */) = 0
  ` /tmp/dc.16879:clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f07acbaea10) = 16880
    `-- /tmp/dc.16880:execve("/home/ijc/development/Docker/compose/compose/docker-compose-Linux-x86_64", ["./docker-compose-Linux-x86_64", "version"], 0xfa0530 /* 58 vars */) = 0
     |
     +- /tmp/dc.16880:clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fb8f4c91a10) = 16881
     |  `-- /tmp/dc.16881:execve("/sbin/ldconfig", ["/sbin/ldconfig", "-p"], 0x7fb8f425ef00 /* 2 vars */) = 0
     |
     `- /tmp/dc.16880:clone(child_stack=NULL, f lags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fb8f4c91a10) = 16882
        `-- /tmp/dc.16882:execve("/sbin/ldconfig", ["/sbin/ldconfig", "-p"], 0x7fb8efbbb168 /* 2 vars */) = 0

I wonder if there is something which is slowing /sbin/ldconfig on your system unacceptably? e.g. an nfs or autofs mount in the paths it is hitting or something like that.

FWIW I was running strace -s 4096 -o /tmp/dc -fff ./docker-compose-Linux-x86_64 version. You might also want to add -tt to timestamp all the calls and see if there is an obviously slow bit.

@xcash
Copy link
Author

xcash commented May 20, 2019

@xcash I'm running on Docker version 18.09.6, build 481bc77
and it's still happening.

Ubuntu 18.04.2LTS x64 with everything up to date

On that machine (the only one I had with that problem) it disappeared and we use the same version :(

# docker-compose version
docker-compose version 1.24.0, build 0aa59064

# docker version
Client:
 Version:           18.09.6
 API version:       1.39
 Go version:        go1.10.8
 Git commit:        481bc77
 Built:             Sat May  4 02:35:57 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          18.09.6
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.8
  Git commit:       481bc77
  Built:            Sat May  4 01:59:36 2019
  OS/Arch:          linux/amd64
  Experimental:     false

@xcash
Copy link
Author

xcash commented May 20, 2019

@xcash was your strace capture (from the original issue) captured with -fff to follow forks/clones?

Sadly not.

e.g. at the point you have a hang I have a child which is spawned and then exits.

I think the /tmp/ stuff is the self-extracting stuff from https://github.com/pyinstaller/pyinstaller.

I've seen the self extraction stuff before that line. In the output I evicted from the post.

I wonder if there is something which is slowing /sbin/ldconfig on your system unacceptably? e.g. an nfs or autofs mount in the paths it is hitting or something like that.

Considering it is a VPS it's entirely possible something like that behind the scenes.
I tried issuing a docker command (not compose) while docker-compose was locking/locked and it worked seamlessy but it could've been an operation that didn't hit the filesystem..

FWIW I was running strace -s 4096 -o /tmp/dc -fff ./docker-compose-Linux-x86_64 version. You might also want to add -tt to timestamp all the calls and see if there is an obviously slow bit.

Should this happen again i will use your strace options. :)

@ijc
Copy link

ijc commented May 21, 2019

Thanks @xcash

Considering it is a VPS it's entirely possible something like that behind the scenes.

FWIW one thing I considered mentioning before was that some of the lower tier virtual machines on AWS (and I presume other clouds) are scheduled in such a way that if they use all their "credit" for a given period they essentially hang (the whole VM is descheduled) until they rebuild up sufficient credit.

I think this is probably not the case for you, since only docker-compose is hanging not the whole VM (AIUI, e.g. you see docker cli commands working while in the hung state). Also hangs of 114s like you are seeing are so close to triggering Linux's default soft-lockup check (120s) that I think someone would have mentioned that if it were relevant.

So I think this is almost certainly not relevant to this case (at least to your particular instance), but worth mentioning just in case.

Should this happen again i will use your strace options. :)

Thank you!

BTW I think running /sbin/ldconfig while you are in the "bad" state should also take a long time, which would be interesting to validate.

@will-emmerson
Copy link

will-emmerson commented May 21, 2019

I am still having the same problem, with upgraded everything, but can confirm that running docker-compose on its own does seem to fix it.

@ijc
Copy link

ijc commented May 21, 2019

@will-emmerson please can you try collecting some logs with strace (see my reply to @xcash above for the runes) while the problem is exhibiting itself.

@will-emmerson
Copy link

I ran the following command
strace -s 4096 -o /tmp/dc -fff -tt docker-compose --version

and it created these files:

3220120 May 21 10:34 dc.29714
8457740 May 21 10:34 dc.29717
  27423 May 21 10:33 dc.29718
  27423 May 21 10:33 dc.29719

What should I do with them?

I also ran strace -c docker-compose --version and it output this:

    docker-compose version 1.24.0, build 0aa59064
    % time     seconds  usecs/call     calls    errors syscall
    ------ ----------- ----------- --------- --------- ----------------
    62.83   0.046038       46038         1           wait4
    20.47   0.014998          88       170           write
    9.71    0.007118          82        87           unlink
    2.79    0.002041           8       248           read
    0.94    0.000689           4       191           openat
    0.83    0.000609           3       231        98 stat
    0.61    0.000447          56         8           munmap
    0.36    0.000266          22        12           rmdir
    0.31    0.000225           1       191           close
    0.23    0.000166           7        24           getdents
    0.22    0.000160          13        12           mkdir
    0.21    0.000153         153         1           clone
    0.20    0.000144           1       192           fstat
    0.19    0.000140           2        87           fchmod
    0.07    0.000053           1        90           lseek
    0.02    0.000015           2         9           brk
    0.01    0.000010           2         6         2 rt_sigaction
    0.01    0.000007           0        17           mmap
    0.00    0.000000           0         7           mprotect
    0.00    0.000000           0         5         5 access
    0.00    0.000000           0         1           getpid
    0.00    0.000000           0         1           execve
    0.00    0.000000           0         1           readlink
    0.00    0.000000           0         1           arch_prctl
    ------ ----------- ----------- --------- --------- ----------------
    100.00    0.073279                  1593       105 total

I've never used strace before.

@ijc
Copy link

ijc commented May 21, 2019

@will-emmerson easiest is probably to tar (or zip) them up and attach them here, thanks.

@will-emmerson
Copy link

I'm not sure I want to upload a bunch of logs to a public place when I don't really know what's in them.

@ijc
Copy link

ijc commented May 21, 2019

@will-emmerson hrm, yes, that's fair. please zip/tar them and mail them to ijc (at) docker.com.

@will-emmerson
Copy link

I've sent them.

@ijc
Copy link

ijc commented May 21, 2019

Received, thanks.

@ijc
Copy link

ijc commented May 21, 2019

So your process tree is:

29714 `docker-compose --version`
`-- 29717 self exec of `docker-compose --version`
    `-- 29718 exec `/sbin/ldconfig` → exec `/sbin/ldconfig.real`
    `-- 29719 exec `/sbin/ldconfig` → exec `/sbin/ldconfig.real`

The two processes 29718 and 29719 execute fully in <1s. e.g. dc.29718 is

6678-will-emmerson$ sed -n -e '1p;$p' dc.29718
10:33:59.478903 set_robust_list(0x7fa0b3052a20, 24) = 0
10:33:59.483282 +++ exited with 0 +++
6678-will-emmerson$ sed -n -e '1p;$p' dc.29719
10:33:59.597443 set_robust_list(0x7fa0b3052a20, 24) = 0
10:33:59.602154 +++ exited with 0 +++

Which I think rules out something slowing ldconfig down, at least in your case.

It's the two main processes which take some time:

6678-will-emmerson$ sed -n -e '1p;$p' dc.29714
10:33:58.821291 execve("/usr/local/bin/docker-compose", ["docker-compose", "--version"], 0x7ffea9630aa8 /* 23 vars */) = 0
10:34:49.283590 +++ exited with 0 +++
6678-will-emmerson$ sed -n -e '1p;$p' dc.29717
10:33:59.145695 execve("/usr/local/bin/docker-compose", ["docker-compose", "--version"], 0xa6d630 /* 25 vars */) = 0
10:34:49.268734 +++ exited with 0 +++

In dc.29714 (the parent) there is a gap of ~48s after 10:34:01.454543:

10:33:59.145224 clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fd813295a10) = 29717
10:33:59.145455 rt_sigaction(SIGINT, {sa_handler=0x404840, sa_mask=[INT], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fd8128a5f20}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
10:33:59.145495 rt_sigaction(SIGKILL, {sa_handler=0x404840, sa_mask=[KILL], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fd8128a5f20}, 0x7ffd2ab7dcd0, 8) = -1 EINVAL (Invalid argument)
10:33:59.145521 rt_sigaction(SIGTERM, {sa_handler=0x404840, sa_mask=[TERM], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fd8128a5f20}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
10:33:59.145545 wait4(-1, 0x7ffd2ab7dedc, 0, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
10:34:01.163587 --- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
10:34:01.163639 wait4(-1, 0x7ffd2ab7dedc, 0, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
10:34:01.279720 --- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
10:34:01.279774 wait4(-1, 0x7ffd2ab7dedc, 0, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
10:34:01.305847 --- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
10:34:01.305904 wait4(-1, 0x7ffd2ab7dedc, 0, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
10:34:01.333863 --- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
10:34:01.333915 wait4(-1, 0x7ffd2ab7dedc, 0, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
10:34:01.396106 --- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
10:34:01.396157 wait4(-1, 0x7ffd2ab7dedc, 0, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
10:34:01.454491 --- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
10:34:01.454543 wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 29717
10:34:49.268835 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=29717, si_uid=0, si_status=0, si_utime=44, si_stime=6} ---
10:34:49.268861 rt_sigaction(SIGINT, {sa_handler=SIG_DFL, sa_mask=[INT], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fd8128a5f20}, {sa_handler=0x404840, sa_mask=[INT], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fd8128a5f20}, 8) = 0
10:34:49.268891 rt_sigaction(SIGKILL, {sa_handler=SIG_DFL, sa_mask=[KILL], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fd8128a5f20}, 0x7ffd2ab7dcd0, 8) = -1 EINVAL (Invalid argument)
10:34:49.268916 rt_sigaction(SIGTERM, {sa_handler=SIG_DFL, sa_mask=[TERM], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fd8128a5f20}, {sa_handler=0x404840, sa_mask=[TERM], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fd8128a5f20}, 8) = 0

Which seems to be the parent waiting for the child.

In the child (dc.29717) the gap is after 10:34:01.454593:

10:33:59.629565 openat(AT_FDCWD, "/dev/random", O_RDONLY) = 3
10:33:59.629613 poll([{fd=3, events=POLLIN}], 1, -1) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
10:34:01.163675 --- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
10:34:01.163690 restart_syscall(<... resuming interrupted poll ...>) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
10:34:01.279809 --- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
10:34:01.279823 restart_syscall(<... resuming interrupted restart_syscall ...>) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
10:34:01.305939 --- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
10:34:01.305952 restart_syscall(<... resuming interrupted restart_syscall ...>) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
10:34:01.333951 --- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
10:34:01.333965 restart_syscall(<... resuming interrupted restart_syscall ...>) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
10:34:01.396192 --- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
10:34:01.396205 restart_syscall(<... resuming interrupted restart_syscall ...>) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
10:34:01.454579 --- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
10:34:01.454593 restart_syscall(<... resuming interrupted restart_syscall ...>) = 1
10:34:49.023121 close(3)                = 0

Aha! So the process is blocked in a poll syscall (poll([{fd=3, events=POLLIN}], 1, -1)) where the only passed fd is fd 3, which is a recently opened fd onto /dev/random (openat(AT_FDCWD, "/dev/random", O_RDONLY) = 3).

The /dev/random device is well known for blocking when there is insufficient entropy on the system. And VPSs/cloud VMs are well known for having a general shortage of entropy.

There is no use of /dev/random in compose (outside of a single unrelated test case), nor is there in docker-py, so this must be in one of our dependencies.

Although I cannot reproduce the hang I can see the use of /dev/random in my own strace (my system must have plenty of entropy I guess).

The maddest bit is that after blocking waiting for /dev/random to become readable it immediately closes it and opens /dev/urandom (the non-blocking one) instead and reads from that.

Well, thank you for those logs, they have at least given me a string to pull on and see where it takes me...

@will-emmerson
Copy link

No problem, it makes sense as it's a new VPS with very little use. And installing haveged fixes the problem.

@ijc
Copy link

ijc commented May 21, 2019

Not long before the /dev/random dance I described I see references to cryptography.hazmat.bindings._openssl.so in the traces.

With that knowledge this comment seems awfully telling:
https://github.com/pyca/cryptography/blob/master/src/_cffi_src/openssl/src/osrandom_engine.c#L123...L127

Which in turn seems to derive from:
pyca/cryptography#4645
pyca/cryptography#4656

So I feel pretty sure in declaring that this isn't a compose issue as such, but some combination of Python's behaviour and lack of entropy on some systems. Installing haveged or some other entropy daemon or otherwise ensuring you have sufficient entropy seems like the right answer.

I'm going to close since this does not appear to be a bug in compose, thanks for your help in getting to the bottom of the issue.

FWIW https://lwn.net/Articles/693189/ is an interesting article on a related subject (I'm not sure whether or not this issue is the one being described there, but the symptoms are certainly similar).

@ijc ijc closed this as completed May 21, 2019
@xcash
Copy link
Author

xcash commented May 22, 2019

@ijc congrats for the troubleshoot Ian! :)
It makes sense even in my setup as the problem disappeared after that VPS went in production and thus started creating entropy.

@gurland
Copy link

gurland commented Jun 6, 2019

Yeah! I had number 6 in entropy_avail, after Haveged installation number increased to ~2500 and now docker-compose commands are running pretty fast.

Very strange issue, @ijc thank you for your investigation that issue was so unexpectable but interesting to discover.

@sebastialonso
Copy link

I don't understand the conclusion of this issue. At the end of it all, what can I do to not have my docker-compose commands from hanging??

@Phuker
Copy link

Phuker commented Aug 30, 2019

I don't understand the conclusion of this issue. At the end of it all, what can I do to not have my docker-compose commands from hanging??

You can try my patch here to keey away from entropy, pynacl, libsodium, paramiko and anything else that you don't need: #6552 (comment) .

@lonix1
Copy link

lonix1 commented Sep 29, 2019

If this was a problem for you please upvote #6931 so we can get better docs, which would have avoided lots of wasted time chasing down this problem.

@sebastialonso
Copy link

No, it didn't. Turns out my entropy was already high enough before installing that package. I installed anyway, but everything stands the same.

@lazarcf
Copy link

lazarcf commented Oct 10, 2019

Why should "docker-compose --help" require system entropy?

@sebastialonso
Copy link

sebastialonso commented Oct 10, 2019 via email

@sillycube
Copy link

any update to the issue?

@xrisk
Copy link

xrisk commented Dec 13, 2019

Installing haveged fixed this problem for me.

CentOS 7
docker 19.03.5
docker-compose 1.25.0

@jackharrhy
Copy link

I encountered this issue, but realized I was in a Python 3.8 virtualenv, ran docker-compose up -d outside of it and no longer had any weird stalling issues.

@rahmatnazali
Copy link

I want to also confirm that installing haveged fixed our problem.

Using Digital Ocean's Droplet, with:

  • ubuntu 18 LTS
  • docker 19.03.1
  • docker-compose 1.26.2
  • python based project, and porbably depends heavily on entropy such as SSL, etc

We used this article to install haveged
https://www.digitalocean.com/community/tutorials/how-to-setup-additional-entropy-for-cloud-servers-using-haveged

I hope this helps anyone just in case.

@eddyg
Copy link

eddyg commented Sep 29, 2020

FWIW, installing jitterentropy-rngd on a small server running Ubuntu 20.04 (via apt install jitterentropy-rngd) resolved this problem for me.

Now cat /proc/sys/kernel/random/entropy_avail shows a reasonable amount of entropy available (even after a fresh boot) and the hangs from docker-compose --help or docker-compose --version are significantly reduced.

@rogierlommers
Copy link

I'm having exactly the same problem :(

@negative0
Copy link

Same problem

@eddyg
Copy link

eddyg commented Apr 18, 2021

Have you tried checking available kernel entropy? Does increasing entropy generation, as described above, help?

@rogierlommers
Copy link

I'll reiterate for @rogierlommers and @negative0

You installed docker via snap most likely. Uninstall via snap, install via apt-get. The snap version hangs.

Hey @maidmariandev , I used apt-get to install docker, not snap.
I have installed the docker-compose package manually:

sudo curl -L https://github.com/docker/compose/releases/download/1.29.1/docker-compose-`uname -s`-`uname -m` -o /usr/local/bin/docker-compose

@averypmc
Copy link

Installing haveged fixed my issue also. Thanks!

@mloska
Copy link

mloska commented May 17, 2021

when using the docker-compose on digitalocean I realized this behaviour after trying to connect the host network. After removing it docker-compose up runs without any problem

@banagale
Copy link

While this is a system issue, should docker compose do some reasonable check of available entropy, and issue a warning at least?

As is, running a compose command locks the system. So even though it’s not a bug in compose, Trying to use compose is a killjoy.

@dubitoergo
Copy link

dubitoergo commented Jun 22, 2021

Check your path? You can also create a symbolic link.

sudo ln -s /usr/local/bin/docker-compose /usr/bin/docker-compose

Worked for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests