-
Notifications
You must be signed in to change notification settings - Fork 4
Description
Relates to: #1
I'm trying to fix this issue on the live demo server. The tracker container restarts every 2 hours because of the healthcheck. I'm still trying to figure out what is happening. However, I've noticed a lot of zombie processes. This may or may not be related to the periodic restart.
Some minutes after restarting the server, you see a lot of zombie processes.
This is the server 3 hours after restarting the tracker container:
docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
ef72b037bf26 nginx:mainline-alpine "/docker-entrypoint.…" 3 hours ago Up 3 hours 0.0.0.0:80->80/tcp, :::80->80/tcp, 0.0.0.0:443->443/tcp, :::443->443/tcp proxy
d7618a22d425 torrust/index-gui:develop "/usr/local/bin/entr…" 3 hours ago Up 3 hours (unhealthy) 0.0.0.0:3000->3000/tcp, :::3000->3000/tcp index-gui
3f34f41514bb torrust/index:develop "/usr/local/bin/entr…" 3 hours ago Up 3 hours (healthy) 0.0.0.0:3001->3001/tcp, :::3001->3001/tcp index
e938bf65ea02 torrust/tracker:develop "/usr/local/bin/entr…" 3 hours ago Up 3 hours (unhealthy) 0.0.0.0:1212->1212/tcp, :::1212->1212/tcp, 0.0.0.0:7070->7070/tcp, :::7070->7070/tcp, 1313/tcp, 0.0.0.0:6969->6969/udp, :::6969->6969/udp trackerAs you can see the tracker is unhealthy. Running top gives you this output:
top - 15:06:45 up 21:41, 1 user, load average: 9.53, 10.18, 10.44
Tasks: 212 total, 4 running, 121 sleeping, 0 stopped, 87 zombie
%Cpu(s): 3.0 us, 90.8 sy, 0.0 ni, 0.0 id, 1.0 wa, 0.0 hi, 4.6 si, 0.7 st
MiB Mem : 957.4 total, 80.0 free, 834.5 used, 43.0 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 25.1 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
601160 root 20 0 1022516 2644 0 R 41.0 0.3 0:10.81 snapd
85 root 20 0 0 0 0 S 17.7 0.0 172:28.45 kswapd0
13 root 20 0 0 0 0 S 7.5 0.0 23:28.68 ksoftirqd/0
709 root 20 0 1546016 38040 0 S 4.3 3.9 100:53.91 dockerd
494855 torrust 20 0 573052 12776 0 S 4.3 1.3 9:06.97 torrust-index
601209 root 20 0 724908 5716 0 R 3.9 0.6 0:01.20 node
494706 torrust 20 0 815552 538000 0 S 3.6 54.9 21:01.63 torrust-tracker
655 root 20 0 1357240 18052 0 S 3.3 1.8 13:14.69 containerd
494683 root 20 0 719640 3568 0 S 3.3 0.4 13:31.96 containerd-shim
601255 root 20 0 1237648 2796 0 S 2.6 0.3 0:00.17 runc87 zombie processes but I've seen more in other cases. That output is where the server is already too busy swapping. Before reaching that point you get an output like this:
top -U torrust
top - 14:59:08 up 21:33, 1 user, load average: 13.99, 13.41, 11.21
Tasks: 184 total, 5 running, 116 sleeping, 0 stopped, 63 zombie
%Cpu(s): 14.6 us, 73.5 sy, 0.0 ni, 0.0 id, 4.6 wa, 0.0 hi, 7.0 si, 0.3 st
MiB Mem : 957.4 total, 79.9 free, 823.5 used, 54.1 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 30.6 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
494706 torrust 20 0 815552 538000 0 S 7.6 54.9 20:33.65 torrust-tracker
495006 torrust 20 0 21.0g 30616 0 S 0.3 3.1 0:34.54 node
599470 torrust 20 0 11040 3136 2244 R 0.3 0.3 0:00.07 top
598211 torrust 20 0 17068 2580 856 S 0.0 0.3 0:00.29 systemd
598212 torrust 20 0 169404 4000 0 S 0.0 0.4 0:00.00 (sd-pam)
598290 torrust 20 0 17224 3100 548 S 0.0 0.3 0:01.17 sshd
598291 torrust 20 0 9980 4656 2108 S 0.0 0.5 0:00.83 bashYou can see how zombie processes have increased.
I have also listed the zombie processes:
ps aux | grep Z
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 589711 0.1 0.0 0 0 ? Z 14:19 0:04 [node] <defunct>
root 589967 0.0 0.0 0 0 ? Z 14:23 0:00 [node] <defunct>
root 590976 0.0 0.0 0 0 ? Z 14:25 0:00 [node] <defunct>
root 591046 0.0 0.0 0 0 ? Z 14:25 0:00 [node] <defunct>
root 591115 0.0 0.0 0 0 ? Z 14:26 0:00 [node] <defunct>
root 591182 0.0 0.0 0 0 ? Z 14:26 0:00 [node] <defunct>
root 591231 0.0 0.0 0 0 ? Z 14:26 0:00 [http_health_che] <defunct>
root 591255 0.0 0.0 0 0 ? Z 14:26 0:00 [node] <defunct>
root 591360 0.0 0.0 0 0 ? Z 14:27 0:00 [node] <defunct>
root 591644 0.0 0.0 0 0 ? Z 14:28 0:00 [node] <defunct>
root 591727 0.0 0.0 0 0 ? Z 14:28 0:00 [node] <defunct>
root 591867 0.0 0.0 0 0 ? Z 14:28 0:00 [node] <defunct>
root 591938 0.0 0.0 0 0 ? Z 14:29 0:00 [node] <defunct>
root 592020 0.0 0.0 0 0 ? Z 14:29 0:01 [node] <defunct>
root 592103 0.0 0.0 0 0 ? Z 14:29 0:00 [node] <defunct>
root 592455 0.0 0.0 0 0 ? Z 14:30 0:00 [node] <defunct>
root 592528 0.0 0.0 0 0 ? Z 14:30 0:00 [node] <defunct>
root 593183 0.0 0.0 0 0 ? Z 14:32 0:00 [node] <defunct>
root 593263 0.0 0.0 0 0 ? Z 14:33 0:00 [node] <defunct>
root 593704 0.0 0.0 0 0 ? Z 14:34 0:00 [node] <defunct>
root 593777 0.0 0.0 0 0 ? Z 14:34 0:00 [node] <defunct>
root 594501 0.0 0.0 0 0 ? Z 14:36 0:00 [node] <defunct>
root 594891 0.0 0.0 0 0 ? Z 14:37 0:00 [node] <defunct>
root 595260 0.0 0.0 0 0 ? Z 14:39 0:00 [node] <defunct>
root 595404 0.0 0.0 0 0 ? Z 14:39 0:00 [node] <defunct>
root 595494 0.0 0.0 0 0 ? Z 14:39 0:00 [node] <defunct>
root 595563 0.0 0.0 0 0 ? Z 14:40 0:00 [node] <defunct>
root 595641 0.0 0.0 0 0 ? Z 14:40 0:00 [node] <defunct>
root 595664 0.0 0.0 0 0 ? Z 14:40 0:00 [http_health_che] <defunct>
root 595708 0.0 0.0 0 0 ? Z 14:40 0:01 [node] <defunct>
root 595782 0.1 0.0 0 0 ? Z 14:40 0:01 [node] <defunct>
root 595856 0.0 0.0 0 0 ? Z 14:41 0:00 [node] <defunct>
root 595928 0.0 0.0 0 0 ? Z 14:41 0:00 [node] <defunct>
root 595999 0.0 0.0 0 0 ? Z 14:41 0:00 [node] <defunct>
root 596068 0.0 0.0 0 0 ? Z 14:42 0:00 [node] <defunct>
root 596135 0.1 0.0 0 0 ? Z 14:42 0:01 [node] <defunct>
root 596207 0.0 0.0 0 0 ? Z 14:42 0:01 [node] <defunct>
root 596278 0.1 0.0 0 0 ? Z 14:43 0:01 [node] <defunct>
root 596323 0.0 0.0 0 0 ? Z 14:43 0:00 [health_check] <defunct>
root 596325 0.0 0.0 0 0 ? Z 14:43 0:00 [http_health_che] <defunct>
root 596350 0.1 0.0 0 0 ? Z 14:43 0:01 [node] <defunct>
root 596421 0.0 0.0 0 0 ? Z 14:44 0:00 [node] <defunct>
root 596488 0.0 0.0 0 0 ? Z 14:44 0:00 [node] <defunct>
root 596555 0.0 0.0 0 0 ? Z 14:44 0:00 [node] <defunct>
root 596693 0.0 0.0 0 0 ? Z 14:45 0:00 [node] <defunct>
root 596761 0.0 0.0 0 0 ? Z 14:45 0:00 [node] <defunct>
root 596833 0.1 0.0 0 0 ? Z 14:45 0:01 [node] <defunct>
root 596911 0.3 0.0 0 0 ? Z 14:46 0:03 [node] <defunct>
root 597029 0.1 0.0 0 0 ? Z 14:47 0:01 [node] <defunct>
root 597099 0.1 0.0 0 0 ? Z 14:47 0:00 [node] <defunct>
root 597164 0.1 0.0 0 0 ? Z 14:47 0:00 [node] <defunct>
root 597234 0.1 0.0 0 0 ? Z 14:47 0:01 [node] <defunct>
root 597302 0.2 0.0 0 0 ? Z 14:48 0:01 [node] <defunct>
root 597375 0.0 0.0 0 0 ? Z 14:48 0:00 [node] <defunct>
root 597443 0.1 0.0 0 0 ? Z 14:49 0:00 [node] <defunct>
root 597475 0.0 0.0 0 0 ? Z 14:49 0:00 [health_check] <defunct>
root 597493 0.4 0.0 0 0 ? Z 14:49 0:03 [node] <defunct>
root 597567 0.3 0.0 0 0 ? Z 14:50 0:02 [node] <defunct>
root 597620 0.2 0.0 0 0 ? Z 14:51 0:01 [node] <defunct>
root 597693 0.2 0.0 0 0 ? Z 14:51 0:01 [node] <defunct>
root 597735 0.0 0.0 0 0 ? Z 14:51 0:00 [health_check] <defunct>
root 597742 0.0 0.0 0 0 ? Z 14:51 0:00 [http_health_che] <defunct>
root 597762 0.1 0.0 0 0 ? Z 14:52 0:00 [node] <defunct>
root 599434 2.3 0.0 0 0 ? Z 14:58 0:01 [node] <defunct>
root 599527 0.2 0.0 0 0 ? Z 14:59 0:00 [health_check] <defunct>
root 599593 1.3 0.0 0 0 ? Z 14:59 0:00 [node] <defunct>
root 599745 3.6 0.0 0 0 ? Z 14:59 0:00 [node] <defunct>Those processes are a child of the main torrust tracker, index and index-gui processes.
ps -eo pid,ppid,state,command | grep Z
589711 495006 Z [node] <defunct>
589967 495006 Z [node] <defunct>
590976 495006 Z [node] <defunct>
591046 495006 Z [node] <defunct>
591115 495006 Z [node] <defunct>
591182 495006 Z [node] <defunct>
591231 494706 Z [http_health_che] <defunct>
591255 495006 Z [node] <defunct>
591360 495006 Z [node] <defunct>
591644 495006 Z [node] <defunct>
591727 495006 Z [node] <defunct>
591867 495006 Z [node] <defunct>
591938 495006 Z [node] <defunct>
592020 495006 Z [node] <defunct>
592103 495006 Z [node] <defunct>
592455 495006 Z [node] <defunct>
592528 495006 Z [node] <defunct>
593183 495006 Z [node] <defunct>
593263 495006 Z [node] <defunct>
593704 495006 Z [node] <defunct>
593777 495006 Z [node] <defunct>
594501 495006 Z [node] <defunct>
594891 495006 Z [node] <defunct>
595260 495006 Z [node] <defunct>
595404 495006 Z [node] <defunct>
595494 495006 Z [node] <defunct>
595563 495006 Z [node] <defunct>
595641 495006 Z [node] <defunct>
595664 494706 Z [http_health_che] <defunct>
595708 495006 Z [node] <defunct>
595782 495006 Z [node] <defunct>
595856 495006 Z [node] <defunct>
595928 495006 Z [node] <defunct>
595999 495006 Z [node] <defunct>
596068 495006 Z [node] <defunct>
596135 495006 Z [node] <defunct>
596207 495006 Z [node] <defunct>
596278 495006 Z [node] <defunct>
596323 494855 Z [health_check] <defunct>
596325 494706 Z [http_health_che] <defunct>
596350 495006 Z [node] <defunct>
596421 495006 Z [node] <defunct>
596488 495006 Z [node] <defunct>
596555 495006 Z [node] <defunct>
596693 495006 Z [node] <defunct>
596761 495006 Z [node] <defunct>
596833 495006 Z [node] <defunct>
596911 495006 Z [node] <defunct>
597029 495006 Z [node] <defunct>
597099 495006 Z [node] <defunct>
597164 495006 Z [node] <defunct>
597234 495006 Z [node] <defunct>
597302 495006 Z [node] <defunct>
597375 495006 Z [node] <defunct>
597443 495006 Z [node] <defunct>
597475 494855 Z [health_check] <defunct>
597493 495006 Z [node] <defunct>
597567 495006 Z [node] <defunct>
597620 495006 Z [node] <defunct>
597693 495006 Z [node] <defunct>
597735 494855 Z [health_check] <defunct>
597742 494706 Z [http_health_che] <defunct>
597762 495006 Z [node] <defunct>
599434 495006 Z [node] <defunct>
599527 494855 Z [health_check] <defunct>
599593 495006 Z [node] <defunct>
599745 495006 Z [node] <defunct>
599872 495006 Z [node] <defunct>These are the parent processes:
ps -o pid,ppid,cmd -p 494706
PID PPID CMD
494706 494683 /usr/bin/torrust-trackerps -o pid,ppid,cmd -p 494855
PID PPID CMD
494855 494833 /usr/bin/torrust-index ps -o pid,ppid,cmd -p 495006
PID PPID CMD
495006 494983 /nodejs/bin/node /app/.output/server/index.mjsIn the past, we had a similar problem:
And it was solved by adding timeouts. That could be the reason for the healthcheck zombies, but I have not idea but the problem is with the node webserver ( 495006 494983 /nodejs/bin/node /app/.output/server/index.mjs). I guess the webserver is launching threads to handle requests but they are not finishing correctly.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status