Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runc running very slow #3464

Closed
ydcool opened this issue May 2, 2022 · 6 comments
Closed

runc running very slow #3464

ydcool opened this issue May 2, 2022 · 6 comments

Comments

@ydcool
Copy link

ydcool commented May 2, 2022

runc --systemd-cgroup run test takes 10 minutes long to execute, but no error or warning.

here're some system info:

root@compute20:~/dominic/testrunc# cat /etc/os-release
NAME="Ubuntu"
VERSION="18.04.3 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.3 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic
root@compute20:~/dominic/testrunc#
root@compute20:~/dominic/testrunc# uname -a
Linux compute20 4.15.0-72-generic #81-Ubuntu SMP Tue Nov 26 12:20:02 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
root@compute20:~/dominic/testrunc#
root@compute20:~/dominic/testrunc#
root@compute20:~/dominic/testrunc# runc -v
runc version 1.0.0-rc8
commit: 425e105d5a03fabd737a126ad93d62a9eeede87f
spec: 1.0.1-dev
root@compute20:~/dominic/testrunc#

I modified the source code of runc , added some trace codes, and got the go trace file:
block

and found the syscall were blocked:
image

I'm not familiar with the dbus system, does anybody knows what was happened on my OS? or how to debug more deeply?

@kolyshkin
Copy link
Contributor

What you found means runc waits for systemd (to provide a response over dbus, I guess).

Not sure why it's taking so long -- something is wrong with your systemd?

If there's a way to reproduce this from scratch -- please provide detailed steps.

@ydcool
Copy link
Author

ydcool commented May 4, 2022

@kolyshkin I haven't found the way to reproduce this, but this is the only server out of dozens in my k8s cluster that occurred twice in the past six months. I'll try to find it out.

@ydcool
Copy link
Author

ydcool commented May 14, 2022

update:
Today it happened again, we found there're too many multipathes (up to 12k) , while the multipathd using too many linux fd.
And there're lots of timeout error in dbus journal log. So I think it many be the root cause of slowly runc, but didn't found more
direct evidence that the leaked multipathes blocked the dbus.

As for the multipath leak, it's something wrong with the OpenStack running on that host.

root@compute20:~#
root@compute20:~# lsof -n 2>/dev/null |awk '{print $2}'|sort|uniq -c|sort -nr | head -n 10
1040438 147156
 279533 105889
  45087 61523
  39130 57905
  25595 2369946
   9398 61515
   5805 985191
   5805 3424373
   5805 3307072
   5805 3050402
root@compute20:~#
root@compute20:~# ps -ef | grep 147156
root      147156       1  0 Mar08 ?        13:31:31 /sbin/multipathd -d -s
root     3054681 2989740  0 16:08 pts/4    00:00:00 grep --color=auto 147156
root@compute20:~#
root@compute20:~#
root@compute20:~# multipath -ll | wc -l
8965
root@compute20:~#
root@compute20:~#
root@compute20:~# systemctl status dbus
● dbus.service - D-Bus System Message Bus
   Loaded: loaded (/lib/systemd/system/dbus.service; static; vendor preset: enabled)
   Active: active (running) since Tue 2022-03-08 22:19:15 CST; 2 months 5 days ago
     Docs: man:dbus-daemon(1)
 Main PID: 55898 (dbus-daemon)
    Tasks: 1 (limit: 599999)
   CGroup: /system.slice/dbus.service
           └─55898 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only

May 14 16:00:11 compute20 dbus-daemon[55898]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
May 14 16:00:36 compute20 dbus-daemon[55898]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
May 14 16:01:01 compute20 dbus-daemon[55898]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
May 14 16:01:27 compute20 dbus-daemon[55898]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
May 14 16:01:52 compute20 dbus-daemon[55898]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
May 14 16:02:17 compute20 dbus-daemon[55898]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
May 14 16:02:42 compute20 dbus-daemon[55898]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
May 14 16:03:07 compute20 dbus-daemon[55898]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
May 14 16:03:32 compute20 dbus-daemon[55898]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
May 14 16:03:57 compute20 dbus-daemon[55898]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
root@compute20:~#

After we cleaned up all fault multipathes, dbus syscall still blocking.

TBC...

@jasvinder1107
Copy link

Team,

Any progress on this ? We are seeing a similar issue occurring randomly on the production nodes.

Regards,
Jas

@cyphar
Copy link
Member

cyphar commented Aug 16, 2023

There were similar issues with systemd communication via dbus caused by the way runc was protecting the executable against attacks. #3931 should have fixed this issue -- you can test with that code to see if it resolves the issue.

@ydcool
Copy link
Author

ydcool commented Apr 4, 2024

There were similar issues with systemd communication via dbus caused by the way runc was protecting the executable against attacks. #3931 should have fixed this issue -- you can test with that code to see if it resolves the issue.

Thank you for your reply, I tried with this fix on another node which has similar issue, and seems works well. I'll close this issue.

@ydcool ydcool closed this as completed Apr 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants