Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

src/solaris/native/sun/management/LinuxOperatingSystem.c #51

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

UncleNine
Copy link

Fix the dead loop
Summary: if the /proc/stat mount point is changed in container evironment, the while loop may leaded to 100% cpu usage.

Fix the dead loop
Summary: if the /proc/stat mount point is changed in container evironment, the while loop may leaded to 100% cpu usage.
@CLAassistant
Copy link

CLAassistant commented Feb 6, 2021

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


ganjianxuan seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

@joeyleeeeeee97
Copy link
Contributor

@UncleNine, Could you give a test to reproduce that? Thanks

@UncleNine
Copy link
Author

UncleNine commented Feb 7, 2021

@UncleNine, Could you give a test to reproduce that? Thanks

In my environment, it was caused by a lxcfs(< 4.0.0) bug, but the jdk code(sun/management/OperatingSystemImpl.getSystemCpuLoad) affects all of JDKs**

below is our test code:

demo code

nclude <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#define SCNd64 "I64d"
#define DEC_64 "%"SCNd64
static void next_line(FILE *f) {
    while (fgetc(f) != '\n');//出错的关键点
}
#define SET_VALUE_GDB 10
int main(int argc,char* argv[])
{
  unsigned long i =1;
  unsigned long j=0;
  FILE * f=NULL;
  FILE         *fh;
  unsigned long        userTicks, niceTicks, systemTicks, idleTicks;
  unsigned long        iowTicks = 0, irqTicks = 0, sirqTicks= 0;
  int             n;
  if((fh = fopen("/proc/stat", "r")) == NULL) {
    return -1;
  }
    n = fscanf(fh, "cpu " DEC_64 " " DEC_64 " " DEC_64 " " DEC_64 " " DEC_64 " "
                   DEC_64 " " DEC_64,
           &userTicks, &niceTicks, &systemTicks, &idleTicks,
           &iowTicks, &irqTicks, &sirqTicks);
    // Move to next line
    while(i!=SET_VALUE_GDB)----------如果挂载点不变,则走这个大循环,单核cpu接近40%
    {
       next_line(fh);//挂载点一旦变化,返回 ENOTCONN,则走这个小循环,单核cpu会接近100%
       j++;
    }
   fclose(fh);
   return 0;
}

test

#gcc -g -o caq.o caq.c
一开始单独运行./caq.o,会看到cpu占用如下:
628957 root      20   0    8468    612    484 R  32.5  0.0  18:40.92 caq.o 
发现cpu占用率时32.5左右,
此时挂载点信息如下:
crash> mount -n 628957 |grep lxcfs
ffff88a5a2619800 ffff88a1ab25f800 fuse   lxcfs     /rootfs/proc/stat
ffff88cf53417000 ffff88a4dd622800 fuse   lxcfs     /rootfs/var/lib/baymax/var/lib/baymax/lxcfs
ffff88a272f8c600 ffff88a4dd622800 fuse   lxcfs     /rootfs/proc/cpuinfo
ffff88a257b28900 ffff88a4dd622800 fuse   lxcfs     /rootfs/proc/meminfo
ffff88a5aff40300 ffff88a4dd622800 fuse   lxcfs     /rootfs/proc/uptime
ffff88a3db2bd680 ffff88a4dd622800 fuse   lxcfs     /rootfs/proc/stat/proc/stat
ffff88a2836ba400 ffff88a4dd622800 fuse   lxcfs     /rootfs/proc/diskstats
ffff88bcb361b600 ffff88a4dd622800 fuse   lxcfs     /rootfs/proc/swaps
ffff88776e623480 ffff88a4dd622800 fuse   lxcfs     /rootfs/sys/devices/system/cpu/online
由于没有关闭/proc/stat的fd,也就是进行大循环,然后这个时候重启lxcfs挂载:
#systemctl restart lxcfs
重启之后,发现挂载点信息如下:
crash> mount -n 628957 |grep lxcfs
ffff88a5a2619800 ffff88a1ab25f800 fuse   lxcfs     /rootfs/proc/stat
ffff88a3db2bd680 ffff88a4dd622800 fuse   lxcfs     /rootfs/proc/stat/proc/stat------------这个挂载点,由于fd未关闭,所以卸载肯定失败,可以看到super_block是重启前的
ffff887795a8f600 ffff88a53b6c6800 fuse   lxcfs     /rootfs/var/lib/baymax/var/lib/baymax/lxcfs
ffff88a25472ae80 ffff88a53b6c6800 fuse   lxcfs     /rootfs/proc/cpuinfo
ffff88cf75ff1e00 ffff88a53b6c6800 fuse   lxcfs     /rootfs/proc/meminfo
ffff88a257b2ad00 ffff88a53b6c6800 fuse   lxcfs     /rootfs/proc/uptime
ffff88cf798f0d80 ffff88a53b6c6800 fuse   lxcfs     /rootfs/proc/stat/proc/stat/proc/stat--------bind模式挂载,会新生成一个/proc/stat
ffff88cf36ff2880 ffff88a53b6c6800 fuse   lxcfs     /rootfs/proc/diskstats
ffff88cf798f1f80 ffff88a53b6c6800 fuse   lxcfs     /rootfs/proc/swaps
ffff88a53f295980 ffff88a53b6c6800 fuse   lxcfs     /rootfs/sys/devices/system/cpu/online
cpu立刻打到接近100%
628957 root      20   0    8468    612    484 R  98.8  0.0  18:40.92 caq.o

java-side

If a application use the api "sun/management/OperatingSystemImpl.getSystemCpuLoad", it maybe
occur.

I find serveral common frameworks use it: micrometer, elasticsearch
IMG_20210207_150741

@sanhong
Copy link

sanhong commented Feb 7, 2021

Hi @UncleNine ,
Thanks for raising this issue on Dragonwell, is it reproducible on general OpenJDK build? If so I think better to discuss this in OpenJDK commuinty.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants