Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault (core dumped) #107

Closed
lamhoangtung opened this issue May 24, 2021 · 16 comments · Fixed by #108
Closed

Segmentation fault (core dumped) #107

lamhoangtung opened this issue May 24, 2021 · 16 comments · Fixed by #108

Comments

@lamhoangtung
Copy link

Hi, I'm installing nvtop on Ubuntu 18.04.5 LTS following the build instruction in this repo. The build went smooth and there are no warning or error.

But when trying to launch, nvtop, I got the error: Segmentation fault (core dumped)

Here is my nvidia-smi output:

Mon May 24 04:28:38 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   40C    P0    39W / 250W |   3025MiB / 16280MiB |     19%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Also I did some test and find out that the last working commit for me was 0ef51c9. From that point on I always get this error.

Are there anything I can help to resolve this. Thanks ;)

@Syllo
Copy link
Owner

Syllo commented May 24, 2021

Hey,

Could you please compile that way

mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Debug
make
./src/nvtop 2> error.txt

and then post the content of the file error.txt

@TommyJerryMairo
Copy link

I have a similar problem with the segfault on the latest master commit, where the stderr didn't print anything. However, I do have short reports from systemd-coredump:

[tjm@ArchPad tmp]$ coredumpctl info 235492
           PID: 235492 (nvtop)
           UID: 1000 (tjm)
           GID: 100 (users)
        Signal: 11 (SEGV)
     Timestamp: Mon 2021-05-24 05:38:30 PDT (42s ago)
  Command Line: nvtop
    Executable: /usr/bin/nvtop
 Control Group: /user.slice/user-1000.slice/session-3.scope
          Unit: session-3.scope
         Slice: user-1000.slice
       Session: 3
     Owner UID: 1000 (tjm)
       Boot ID: 3da7ccc2c46b4b619eeb7cf45882b3b8
    Machine ID: ffd680d0906946c29ee244fc8114ae2c
      Hostname: ArchPad
       Storage: /var/lib/systemd/coredump/core.nvtop.1000.3da7ccc2c46b4b619eeb7cf45882b3b8.235492.1621859910000000.zst (present)
     Disk Size: 79.1K
       Message: Process 235492 (nvtop) of user 1000 dumped core.
                
                Stack trace of thread 235492:
                #0  0x0000557be2f8d882 draw_processes (nvtop + 0x7882)
                #1  0x0000557be2f8a5cf main (nvtop + 0x45cf)
                #2  0x00007efddff30b25 __libc_start_main (libc.so.6 + 0x27b25)
                #3  0x0000557be2f8a8ee _start (nvtop + 0x48ee)
[tjm@ArchPad tmp]$ 

From the coredump file, we could see there might be an NPE in interface.c:1251

[tjm@ArchPad tmp]$ gdb -q /usr/bin/nvtop core.nvtop.235492 
Reading symbols from /usr/bin/nvtop...
[New LWP 235492]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
Core was generated by `nvtop'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000557be2f8d882 in draw_processes (interface=0x557be33e1080, devices=<optimized out>, devices_count=1)
    at /home/tjm/.cache/pikaur/build/nvtop-git/src/nvtop-git/src/interface.c:1251
1251	      all_procs.processes[interface->process.selected_row].process->pid;
(gdb) 

@XuehaiPan
Copy link
Contributor

XuehaiPan commented May 24, 2021

Same issuse for me. On my device, this issuse raises when a single process is using multiple GPUs.

Here is the output of nvtop 2>debug.txt

ASAN:DEADLYSIGNAL
=================================================================
==75378==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0x7fa5a3bf77c6 bp 0x7ffdd51dfc80 sp 0x7ffdd51df3f8 T0)
    #0 0x7fa5a3bf77c5 in strlen (/lib/x86_64-linux-gnu/libc.so.6+0x8b7c5)
    #1 0x7fa5a447febb  (/usr/lib/x86_64-linux-gnu/libasan.so.3+0x3cebb)
    #2 0x40f09c in draw_processes /home/panxuehai/Projects/nvtop/src/interface.c:1257
    #3 0x411dea in draw_gpu_info_ncurses /home/panxuehai/Projects/nvtop/src/interface.c:1685
    #4 0x404e3d in main /home/panxuehai/Projects/nvtop/src/nvtop.c:341
    #5 0x7fa5a3b8c83f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2083f)
    #6 0x403c18 in _start (/home/panxuehai/Projects/nvtop/build/src/nvtop+0x403c18)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV (/lib/x86_64-linux-gnu/libc.so.6+0x8b7c5) in strlen
==75378==ABORTING

Steps to reproduce:

  1. run nvtop first.

  2. run the following Python code:

$ ipython3
In [1]: import cupy as cp

In [2]: with cp.cuda.Device(0):
   ...:     x = cp.zeros((10000, 1000))
   ...:     

In [2]: with cp.cuda.Device(1):
   ...:     y = cp.zeros((10000, 1000))
   ...:     

If I reverse the order of step 1 and step 2, nvtop will run as expected.

@Syllo
Copy link
Owner

Syllo commented May 24, 2021

I think that the patch in the branch fix_segfault should do the trick.

When writing this bit of code I assumed for some reason that there will always be one process running on the GPUs, which is not the case for a server.

Could any of you tell me if the patch solves this issue?

To test you must checkout to the correct branch:

git pull
git checkout fix_segfault
# Build as usual

@XuehaiPan
Copy link
Contributor

The issuse still exists.

Could any of you tell me if the patch solves this issue?

@Syllo
Copy link
Owner

Syllo commented May 24, 2021

Can you please provide the error.txt output for this branch?

@XuehaiPan
Copy link
Contributor

../src/interface.c:1266:25: runtime error: null pointer passed as argument 1, which is declared to never be null
AddressSanitizer:DEADLYSIGNAL
=================================================================
==29805==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0x7f164627b7c6 bp 0x7ffe12c16ca0 sp 0x7ffe12c16448 T0)
==29805==The signal is caused by a READ memory access.
==29805==Hint: address points to the zero page.
    #0 0x7f164627b7c6 in strlen (/lib/x86_64-linux-gnu/libc.so.6+0x8b7c6)
    #1 0x7f1647475cdc  (/home/panxuehai/.linuxbrew/lib/gcc/11/libasan.so.6+0x3fcdc)
    #2 0x41c740 in draw_processes ../src/interface.c:1266
    #3 0x42257c in draw_gpu_info_ncurses ../src/interface.c:1694
    #4 0x407693 in main ../src/nvtop.c:341
    #5 0x7f164621083f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2083f)
    #6 0x405848 in _start (/home/panxuehai/Projects/nvtop/cmake-build-debug/src/nvtop+0x405848)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV (/lib/x86_64-linux-gnu/libc.so.6+0x8b7c6) in strlen
==29805==ABORTING

@XuehaiPan
Copy link
Contributor

strlen gets a nullptr for all_procs.processes[i].process->user_name here:

nvtop/src/interface.c

Lines 1264 to 1269 in 7b0d8e5

if (IS_VALID(gpuinfo_process_user_name_valid,
all_procs.processes[i].process->valid)) {
unsigned length = strlen(all_procs.processes[i].process->user_name);
if (length > largest_username)
largest_username = length;
}

I think this may be caused by the PID info cache for processes that using multiple GPUs.

nvtop/src/extract_gpuinfo.c

Lines 132 to 148 in 7b0d8e5

pid_t current_pid = devices[i].processes[j].pid;
process_info_cache *cached_pid_info;
HASH_FIND_PID(cached_process_info, &current_pid, cached_pid_info);
if (!cached_pid_info) {
// Newly encountered pid
cached_pid_info = malloc(sizeof(*cached_pid_info));
cached_pid_info->pid = current_pid;
get_username_from_pid(current_pid, &cached_pid_info->user_name);
get_command_from_pid(current_pid, &cached_pid_info->cmdline);
cached_pid_info->last_total_consumed_cpu_time = -1.;
} else {
// Already encountered so delete from cached list to avoid freeing
// memory at the end of this function
HASH_DEL(cached_process_info, cached_pid_info);
}
HASH_ADD_PID(updated_process_info, cached_pid_info);

@Syllo
Copy link
Owner

Syllo commented May 24, 2021

Thank you.

Another patch has been pushed to fix this bug on the branch fix_segfault.

@XuehaiPan
Copy link
Contributor

XuehaiPan commented May 24, 2021

=================================================================
==88252==ERROR: AddressSanitizer: dynamic-stack-buffer-overflow on address 0x7ffeb142b2af at pc 0x7f5ef83ca1e6 bp 0x7ffeb142b060 sp 0x7ffeb142a810
WRITE of size 7 at 0x7ffeb142b2af thread T0
    #0 0x7f5ef83ca1e5 in __interceptor_vsnprintf (/home/panxuehai/.linuxbrew/lib/gcc/11/libasan.so.6+0x601e5)
    #1 0x7f5ef83ca3ee in __interceptor_snprintf (/home/panxuehai/.linuxbrew/lib/gcc/11/libasan.so.6+0x603ee)
    #2 0x41aa61 in print_processes_on_screen ../src/interface.c:1168
    #3 0x41c900 in draw_processes ../src/interface.c:1273
    #4 0x42257c in draw_gpu_info_ncurses ../src/interface.c:1694
    #5 0x407693 in main ../src/nvtop.c:341
    #6 0x7f5ef714483f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2083f)
    #7 0x405848 in _start (/home/panxuehai/Projects/nvtop/cmake-build-debug/src/nvtop+0x405848)

Address 0x7ffeb142b2af is located in stack of thread T0
SUMMARY: AddressSanitizer: dynamic-stack-buffer-overflow (/home/panxuehai/.linuxbrew/lib/gcc/11/libasan.so.6+0x601e5) in __interceptor_vsnprintf
Shadow bytes around the buggy address:
  0x10005627d600: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x10005627d610: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x10005627d620: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x10005627d630: ca ca ca ca 00 02 cb cb cb cb cb cb 00 00 00 00
  0x10005627d640: ca ca ca ca 07 cb cb cb cb cb cb cb 00 00 00 00
=>0x10005627d650: ca ca ca ca 00[07]cb cb cb cb cb cb 00 00 00 00
  0x10005627d660: ca ca ca ca 04 cb cb cb cb cb cb cb 00 00 00 00
  0x10005627d670: ca ca ca ca 00 cb cb cb cb cb cb cb 00 00 00 00
  0x10005627d680: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x10005627d690: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x10005627d6a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
  Shadow gap:              cc
==88252==ABORTING

It gets a dynamic stack overflow error when user_name is not NULL for the newest patch.

I get the same stack overflow error when I simply add user_name != NULL check on the previous patch.

if (IS_VALID(gpuinfo_process_user_name_valid,
              all_procs.processes[i].process->valid) &&
    all_procs.processes[i].process->user_name != NULL)
{
  unsigned length = strlen(all_procs.processes[i].process->user_name);
  if (length > largest_username)
    largest_username = length;
}

@Syllo
Copy link
Owner

Syllo commented May 24, 2021

It seems unrelated, it was on another part of the program, where the sprintf function could print outside of the buffer.
Yet another patch available.

@XuehaiPan
Copy link
Contributor

Error of the third patch:

nvtop: ../src/extract_gpuinfo.c:200: gpuinfo_populate_process_infos: Assertion `devices[i].processes[j].gpu_memory_percentage <= 100' failed.

The output adding printf before the assertion:

if (IS_VALID(gpuinfo_total_memory_valid, devices[i].dynamic_info.valid) &&
    IS_VALID(gpuinfo_process_gpu_memory_usage_valid,
              devices[i].processes[j].valid)) {
  float percentage =
      roundf(100.f * (float)devices[i].processes[j].gpu_memory_usage /
              (float)devices[i].dynamic_info.total_memory);
  devices[i].processes[j].gpu_memory_percentage = (unsigned)percentage;
  fprintf(stderr,
          "gpu_memory_usage=%llu  total_memory=%llu  percentage=%f  gpu_memory_percentage=%llu\n",
          devices[i].processes[j].gpu_memory_usage, devices[i].dynamic_info.total_memory,
          percentage, devices[i].processes[j].gpu_memory_percentage);
  assert(devices[i].processes[j].gpu_memory_percentage <= 100);
  SET_VALID(gpuinfo_process_gpu_memory_percentage_valid,
            devices[i].processes[j].valid);
}
gpu_memory_usage=10585374720  total_memory=11554717696  percentage=92.000000  gpu_memory_percentage=92
gpu_memory_usage=10585374720  total_memory=11554717696  percentage=92.000000  gpu_memory_percentage=92
gpu_memory_usage=10585374720  total_memory=11554717696  percentage=92.000000  gpu_memory_percentage=92
gpu_memory_usage=10585374720  total_memory=11554717696  percentage=92.000000  gpu_memory_percentage=92
gpu_memory_usage=7961837568  total_memory=11554717696  percentage=69.000000  gpu_memory_percentage=69
gpu_memory_usage=6277824512  total_memory=11554717696  percentage=54.000000  gpu_memory_percentage=54
gpu_memory_usage=1081081856  total_memory=11554717696  percentage=9.000000  gpu_memory_percentage=9
gpu_memory_usage=13744632839234567870  total_memory=11554717696  percentage=118952566784.000000  gpu_memory_percentage=2988449792
nvtop: ../src/extract_gpuinfo.c:204: gpuinfo_populate_process_infos: Assertion `devices[i].processes[j].gpu_memory_percentage <= 100' failed.

@Syllo
Copy link
Owner

Syllo commented May 25, 2021

Thanks for finding this one.
I copied the struct definition from the header, and there was only one version, so I assumed backward compatibility.
I double checked for the other functions and the types remained the same.

Is that all there was?

@XuehaiPan
Copy link
Contributor

It works fine on my machine with driver version 430.64 / CUDA 10.1 on Ubuntu 16.04 LTS.

@Syllo
Copy link
Owner

Syllo commented May 25, 2021

All right, I merged the patches into master.

Thanks a lot @XuehaiPan for your help fixing these and @TommyJerryMairo for providing the process dump.

Take care

@Syllo Syllo closed this as completed May 25, 2021
@lamhoangtung
Copy link
Author

Thanks u guys for the help

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants