Segmentation fault (core dumped) #107

lamhoangtung · 2021-05-24T04:31:40Z

Hi, I'm installing nvtop on Ubuntu 18.04.5 LTS following the build instruction in this repo. The build went smooth and there are no warning or error.

But when trying to launch, nvtop, I got the error: Segmentation fault (core dumped)

Here is my nvidia-smi output:

Mon May 24 04:28:38 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   40C    P0    39W / 250W |   3025MiB / 16280MiB |     19%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Also I did some test and find out that the last working commit for me was 0ef51c9. From that point on I always get this error.

Are there anything I can help to resolve this. Thanks ;)

The text was updated successfully, but these errors were encountered:

Syllo · 2021-05-24T11:53:42Z

Hey,

Could you please compile that way

mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Debug
make
./src/nvtop 2> error.txt

and then post the content of the file error.txt

TommyJerryMairo · 2021-05-24T12:46:08Z

I have a similar problem with the segfault on the latest master commit, where the stderr didn't print anything. However, I do have short reports from systemd-coredump:

[tjm@ArchPad tmp]$ coredumpctl info 235492
           PID: 235492 (nvtop)
           UID: 1000 (tjm)
           GID: 100 (users)
        Signal: 11 (SEGV)
     Timestamp: Mon 2021-05-24 05:38:30 PDT (42s ago)
  Command Line: nvtop
    Executable: /usr/bin/nvtop
 Control Group: /user.slice/user-1000.slice/session-3.scope
          Unit: session-3.scope
         Slice: user-1000.slice
       Session: 3
     Owner UID: 1000 (tjm)
       Boot ID: 3da7ccc2c46b4b619eeb7cf45882b3b8
    Machine ID: ffd680d0906946c29ee244fc8114ae2c
      Hostname: ArchPad
       Storage: /var/lib/systemd/coredump/core.nvtop.1000.3da7ccc2c46b4b619eeb7cf45882b3b8.235492.1621859910000000.zst (present)
     Disk Size: 79.1K
       Message: Process 235492 (nvtop) of user 1000 dumped core.
                
                Stack trace of thread 235492:
                #0  0x0000557be2f8d882 draw_processes (nvtop + 0x7882)
                #1  0x0000557be2f8a5cf main (nvtop + 0x45cf)
                #2  0x00007efddff30b25 __libc_start_main (libc.so.6 + 0x27b25)
                #3  0x0000557be2f8a8ee _start (nvtop + 0x48ee)
[tjm@ArchPad tmp]$

From the coredump file, we could see there might be an NPE in interface.c:1251

[tjm@ArchPad tmp]$ gdb -q /usr/bin/nvtop core.nvtop.235492 
Reading symbols from /usr/bin/nvtop...
[New LWP 235492]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
Core was generated by `nvtop'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000557be2f8d882 in draw_processes (interface=0x557be33e1080, devices=<optimized out>, devices_count=1)
    at /home/tjm/.cache/pikaur/build/nvtop-git/src/nvtop-git/src/interface.c:1251
1251	      all_procs.processes[interface->process.selected_row].process->pid;
(gdb)

XuehaiPan · 2021-05-24T14:43:19Z

Same issuse for me. On my device, this issuse raises when a single process is using multiple GPUs.

Here is the output of nvtop 2>debug.txt

ASAN:DEADLYSIGNAL
=================================================================
==75378==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0x7fa5a3bf77c6 bp 0x7ffdd51dfc80 sp 0x7ffdd51df3f8 T0)
    #0 0x7fa5a3bf77c5 in strlen (/lib/x86_64-linux-gnu/libc.so.6+0x8b7c5)
    #1 0x7fa5a447febb  (/usr/lib/x86_64-linux-gnu/libasan.so.3+0x3cebb)
    #2 0x40f09c in draw_processes /home/panxuehai/Projects/nvtop/src/interface.c:1257
    #3 0x411dea in draw_gpu_info_ncurses /home/panxuehai/Projects/nvtop/src/interface.c:1685
    #4 0x404e3d in main /home/panxuehai/Projects/nvtop/src/nvtop.c:341
    #5 0x7fa5a3b8c83f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2083f)
    #6 0x403c18 in _start (/home/panxuehai/Projects/nvtop/build/src/nvtop+0x403c18)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV (/lib/x86_64-linux-gnu/libc.so.6+0x8b7c5) in strlen
==75378==ABORTING

Steps to reproduce:

run nvtop first.
run the following Python code:

$ ipython3
In [1]: import cupy as cp

In [2]: with cp.cuda.Device(0):
   ...:     x = cp.zeros((10000, 1000))
   ...:     

In [2]: with cp.cuda.Device(1):
   ...:     y = cp.zeros((10000, 1000))
   ...:

If I reverse the order of step 1 and step 2, nvtop will run as expected.

Syllo · 2021-05-24T14:49:23Z

I think that the patch in the branch fix_segfault should do the trick.

When writing this bit of code I assumed for some reason that there will always be one process running on the GPUs, which is not the case for a server.

Could any of you tell me if the patch solves this issue?

To test you must checkout to the correct branch:

git pull
git checkout fix_segfault
# Build as usual

XuehaiPan · 2021-05-24T15:14:55Z

The issuse still exists.

Could any of you tell me if the patch solves this issue?

Syllo · 2021-05-24T15:23:45Z

Can you please provide the error.txt output for this branch?

XuehaiPan · 2021-05-24T15:26:13Z

../src/interface.c:1266:25: runtime error: null pointer passed as argument 1, which is declared to never be null
AddressSanitizer:DEADLYSIGNAL
=================================================================
==29805==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0x7f164627b7c6 bp 0x7ffe12c16ca0 sp 0x7ffe12c16448 T0)
==29805==The signal is caused by a READ memory access.
==29805==Hint: address points to the zero page.
    #0 0x7f164627b7c6 in strlen (/lib/x86_64-linux-gnu/libc.so.6+0x8b7c6)
    #1 0x7f1647475cdc  (/home/panxuehai/.linuxbrew/lib/gcc/11/libasan.so.6+0x3fcdc)
    #2 0x41c740 in draw_processes ../src/interface.c:1266
    #3 0x42257c in draw_gpu_info_ncurses ../src/interface.c:1694
    #4 0x407693 in main ../src/nvtop.c:341
    #5 0x7f164621083f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2083f)
    #6 0x405848 in _start (/home/panxuehai/Projects/nvtop/cmake-build-debug/src/nvtop+0x405848)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV (/lib/x86_64-linux-gnu/libc.so.6+0x8b7c6) in strlen
==29805==ABORTING

XuehaiPan · 2021-05-24T15:59:02Z

strlen gets a nullptr for all_procs.processes[i].process->user_name here:

nvtop/src/interface.c

Lines 1264 to 1269 in 7b0d8e5

    
           if (IS_VALID(gpuinfo_process_user_name_valid, 
        
                        all_procs.processes[i].process->valid)) { 
        
             unsigned length = strlen(all_procs.processes[i].process->user_name); 
        
             if (length > largest_username) 
        
               largest_username = length; 
        
           }

I think this may be caused by the PID info cache for processes that using multiple GPUs.

nvtop/src/extract_gpuinfo.c

Lines 132 to 148 in 7b0d8e5

    
           pid_t current_pid = devices[i].processes[j].pid; 
        
           process_info_cache *cached_pid_info; 
        
           HASH_FIND_PID(cached_process_info, &current_pid, cached_pid_info); 
        
           if (!cached_pid_info) { 
        
             // Newly encountered pid 
        
             cached_pid_info = malloc(sizeof(*cached_pid_info)); 
        
             cached_pid_info->pid = current_pid; 
        
             get_username_from_pid(current_pid, &cached_pid_info->user_name); 
        
             get_command_from_pid(current_pid, &cached_pid_info->cmdline); 
        
             cached_pid_info->last_total_consumed_cpu_time = -1.; 
        
           } else { 
        
             // Already encountered so delete from cached list to avoid freeing 
        
             // memory at the end of this function 
        
             HASH_DEL(cached_process_info, cached_pid_info); 
        
           } 
        
           HASH_ADD_PID(updated_process_info, cached_pid_info);

Syllo · 2021-05-24T16:13:05Z

Thank you.

Another patch has been pushed to fix this bug on the branch fix_segfault.

XuehaiPan · 2021-05-24T16:23:52Z

=================================================================
==88252==ERROR: AddressSanitizer: dynamic-stack-buffer-overflow on address 0x7ffeb142b2af at pc 0x7f5ef83ca1e6 bp 0x7ffeb142b060 sp 0x7ffeb142a810
WRITE of size 7 at 0x7ffeb142b2af thread T0
    #0 0x7f5ef83ca1e5 in __interceptor_vsnprintf (/home/panxuehai/.linuxbrew/lib/gcc/11/libasan.so.6+0x601e5)
    #1 0x7f5ef83ca3ee in __interceptor_snprintf (/home/panxuehai/.linuxbrew/lib/gcc/11/libasan.so.6+0x603ee)
    #2 0x41aa61 in print_processes_on_screen ../src/interface.c:1168
    #3 0x41c900 in draw_processes ../src/interface.c:1273
    #4 0x42257c in draw_gpu_info_ncurses ../src/interface.c:1694
    #5 0x407693 in main ../src/nvtop.c:341
    #6 0x7f5ef714483f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2083f)
    #7 0x405848 in _start (/home/panxuehai/Projects/nvtop/cmake-build-debug/src/nvtop+0x405848)

Address 0x7ffeb142b2af is located in stack of thread T0
SUMMARY: AddressSanitizer: dynamic-stack-buffer-overflow (/home/panxuehai/.linuxbrew/lib/gcc/11/libasan.so.6+0x601e5) in __interceptor_vsnprintf
Shadow bytes around the buggy address:
  0x10005627d600: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x10005627d610: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x10005627d620: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x10005627d630: ca ca ca ca 00 02 cb cb cb cb cb cb 00 00 00 00
  0x10005627d640: ca ca ca ca 07 cb cb cb cb cb cb cb 00 00 00 00
=>0x10005627d650: ca ca ca ca 00[07]cb cb cb cb cb cb 00 00 00 00
  0x10005627d660: ca ca ca ca 04 cb cb cb cb cb cb cb 00 00 00 00
  0x10005627d670: ca ca ca ca 00 cb cb cb cb cb cb cb 00 00 00 00
  0x10005627d680: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x10005627d690: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x10005627d6a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
  Shadow gap:              cc
==88252==ABORTING

It gets a dynamic stack overflow error when user_name is not NULL for the newest patch.

I get the same stack overflow error when I simply add user_name != NULL check on the previous patch.

if (IS_VALID(gpuinfo_process_user_name_valid,
              all_procs.processes[i].process->valid) &&
    all_procs.processes[i].process->user_name != NULL)
{
  unsigned length = strlen(all_procs.processes[i].process->user_name);
  if (length > largest_username)
    largest_username = length;
}

Syllo · 2021-05-24T17:52:54Z

It seems unrelated, it was on another part of the program, where the sprintf function could print outside of the buffer.
Yet another patch available.

XuehaiPan · 2021-05-25T07:12:10Z

Error of the third patch:

nvtop: ../src/extract_gpuinfo.c:200: gpuinfo_populate_process_infos: Assertion `devices[i].processes[j].gpu_memory_percentage <= 100' failed.

The output adding printf before the assertion:

if (IS_VALID(gpuinfo_total_memory_valid, devices[i].dynamic_info.valid) &&
    IS_VALID(gpuinfo_process_gpu_memory_usage_valid,
              devices[i].processes[j].valid)) {
  float percentage =
      roundf(100.f * (float)devices[i].processes[j].gpu_memory_usage /
              (float)devices[i].dynamic_info.total_memory);
  devices[i].processes[j].gpu_memory_percentage = (unsigned)percentage;
  fprintf(stderr,
          "gpu_memory_usage=%llu  total_memory=%llu  percentage=%f  gpu_memory_percentage=%llu\n",
          devices[i].processes[j].gpu_memory_usage, devices[i].dynamic_info.total_memory,
          percentage, devices[i].processes[j].gpu_memory_percentage);
  assert(devices[i].processes[j].gpu_memory_percentage <= 100);
  SET_VALID(gpuinfo_process_gpu_memory_percentage_valid,
            devices[i].processes[j].valid);
}

gpu_memory_usage=10585374720  total_memory=11554717696  percentage=92.000000  gpu_memory_percentage=92
gpu_memory_usage=10585374720  total_memory=11554717696  percentage=92.000000  gpu_memory_percentage=92
gpu_memory_usage=10585374720  total_memory=11554717696  percentage=92.000000  gpu_memory_percentage=92
gpu_memory_usage=10585374720  total_memory=11554717696  percentage=92.000000  gpu_memory_percentage=92
gpu_memory_usage=7961837568  total_memory=11554717696  percentage=69.000000  gpu_memory_percentage=69
gpu_memory_usage=6277824512  total_memory=11554717696  percentage=54.000000  gpu_memory_percentage=54
gpu_memory_usage=1081081856  total_memory=11554717696  percentage=9.000000  gpu_memory_percentage=9
gpu_memory_usage=13744632839234567870  total_memory=11554717696  percentage=118952566784.000000  gpu_memory_percentage=2988449792
nvtop: ../src/extract_gpuinfo.c:204: gpuinfo_populate_process_infos: Assertion `devices[i].processes[j].gpu_memory_percentage <= 100' failed.

Syllo · 2021-05-25T09:20:06Z

Thanks for finding this one.
I copied the struct definition from the header, and there was only one version, so I assumed backward compatibility.
I double checked for the other functions and the types remained the same.

Is that all there was?

XuehaiPan · 2021-05-25T10:41:47Z

It works fine on my machine with driver version 430.64 / CUDA 10.1 on Ubuntu 16.04 LTS.

Syllo · 2021-05-25T11:47:48Z

All right, I merged the patches into master.

Thanks a lot @XuehaiPan for your help fixing these and @TommyJerryMairo for providing the process dump.

Take care

lamhoangtung · 2021-05-26T04:35:41Z

Thanks u guys for the help

XuehaiPan mentioned this issue May 25, 2021

Fix: improve backward compatibility for old drivers (fixes #107) #108

Merged

Syllo closed this as completed May 25, 2021

XuehaiPan mentioned this issue Jun 8, 2021

The GPU usage shows huge % usage #110

Closed

claus-h-g mentioned this issue Aug 9, 2022

Segmentation fault (core dumped) with CUDA 11.7 #157

Closed

XuehaiPan mentioned this issue Apr 16, 2023

Add GPU stats features giampaolo/psutil#526

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation fault (core dumped) #107

Segmentation fault (core dumped) #107

lamhoangtung commented May 24, 2021

Syllo commented May 24, 2021

TommyJerryMairo commented May 24, 2021

XuehaiPan commented May 24, 2021 •

edited

Loading

Syllo commented May 24, 2021

XuehaiPan commented May 24, 2021

Syllo commented May 24, 2021

XuehaiPan commented May 24, 2021

XuehaiPan commented May 24, 2021

Syllo commented May 24, 2021

XuehaiPan commented May 24, 2021 •

edited

Loading

Syllo commented May 24, 2021

XuehaiPan commented May 25, 2021

Syllo commented May 25, 2021

XuehaiPan commented May 25, 2021

Syllo commented May 25, 2021

lamhoangtung commented May 26, 2021

Segmentation fault (core dumped) #107

Segmentation fault (core dumped) #107

Comments

lamhoangtung commented May 24, 2021

Syllo commented May 24, 2021

TommyJerryMairo commented May 24, 2021

XuehaiPan commented May 24, 2021 • edited Loading

Syllo commented May 24, 2021

XuehaiPan commented May 24, 2021

Syllo commented May 24, 2021

XuehaiPan commented May 24, 2021

XuehaiPan commented May 24, 2021

Syllo commented May 24, 2021

XuehaiPan commented May 24, 2021 • edited Loading

Syllo commented May 24, 2021

XuehaiPan commented May 25, 2021

Syllo commented May 25, 2021

XuehaiPan commented May 25, 2021

Syllo commented May 25, 2021

lamhoangtung commented May 26, 2021

XuehaiPan commented May 24, 2021 •

edited

Loading

XuehaiPan commented May 24, 2021 •

edited

Loading