-
Notifications
You must be signed in to change notification settings - Fork 292
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation fault (core dumped) #107
Comments
Hey, Could you please compile that way mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Debug
make
./src/nvtop 2> error.txt and then post the content of the file error.txt |
I have a similar problem with the segfault on the latest master commit, where the stderr didn't print anything. However, I do have short reports from systemd-coredump:
From the coredump file, we could see there might be an NPE in interface.c:1251
|
Same issuse for me. On my device, this issuse raises when a single process is using multiple GPUs. Here is the output of
Steps to reproduce:
$ ipython3
In [1]: import cupy as cp
In [2]: with cp.cuda.Device(0):
...: x = cp.zeros((10000, 1000))
...:
In [2]: with cp.cuda.Device(1):
...: y = cp.zeros((10000, 1000))
...: If I reverse the order of step 1 and step 2, |
I think that the patch in the branch fix_segfault should do the trick. When writing this bit of code I assumed for some reason that there will always be one process running on the GPUs, which is not the case for a server. Could any of you tell me if the patch solves this issue? To test you must checkout to the correct branch: git pull
git checkout fix_segfault
# Build as usual |
The issuse still exists.
|
Can you please provide the error.txt output for this branch? |
|
Lines 1264 to 1269 in 7b0d8e5
I think this may be caused by the PID info cache for processes that using multiple GPUs. Lines 132 to 148 in 7b0d8e5
|
Thank you. Another patch has been pushed to fix this bug on the branch fix_segfault. |
It gets a dynamic stack overflow error when I get the same stack overflow error when I simply add if (IS_VALID(gpuinfo_process_user_name_valid,
all_procs.processes[i].process->valid) &&
all_procs.processes[i].process->user_name != NULL)
{
unsigned length = strlen(all_procs.processes[i].process->user_name);
if (length > largest_username)
largest_username = length;
} |
It seems unrelated, it was on another part of the program, where the sprintf function could print outside of the buffer. |
Error of the third patch:
The output adding if (IS_VALID(gpuinfo_total_memory_valid, devices[i].dynamic_info.valid) &&
IS_VALID(gpuinfo_process_gpu_memory_usage_valid,
devices[i].processes[j].valid)) {
float percentage =
roundf(100.f * (float)devices[i].processes[j].gpu_memory_usage /
(float)devices[i].dynamic_info.total_memory);
devices[i].processes[j].gpu_memory_percentage = (unsigned)percentage;
fprintf(stderr,
"gpu_memory_usage=%llu total_memory=%llu percentage=%f gpu_memory_percentage=%llu\n",
devices[i].processes[j].gpu_memory_usage, devices[i].dynamic_info.total_memory,
percentage, devices[i].processes[j].gpu_memory_percentage);
assert(devices[i].processes[j].gpu_memory_percentage <= 100);
SET_VALID(gpuinfo_process_gpu_memory_percentage_valid,
devices[i].processes[j].valid);
}
|
Thanks for finding this one. Is that all there was? |
It works fine on my machine with driver version 430.64 / CUDA 10.1 on Ubuntu 16.04 LTS. |
All right, I merged the patches into master. Thanks a lot @XuehaiPan for your help fixing these and @TommyJerryMairo for providing the process dump. Take care |
Thanks u guys for the help |
Hi, I'm installing nvtop on Ubuntu 18.04.5 LTS following the build instruction in this repo. The build went smooth and there are no warning or error.
But when trying to launch, nvtop, I got the error:
Segmentation fault (core dumped)
Here is my
nvidia-smi
output:Also I did some test and find out that the last working commit for me was 0ef51c9. From that point on I always get this error.
Are there anything I can help to resolve this. Thanks ;)
The text was updated successfully, but these errors were encountered: