Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shared context test crashes on psm2 when it's built as a DL provider #3282

Closed
arn314 opened this issue Sep 13, 2017 · 7 comments · Fixed by #3291
Closed

Shared context test crashes on psm2 when it's built as a DL provider #3282

arn314 opened this issue Sep 13, 2017 · 7 comments · Fixed by #3291
Assignees
Labels

Comments

@arn314
Copy link

arn314 commented Sep 13, 2017

Test: fi_shared_ctx -p "psm2"

I see the following crash on psm2 provider when it's built as a separate dynamic library:

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff6a9d0ed in getenv () from /lib64/libc.so.6
(gdb) bt
#0  0x00007ffff6a9d0ed in getenv () from /lib64/libc.so.6
#1  0x00007ffff63cb470 in fini_ipath_backtrace () from /lib64/libinfinipath.so.4
#2  0x00007ffff7deab7a in _dl_fini () from /lib64/ld-linux-x86-64.so.2
#3  0x00007ffff6a9da69 in __run_exit_handlers () from /lib64/libc.so.6
#4  0x00007ffff6a9dab5 in exit () from /lib64/libc.so.6
#5  0x00007ffff6a86c0c in __libc_start_main () from /lib64/libc.so.6
#6  0x0000000000401759 in _start ()

The issue is present in v1.5.0 as well.

@shefty
Copy link
Member

shefty commented Sep 13, 2017

This may be of interest:
https://bugzilla.redhat.com/show_bug.cgi?id=1344529
Is this verbs related code?

@j-xiong
Copy link
Contributor

j-xiong commented Sep 13, 2017

No. It's the user space driver interface for psm.

@j-xiong
Copy link
Contributor

j-xiong commented Sep 13, 2017

It's weird that the segfault happened in the psm code path while the provider being used was psm2.

@shefty
Copy link
Member

shefty commented Sep 13, 2017

Maybe OpenMPI is calling psm code directly, resulting in a double use of the library?

@j-xiong
Copy link
Contributor

j-xiong commented Sep 13, 2017

But it is not an OpenMPI test.

The code is part of the library destructor which is called when the program exits. So the code being called is probably just how the finalization process works. Anyway, segfault inside getenv is quite unusual. Either the getenv symbol is pointing to an invalid address or some memory corruption has occurred somewhere.

@shefty
Copy link
Member

shefty commented Sep 13, 2017

Sorry - I need to stop following so many threads at once. Yeah, we may be looking at a memory corruption from somewhere.

@j-xiong
Copy link
Contributor

j-xiong commented Sep 15, 2017

I think I have found the reason. Under certain condition (satisfied by the fi_shared_ctx test), the provider calls putenv() to add an environment variable to automatically turn on the PSM2 multi EP support. This works fine as a built-in provider. However, if the provider is dynamically loaded, the environment becomes corrupted after the provider is unloaded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants