-
Notifications
You must be signed in to change notification settings - Fork 382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shared context test crashes on psm2 when it's built as a DL provider #3282
Comments
This may be of interest: |
No. It's the user space driver interface for psm. |
It's weird that the segfault happened in the psm code path while the provider being used was psm2. |
Maybe OpenMPI is calling psm code directly, resulting in a double use of the library? |
But it is not an OpenMPI test. The code is part of the library destructor which is called when the program exits. So the code being called is probably just how the finalization process works. Anyway, segfault inside getenv is quite unusual. Either the getenv symbol is pointing to an invalid address or some memory corruption has occurred somewhere. |
Sorry - I need to stop following so many threads at once. Yeah, we may be looking at a memory corruption from somewhere. |
I think I have found the reason. Under certain condition (satisfied by the fi_shared_ctx test), the provider calls putenv() to add an environment variable to automatically turn on the PSM2 multi EP support. This works fine as a built-in provider. However, if the provider is dynamically loaded, the environment becomes corrupted after the provider is unloaded. |
Test: fi_shared_ctx -p "psm2"
I see the following crash on psm2 provider when it's built as a separate dynamic library:
The issue is present in v1.5.0 as well.
The text was updated successfully, but these errors were encountered: