Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thread safety issue when using the SIDM module with OPENMP enabled #6

Open
appy2806 opened this issue Oct 11, 2024 · 0 comments
Open

Comments

@appy2806
Copy link

I am running into an issue when running the SIDM implementation of Gizmo with OPENMP enabled. This issue seems to only persist on the Flatiron institute's cluster (this was also noted by Chayward in the older hydro implementation of Gizmo). Running things on Frontera looks reasonably okay.

When I try to run a new IC simulation or restart from a snapshot with the SIDM flag turned on and the OPENMP enabled, some of the particles at very early sync-points with a non-zero cross-section have unphysically large velocities. This is not an issue of division by zero/near-zero/NaNs. The regular CDM + hydro works completely fine. I can also make the SIDM work just fine by simply not using OPENMP or setting the OMP_NUM_THREADS=1.

I have tried playing with different optimization flags/ module options but the problem still persists running SIDM with OPENMP.

Looking at the code a bit, it seems like the OpenMP parallelization is happening at a very high level (I'm having a hard time seeing where through all the layers of macros). But it appears at least everything inside AGSForce_evaluate() needs to be thread safe, which calls a lot of code.

Inside sidm_core_flux_computation.h (included inside AGSForce_evaluate()), it looks like "#pragma omp atomic" is being used to protect all the updates of P[j].Vel[k] and similar. I assume this is because multiple threads may be updating the same particle. But elsewhere in AGSForce_evaluate(), there are references to the same P[j], including P[j].Vel[k].

This doesn't look particularly safe; if one thread is updating P[j], even with "#pragma omp atomic", while another thread is trying to read it from another code location, then that's a race (or at least non-deterministic).

For why it doesn't seem to cause an issue/ or things look reasonably correct on Frontera/Stampede and other platforms:

Thread safety issues appearing inconsistently across platforms is common I think? Almost always because the different performance characteristics of the platforms change the relative order of operations between threads.

Please let me know if I could help in anyway possible and re-run tests with OPENMP enabled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant