-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Static mesh refinement and GLMMHD fail on Delta GPU A40s #10
Static mesh refinement and GLMMHD fail on Delta GPU A40s #10
Comments
Nothing should evolve in this case, correct? So |
Does this happen with MPI or serial? |
Also,if this happens for glmmhd but not for euler, could you try to comment out the call to the dedner source term in if (hydro_pkg->Param<Fluid>("fluid") == Fluid::glmmhd) {
hydro_pkg->Param<GLMMHD::SourceFun_t>("glmmhd_source")(md, beta_dt);
} which is currently still called, even for |
Yes, nothing should evolve, it should remain static. I've only seen this on the A40s on Delta GPU, although I don't have access to any other A40s. The A40s on Delta GPU are PCIe cards, I've checked A100 PCIe cards but couldn't reproduce the issue.
Compiled with MPI but run on a single rank. I haven't seen this on any other architecture. A100s work fine for this problem.
Very good idea, unfortunately it doesn't prevent the error. However, I did go back and test this with Euler for a much longer time period. Eventually Euler hits this error as well so it's not GLMMHD related as I originally thought. The larger data volume with GLMMHD might have lead to the error sooner. I'm not convinced that this is an error with AthenaPK and not the A40s on Delta GPU. Since we're not tied down to using the A40s on Delta, we don't have to investigate this now. I'd like to see if this error appears on other A40s before we spend too much time on this. |
Out of curiosity, have you tried to compile plain parthenon on the A40 and stumbled across parthenon-hpc-lab/parthenon#659 (comment) Another interesting (low effort) test would be to run AthenaPK with This would be an interesting second data point, but I agree, I'm not too concerned about this now (because when this shows up it means |
I've run into a bug on Delta GPU while trying out the A40's using static mesh refinement GLMMHD. The code runs into negative densities in the conserved to primitive step even when fluid evolution and all source terms are disabled. The failure occurs at different times and on different blocks.
Example Failure
Using this parameter file with a linear wave:
glmmd_smr_linear_wave3d.txt
More outputs:
out_1.txt
out_2.txt
out_3.txt
With this environtment ( loading the
anaconda3_gpu
openmpi/4.1.4
,cuda/11.7.0
, andhdf5
modules.)Smaller meshblocks seem to increase the frequency of the error. In the above example I used 8x8x8 meshblocks and it never ran to completion, using 16x16x16 meshblocks it failed 1 out of 5 times. Changing the fluid method to euler or removing the static mesh refinement zone seems to avoid the issue, although it might just reduce the likely hood of the issue.
The problem occurs with a
RelWithDebInfo
build and not with aDebug
build, making this tricky to debug.The text was updated successfully, but these errors were encountered: