Static mesh refinement and GLMMHD fail on Delta GPU A40s #10

forrestglines · 2022-11-16T02:01:17Z

I've run into a bug on Delta GPU while trying out the A40's using static mesh refinement GLMMHD. The code runs into negative densities in the conserved to primitive step even when fluid evolution and all source terms are disabled. The failure occurs at different times and on different blocks.

Example Failure

cycle=28 time=1.3124988050904851e-01 dt=4.6874957324660183e-03 zone-cycles/wsec_step=7.51e+06 wsec_total=1.01e+00 wsec_step=3.87e-02
### PARTHENON ERROR
  Condition:   u_d > 0.0 || density_floor_ > 0.0
  Message:     Got negative density. Consider enabling first-order flux correction or setting a reasonble density floor.
  File:        /u/glines/code/athenapk-project/athenapk/src/eos/adiabatic_glmmhd.cpp
  Line number: 80
### PARTHENON ERROR
  Condition:   u_d > 0.0 || density_floor_ > 0.0
  Message:     Got negative density. Consider enabling first-order flux correction or setting a reasonble density floor.
  File:        /u/glines/code/athenapk-project/athenapk/src/eos/adiabatic_glmmhd.cpp
  Line number: 80
### PARTHENON ERROR
  Condition:   u_d > 0.0 || density_floor_ > 0.0
  Message:     Got negative density. Consider enabling first-order flux correction or setting a reasonble density floor.
  File:        /u/glines/code/athenapk-project/athenapk/src/eos/adiabatic_glmmhd.cpp
  Line number: 80
### PARTHENON ERROR
  Condition:   u_d > 0.0 || density_floor_ > 0.0
  Message:     Got negative density. Consider enabling first-order flux correction or setting a reasonble density floor.
  File:        /u/glines/code/athenapk-project/athenapk/src/eos/adiabatic_glmmhd.cpp
  Line number: 80
### PARTHENON ERROR
  Condition:   u_d > 0.0 || density_floor_ > 0.0
  Message:     Got negative density. Consider enabling first-order flux correction or setting a reasonble density floor.
  File:        /u/glines/code/athenapk-project/athenapk/src/eos/adiabatic_glmmhd.cpp
  Line number: 80
### PARTHENON ERROR
  Condition:   u_d > 0.0 || density_floor_ > 0.0
  Message:     Got negative density. Consider enabling first-order flux correction or setting a reasonble density floor.
  File:        /u/glines/code/athenapk-project/athenapk/src/eos/adiabatic_glmmhd.cpp
  Line number: 80
### PARTHENON ERROR
  Condition:   u_d > 0.0 || density_floor_ > 0.0
  Message:     Got negative density. Consider enabling first-order flux correction or setting a reasonble density floor.
  File:        /u/glines/code/athenapk-project/athenapk/src/eos/adiabatic_glmmhd.cpp
  Line number: 80
### PARTHENON ERROR
  Condition:   u_d > 0.0 || density_floor_ > 0.0
  Message:     Got negative density. Consider enabling first-order flux correction or setting a reasonble density floor.
  File:        /u/glines/code/athenapk-project/athenapk/src/eos/adiabatic_glmmhd.cpp
  Line number: 80
:0: : block: [1096,0,0], thread: [0,2,0] Assertion `Got negative density. Consider enabling first-order flux correction or setting a reasonble density floor.` failed.
:0: : block: [1096,0,0], thread: [0,3,0] Assertion `Got negative density. Consider enabling first-order flux correction or setting a reasonble density floor.` failed.
:0: : block: [1096,0,0], thread: [0,14,0] Assertion `Got negative density. Consider enabling first-order flux correction or setting a reasonble density floor.` failed.
:0: : block: [1096,0,0], thread: [0,15,0] Assertion `Got negative density. Consider enabling first-order flux correction or setting a reasonble density floor.` failed.
:0: : block: [1096,0,0], thread: [0,146,0] Assertion `Got negative density. Consider enabling first-order flux correction or setting a reasonble density floor.` failed.
:0: : block: [1096,0,0], thread: [0,147,0] Assertion `Got negative density. Consider enabling first-order flux correction or setting a reasonble density floor.` failed.
:0: : block: [1096,0,0], thread: [0,158,0] Assertion `Got negative density. Consider enabling first-order flux correction or setting a reasonble density floor.` failed.
:0: : block: [1096,0,0], thread: [0,159,0] Assertion `Got negative density. Consider enabling first-order flux correction or setting a reasonble density floor.` failed.

Using this parameter file with a linear wave:
glmmd_smr_linear_wave3d.txt

More outputs:

out_1.txt
out_2.txt
out_3.txt

With this environtment ( loading the anaconda3_gpu openmpi/4.1.4, cuda/11.7.0, and hdf5 modules.)

$ module list

Currently Loaded Modules:
  1) cue-login-env/1.0   2) gcc/11.2.0   3) modtree/gpu   4) default   5) cudnn/8.4.1.50   6) anaconda3_gpu/4.13.0   7) ucx/1.12.1   8) openmpi/4.1.4   9) cuda/11.7.0  10) hdf5/1.12.2

Smaller meshblocks seem to increase the frequency of the error. In the above example I used 8x8x8 meshblocks and it never ran to completion, using 16x16x16 meshblocks it failed 1 out of 5 times. Changing the fluid method to euler or removing the static mesh refinement zone seems to avoid the issue, although it might just reduce the likely hood of the issue.

The problem occurs with a RelWithDebInfo build and not with a Debug build, making this tricky to debug.

The text was updated successfully, but these errors were encountered:

pgrete · 2022-11-16T13:43:24Z

Nothing should evolve in this case, correct? So dt should be const, ~~but it isn't so sth is already going wrong in the very first cycle.~~ I mixed time and dt. Looks like dt is const to all digits, so that makes no sense why this would happen.
Did you see this on any other architecture?

pgrete · 2022-11-16T14:03:46Z

Does this happen with MPI or serial?
Just ran this for 10000 cycles on an A100 and was not able to reproduce.

pgrete · 2022-11-16T14:10:34Z

Also,if this happens for glmmhd but not for euler, could you try to comment out the call to the dedner source term in hydro.cpp

  if (hydro_pkg->Param<Fluid>("fluid") == Fluid::glmmhd) {
    hydro_pkg->Param<GLMMHD::SourceFun_t>("glmmhd_source")(md, beta_dt);
  }

which is currently still called, even for Riemann::none, but should do anything for the standard source.

forrestglines · 2022-11-16T20:06:35Z

Nothing should evolve in this case, correct? So dt should be const, ~~but it isn't so sth is already going wrong in the very first cycle.~~ I mixed time and dt. Looks like dt is const to all digits, so that makes no sense why this would happen. Did you see this on any other architecture?

Yes, nothing should evolve, it should remain static. I've only seen this on the A40s on Delta GPU, although I don't have access to any other A40s.

The A40s on Delta GPU are PCIe cards, I've checked A100 PCIe cards but couldn't reproduce the issue.

Does this happen with MPI or serial? Just ran this for 10000 cycles on an A100 and was not able to reproduce.

Compiled with MPI but run on a single rank. I haven't seen this on any other architecture. A100s work fine for this problem.

Also,if this happens for glmmhd but not for euler, could you try to comment out the call to the dedner source term in hydro.cpp

Very good idea, unfortunately it doesn't prevent the error.

However, I did go back and test this with Euler for a much longer time period. Eventually Euler hits this error as well so it's not GLMMHD related as I originally thought. The larger data volume with GLMMHD might have lead to the error sooner.

I'm not convinced that this is an error with AthenaPK and not the A40s on Delta GPU. Since we're not tied down to using the A40s on Delta, we don't have to investigate this now. I'd like to see if this error appears on other A40s before we spend too much time on this.

pgrete · 2022-11-17T16:15:39Z

Out of curiosity, have you tried to compile plain parthenon on the A40 and stumbled across parthenon-hpc-lab/parthenon#659 (comment)

Another interesting (low effort) test would be to run AthenaPK with parthenon/sparse/enable_sparse=false as this does seem to fix the issue in Parthenon above.

This would be an interesting second data point, but I agree, I'm not too concerned about this now (because when this shows up it means 0 are introduced which will always fail loudly).

forrestglines added the bug Something isn't working label Nov 16, 2022

forrestglines mentioned this issue Nov 16, 2022

AGN Triggering #2

Merged

pgrete mentioned this issue Dec 12, 2022

Fixes #659 (rare and intermitted race condition to set allocation status) parthenon-hpc-lab/parthenon#800

Merged

7 tasks

pgrete closed this as completed in parthenon-hpc-lab/parthenon#800 Dec 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Static mesh refinement and GLMMHD fail on Delta GPU A40s #10

Static mesh refinement and GLMMHD fail on Delta GPU A40s #10

forrestglines commented Nov 16, 2022

pgrete commented Nov 16, 2022 •

edited

Loading

pgrete commented Nov 16, 2022

pgrete commented Nov 16, 2022

forrestglines commented Nov 16, 2022

pgrete commented Nov 17, 2022

Static mesh refinement and GLMMHD fail on Delta GPU A40s #10

Static mesh refinement and GLMMHD fail on Delta GPU A40s #10

Comments

forrestglines commented Nov 16, 2022

pgrete commented Nov 16, 2022 • edited Loading

pgrete commented Nov 16, 2022

pgrete commented Nov 16, 2022

forrestglines commented Nov 16, 2022

pgrete commented Nov 17, 2022

pgrete commented Nov 16, 2022 •

edited

Loading