Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Static mesh refinement and GLMMHD fail on Delta GPU A40s #10

Closed
forrestglines opened this issue Nov 16, 2022 · 5 comments · Fixed by parthenon-hpc-lab/parthenon#800
Closed
Labels
bug Something isn't working

Comments

@forrestglines
Copy link
Contributor

I've run into a bug on Delta GPU while trying out the A40's using static mesh refinement GLMMHD. The code runs into negative densities in the conserved to primitive step even when fluid evolution and all source terms are disabled. The failure occurs at different times and on different blocks.

Example Failure
cycle=28 time=1.3124988050904851e-01 dt=4.6874957324660183e-03 zone-cycles/wsec_step=7.51e+06 wsec_total=1.01e+00 wsec_step=3.87e-02
### PARTHENON ERROR
  Condition:   u_d > 0.0 || density_floor_ > 0.0
  Message:     Got negative density. Consider enabling first-order flux correction or setting a reasonble density floor.
  File:        /u/glines/code/athenapk-project/athenapk/src/eos/adiabatic_glmmhd.cpp
  Line number: 80
### PARTHENON ERROR
  Condition:   u_d > 0.0 || density_floor_ > 0.0
  Message:     Got negative density. Consider enabling first-order flux correction or setting a reasonble density floor.
  File:        /u/glines/code/athenapk-project/athenapk/src/eos/adiabatic_glmmhd.cpp
  Line number: 80
### PARTHENON ERROR
  Condition:   u_d > 0.0 || density_floor_ > 0.0
  Message:     Got negative density. Consider enabling first-order flux correction or setting a reasonble density floor.
  File:        /u/glines/code/athenapk-project/athenapk/src/eos/adiabatic_glmmhd.cpp
  Line number: 80
### PARTHENON ERROR
  Condition:   u_d > 0.0 || density_floor_ > 0.0
  Message:     Got negative density. Consider enabling first-order flux correction or setting a reasonble density floor.
  File:        /u/glines/code/athenapk-project/athenapk/src/eos/adiabatic_glmmhd.cpp
  Line number: 80
### PARTHENON ERROR
  Condition:   u_d > 0.0 || density_floor_ > 0.0
  Message:     Got negative density. Consider enabling first-order flux correction or setting a reasonble density floor.
  File:        /u/glines/code/athenapk-project/athenapk/src/eos/adiabatic_glmmhd.cpp
  Line number: 80
### PARTHENON ERROR
  Condition:   u_d > 0.0 || density_floor_ > 0.0
  Message:     Got negative density. Consider enabling first-order flux correction or setting a reasonble density floor.
  File:        /u/glines/code/athenapk-project/athenapk/src/eos/adiabatic_glmmhd.cpp
  Line number: 80
### PARTHENON ERROR
  Condition:   u_d > 0.0 || density_floor_ > 0.0
  Message:     Got negative density. Consider enabling first-order flux correction or setting a reasonble density floor.
  File:        /u/glines/code/athenapk-project/athenapk/src/eos/adiabatic_glmmhd.cpp
  Line number: 80
### PARTHENON ERROR
  Condition:   u_d > 0.0 || density_floor_ > 0.0
  Message:     Got negative density. Consider enabling first-order flux correction or setting a reasonble density floor.
  File:        /u/glines/code/athenapk-project/athenapk/src/eos/adiabatic_glmmhd.cpp
  Line number: 80
:0: : block: [1096,0,0], thread: [0,2,0] Assertion `Got negative density. Consider enabling first-order flux correction or setting a reasonble density floor.` failed.
:0: : block: [1096,0,0], thread: [0,3,0] Assertion `Got negative density. Consider enabling first-order flux correction or setting a reasonble density floor.` failed.
:0: : block: [1096,0,0], thread: [0,14,0] Assertion `Got negative density. Consider enabling first-order flux correction or setting a reasonble density floor.` failed.
:0: : block: [1096,0,0], thread: [0,15,0] Assertion `Got negative density. Consider enabling first-order flux correction or setting a reasonble density floor.` failed.
:0: : block: [1096,0,0], thread: [0,146,0] Assertion `Got negative density. Consider enabling first-order flux correction or setting a reasonble density floor.` failed.
:0: : block: [1096,0,0], thread: [0,147,0] Assertion `Got negative density. Consider enabling first-order flux correction or setting a reasonble density floor.` failed.
:0: : block: [1096,0,0], thread: [0,158,0] Assertion `Got negative density. Consider enabling first-order flux correction or setting a reasonble density floor.` failed.
:0: : block: [1096,0,0], thread: [0,159,0] Assertion `Got negative density. Consider enabling first-order flux correction or setting a reasonble density floor.` failed.

Using this parameter file with a linear wave:
glmmd_smr_linear_wave3d.txt

More outputs:

out_1.txt
out_2.txt
out_3.txt

With this environtment ( loading the anaconda3_gpu openmpi/4.1.4, cuda/11.7.0, and hdf5 modules.)

$ module list

Currently Loaded Modules:
  1) cue-login-env/1.0   2) gcc/11.2.0   3) modtree/gpu   4) default   5) cudnn/8.4.1.50   6) anaconda3_gpu/4.13.0   7) ucx/1.12.1   8) openmpi/4.1.4   9) cuda/11.7.0  10) hdf5/1.12.2

Smaller meshblocks seem to increase the frequency of the error. In the above example I used 8x8x8 meshblocks and it never ran to completion, using 16x16x16 meshblocks it failed 1 out of 5 times. Changing the fluid method to euler or removing the static mesh refinement zone seems to avoid the issue, although it might just reduce the likely hood of the issue.

The problem occurs with a RelWithDebInfo build and not with a Debug build, making this tricky to debug.

@forrestglines forrestglines added the bug Something isn't working label Nov 16, 2022
@pgrete
Copy link
Contributor

pgrete commented Nov 16, 2022

Nothing should evolve in this case, correct? So dt should be const, but it isn't so sth is already going wrong in the very first cycle. I mixed time and dt. Looks like dt is const to all digits, so that makes no sense why this would happen.
Did you see this on any other architecture?

@pgrete
Copy link
Contributor

pgrete commented Nov 16, 2022

Does this happen with MPI or serial?
Just ran this for 10000 cycles on an A100 and was not able to reproduce.

@pgrete
Copy link
Contributor

pgrete commented Nov 16, 2022

Also,if this happens for glmmhd but not for euler, could you try to comment out the call to the dedner source term in hydro.cpp

  if (hydro_pkg->Param<Fluid>("fluid") == Fluid::glmmhd) {
    hydro_pkg->Param<GLMMHD::SourceFun_t>("glmmhd_source")(md, beta_dt);
  }

which is currently still called, even for Riemann::none, but should do anything for the standard source.

@forrestglines
Copy link
Contributor Author

Nothing should evolve in this case, correct? So dt should be const, but it isn't so sth is already going wrong in the very first cycle. I mixed time and dt. Looks like dt is const to all digits, so that makes no sense why this would happen. Did you see this on any other architecture?

Yes, nothing should evolve, it should remain static. I've only seen this on the A40s on Delta GPU, although I don't have access to any other A40s.

The A40s on Delta GPU are PCIe cards, I've checked A100 PCIe cards but couldn't reproduce the issue.

Does this happen with MPI or serial? Just ran this for 10000 cycles on an A100 and was not able to reproduce.

Compiled with MPI but run on a single rank. I haven't seen this on any other architecture. A100s work fine for this problem.

Also,if this happens for glmmhd but not for euler, could you try to comment out the call to the dedner source term in hydro.cpp

Very good idea, unfortunately it doesn't prevent the error.

However, I did go back and test this with Euler for a much longer time period. Eventually Euler hits this error as well so it's not GLMMHD related as I originally thought. The larger data volume with GLMMHD might have lead to the error sooner.

I'm not convinced that this is an error with AthenaPK and not the A40s on Delta GPU. Since we're not tied down to using the A40s on Delta, we don't have to investigate this now. I'd like to see if this error appears on other A40s before we spend too much time on this.

@pgrete
Copy link
Contributor

pgrete commented Nov 17, 2022

Out of curiosity, have you tried to compile plain parthenon on the A40 and stumbled across parthenon-hpc-lab/parthenon#659 (comment)

Another interesting (low effort) test would be to run AthenaPK with parthenon/sparse/enable_sparse=false as this does seem to fix the issue in Parthenon above.

This would be an interesting second data point, but I agree, I'm not too concerned about this now (because when this shows up it means 0 are introduced which will always fail loudly).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants