Skip to content

Fix edge case where meshblocks can be improperly de-refined#1248

Merged
Yurlungur merged 16 commits into
developfrom
jmm/amr-fix
Apr 30, 2025
Merged

Fix edge case where meshblocks can be improperly de-refined#1248
Yurlungur merged 16 commits into
developfrom
jmm/amr-fix

Conversation

@Yurlungur
Copy link
Copy Markdown
Collaborator

@Yurlungur Yurlungur commented Apr 24, 2025

PR Summary

The last week or so, I've been chasing down an AMR bug in riot. With a Rayleigh-Taylor instability, I was seeing the solution become asymmetric. I chased down the problem to the mesh refining incorrectly. Here's a plot of asymmetry in the fluid solution, with the meshblocks labeled by gid right before the problem appears:
rt_amr_broken_before
On the next time step, a collection of blocks at the top derefines, while the symmetric block at the bottom does not. In particular, blocks 87, 88, 89, and 90 get coarsened, while blocks 2, 3, 4, and 5 do not.
rt_amr_broken_mid
One step later, the mesh re-refines:
rt_amr_broken_after
At this point, the top blocks have experienced restriction and prolongation, while the bottom blocks have not. This introduces asymmetric truncation error.

The problem turned out to be in mesh-amr_loadbalance.cpp. The code should only de-refine if all children of a parent block want to de-refine. However, the code that checks for this is quite brittle. It compares logical locations, rather than actual parent associativity. When we moved to a forest of octrees, rather than a single tree, we needed to add a comparison for tree as well as location within a tree, and that check was missing.

In this MR, I move to a slightly less brittle approach suggested by @lroberts36 which checks for "same level neighbor" rather than computing by hand. That said, this code is still quite brittle and we may wish to move away from the current logic, which assumes morton ordering, to logic that just tracks which blocks want to de-refine via, e.g., an unordered set.

While I was at it, I also couldn't resist cleaning up the memory management, but replacing the new and delete calls with a std::vector and adding some explanatory comments.

I also added a little documentation and added the script I used to check for symmetry.

PR Checklist

  • Code passes cpplint
  • New features are documented.
  • Adds a test for any bugs fixed. Adds tests for new features.
  • Code is formatted
  • Changes are summarized in CHANGELOG.md
  • Change is breaking (API, behavior, ...)
    • Change is additionally added to CHANGELOG.md in the breaking section
    • PR is marked as breaking
    • Short summary API changes at the top of the PR (plus optionally with an automated update/fix script)
  • CI has been triggered on Darwin for performance regression tests.
  • Docs build
  • (@lanl.gov employees) Update copyright on changed files

@Yurlungur Yurlungur added the bug Something isn't working label Apr 24, 2025
@Yurlungur Yurlungur self-assigned this Apr 24, 2025
@Yurlungur Yurlungur enabled auto-merge April 24, 2025 16:55
@Yurlungur Yurlungur disabled auto-merge April 24, 2025 16:55
@Yurlungur
Copy link
Copy Markdown
Collaborator Author

@pgrete any idea what's going on with the CI here? This looks like a local issue with the CI machine?
https://github.com/parthenon-hpc-lab/parthenon/actions/runs/14647195866/job/41104266637?pr=1248

@Yurlungur Yurlungur enabled auto-merge April 24, 2025 18:12
@Yurlungur Yurlungur disabled auto-merge April 24, 2025 18:12
@@ -1,3 +1,4 @@
#!/usr/bin/env python
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shebang was just missing before.

@Yurlungur
Copy link
Copy Markdown
Collaborator Author

Thanks for the additional cleanup @lroberts36 ! Looks like builds are now failing?
https://github.com/parthenon-hpc-lab/parthenon/actions/runs/14650074098/job/41113472608?pr=1248

@pgrete
Copy link
Copy Markdown
Collaborator

pgrete commented Apr 25, 2025

@pgrete any idea what's going on with the CI here? This looks like a local issue with the CI machine? https://github.com/parthenon-hpc-lab/parthenon/actions/runs/14647195866/job/41104266637?pr=1248

For whatever reason some lock file was not cleaned up. A manual removal seems to have fixed the issue.

Copy link
Copy Markdown
Collaborator Author

@Yurlungur Yurlungur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lroberts36 for the additional cleanup. Original bug is below. Luke's changes are non-functional cleanup that better isolate the tree machinery to prevent something like this from being re-introduced.

Comment on lines 571 to +574
if (r < tnderef) {
if ((lderef[n].lx1() + i) == lderef[r].lx1() &&
(lderef[n].lx2() + j) == lderef[r].lx2() &&
(lderef[n].lx3() + k) == lderef[r].lx3() &&
lderef[n].level() == lderef[r].level())
if (lderef[n].GetSameLevelNeighbor(oi, oj, ok) == lderef[r]) {
rr++;
}
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change here was the location of the bug.

@Yurlungur Yurlungur enabled auto-merge April 28, 2025 13:31
@Yurlungur Yurlungur disabled auto-merge April 28, 2025 20:05
@Yurlungur Yurlungur enabled auto-merge April 28, 2025 20:05
Copy link
Copy Markdown
Collaborator

@lroberts36 lroberts36 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Yurlungur
Copy link
Copy Markdown
Collaborator Author

Copy link
Copy Markdown
Collaborator

@pgrete pgrete left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (as far as I was able to follow).

Comment thread doc/sphinx/src/outputs.rst Outdated
Comment thread doc/sphinx/src/outputs.rst
Comment thread scripts/python/packages/parthenon_tools/parthenon_tools/compute_asymmetry.py Outdated
Comment thread src/mesh/mesh-amr_loadbalance.cpp
Comment on lines +568 to +570
for (std::int64_t ok = 0; ok <= lk; ok++) {
for (std::int64_t oj = 0; oj <= lj; oj++) {
for (std::int64_t oi = 0; oi <= 1; oi++) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot say that I have fully internalized the updated mesh structure yet, but double checking for my own understanding:
Why are lk and lj tied to symmetry but i is not?
I see that our grids are always at least 1D (which I'd understand overall a different behavior), but what determines the use of symmetry?
IIRC I had a similar question in the past and it was tied to being periodic but I'm not that sure any more.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lroberts36 should jump in but... as I understand it, symmetry is totally synonymous with dimension. It's essentially, "are we plane-symmetric in this direction?"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, symmetry(dir) being true means that we have translational symmetry in direction dir. Previously in many places in the code we just checked if the size of a block was one in a given direction, but this created some issues with multigrid when multiple block sizes were allowed. I think we could also probably remove the symmetry stuff and replace it with ndim checks.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer checking symmetry. It's more self-descriptive.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it should be changed to TranslationalSymmetry or something because this isn't the first time someone has been confused about what that call implies.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A reasonable suggestion for a later MR.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, when I wrote "Maybe it should be changed..." I was thinking the change should occur at some (possibly infinite) time in the future

@pgrete
Copy link
Copy Markdown
Collaborator

pgrete commented Apr 30, 2025

@pgrete this error looks like the CI machine didn't see any GPUs?

https://github.com/parthenon-hpc-lab/parthenon/actions/runs/14716288427/job/41300312436#step:6:1

I think the machine is currently under heavy use. I just retriggerd the job (expecting that it'll go through).

Yurlungur and others added 4 commits April 30, 2025 10:43
Co-authored-by: Philipp Grete <pgrete@hs.uni-hamburg.de>
Co-authored-by: Philipp Grete <pgrete@hs.uni-hamburg.de>
@Yurlungur Yurlungur disabled auto-merge April 30, 2025 15:13
@Yurlungur Yurlungur enabled auto-merge April 30, 2025 15:13
@Yurlungur Yurlungur merged commit 8f4e994 into develop Apr 30, 2025
35 checks passed
@Yurlungur Yurlungur deleted the jmm/amr-fix branch June 20, 2025 19:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants