Meeting notes 10.31.2024

Follow up on hackathon
Individual/group updates
Review non-WIP PRs

Summary of Hackathon

Hierarchical parallelism issues
- Ben Prather worked on it. Well understood. Fix basically ready to go.
Particle profiling/performance, fixing atomics
- Alex Long has open PR, basically ready to go
Fusing comms
- Luke's solution works, but needs to be cleaned up and made production ready
- Strange MPI error Luke encountered, now fixed.
- Some issues to use downstream. Issues with mesh partitioning and sparse variables.
Very useful to have blocked off time to just focus.
Engaged with NVIDIA + Forrest, especially getting set up with profiling tools. Tooling very powerful.
- Forrest expressed a lot of interest in looking at Artemis + Jaybenne. May lead to enhancements in parthenon codebase.

Individual Updates

Ben P

Two PRs, one worth merging, one not
- AthenaK uses fused MPI buffers
- Parthenon only allocates enough teams if there are enough meshblocks available
- Solution is to move tuv index of variable being synched from outer loop to inner loop. Introduced 25% performance improvement for athena-pk. 1% in kharma
Other option was to completely flatten all indices. Drop hierarchical parallelism and do single "flat" loop. Translates to launching a lot of threads that immediately exit.
- Doesn't seem to make sense for this kernel. But maybe other places where this makes sense. So worth exploring as an experiment. But should't be merged. Index movement PR is simply so should be merged.

Adam R

Not too many updates. Still working on meshdata version of AMR refinement
Original version had runtime switch.
During review, most people in favor of just removing meshblock version. No runtime switch.
Will do that. Only breaking if using own version of AMR criteria.
CheckRefinement will still work but users have to register callback.
Some discussion with Philipp on scatterviews. Probably doesn't mater too much which one does.

Ben R

Hackathon orchestration

Luke R

Coalescing communication.
Chased down the MPI issue. If one rank gets far behind another rank, possiblity some messages might not get through. iprobe can just cover up the messages we needed to capture with messages we couldn't use yet. Other pattern attempted is more similar to what we do now. There, wasn't checking carefully enough that the small buffers were available for writing.
- PR now has 3 different strategies:
  - iprobe, which will never work
  - improbe, works a little differently. Assigns a message id for each received
  - now checking that small buffers are actually available, repeated polling seems to work
- important to check this is all working
Other things to get PR in are required:
- How do we make this work with sparse? Many choices to make. Have to see what's performant and doesn't have too big of a memory footprint
- Dealing with when meshdata have some subset of fields on them. Currently combined communication assumes that meshdata is communicating all of fillghost fields. Things break if you only have some subset. So that needs to be fixed. Trying to work on that now.
Seems to be worth doing. Got pretty big scaling enhancements from doing this. Especially for the riot-style (i.e., many individual vectors) codes.
Also working on multigrid solver. Move to staged instead of field based solver. Should give more flexibility.
You really need to be careful about defining your boundary conditions on the surface of your domain, not just your ghost cells, because the cell centers in your ghost cells move as you move up in the hierarchy.

Patrick M

Two things surrounding BCs
- Leveraged new BC API downstream
  - Previous design not really capable of dealing with different BCs for swarms vs default BCs. Really cleaned up things.
- Question: We have a derived face field. When we populate it, we want to call FillGhost such that we can call flux correction operations on it. However, we don't really want to apply physical boundary conditions.
  - Jonah: I don't think this is currently possible... but it doesn't seem unreasonable.

Next steps on hackathon

Particle stuff is basically done.
Hierarchical parallelism buffer packing MR also basically done. Basically ready for review.

Coalesced buffer communication

Big questions/issues

Test this on some other architectures/machines. Would be good to try on Chicoma or Venado. Or Frontier. Maybe look at these different comm patterns.
- Ben R and Patrick will test in Artemis
- Others will try to do with Phoebus + riot
Do we want to keep the old comm infrastructure so we can switch back and forth? Probably yes
Sparse issue
- Concern is it may take time to figure out the right thing to do
- Luke likely out.
Subset communication
- Luke will do. Probably can be done.

Could potentially get things into a place where coalesced communication is off by default so tests pass.

Proposed path: Add a runtime switch that offers, coalesce everything, coalesce only dense, coalesce nothing. Hope to get MR into that place and merge before Luke goes on leave.
Artemis won't work today. Needs the subset communication machinery.

Non-WIP PRs

Ben to submit MR hotfix for docs CI
Jonah + Luke to review Ben's MR
Ben P + Philipp to coordinate on Kokkos version MR
Comm inner loop specializations MR. Ben + Luke to coordinate
Par dispatch: awaiting Philipp's review. Some conflicts to resolve.

Next call will be November 14, 2024.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Meeting notes 10.31.2024

Summary of Hackathon

Individual Updates

Ben P

Adam R

Ben R

Luke R

Patrick M

Next steps on hackathon

Coalesced buffer communication

Non-WIP PRs

Next call will be November 14, 2024.

Clone this wiki locally