-
Notifications
You must be signed in to change notification settings - Fork 37
Meeting notes 10.31.2024
Jonah Miller edited this page Oct 31, 2024
·
2 revisions
- Follow up on hackathon
- Individual/group updates
- Review non-WIP PRs
- Hierarchical parallelism issues
- Ben Prather worked on it. Well understood. Fix basically ready to go.
- Particle profiling/performance, fixing atomics
- Alex Long has open PR, basically ready to go
- Fusing comms
- Luke's solution works, but needs to be cleaned up and made production ready
- Strange MPI error Luke encountered, now fixed.
- Some issues to use downstream. Issues with mesh partitioning and sparse variables.
- Very useful to have blocked off time to just focus.
- Engaged with NVIDIA + Forrest, especially getting set up with profiling tools. Tooling very powerful.
- Forrest expressed a lot of interest in looking at Artemis + Jaybenne. May lead to enhancements in parthenon codebase.
- Two PRs, one worth merging, one not
- AthenaK uses fused MPI buffers
- Parthenon only allocates enough teams if there are enough meshblocks available
- Solution is to move tuv index of variable being synched from outer loop to inner loop. Introduced 25% performance improvement for athena-pk. 1% in kharma
- Other option was to completely flatten all indices. Drop hierarchical parallelism and do single "flat" loop. Translates to launching a lot of threads that immediately exit.
- Doesn't seem to make sense for this kernel. But maybe other places where this makes sense. So worth exploring as an experiment. But should't be merged. Index movement PR is simply so should be merged.
- Not too many updates. Still working on meshdata version of AMR refinement
- Original version had runtime switch.
- During review, most people in favor of just removing meshblock version. No runtime switch.
- Will do that. Only breaking if using own version of AMR criteria.
- CheckRefinement will still work but users have to register callback.
- Some discussion with Philipp on scatterviews. Probably doesn't mater too much which one does.
- Hackathon orchestration
- Coalescing communication.
- Chased down the MPI issue. If one rank gets far behind another rank, possiblity some messages might not get through. iprobe can just cover up the messages we needed to capture with messages we couldn't use yet. Other pattern attempted is more similar to what we do now. There, wasn't checking carefully enough that the small buffers were available for writing.
- PR now has 3 different strategies:
- iprobe, which will never work
- improbe, works a little differently. Assigns a message id for each received
- now checking that small buffers are actually available, repeated polling seems to work
- important to check this is all working
- PR now has 3 different strategies:
- Other things to get PR in are required:
- How do we make this work with sparse? Many choices to make. Have to see what's performant and doesn't have too big of a memory footprint
- Dealing with when meshdata have some subset of fields on them. Currently combined communication assumes that meshdata is communicating all of fillghost fields. Things break if you only have some subset. So that needs to be fixed. Trying to work on that now.
- Seems to be worth doing. Got pretty big scaling enhancements from doing this. Especially for the riot-style (i.e., many individual vectors) codes.
- Also working on multigrid solver. Move to staged instead of field based solver. Should give more flexibility.
- You really need to be careful about defining your boundary conditions on the surface of your domain, not just your ghost cells, because the cell centers in your ghost cells move as you move up in the hierarchy.
- Two things surrounding BCs
- Leveraged new BC API downstream
- Previous design not really capable of dealing with different BCs for swarms vs default BCs. Really cleaned up things.
- Question: We have a derived face field. When we populate it, we want to call
FillGhost
such that we can call flux correction operations on it. However, we don't really want to apply physical boundary conditions.- Jonah: I don't think this is currently possible... but it doesn't seem unreasonable.
- Leveraged new BC API downstream
- Particle stuff is basically done.
- Hierarchical parallelism buffer packing MR also basically done. Basically ready for review.
Big questions/issues
- Test this on some other architectures/machines. Would be good to try on Chicoma or Venado. Or Frontier. Maybe look at these different comm patterns.
- Ben R and Patrick will test in Artemis
- Others will try to do with Phoebus + riot
- Do we want to keep the old comm infrastructure so we can switch back and forth? Probably yes
- Sparse issue
- Concern is it may take time to figure out the right thing to do
- Luke likely out.
- Subset communication
- Luke will do. Probably can be done.
Could potentially get things into a place where coalesced communication is off by default so tests pass.
- Proposed path: Add a runtime switch that offers, coalesce everything, coalesce only dense, coalesce nothing. Hope to get MR into that place and merge before Luke goes on leave.
- Artemis won't work today. Needs the subset communication machinery.
- Ben to submit MR hotfix for docs CI
- Jonah + Luke to review Ben's MR
- Ben P + Philipp to coordinate on Kokkos version MR
- Comm inner loop specializations MR. Ben + Luke to coordinate
- Par dispatch: awaiting Philipp's review. Some conflicts to resolve.