2023.12.14 Meeting Notes

Agenda

LR

working on performance improvements for MG
caching PackDescriptors for boundary packs significantly improved performance (this was a surprising bottleneck)
- for downstream codes: make sure they are static (especially for larger numbers of variables)
- BP: might this be related to FlagCollections (which seem to be slow)

BP

chasing Gremlins in MPI like
- fewer than 3 meshblocks in any direction results in issues (like magnetic field divergence)
- not 1:1 reproducible
trying to bisect
sounds like being related to the issues PG is seeing

JM

small improvements here and there like
index splits (for easier/faster hierarchical parallelism by vectorization), PR is open
- advantage over TeamMDRange is more flexible control over which indices are fused
added machinery for correctness check in parthenon-vibe benchmark
non-cell centered IO
refactoring Phoebus for more modern Parthenon use

PD

working on new Parthenon based code with curvilinear coordinates
- more Athena++-esque
- user prolongation/restriction/flux correction etc worked well
has a use case for "just a flux"
- face fields provide too much boilerplate (and implicitly enable additional machinery)
- might be fixed with a small new metadata flag
- JM/LR: might be worth to refactor the flux correction machinery down the line (e.g., make fluxes face variables with dependencies)
- looks like our implicit use of ghost cells for flux fields is not necessarily intuitive
  - needs to be documented
  - there might be a gotcha when using face and edge centered data in the same loop as cell centered field in the current WIP IndexSplit machinery

PG

encountered MPI issues (timeouts) with few blocks per rank (say one or two) with recent version of Parthenon
- is somewhat reliably reproducible
- unclear where this stems from, but not more detailed debugging yet
still tracing IO issues with HDF5, which now became a roadblock for simulations coming up
- ADIOS2 seems to be very performant (on Orion). Quick testing allowed to write 4.5T of data in 0.85 seconds (from 512 nodes)
- will look into new output based on openPMD with ADIOS2 backend
worked on large scale viz of INCITE sims
- needed some custom xdmf pre-processing to reduce data that could be handled by ParaView

see above (openPMD/ADIOS2)
we should ensure we can ship it as submodule (for ease of use)
need also to ensure that analysis pipelines (especially python based) easily interface with those outputs

-> people should think about ideas/approaches for a Gordon Bell submission

unify/pick packing machinery <-> id based packing
best practices on performance relevant parameters (and more generally block sizes/work per device)