Skip to content

2020.09.23 Meeting Notes

Andrew Gaspar edited this page Sep 23, 2020 · 3 revisions

Agenda

  • Individual/group updates
  • Scaling Update
  • Re-introduce OpenMP dispatch?
  • Review non-WIP PRs

Individual/Group Updates

LANL CS

Andrew: Integration with LANL integrated code is proceeding apace - 0D "estimate PI" example is now running on top of Parthenon infrastructure. Some bad O(N^2) performance scalability issues that is mitigated by increasing the number of ranks - known issue with mesh variables that should be easy to fix.

Galen ran weak scaling tests up to 16000 GPUs: https://github.com/lanl/parthenon/issues/301

LANL Physics

Debugging a weird HtoD memcpy issue - basically if the lambda size gets past a certain size, then it results in an async memcpy. Fixed the issue by capturing a 1D view rather than a ParArrayND - much smaller metadata size. Needs review: https://github.com/lanl/parthenon/pull/293

Ben Ryan has an open PR for particles - fairly mature and seeing promising performance. Factor of 100 performance increase from V100 vs. Skylake node. Needs review:

Jonah has been working through a variety of API changes. Improved FindMeshBlock performance by switching from std::list<MeshBlock> to std::vector<std::shared_ptr<MeshBlock>>. More cleanup to MeshBlock - ready to merge, just needs review (https://github.com/lanl/parthenon/pull/307). Added caching for MeshBlockPacks (https://github.com/lanl/parthenon/pull/308).

PKAthena

Philipp's been looking into a number of different issues including performance testing, review, etc. MeshBlock caching is now used in advection test - promising performance results with small mesh block sizes. 280 million cell-cycle/s which is better than KAthena - promising that our approach is successful. Parthenon now compiles against Cuda 11. We should update to newest kokkos next week with new release. Reintroduced boundary communication for one call for packing boundaries in one go - minuscule performance.

Forrest: Busy prototyping his simulations in KAthena. On Friday giving a conference on german astronomical society and will be mentioning Parthenon. Forrest will share conference paper on Riot.

Jim: Prototyping a hydro code with Kokkos. Been helped by Philipp and Forest. Made some different decisions around boundary buffers that may present an interesting approach.

Scaling Update

Seeing really good weak scaling performance (https://github.com/lanl/parthenon/issues/301) - biggest issue is that the "avg of 10 cycles" performance stops scaling for a certain region and then fixes itself. We think this is related to AMR + load balancing - we think if we re-run with the FindMeshBlock performance improvement, that will smooth itself out.

Phil thinks MPS might be sufficient to get additional host parallelism (similar to streams). This is a simpler approach than doing threads + streaming - a lot less code, but good performance.

Galen has some concern around scalability of MPS. Livermore is not currently doing multiple processes-per-GPU, using host threads.

Current conclusion: 16^3 mesh blocks seem to have good performance.

Requests: Galen asks that Philipp and Jonah send some other scalability updates.

Re-introduce OpenMP dispatch?

No, MPI seems to be good enough, but maybe let's just remove all the omp pragmas in initialization - they have just been the source of issues.

One issues that may come up is we use omp simd for our host loop parallelism. This may cause an issue for loops where we want to use team shared memory. We'll have to investigate this.

Clone this wiki locally