parthenon-hpc-lab · Yurlungur · Dec 19, 2023 · Dec 9, 2023 · Dec 11, 2023 · Dec 11, 2023
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -3,6 +3,7 @@
 ## Current develop
 
 ### Added (new features/APIs/variables/...)
+- [[PR 981]](https://github.com/parthenon-hpc-lab/parthenon/pull/981) Add IndexSplit
 - [[PR 983]](https://github.com/parthenon-hpc-lab/parthenon/pull/983) Add Contains to SparsePack
 - [[PR 968]](https://github.com/parthenon-hpc-lab/parthenon/pull/968) Add per package registration of boundary conditions
 - [[PR 948]](https://github.com/parthenon-hpc-lab/parthenon/pull/948) Add solver interface and update Poisson geometric multi-grid example

diff --git a/doc/sphinx/src/nested_par_for.rst b/doc/sphinx/src/nested_par_for.rst
@@ -51,7 +51,7 @@ Data type for memory in scratch pad/cache memory. Use
 documentation <https://kokkos.github.io/kokkos-core-wiki/ProgrammingGuide/HierarchicalParallelism.html?highlight=hierarchical>`__
 for determining scratch pad memory needs before kernel launch.
 
-Important usage hints
+On Barriers
 ---------------------
 
 In order to ensure that individual threads of a team are synchronized
@@ -70,6 +70,7 @@ write to common variables, see this
 `code <https://github.com/parthenon-hpc-lab/parthenon/issues/659#issuecomment-1346871509>`__
 for an example.
 
+
 Cmake Options
 -------------
 
@@ -86,3 +87,153 @@ GPUs.
 ``#pragma omp simd`` to vectorize the loop, which typically gives better
 vectorization loops than ``PAR_LOOP_INNER_LAYOUT=TVR_INNER_LOOP`` on
 CPUs and so is the default on CPUs.
+
+
+Performance Considerations
+---------------------------
+
+Hierarchical parallelism can produce very performant code, but a
+deeper awareness of how hardware is mapped to threads is required to
+get optimal performance. Here we list a few strategies/considerations.
+
+* On CPU, with `SIMDFOR_INNER_LOOP` you may have trouble vectorizing
+  unless you help the compiler along. One way to do so is to work with
+  raw pointers to contiguous memory, rather than working with views
+  and strides. Even for stencil ops, if you can pull out pointers that
+  represent the different points on the stencil, this can help with
+  vectorization.
+* Similarly on CPUs, due to the cost of starting up a vector op,
+  vectorization will only be a performance win if there's enough work
+  in the inner loop. A minimum of 16 points is required for the op to
+  vectorize at all. Experience shows, however, that at least 64 is
+  really required to see big wins. One strategy for providing enough
+  vector cells in the inner loop is to do a 1D ``SIMDFOR`` inner loop
+  but combine the ``j`` and ``i`` indices by simply looping over the
+  contiguous memory in a rasterized plane on a block.
+* On GPUs, the outer loop typically maps to blocks, while the inner
+  maps to threads. To see good performance, you must both provide
+  enough work in the inner loop to create enough threads to fill in
+  CUDA terms a streaming multiprocessor (SM, equivalent to a Compute
+  Unit or CU on AMD GPUs) with multiple warps (or wavefronts for AMD)
+  to take advantage of pipelining and enough work in the outer loop to
+  create enough blocks to fill all SMs on the GPU divided by the
+  number of simultaneous streams. The number of warps in flight on the
+  inner loop per SM (which is related to "occupancy") will depend
+  positively on length of the inner loop and negatively on higher
+  shared memory usage (scratch pad memory in Kokkos parlance and Local
+  Data Share or LDS on AMD GPUs) and higher register usage. Note that
+  the number of SMs and the available shared memory and registers per
+  SM will vary between GPU architectures and especially between GPU
+  vendors.
+
+IndexSplit
+-------------
+
+To balance the CPU vs GPU hardware considerations of hierarchical
+parallelism, ``Parthenon`` provides a utility, the ``IndexSplit``
+class, defined in the ``utils/index_split.hpp`` header file and
+available in ``<parthenon/package.hpp>`` in the
+``parthenon::package::prelude`` namespace.
+
+In our experience ``IndexSplit`` is most beneficial when working with
+small meshblocks on CPUs, especially in two dimensions. For small
+blocks, we want vectorized operations over contiguous memory for our
+innermost loop, but we want that loop to contain enough work for,
+e.g., vector ops to function. We have often found that the optimal
+split is to fuse j, and i into the inner loop and use k and blocks in
+the outer loop.
+
+The ``IndexSplit`` class can be constructed as
+
+.. code:: cpp
+
+  IndexSplit(MeshData<Real> md, IndexDomain domain, const int nkp, const int njp);
+
+where here ``md`` is a ``MeshData`` object on which you want to
+operate. ``domain`` specifies where in the ``MeshBlock`` you wish to
+operate, for example ``IndexDomain::Interior``. ``nkp`` and ``njp``
+are the number of points in ``X3`` and ``X2`` respectively that are in
+the outer loop. All remaining points are in the inner loop; each team
+will iterate over multiple `k` and/or `j` indices to cover the
+specified `k/j` range. Typically ``MeshBlock`` index in the pack is
+also assumed to be in the outer loop. ``nkp`` and ``njp`` also accept
+special flags ``IndexSplit::all_outer`` and ``IndexSplit::no_outer``,
+which specify that all and none of the indices in that direction
+should be in the outer loop.
+
+.. warning::
+
+  Note that, in contrast to ``njp``, ``nkp`` points in the
+  ``k``-direction are not included in the innermost loop bounds. You
+  must loop over ``k`` by hand inside the outer loop body.
+
+A second constructor alternatively sets the range for ``X3``, ``X2``,
+and ``X1`` explicitly:
+
+.. code:: cpp
+
+  IndexSplit(MeshData<Real> *md, const IndexRange &kb, const IndexRange &jb,
+             const IndexRange &ib, const int nkp, const int njp);
+
+where here ``kb``, ``jb``, and ``ib`` specify the starting and ending
+indices for ``X3``, ``X2``, and ``X1`` respecively.
+
+.. warning::
+
+  Note that, at this time, ``IndexSplit`` doesn't know about
+  face-centered or edge-centered data. To use ``IndexSplit`` with,
+  e.g., face-centered data, set the input ``IndexRange`` quantities to
+  match the shape for the face-centered data (e.g., with the
+  appropriate offsets).
+
+An ``IndexSplit`` object is typically used as:
+
+.. code:: cpp
+
+  using namespace parthenon::package::prelude;
+  using parthenon::ScratchPad1D;
+  using parthenon::IndexSplit;
+  using parthenon::par_for_outer;
+  using parthenon::par_for_inner;
+  using parthenon::team_mbr_t;
+  // Initialize index split object
+  IndexSplit idx_sp(md, IndexDomain::interior, nkp, njp);
+
+  // Request maximum size in i and j in the inner loop, for scratch
+  const int Ni = idx_sp.get_max_ni();
+  const int Nj = idx_sp = get_max_nj();
+  const in tNmax = Ni * Nj;
+
+  // single scratch array for i,j
+  auto scratch_size = ScratchPad1D<Real>::shmem_size(Nmax);
+  constexpr int scratch_level = 0;
+
+  // Par for
+  par_for_outer(
+	  DEFAULT_OUTER_LOOP_PATTERN, "KernalOuter", DevExecSpace(), scratch_size,
+	  scratch_level, 0, nblocks - 1, 0, idx_sp.outer_size() - 1,
+	  KOKKOS_LAMBDA(team_mbr_t member, const int b, const int outer_idx) {
+	    ScratchPad1D<Real> scratch(member.team_scratch(scratch_level), Nmax);
+	    // Get index ranges. Note they depend on where we are in the outer index!
+	    // These give us a sense for where we are in k,j space
+	    const auto krange = idx_sp.GetBoundsK(outer_idx);
+	    const auto jrange = idx_sp.GetBoundsJ(outer_idx);
+	    // This is the loop of contiguous inner memory. May contain i and j!
+	    const auto flattened_inner_ijrange = idx_sp.GetInnerBounds(jrange);
+            const int inner_size = flattened_inner_ijrange.e - flattened_inner_ijrange.s + 1;
+
+	    // Whatever part of k is not in the outer loop can be looped over
+	    // with a normal for loop here
+	    for (int k = krange.s; k <= krange.e; ++k) {
+
+	      // pull out a pointer some variable in some pack. Note
+	      // we pick the 0th index of i at k and jrange.s
+	      Real *var = &pack(b, ivar, k, jrange.s, flattened_inner_ijrange.s);
+
+	      // Do something with the pointer in the inner loop.
+	      par_for_inner(DEFAULT_INNER_LOOP_PATTERN, member, 0, flattened_inner_size,
+	        [&](const int i) {
+		  foo(var[i]);
+		});
+	    }
+	  });
diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
@@ -233,9 +233,11 @@ add_library(parthenon
   utils/communication_buffer.hpp
   utils/cleantypes.hpp
   utils/concepts_lite.hpp
-  utils/error_checking.hpp
   utils/error_checking.cpp
+  utils/error_checking.hpp
   utils/hash.hpp
+  utils/index_split.cpp
+  utils/index_split.hpp
   utils/indexer.hpp
   utils/loop_utils.hpp
   utils/morton_number.hpp

diff --git a/src/interface/mesh_data.cpp b/src/interface/mesh_data.cpp
@@ -31,6 +31,26 @@ void MeshData<T>::Initialize(const MeshData<T> *src,
   }
 }
 
+template <typename T>
+void MeshData<T>::Set(BlockList_t blocks, Mesh *pmesh, int ndim) {
+  const int nblocks = blocks.size();
+  ndim_ = ndim;
+  block_data_.resize(nblocks);
+  SetMeshPointer(pmesh);
+  for (int i = 0; i < nblocks; i++) {
+    block_data_[i] = blocks[i]->meshblock_data.Get(stage_name_);
+  }
+}
+
+template <typename T>
+void MeshData<T>::Set(BlockList_t blocks, Mesh *pmesh) {
+  int ndim;
+  if (pmesh != nullptr) {
+    ndim = pmesh->ndim;
+  }
+  Set(blocks, pmesh, ndim);
+}
+
 template class MeshData<Real>;
 
 } // namespace parthenon
diff --git a/src/interface/mesh_data.hpp b/src/interface/mesh_data.hpp
@@ -246,15 +246,8 @@ class MeshData {
     }
   }
 
-  void Set(BlockList_t blocks, Mesh *pmesh) {
-    const int nblocks = blocks.size();
-    block_data_.resize(nblocks);
-    SetMeshPointer(pmesh);
-    for (int i = 0; i < nblocks; i++) {
-      block_data_[i] = blocks[i]->meshblock_data.Get(stage_name_);
-    }
-  }
-
+  void Set(BlockList_t blocks, Mesh *pmesh, int ndim);
+  void Set(BlockList_t blocks, Mesh *pmesh);
   void Initialize(const MeshData<T> *src, const std::vector<std::string> &names,
                   const bool shallow);
 
@@ -419,6 +412,7 @@ class MeshData {
     bvars_cache_.clear();
   }
 
+  int GetNDim() const { return ndim_; }
   int NumBlocks() const { return block_data_.size(); }
 
   bool operator==(MeshData<T> &cmp) const {
@@ -442,6 +436,7 @@ class MeshData {
   SparsePackCache &GetSparsePackCache() { return sparse_pack_cache_; }
 
  private:
+  int ndim_;
   Mesh *pmy_mesh_;
   BlockDataList_t<T> block_data_;
   std::string stage_name_;

diff --git a/src/parthenon/package.hpp b/src/parthenon/package.hpp
@@ -31,6 +31,7 @@
 #include <mesh/meshblock_pack.hpp>
 #include <parameter_input.hpp>
 #include <parthenon_manager.hpp>
+#include <utils/index_split.hpp>
 #include <utils/partition_stl_containers.hpp>
 
 // Local Includes
@@ -46,6 +47,7 @@ using ::parthenon::ApplicationInput;
 using ::parthenon::BlockList_t;
 using ::parthenon::DevExecSpace;
 using ::parthenon::HostExecSpace;
+using ::parthenon::IndexSplit;
 using ::parthenon::Mesh;
 using ::parthenon::MeshBlock;
 using ::parthenon::MeshBlockPack;

diff --git a/src/utils/index_split.cpp b/src/utils/index_split.cpp
@@ -0,0 +1,128 @@
+//========================================================================================
+// (C) (or copyright) 2023. Triad National Security, LLC. All rights reserved.
+//
+// This program was produced under U.S. Government contract 89233218CNA000001 for Los
+// Alamos National Laboratory (LANL), which is operated by Triad National Security, LLC
+// for the U.S. Department of Energy/National Nuclear Security Administration. All rights
+// in the program are reserved by Triad National Security, LLC, and the U.S. Department
+// of Energy/National Nuclear Security Administration. The Government is granted for
+// itself and others acting on its behalf a nonexclusive, paid-up, irrevocable worldwide
+// license in this material to reproduce, prepare derivative works, distribute copies to
+// the public, perform publicly and display publicly, and to permit others to do so.
+//========================================================================================
+
+#include <algorithm>
+
+#include <Kokkos_Core.hpp>
+
+#include "utils/index_split.hpp"
+
+#include "basic_types.hpp"
+#include "defs.hpp"
+#include "globals.hpp"
+#include "interface/mesh_data.hpp"
+#include "kokkos_abstraction.hpp"
+#include "mesh/domain.hpp"
+#include "mesh/mesh.hpp"
+
+namespace parthenon {
+
+struct DummyFunctor {
+  DummyFunctor() = default;
+  KOKKOS_INLINE_FUNCTION
+  void operator()(team_mbr_t team_member) const {}
+};
+
+IndexSplit::IndexSplit(MeshData<Real> *md, const IndexRange &kb, const IndexRange &jb,
+                       const IndexRange &ib, const int nkp, const int njp)
+    : nghost_(Globals::nghost), nkp_(nkp), njp_(njp), kbs_(kb.s), jbs_(jb.s), ibs_(ib.s),
+      ibe_(ib.e) {
+  Init(md, kb.e, jb.e);
+  ndim_ = md->GetNDim();
+}
+
+IndexSplit::IndexSplit(MeshData<Real> *md, IndexDomain domain, const int nkp,
+                       const int njp)
+    : nghost_(Globals::nghost), nkp_(nkp), njp_(njp) {
+  auto ib = md->GetBoundsI(domain);
+  auto jb = md->GetBoundsJ(domain);
+  auto kb = md->GetBoundsK(domain);
+  kbs_ = kb.s;
+  jbs_ = jb.s;
+  ibs_ = ib.s;
+  ibe_ = ib.e;
+  Init(md, kb.e, jb.e);
+  ndim_ = md->GetNDim();
+}
+
+void IndexSplit::Init(MeshData<Real> *md, const int kbe, const int jbe) {
+  const int total_k = kbe - kbs_ + 1;
+  const int total_j = jbe - jbs_ + 1;
+  const int total_i = ibe_ - ibs_ + 1;
+
+  // Compute max parallelism (at outer loop level) from Kokkos
+  // equivalent to NSMS in Kokkos
+  // TODO(JMM): I'm not sure if this is really the best way to do
+  // this. Based on discussion on Kokkos slack.
+#ifdef KOKKOS_ENABLE_CUDA
+  const auto space = DevExecSpace();
+  team_policy policy(space, (md->NumBlocks()) * total_k, Kokkos::AUTO);
+  // JMM: In principle, should pass a realistic functor here. Using a
+  // dummy because we don't know what's available.
+  // TODO(JMM): Should we expose the functor?
+  policy.set_scratch_size(1, Kokkos::PerTeam(sizeof(Real) * total_i * total_j));
+  const int nteams =
+      policy.team_size_recommended(DummyFunctor(), Kokkos::ParallelForTag());
+  concurrency_ = space.concurrency() / nteams;
+#else
+  concurrency_ = 1;
+#endif // KOKKOS_ENABLE_CUDA
+
+  if (nkp_ == all_outer)
+    nkp_ = total_k;
+  else if (nkp_ == no_outer)
+    nkp_ = 1;
+  if (njp_ == all_outer)
+    njp_ = total_j;
+  else if (njp_ == no_outer)
+    njp_ = 1;
+
+  if (nkp_ == 0) {
+#ifdef KOKKOS_ENABLE_CUDA
+    nkp_ = total_k;
+#else
+    nkp_ = 1;
+#endif
+  } else if (nkp_ > total_k) {
+    nkp_ = total_k;
+  }
+  if (njp_ == 0) {
+#ifdef KOKKOS_ENABLE_CUDA
+    // From Forrest Glines:
+    // nkp_ * njp_ >= number of SMs / number of streams
+    // => njp_ >= SMS / streams / NKP
+    njp_ = std::min(concurrency_ / (NSTREAMS_ * nkp_), total_j);
+#else
+    njp_ = 1;
+#endif
+  } else if (njp_ > total_j) {
+    njp_ = total_j;
+  }
+
+  // add a tiny bit to avoid round-off issues when we ultimately convert to int
+  // JMM: Do NOT cast these to integers here. The casting happens later.
+  // These being doubles is necessary for proper interleaving of work.
+  target_k_ = (1.0 * total_k) / nkp_ + 1.e-6;
+  target_j_ = (1.0 * total_j) / njp_ + 1.e-6;
+
+  // save the "entire" ranges
+  // don't bother save ".s" since it's always zero
+  auto ib = md->GetBoundsI(IndexDomain::entire);
+  auto jb = md->GetBoundsJ(IndexDomain::entire);
+  auto kb = md->GetBoundsK(IndexDomain::entire);
+  kbe_entire_ = kb.e;
+  jbe_entire_ = jb.e;
+  ibe_entire_ = ib.e;
+}
+
+} // namespace parthenon