Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IndexSplit for hierarchical parallelism #981

Merged
merged 37 commits into from
Dec 19, 2023
Merged
Show file tree
Hide file tree
Changes from 30 commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
1b8681a
Add indexsplit
jonahm-LANL Dec 9, 2023
fc9c3b4
add docs
jonahm-LANL Dec 11, 2023
81859b4
index split
jonahm-LANL Dec 11, 2023
fd01483
changelog
jonahm-LANL Dec 11, 2023
ccea896
Merge branch 'develop' into jmm/indexsplit
Yurlungur Dec 13, 2023
faa65e7
Update doc/sphinx/src/nested_par_for.rst
Yurlungur Dec 13, 2023
36df614
Update doc/sphinx/src/nested_par_for.rst
Yurlungur Dec 13, 2023
ca31631
Update tst/unit/test_index_split.cpp
Yurlungur Dec 14, 2023
ddd8b53
Update doc/sphinx/src/nested_par_for.rst
Yurlungur Dec 14, 2023
b64f348
Update src/utils/index_split.cpp
Yurlungur Dec 14, 2023
6d8c2d0
Update doc/sphinx/src/nested_par_for.rst
Yurlungur Dec 14, 2023
ad4edcd
some of forrests comments
jonahm-LANL Dec 14, 2023
2da13f8
merge
jonahm-LANL Dec 14, 2023
f082bec
tweak docs
jonahm-LANL Dec 14, 2023
e9b8052
Merge branch 'develop' into jmm/indexsplit
Yurlungur Dec 14, 2023
3c5186a
add CPU note
jonahm-LANL Dec 14, 2023
dfd33dd
Add Kokkos-based concurrency test
jonahm-LANL Dec 14, 2023
6be82ef
format
jonahm-LANL Dec 14, 2023
e747613
add warning about face-centered data
jonahm-LANL Dec 14, 2023
e43901d
NSMS not real
jonahm-LANL Dec 14, 2023
83fafdf
Merge branch 'develop' into jmm/indexsplit
Yurlungur Dec 14, 2023
4c9db82
DummyFunctor
jonahm-LANL Dec 14, 2023
99bb599
Merge branch 'jmm/indexsplit' of github.com:parthenon-hpc-lab/parthen…
jonahm-LANL Dec 14, 2023
4c58afd
try to get this dummy functor to work
jonahm-LANL Dec 15, 2023
475b242
Update doc/sphinx/src/nested_par_for.rst
Yurlungur Dec 15, 2023
39d147a
add fglines requested test. Fix bug introduced in review process
jonahm-LANL Dec 15, 2023
85d27cf
Update doc/sphinx/src/nested_par_for.rst
Yurlungur Dec 16, 2023
15205ad
lroberts comments
jonahm-LANL Dec 16, 2023
0f68805
merge
jonahm-LANL Dec 16, 2023
61819e8
nstreams needs underscore
jonahm-LANL Dec 16, 2023
e9897a4
Remove shadowing warning from calculate pi
jonahm-LANL Dec 19, 2023
5442738
remove unused variable
jonahm-LANL Dec 19, 2023
7885037
Swap total_work tests to use standard par_reduce, as otherwise the pa…
jonahm-LANL Dec 19, 2023
6c92301
formatting
jonahm-LANL Dec 19, 2023
abbe863
Default loop pattern doesn't work for reductions... Of course.
jonahm-LANL Dec 19, 2023
e03e72c
merge
jonahm-LANL Dec 19, 2023
da1a27d
formatting... agian
jonahm-LANL Dec 19, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
## Current develop

### Added (new features/APIs/variables/...)
- [[PR 981]](https://github.com/parthenon-hpc-lab/parthenon/pull/981) Add IndexSplit
- [[PR 983]](https://github.com/parthenon-hpc-lab/parthenon/pull/983) Add Contains to SparsePack
- [[PR 968]](https://github.com/parthenon-hpc-lab/parthenon/pull/968) Add per package registration of boundary conditions
- [[PR 948]](https://github.com/parthenon-hpc-lab/parthenon/pull/948) Add solver interface and update Poisson geometric multi-grid example
Expand Down
153 changes: 152 additions & 1 deletion doc/sphinx/src/nested_par_for.rst
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ Data type for memory in scratch pad/cache memory. Use
documentation <https://kokkos.github.io/kokkos-core-wiki/ProgrammingGuide/HierarchicalParallelism.html?highlight=hierarchical>`__
for determining scratch pad memory needs before kernel launch.

Important usage hints
On Barriers
---------------------

In order to ensure that individual threads of a team are synchronized
Expand All @@ -70,6 +70,7 @@ write to common variables, see this
`code <https://github.com/parthenon-hpc-lab/parthenon/issues/659#issuecomment-1346871509>`__
for an example.


Cmake Options
-------------

Expand All @@ -86,3 +87,153 @@ GPUs.
``#pragma omp simd`` to vectorize the loop, which typically gives better
vectorization loops than ``PAR_LOOP_INNER_LAYOUT=TVR_INNER_LOOP`` on
CPUs and so is the default on CPUs.


Performance Considerations
---------------------------

Hierarchical parallelism can produce very performant code, but a
deeper awareness of how hardware is mapped to threads is required to
get optimal performance. Here we list a few strategies/considerations.

* On CPU, with `SIMDFOR_INNER_LOOP` you may have trouble vectorizing
unless you help the compiler along. One way to do so is to work with
raw pointers to contiguous memory, rather than working with views
and strides. Even for stencil ops, if you can pull out pointers that
represent the different points on the stencil, this can help with
vectorization.
Yurlungur marked this conversation as resolved.
Show resolved Hide resolved
* Similarly on CPUs, due to the cost of starting up a vector op,
vectorization will only be a performance win if there's enough work
in the inner loop. A minimum of 16 points is required for the op to
vectorize at all. Experience shows, however, that at least 64 is
really required to see big wins. One strategy for providing enough
vector cells in the inner loop is to do a 1D ``SIMDFOR`` inner loop
but combine the ``j`` and ``i`` indices by simply looping over the
contiguous memory in a rasterized plane on a block.
* On GPUs, the outer loop typically maps to blocks, while the inner
maps to threads. To see good performance, you must both provide
enough work in the inner loop to create enough threads to fill in
CUDA terms a streaming multiprocessor (SM, equivalent to a Compute
Unit or CU on AMD GPUs) with multiple warps (or wavefronts for AMD)
to take advantage of pipelining and enough work in the outer loop to
create enough blocks to fill all SMs on the GPU divided by the
number of simultaneous streams. The number of warps in flight on the
inner loop per SM (which is related to "occupancy") will depend
positively on length of the inner loop and negatively on higher
shared memory usage (scratch pad memory in Kokkos parlance and Local
Data Share or LDS on AMD GPUs) and higher register usage. Note that
the number of SMs and the available shared memory and registers per
SM will vary between GPU architectures and especially between GPU
vendors.

IndexSplit
-------------

To balance the CPU vs GPU hardware considerations of hierarchical
parallelism, ``Parthenon`` provides a utility, the ``IndexSplit``
class, defined in the ``utils/index_split.hpp`` header file and
available in ``<parthenon/package.hpp>`` in the
``parthenon::package::prelude`` namespace.

In our experience ``IndexSplit`` is most beneficial when working with
small meshblocks on CPUs, especially in two dimensions. For small
blocks, we want vectorized operations over contiguous memory for our
innermost loop, but we want that loop to contain enough work for,
e.g., vector ops to function. We have often found that the optimal
split is to fuse j, and i into the inner loop and use k and blocks in
the outer loop.

The ``IndexSplit`` class can be constructed as

.. code:: cpp

IndexSplit(MeshData<Real> md, IndexDomain domain, const int nkp, const int njp);

where here ``md`` is a ``MeshData`` object on which you want to
operate. ``domain`` specifies where in the ``MeshBlock`` you wish to
operate, for example ``IndexDomain::Interior``. ``nkp`` and ``njp``
are the number of points in ``X3`` and ``X2`` respectively that are in
the outer loop. All remaining points are in the inner loop; each team
will iterate over multiple `k` and/or `j` indices to cover the
specified `k/j` range. Typically ``MeshBlock`` index in the pack is
also assumed to be in the outer loop. ``nkp`` and ``njp`` also accept
special flags ``IndexSplit::all_outer`` and ``IndexSplit::no_outer``,
which specify that all and none of the indices in that direction
should be in the outer loop.

Yurlungur marked this conversation as resolved.
Show resolved Hide resolved
.. warning::

Note that, in contrast to ``njp``, ``nkp`` points in the
``k``-direction are not included in the innermost loop bounds. You
must loop over ``k`` by hand inside the outer loop body.

A second constructor alternatively sets the range for ``X3``, ``X2``,
and ``X1`` explicitly:

.. code:: cpp

IndexSplit(MeshData<Real> *md, const IndexRange &kb, const IndexRange &jb,
const IndexRange &ib, const int nkp, const int njp);

where here ``kb``, ``jb``, and ``ib`` specify the starting and ending
indices for ``X3``, ``X2``, and ``X1`` respecively.

.. warning::

Note that, at this time, ``IndexSplit`` doesn't know about
face-centered or edge-centered data. To use ``IndexSplit`` with,
e.g., face-centered data, set the input ``IndexRange`` quantities to
match the shape for the face-centered data (e.g., with the
appropriate offsets).

An ``IndexSplit`` object is typically used as:

.. code:: cpp

using namespace parthenon::package::prelude;
using parthenon::ScratchPad1D;
using parthenon::IndexSplit;
using parthenon::par_for_outer;
using parthenon::par_for_inner;
using parthenon::team_mbr_t;
// Initialize index split object
IndexSplit idx_sp(md, IndexDomain::interior, nkp, njp);

// Request maximum size in i and j in the inner loop, for scratch
const int Ni = idx_sp.get_max_ni();
const int Nj = idx_sp = get_max_nj();
const in tNmax = Ni * Nj;

// single scratch array for i,j
auto scratch_size = ScratchPad1D<Real>::shmem_size(Nmax);
constexpr int scratch_level = 0;

// Par for
par_for_outer(
DEFAULT_OUTER_LOOP_PATTERN, "KernalOuter", DevExecSpace(), scratch_size,
scratch_level, 0, nblocks - 1, 0, idx_sp.outer_size() - 1,
KOKKOS_LAMBDA(team_mbr_t member, const int b, const int outer_idx) {
ScratchPad1D<Real> scratch(member.team_scratch(scratch_level), Nmax);
// Get index ranges. Note they depend on where we are in the outer index!
// These give us a sense for where we are in k,j space
const auto krange = idx_sp.GetBoundsK(outer_idx);
const auto jrange = idx_sp.GetBoundsJ(outer_idx);
// This is the loop of contiguous inner memory. May contain i and j!
const auto flattened_inner_ijrange = idx_sp.GetInnerBounds(jrange);
const int inner_size = flattened_inner_ijrange.e - flattened_inner_ijrange.s + 1;

// Whatever part of k is not in the outer loop can be looped over
// with a normal for loop here
for (int k = krange.s; k <= krange.e; ++k) {

// pull out a pointer some variable in some pack. Note
// we pick the 0th index of i at k and jrange.s
Real *var = &pack(b, ivar, k, jrange.s, flattened_inner_ijrange.s);

// Do something with the pointer in the inner loop.
par_for_inner(DEFAULT_INNER_LOOP_PATTERN, member, 0, flattened_inner_size,
[&](const int i) {
foo(var[i]);
});
}
});
4 changes: 3 additions & 1 deletion src/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -233,9 +233,11 @@ add_library(parthenon
utils/communication_buffer.hpp
utils/cleantypes.hpp
utils/concepts_lite.hpp
utils/error_checking.hpp
utils/error_checking.cpp
utils/error_checking.hpp
utils/hash.hpp
utils/index_split.cpp
utils/index_split.hpp
utils/indexer.hpp
utils/loop_utils.hpp
utils/morton_number.hpp
Expand Down
20 changes: 20 additions & 0 deletions src/interface/mesh_data.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,26 @@ void MeshData<T>::Initialize(const MeshData<T> *src,
}
}

template <typename T>
void MeshData<T>::Set(BlockList_t blocks, Mesh *pmesh, int ndim) {
const int nblocks = blocks.size();
ndim_ = ndim;
block_data_.resize(nblocks);
SetMeshPointer(pmesh);
for (int i = 0; i < nblocks; i++) {
block_data_[i] = blocks[i]->meshblock_data.Get(stage_name_);
}
}

template <typename T>
void MeshData<T>::Set(BlockList_t blocks, Mesh *pmesh) {
int ndim;
if (pmesh != nullptr) {
ndim = pmesh->ndim;
}
Set(blocks, pmesh, ndim);
}

template class MeshData<Real>;

} // namespace parthenon
13 changes: 4 additions & 9 deletions src/interface/mesh_data.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -246,15 +246,8 @@ class MeshData {
}
}

void Set(BlockList_t blocks, Mesh *pmesh) {
const int nblocks = blocks.size();
block_data_.resize(nblocks);
SetMeshPointer(pmesh);
for (int i = 0; i < nblocks; i++) {
block_data_[i] = blocks[i]->meshblock_data.Get(stage_name_);
}
}

void Set(BlockList_t blocks, Mesh *pmesh, int ndim);
void Set(BlockList_t blocks, Mesh *pmesh);
void Initialize(const MeshData<T> *src, const std::vector<std::string> &names,
const bool shallow);

Expand Down Expand Up @@ -419,6 +412,7 @@ class MeshData {
bvars_cache_.clear();
}

int GetNDim() const { return ndim_; }
Yurlungur marked this conversation as resolved.
Show resolved Hide resolved
int NumBlocks() const { return block_data_.size(); }

bool operator==(MeshData<T> &cmp) const {
Expand All @@ -442,6 +436,7 @@ class MeshData {
SparsePackCache &GetSparsePackCache() { return sparse_pack_cache_; }

private:
int ndim_;
Mesh *pmy_mesh_;
BlockDataList_t<T> block_data_;
std::string stage_name_;
Expand Down
2 changes: 2 additions & 0 deletions src/parthenon/package.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@
#include <mesh/meshblock_pack.hpp>
#include <parameter_input.hpp>
#include <parthenon_manager.hpp>
#include <utils/index_split.hpp>
#include <utils/partition_stl_containers.hpp>

// Local Includes
Expand All @@ -46,6 +47,7 @@ using ::parthenon::ApplicationInput;
using ::parthenon::BlockList_t;
using ::parthenon::DevExecSpace;
using ::parthenon::HostExecSpace;
using ::parthenon::IndexSplit;
using ::parthenon::Mesh;
using ::parthenon::MeshBlock;
using ::parthenon::MeshBlockPack;
Expand Down
128 changes: 128 additions & 0 deletions src/utils/index_split.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
//========================================================================================
// (C) (or copyright) 2023. Triad National Security, LLC. All rights reserved.
//
// This program was produced under U.S. Government contract 89233218CNA000001 for Los
// Alamos National Laboratory (LANL), which is operated by Triad National Security, LLC
// for the U.S. Department of Energy/National Nuclear Security Administration. All rights
// in the program are reserved by Triad National Security, LLC, and the U.S. Department
// of Energy/National Nuclear Security Administration. The Government is granted for
// itself and others acting on its behalf a nonexclusive, paid-up, irrevocable worldwide
// license in this material to reproduce, prepare derivative works, distribute copies to
// the public, perform publicly and display publicly, and to permit others to do so.
//========================================================================================

#include <algorithm>

#include <Kokkos_Core.hpp>

#include "utils/index_split.hpp"

#include "basic_types.hpp"
#include "defs.hpp"
#include "globals.hpp"
#include "interface/mesh_data.hpp"
#include "kokkos_abstraction.hpp"
#include "mesh/domain.hpp"
#include "mesh/mesh.hpp"

namespace parthenon {

struct DummyFunctor {
DummyFunctor() = default;
KOKKOS_INLINE_FUNCTION
void operator()(team_mbr_t team_member) const {}
};

IndexSplit::IndexSplit(MeshData<Real> *md, const IndexRange &kb, const IndexRange &jb,
const IndexRange &ib, const int nkp, const int njp)
: nghost_(Globals::nghost), nkp_(nkp), njp_(njp), kbs_(kb.s), jbs_(jb.s), ibs_(ib.s),
ibe_(ib.e) {
Init(md, kb.e, jb.e);
ndim_ = md->GetNDim();
}

IndexSplit::IndexSplit(MeshData<Real> *md, IndexDomain domain, const int nkp,
const int njp)
: nghost_(Globals::nghost), nkp_(nkp), njp_(njp) {
auto ib = md->GetBoundsI(domain);
auto jb = md->GetBoundsJ(domain);
auto kb = md->GetBoundsK(domain);
kbs_ = kb.s;
jbs_ = jb.s;
ibs_ = ib.s;
ibe_ = ib.e;
Init(md, kb.e, jb.e);
ndim_ = md->GetNDim();
}

void IndexSplit::Init(MeshData<Real> *md, const int kbe, const int jbe) {
const int total_k = kbe - kbs_ + 1;
const int total_j = jbe - jbs_ + 1;
const int total_i = ibe_ - ibs_ + 1;

// Compute max parallelism (at outer loop level) from Kokkos
// equivalent to NSMS in Kokkos
// TODO(JMM): I'm not sure if this is really the best way to do
// this. Based on discussion on Kokkos slack.
#ifdef KOKKOS_ENABLE_CUDA
const auto space = DevExecSpace();
team_policy policy(space, (md->NumBlocks()) * total_k, Kokkos::AUTO);
// JMM: In principle, should pass a realistic functor here. Using a
// dummy because we don't know what's available.
// TODO(JMM): Should we expose the functor?
policy.set_scratch_size(1, Kokkos::PerTeam(sizeof(Real) * total_i * total_j));
const int nteams =
policy.team_size_recommended(DummyFunctor(), Kokkos::ParallelForTag());
concurrency_ = space.concurrency() / nteams;
#else
concurrency_ = 1;
#endif // KOKKOS_ENABLE_CUDA

if (nkp_ == all_outer)
nkp_ = total_k;
else if (nkp_ == no_outer)
nkp_ = 1;
if (njp_ == all_outer)
njp_ = total_j;
else if (njp_ == no_outer)
njp_ = 1;

if (nkp_ == 0) {
#ifdef KOKKOS_ENABLE_CUDA
nkp_ = total_k;
#else
nkp_ = 1;
#endif
} else if (nkp_ > total_k) {
nkp_ = total_k;
}
if (njp_ == 0) {
#ifdef KOKKOS_ENABLE_CUDA
// From Forrest Glines:
// nkp_ * njp_ >= number of SMs / number of streams
// => njp_ >= SMS / streams / NKP
njp_ = std::min(concurrency_ / (NSTREAMS_ * nkp_), total_j);
#else
njp_ = 1;
#endif
} else if (njp_ > total_j) {
njp_ = total_j;
}

// add a tiny bit to avoid round-off issues when we ultimately convert to int
// JMM: Do NOT cast these to integers here. The casting happens later.
// These being doubles is necessary for proper interleaving of work.
target_k_ = (1.0 * total_k) / nkp_ + 1.e-6;
target_j_ = (1.0 * total_j) / njp_ + 1.e-6;
Yurlungur marked this conversation as resolved.
Show resolved Hide resolved

// save the "entire" ranges
// don't bother save ".s" since it's always zero
auto ib = md->GetBoundsI(IndexDomain::entire);
auto jb = md->GetBoundsJ(IndexDomain::entire);
auto kb = md->GetBoundsK(IndexDomain::entire);
kbe_entire_ = kb.e;
jbe_entire_ = jb.e;
ibe_entire_ = ib.e;
}

} // namespace parthenon
Loading
Loading