-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regression test fail on Spock/Crusher/Frontier/HIP #659
Comments
All tests pass for non-MPI runs. |
@lroberts36 I think I identified that the root cause of the issue is related to sparse variables. When I run the regression test manually
The test will fail at random times (sometime cycles 4, sometimes 153, sometimes during init) with the following error I still don't know what the root cause is, so if it's a bug in the sparse machinery or if it's a bug/problem with HIP/AMD GPUs.
|
I wonder if this is related to a failure to identify whether or not a pack should be rebuilt. @lroberts36 and I discussed this and a solution at some point. I think there's a version in riot with the fix. |
Tried to reproduce on a Darwin
but was unable to reproduce the issue. |
I also tried to reproduce on Spock with MI100 and was not able to. Now the question is whether this is an issue of performance (MI250[x] vs slower devices), libs or hardware... |
Alright, "good" news. I got one more data point. On a different machine and a different environment the test similarly fails on MI250X (so we can rule out the OLCF environment). |
Also #699 does not fix the issue. |
Alright, here's what went wrong after a long, joint debugging session with @Yurlungur The offending kernel was related to checking/setting the allocation status. auto main(int argc, char *argv[]) -> int {
Kokkos::initialize(argc, argv);
{
size_t num_b = 26 * 4;
Kokkos::View<bool *> nonzero_flags("nonzero_flags", num_b);
size_t num_idx = 16;
Kokkos::View<double **> work("work", num_b, num_idx);
Kokkos::parallel_for(
"set flags", Kokkos::TeamPolicy<>(num_b, Kokkos::AUTO),
KOKKOS_LAMBDA(const Kokkos::TeamPolicy<>::member_type &member) {
const int b = member.league_rank();
nonzero_flags(b) = false;
Kokkos::parallel_for(Kokkos::TeamThreadRange<>(member, num_idx),
[&](const int idx) {
work(b, idx) = 1.0;
if (std::abs(work(b, idx)) >= 0.0) {
nonzero_flags(b) = true;
}
});
});
auto nonzero_flags_h =
Kokkos::create_mirror_view_and_copy(Kokkos::HostSpace(), nonzero_flags);
for (auto b = 0; b < num_b; b++) {
if (!nonzero_flags_h(b)) {
std::cerr << "HEEEEEEEEEEEEEELP!!!\n";
}
}
}
Kokkos::finalize(); There are actually two issues:
|
…tus) (#800) * Fixes #659 * Replace lambda with if * Typo: BOr to LOr * Apply suggestions from code review Co-authored-by: Luke Roberts <[email protected]> Co-authored-by: Jonah Miller <[email protected]> * format * Fix dealloc Co-authored-by: Luke Roberts <[email protected]> Co-authored-by: Jonah Miller <[email protected]>
Serial runs seem to work now. The test suite passed 10 times in a row
with MPI, less so
|
Kokkos::parallel_for(
"SparseDealloc",
Kokkos::TeamPolicy<>(parthenon::DevExecSpace(), pack.GetNBlocks(), Kokkos::AUTO),
KOKKOS_LAMBDA(parthenon::team_mbr_t team_member) {
const int b = team_member.league_rank();
const int lo = pack.GetLowerBound(b);
const int hi = pack.GetUpperBound(b);
for (int v = lo; v <= hi; ++v) {
const auto &var = pack(b, v);
const Real threshold = var.deallocation_threshold;
bool all_zero = true;
Kokkos::parallel_reduce(
Kokkos::TeamThreadRange<>(team_member, NkNjNi),
[&](const int idx, bool &lall_zero) {
const int k = kb.s + idx / NjNi;
const int j = jb.s + (idx % NjNi) / Ni;
const int i = ib.s + idx % Ni;
if (std::abs(var(k, j, i)) > threshold) {
lall_zero = false;
return;
}
},
Kokkos::LAnd<bool, DevMemSpace>(all_zero));
Kokkos::single(Kokkos::PerTeam(team_member),
[&]() { is_zero(b, v) = all_zero; });
}
}); @lroberts36 what does this check mean for vector or tensor sparse variable (as the fourth index is extracted from a pack and but the threshold is for the |
Good news! Looks like this was actually a (fixed) Kokkos bug. Independently, the question above on the SparseDealloc kernel remains open. |
@pgrete: I believe this should work as expected (i.e. if any component of a vector or tensor valued field is above threshold it should not trigger deallocation). This loop checks every (scalar/vector/tensor) element of the control fields for being above zero and stores in |
Fixed in #805 |
The text was updated successfully, but these errors were encountered: