Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parthenon hangs at the end of simulation #1193

Open
pgrete opened this issue Oct 17, 2024 · 7 comments
Open

Parthenon hangs at the end of simulation #1193

pgrete opened this issue Oct 17, 2024 · 7 comments
Labels
bug Something isn't working

Comments

@pgrete
Copy link
Collaborator

pgrete commented Oct 17, 2024

Observed by @BenWibking on Stampede3 and on a Mac and by myself on a Linux workstation.

Sims run fine and then hang after printing

Driver completed.
time=1.50e-01 cycle=35
tlim=1.50e-01 nlim=100000

walltime used = 4.23e+00
zone-cycles/wallsecond = 1.49e+06

The last output does also seem to have been written completely.

@pgrete pgrete added the bug Something isn't working label Oct 17, 2024
@BenWibking
Copy link
Collaborator

BenWibking commented Oct 17, 2024

It hangs inside Kokkos::Impl::deallocate inside parthenon::Mesh::~Mesh:

[lines deleted]
    frame #34: 0x000000010021affc athenaPK`void Kokkos::Impl::deallocate<Kokkos::HostSpace, Kokkos::Impl::ViewValueFunctor<Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, parthenon::BndInfo, false>>(record_ptr=0x0000600001a5ccf0) at Kokkos_SharedAlloc.hpp:382:18 [opt]
[lines deleted]
    frame #48: 0x0000000100267294 athenaPK`parthenon::MeshData<double>::~MeshData() [inlined] parthenon::BvarsCache_t::~BvarsCache_t(this=<unavailable>) at bnd_info.hpp:190:8 [opt]
[lines deleted]
    frame #65: 0x000000010036abc8 athenaPK`parthenon::Mesh::~Mesh(this=0x000000012281d400) at mesh.cpp:388:1 [opt]
    frame #66: 0x00000001003d435c athenaPK`parthenon::ParthenonManager::ParthenonFinalize() [inlined] std::__1::default_delete<parthenon::Mesh>::operator()[abi:v160006](this=<unavailable>, __ptr=<unavailable>) const at unique_ptr.h:65:5 [opt]
    frame #67: 0x00000001003d4358 athenaPK`parthenon::ParthenonManager::ParthenonFinalize() [inlined] std::__1::unique_ptr<parthenon::Mesh, std::__1::default_delete<parthenon::Mesh>>::reset[abi:v160006](this=<unavailable>, __p=0x0000000000000000) at unique_ptr.h:297:7 [opt]
    frame #68: 0x00000001003d434c athenaPK`parthenon::ParthenonManager::ParthenonFinalize(this=<unavailable>) at parthenon_manager.cpp:232:9 [opt]
    frame #69: 0x0000000100002510 athenaPK`main(argc=<unavailable>, argv=<unavailable>) at main.cpp:127:8 [opt]
    frame #70: 0x00000001935bc274 dyld`start + 2840

Full backtrace from lldb:
backtrace.txt

@BenWibking
Copy link
Collaborator

BenWibking commented Oct 17, 2024

Appears to be a Kokkos regression introduced in Kokkos 4.4.0 (also present in Kokkos 4.4.01). If I swap out the current Kokkos submodule for Kokkos 4.3.01, it finalizes successfully.

@BenWibking
Copy link
Collaborator

@pgrete Maybe we can revert to Kokkos 4.3.01?

@pgrete
Copy link
Collaborator Author

pgrete commented Oct 18, 2024

I suspect that this is not a Kokkos regression but sth on our end. Any idea @lroberts36 (as it seems to point to the buffer cache.

So before changing/downgrading the Kokkos version, I'd like to spent a little time to check if this cannot be fixed easily in Parthenon itself.

@pgrete
Copy link
Collaborator Author

pgrete commented Oct 18, 2024

I just asked on the Kokkos Slack where we should look first.

@pgrete
Copy link
Collaborator Author

pgrete commented Oct 18, 2024

Slide 47 in https://github.com/kokkos/kokkos-tutorials/blob/main/Other/ReleaseBriefings/release-44.pdf: "Otherwise, you program may hang when you upgrade to 4.4" <- Does that sound familiar?

So it's very likely on us.
We were also pointed towards https://kokkos.org/kokkos-core-wiki/ProgrammingGuide/View.html#can-i-make-a-view-of-views and kokkos/kokkos#7229 and kokkos/kokkos-tools#267

I won't be able to look at this today. We might be able to coordinate fixing this as part of the hackathon next week (as we're touching stuff around the buffers anyway).

@pgrete
Copy link
Collaborator Author

pgrete commented Oct 18, 2024

Actually,I just tried it (the view of view debug tool is quite handy -- thanks @dalg24) and pushed a fix to #1191 (b4ab05f).
Let's see what the pipelines say (at least downstream it seemed to work but I might have missed a view of view).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants