Increase MPI timeout counter by 10x by pgrete · Pull Request #1243 · parthenon-hpc-lab/parthenon

pgrete · 2025-04-16T13:47:05Z

PR Summary

We recently got bit by this again on a buy CPU cluster.
IIRC other groups have also adapted this in the past so that I think increasing the default is a good idea (that also does no harm).
Any objection getting this in before the next release?

PR Checklist

Yurlungur

No objections from me. Hopefully the unified comms machinery can help with this long term...

lroberts36

LGTM. I do think we should move to a timer based solution though, which may be pretty easy to add to the tasking infrastructure though, rather than directly in the comms stuff (i.e. set a maximum amount of time a TaskCollection is allowed to run for).

c-prather · 2025-04-16T17:14:44Z

I propose we just remove this counter, even without a replacement in place yet. I have yet to hear a report of an actual MPI hang being prevented by this thing, and there was nothing magic to the number I bumped it to last time(s) we hit this. The same number of iterations might be seconds on a fast CPU cluster, and hours on older GPU hardware, making this method functionally useless no matter how much we tweak it. If we remove it, the worst case scenario is Parthenon wastes a few hours of allocation time hung, but smart job management and supervision really limit that impact.

I have incidentally had Parthenon hang and waste entire jobs, but always at initialization, filesystem access, that sort of thing, not an MPI hang caught by this code. So, as @lroberts36 mentions, the real solution is to implement timer watchdogs for any task, and set default maybe ~hour time limits for a bunch of simple things. If we don't care about thread safety, it's just a matter of kicking off a timer and implementing a callback, right?

pgrete · 2025-04-16T18:00:47Z

I agree that the current solution is not ideal and that a timer based solution is very much preferable.
IIRC we even have one already in place for some other routine.

However, the current solution did help us in the past on Frontier (where even short jobs were 10k+ GPU hours).
Thus, I'm in favor of keeping this suboptimal solution for now (instead of removing it) until a clean one is in place.

lroberts36 · 2025-04-16T18:07:23Z

I just started on adding TaskRegion timeouts, #1244, let me know if you think this sort of strategy will work for you.

pgrete · 2025-04-16T19:04:36Z

I think that strategy overall looks great (as is much more powerful).

Given subtle (past) challenges with threading and the ThreadPool (isn't it disabled by default at the moment), I'd feel better if that kind of change only makes it develop after the next release (so that it can mature/be tested more broadly).

So what shall we do with the current PR? We can just keep the number updated downstream and not merge this (though from my point of view it wouldn't hurt).

lroberts36 · 2025-04-16T19:30:37Z

I think that strategy overall looks great (as is much more powerful).

Given subtle (past) challenges with threading and the ThreadPool (isn't it disabled by default at the moment), I'd feel better if that kind of change only makes it develop after the next release (so that it can mature/be tested more broadly).

Yeah, it seemed like a simple enough change but the threading logic makes it a bit complicated to determine whether or not this has some unintended consequence. I am currently playing around with testing this. It would probably be good to have some unit tests for ThreadPool.

So what shall we do with the current PR? We can just keep the number updated downstream and not merge this (though from my point of view it wouldn't hurt).

I agree that we should just merge this for now.

lroberts36 · 2025-04-16T19:32:49Z

FWIW, it is trivial to add timing to SerialPool in check_task_returns. After looking at this a bit, I am not sure why *::wait() and *::check_task_returns are both needed though...

c-prather · 2025-04-16T19:37:34Z

I'm happy merging this for the release (though maybe we should consider 100x, for the breathing room?). I was under the impression it hadn't provided much value to anyone, but if it's useful that's great.

Also agree with merging timers after release. I don't think there's any great risk of bugs with such simple implementations, but I do think setting any default limits is going to screw with downstreams (e.g., say that mostly you want TaskRegions to complete in <1 second, but your file output region takes >10 minutes, that sort of thing). And I would suggest we set some default limits with good error messages, rather than rely on users to lay out all of their own timings.

pgrete · 2025-04-17T07:09:40Z

Alright, thanks for the feedback. I'm going to pull the trigger (keeping it at the 10x for now as this has been sufficient on all cases we encountered in the past -- and it's also within standard int limits).

* start * seemingly works * format * remove intermediate prints * print some solutions * small * remove comments * Increase MPI timeout counter by 10x (#1243) * cleanup output and add option of outputting a specific stage * docs * changelog and formatting * oops swarms are only in Base * formatting * lukes nicer cleanup * add compute asymmetry script * Add shebang to movie2d * document how to change output cadence from restart * changelog * CC * CC * add discussion of SMR to AMR doc * changelog * typo thanks patrick * Try to move all LogicalLocation related logic to class methods * fix bugs * make sure colorbar label generated * add resport_asymmetry * Update doc/sphinx/src/outputs.rst Co-authored-by: Philipp Grete <pgrete@hs.uni-hamburg.de> * Update doc/sphinx/src/outputs.rst Co-authored-by: Philipp Grete <pgrete@hs.uni-hamburg.de> * Add checks for `parthenon/mesh/multigrid = true` when using multigrid block lists (#1241) * add checks and move things to being private * changelog * [Bugfix]: Propagate flux `Metadata` correctly in `SparsePool` (#1249) * propagate flux metadata correctly in SparsePool * changelog * [Trivial] Fix leftover reference to old coords interface (#1251) * Fix leftover reference to old coords interface * Fix doc formatting * Release 25.05 (#1252) * Release 25.05 * Update CHANGELOG.md * Add support for uint64 swarm vars and add default ids (#1253) * Add support for uint64 swarm vars and enforce ids * Provide known name for id field in xdmf. Always write ids * Allow disabling unique ids for particles (old default behavior) * Update Changelog * Fix doc format * Allow writing of swarm xdmf for rst outputs * Fix unsafe/errornous casting * Add comment on use of nonpersistnet ids in traces * New CUDA CI+development Docker container (#1162) * update CI Docker container Update to CUDA 12.6, Ubuntu 24.04, and Clang 19. * update OpenMPI * revert to OpenMPI 4.1.4 * add ADIOS2+openPMD * add c-blosc ubuntu package * fix Dockerfile.nvcc * fix dockerfile * install python headers * install cmake from apt (required for aarch64) * remove duplicate cmake dep * disable ascent build * add newer ascent version * disable ascent build; fix openpmd build * downgrade to CUDA 12.0 * fix ascent build path * fix bug in build_ascent.sh * remove unneeded patches * ascent complains if MFEM is not built * control cuda support for ascent with env var * add MAKEOPTS=--output-sync=target * add comment to Dockerfile * Downgrade numpy * Fix ADIOS2 and OpenPMD versions * Directly use Ascent script with small patch * Use Cuda12.1 container and drop to local user * add emacs and vi * set build_jobs=`nproc` to avoid OOM kill * add developer tools for Codespaces/VSCode * add devcontainer.json * update to CUDA 12.8 and VTK-m 2.3 * update image ref * fetch BLT * extract BLT into correct dir * avoid uid 1000 * build and publish Docker image based on Dockerfiles in repo * Update CI image to be used * Fix python version used for linting * Bump opmd to stable release * Add changelog * Use updated clang for compiler check * remove docker-publish action * Fix C++20 build * Attempt to fix Parthenon Ascent dep * Try Ben's BLT build fix * Fix Ascent build * Include opmd in rocm image * Use actual image --------- Co-authored-by: Philipp Grete <pgrete@hs.uni-hamburg.de> * Fix -m (#1257) * Exit cleanly if -m is set after Mesh is constructed but before initialize * Update CHANGELOG.md * remove kokkos warning * add core dump output format * check that only one core dump file is requested * changelog * update docs and delete accidentally added files * output postfix -> .chdf * make sure to slurp in fluxes * add a 5 * CC * more CC * example CC * comment that ghost zones exist * seemingly working timeout capability * seemingly working * add timing option * format * changelog * Remove old receive try test * remove comm buffer hang check * use task collection for internal mesh communication * Add timeout parameter to doc * format * Update src/mesh/mesh.cpp * Update src/tasks/tasks.hpp * Fix auto-docs and typo * respond to Philipp's comments * check the task collection return status --------- Co-authored-by: Philipp Grete <pgrete@hs.uni-hamburg.de> Co-authored-by: Jonah Miller <jonahm@lanl.gov> Co-authored-by: Jonah Miller <jonah.maxwell.miller@gmail.com> Co-authored-by: Ben Wibking <ben@wibking.com> Co-authored-by: Adam M. Dempsey <adempsey@lanl.gov>

Increase MPI timeout counter by 10x

ca8b4bc

pgrete requested review from Yurlungur, c-prather, lroberts36 and pdmullen April 16, 2025 13:47

Yurlungur approved these changes Apr 16, 2025

View reviewed changes

lroberts36 approved these changes Apr 16, 2025

View reviewed changes

pgrete enabled auto-merge (squash) April 17, 2025 07:09

pgrete merged commit d6eb67d into develop Apr 17, 2025
35 checks passed

pgrete deleted the pgrete/incr-safety-counter branch May 19, 2025 07:38

lroberts36 pushed a commit that referenced this pull request Jun 18, 2025

Increase MPI timeout counter by 10x (#1243)

257cede

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase MPI timeout counter by 10x#1243

Increase MPI timeout counter by 10x#1243
pgrete merged 1 commit into
developfrom
pgrete/incr-safety-counter

pgrete commented Apr 16, 2025

Uh oh!

Yurlungur left a comment

Uh oh!

lroberts36 left a comment •

edited

Loading

Uh oh!

c-prather commented Apr 16, 2025 •

edited

Loading

Uh oh!

pgrete commented Apr 16, 2025

Uh oh!

lroberts36 commented Apr 16, 2025

Uh oh!

pgrete commented Apr 16, 2025

Uh oh!

lroberts36 commented Apr 16, 2025

Uh oh!

lroberts36 commented Apr 16, 2025

Uh oh!

c-prather commented Apr 16, 2025

Uh oh!

pgrete commented Apr 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

pgrete commented Apr 16, 2025

PR Summary

PR Checklist

Uh oh!

Yurlungur left a comment

Choose a reason for hiding this comment

Uh oh!

lroberts36 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

c-prather commented Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pgrete commented Apr 16, 2025

Uh oh!

lroberts36 commented Apr 16, 2025

Uh oh!

pgrete commented Apr 16, 2025

Uh oh!

lroberts36 commented Apr 16, 2025

Uh oh!

lroberts36 commented Apr 16, 2025

Uh oh!

c-prather commented Apr 16, 2025

Uh oh!

pgrete commented Apr 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lroberts36 left a comment •

edited

Loading

c-prather commented Apr 16, 2025 •

edited

Loading