Increase MPI timeout counter by 10x#1243
Conversation
Yurlungur
left a comment
There was a problem hiding this comment.
No objections from me. Hopefully the unified comms machinery can help with this long term...
There was a problem hiding this comment.
LGTM. I do think we should move to a timer based solution though, which may be pretty easy to add to the tasking infrastructure though, rather than directly in the comms stuff (i.e. set a maximum amount of time a TaskCollection is allowed to run for).
|
I propose we just remove this counter, even without a replacement in place yet. I have yet to hear a report of an actual MPI hang being prevented by this thing, and there was nothing magic to the number I bumped it to last time(s) we hit this. The same number of iterations might be seconds on a fast CPU cluster, and hours on older GPU hardware, making this method functionally useless no matter how much we tweak it. If we remove it, the worst case scenario is Parthenon wastes a few hours of allocation time hung, but smart job management and supervision really limit that impact. I have incidentally had Parthenon hang and waste entire jobs, but always at initialization, filesystem access, that sort of thing, not an MPI hang caught by this code. So, as @lroberts36 mentions, the real solution is to implement timer watchdogs for any task, and set default maybe ~hour time limits for a bunch of simple things. If we don't care about thread safety, it's just a matter of kicking off a timer and implementing a callback, right? |
|
I agree that the current solution is not ideal and that a timer based solution is very much preferable. However, the current solution did help us in the past on Frontier (where even short jobs were 10k+ GPU hours). |
|
I just started on adding |
|
I think that strategy overall looks great (as is much more powerful). Given subtle (past) challenges with threading and the ThreadPool (isn't it disabled by default at the moment), I'd feel better if that kind of change only makes it develop after the next release (so that it can mature/be tested more broadly). So what shall we do with the current PR? We can just keep the number updated downstream and not merge this (though from my point of view it wouldn't hurt). |
Yeah, it seemed like a simple enough change but the threading logic makes it a bit complicated to determine whether or not this has some unintended consequence. I am currently playing around with testing this. It would probably be good to have some unit tests for
I agree that we should just merge this for now. |
|
FWIW, it is trivial to add timing to |
|
I'm happy merging this for the release (though maybe we should consider 100x, for the breathing room?). I was under the impression it hadn't provided much value to anyone, but if it's useful that's great. Also agree with merging timers after release. I don't think there's any great risk of bugs with such simple implementations, but I do think setting any default limits is going to screw with downstreams (e.g., say that mostly you want |
|
Alright, thanks for the feedback. I'm going to pull the trigger (keeping it at the 10x for now as this has been sufficient on all cases we encountered in the past -- and it's also within standard int limits). |
* start * seemingly works * format * remove intermediate prints * print some solutions * small * remove comments * Increase MPI timeout counter by 10x (#1243) * cleanup output and add option of outputting a specific stage * docs * changelog and formatting * oops swarms are only in Base * formatting * lukes nicer cleanup * add compute asymmetry script * Add shebang to movie2d * document how to change output cadence from restart * changelog * CC * CC * add discussion of SMR to AMR doc * changelog * typo thanks patrick * Try to move all LogicalLocation related logic to class methods * fix bugs * make sure colorbar label generated * add resport_asymmetry * Update doc/sphinx/src/outputs.rst Co-authored-by: Philipp Grete <pgrete@hs.uni-hamburg.de> * Update doc/sphinx/src/outputs.rst Co-authored-by: Philipp Grete <pgrete@hs.uni-hamburg.de> * Add checks for `parthenon/mesh/multigrid = true` when using multigrid block lists (#1241) * add checks and move things to being private * changelog * [Bugfix]: Propagate flux `Metadata` correctly in `SparsePool` (#1249) * propagate flux metadata correctly in SparsePool * changelog * [Trivial] Fix leftover reference to old coords interface (#1251) * Fix leftover reference to old coords interface * Fix doc formatting * Release 25.05 (#1252) * Release 25.05 * Update CHANGELOG.md * Add support for uint64 swarm vars and add default ids (#1253) * Add support for uint64 swarm vars and enforce ids * Provide known name for id field in xdmf. Always write ids * Allow disabling unique ids for particles (old default behavior) * Update Changelog * Fix doc format * Allow writing of swarm xdmf for rst outputs * Fix unsafe/errornous casting * Add comment on use of nonpersistnet ids in traces * New CUDA CI+development Docker container (#1162) * update CI Docker container Update to CUDA 12.6, Ubuntu 24.04, and Clang 19. * update OpenMPI * revert to OpenMPI 4.1.4 * add ADIOS2+openPMD * add c-blosc ubuntu package * fix Dockerfile.nvcc * fix dockerfile * install python headers * install cmake from apt (required for aarch64) * remove duplicate cmake dep * disable ascent build * add newer ascent version * disable ascent build; fix openpmd build * downgrade to CUDA 12.0 * fix ascent build path * fix bug in build_ascent.sh * remove unneeded patches * ascent complains if MFEM is not built * control cuda support for ascent with env var * add MAKEOPTS=--output-sync=target * add comment to Dockerfile * Downgrade numpy * Fix ADIOS2 and OpenPMD versions * Directly use Ascent script with small patch * Use Cuda12.1 container and drop to local user * add emacs and vi * set build_jobs=`nproc` to avoid OOM kill * add developer tools for Codespaces/VSCode * add devcontainer.json * update to CUDA 12.8 and VTK-m 2.3 * update image ref * fetch BLT * extract BLT into correct dir * avoid uid 1000 * build and publish Docker image based on Dockerfiles in repo * Update CI image to be used * Fix python version used for linting * Bump opmd to stable release * Add changelog * Use updated clang for compiler check * remove docker-publish action * Fix C++20 build * Attempt to fix Parthenon Ascent dep * Try Ben's BLT build fix * Fix Ascent build * Include opmd in rocm image * Use actual image --------- Co-authored-by: Philipp Grete <pgrete@hs.uni-hamburg.de> * Fix -m (#1257) * Exit cleanly if -m is set after Mesh is constructed but before initialize * Update CHANGELOG.md * remove kokkos warning * add core dump output format * check that only one core dump file is requested * changelog * update docs and delete accidentally added files * output postfix -> .chdf * make sure to slurp in fluxes * add a 5 * CC * more CC * example CC * comment that ghost zones exist * seemingly working timeout capability * seemingly working * add timing option * format * changelog * Remove old receive try test * remove comm buffer hang check * use task collection for internal mesh communication * Add timeout parameter to doc * format * Update src/mesh/mesh.cpp * Update src/tasks/tasks.hpp * Fix auto-docs and typo * respond to Philipp's comments * check the task collection return status --------- Co-authored-by: Philipp Grete <pgrete@hs.uni-hamburg.de> Co-authored-by: Jonah Miller <jonahm@lanl.gov> Co-authored-by: Jonah Miller <jonah.maxwell.miller@gmail.com> Co-authored-by: Ben Wibking <ben@wibking.com> Co-authored-by: Adam M. Dempsey <adempsey@lanl.gov>
PR Summary
We recently got bit by this again on a buy CPU cluster.
IIRC other groups have also adapted this in the past so that I think increasing the default is a good idea (that also does no harm).
Any objection getting this in before the next release?
PR Checklist