Skip to content

Increase MPI timeout counter by 10x#1243

Merged
pgrete merged 1 commit into
developfrom
pgrete/incr-safety-counter
Apr 17, 2025
Merged

Increase MPI timeout counter by 10x#1243
pgrete merged 1 commit into
developfrom
pgrete/incr-safety-counter

Conversation

@pgrete
Copy link
Copy Markdown
Collaborator

@pgrete pgrete commented Apr 16, 2025

PR Summary

We recently got bit by this again on a buy CPU cluster.
IIRC other groups have also adapted this in the past so that I think increasing the default is a good idea (that also does no harm).
Any objection getting this in before the next release?

PR Checklist

  • Code passes cpplint
  • New features are documented.
  • Adds a test for any bugs fixed. Adds tests for new features.
  • Code is formatted
  • Changes are summarized in CHANGELOG.md
  • Change is breaking (API, behavior, ...)
    • Change is additionally added to CHANGELOG.md in the breaking section
    • PR is marked as breaking
    • Short summary API changes at the top of the PR (plus optionally with an automated update/fix script)
  • CI has been triggered on Darwin for performance regression tests.
  • Docs build
  • (@lanl.gov employees) Update copyright on changed files

Copy link
Copy Markdown
Collaborator

@Yurlungur Yurlungur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No objections from me. Hopefully the unified comms machinery can help with this long term...

Copy link
Copy Markdown
Collaborator

@lroberts36 lroberts36 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I do think we should move to a timer based solution though, which may be pretty easy to add to the tasking infrastructure though, rather than directly in the comms stuff (i.e. set a maximum amount of time a TaskCollection is allowed to run for).

@c-prather
Copy link
Copy Markdown
Collaborator

c-prather commented Apr 16, 2025

I propose we just remove this counter, even without a replacement in place yet. I have yet to hear a report of an actual MPI hang being prevented by this thing, and there was nothing magic to the number I bumped it to last time(s) we hit this. The same number of iterations might be seconds on a fast CPU cluster, and hours on older GPU hardware, making this method functionally useless no matter how much we tweak it. If we remove it, the worst case scenario is Parthenon wastes a few hours of allocation time hung, but smart job management and supervision really limit that impact.

I have incidentally had Parthenon hang and waste entire jobs, but always at initialization, filesystem access, that sort of thing, not an MPI hang caught by this code. So, as @lroberts36 mentions, the real solution is to implement timer watchdogs for any task, and set default maybe ~hour time limits for a bunch of simple things. If we don't care about thread safety, it's just a matter of kicking off a timer and implementing a callback, right?

@pgrete
Copy link
Copy Markdown
Collaborator Author

pgrete commented Apr 16, 2025

I agree that the current solution is not ideal and that a timer based solution is very much preferable.
IIRC we even have one already in place for some other routine.

However, the current solution did help us in the past on Frontier (where even short jobs were 10k+ GPU hours).
Thus, I'm in favor of keeping this suboptimal solution for now (instead of removing it) until a clean one is in place.

@lroberts36
Copy link
Copy Markdown
Collaborator

I just started on adding TaskRegion timeouts, #1244, let me know if you think this sort of strategy will work for you.

@pgrete
Copy link
Copy Markdown
Collaborator Author

pgrete commented Apr 16, 2025

I think that strategy overall looks great (as is much more powerful).

Given subtle (past) challenges with threading and the ThreadPool (isn't it disabled by default at the moment), I'd feel better if that kind of change only makes it develop after the next release (so that it can mature/be tested more broadly).

So what shall we do with the current PR? We can just keep the number updated downstream and not merge this (though from my point of view it wouldn't hurt).

@lroberts36
Copy link
Copy Markdown
Collaborator

I think that strategy overall looks great (as is much more powerful).

Given subtle (past) challenges with threading and the ThreadPool (isn't it disabled by default at the moment), I'd feel better if that kind of change only makes it develop after the next release (so that it can mature/be tested more broadly).

Yeah, it seemed like a simple enough change but the threading logic makes it a bit complicated to determine whether or not this has some unintended consequence. I am currently playing around with testing this. It would probably be good to have some unit tests for ThreadPool.

So what shall we do with the current PR? We can just keep the number updated downstream and not merge this (though from my point of view it wouldn't hurt).

I agree that we should just merge this for now.

@lroberts36
Copy link
Copy Markdown
Collaborator

FWIW, it is trivial to add timing to SerialPool in check_task_returns. After looking at this a bit, I am not sure why *::wait() and *::check_task_returns are both needed though...

@c-prather
Copy link
Copy Markdown
Collaborator

I'm happy merging this for the release (though maybe we should consider 100x, for the breathing room?). I was under the impression it hadn't provided much value to anyone, but if it's useful that's great.

Also agree with merging timers after release. I don't think there's any great risk of bugs with such simple implementations, but I do think setting any default limits is going to screw with downstreams (e.g., say that mostly you want TaskRegions to complete in <1 second, but your file output region takes >10 minutes, that sort of thing). And I would suggest we set some default limits with good error messages, rather than rely on users to lay out all of their own timings.

@pgrete
Copy link
Copy Markdown
Collaborator Author

pgrete commented Apr 17, 2025

Alright, thanks for the feedback. I'm going to pull the trigger (keeping it at the 10x for now as this has been sufficient on all cases we encountered in the past -- and it's also within standard int limits).

@pgrete pgrete enabled auto-merge (squash) April 17, 2025 07:09
@pgrete pgrete merged commit d6eb67d into develop Apr 17, 2025
35 checks passed
@pgrete pgrete deleted the pgrete/incr-safety-counter branch May 19, 2025 07:38
lroberts36 pushed a commit that referenced this pull request Jun 18, 2025
pgrete added a commit that referenced this pull request Sep 10, 2025
* start

* seemingly works

* format

* remove intermediate prints

* print some solutions

* small

* remove comments

* Increase MPI timeout counter by 10x (#1243)

* cleanup output and add option of outputting a specific stage

* docs

* changelog and formatting

* oops swarms are only in Base

* formatting

* lukes nicer cleanup

* add compute asymmetry script

* Add shebang to movie2d

* document how to change output cadence from restart

* changelog

* CC

* CC

* add discussion of SMR to AMR doc

* changelog

* typo thanks patrick

* Try to move all LogicalLocation related logic to class methods

* fix bugs

* make sure colorbar label generated

* add resport_asymmetry

* Update doc/sphinx/src/outputs.rst

Co-authored-by: Philipp Grete <pgrete@hs.uni-hamburg.de>

* Update doc/sphinx/src/outputs.rst

Co-authored-by: Philipp Grete <pgrete@hs.uni-hamburg.de>

* Add checks for `parthenon/mesh/multigrid = true` when using multigrid block lists (#1241)

* add checks and move things to being private

* changelog

* [Bugfix]: Propagate flux `Metadata` correctly in `SparsePool` (#1249)

* propagate flux metadata correctly in SparsePool

* changelog

* [Trivial] Fix leftover reference to old coords interface (#1251)

* Fix leftover reference to old coords interface

* Fix doc formatting

* Release 25.05 (#1252)

* Release 25.05

* Update CHANGELOG.md

* Add support for uint64 swarm vars and add default ids (#1253)

* Add support for uint64 swarm vars and enforce ids

* Provide known name for id field in xdmf. Always write ids

* Allow disabling unique ids for particles (old default behavior)

* Update Changelog

* Fix doc format

* Allow writing of swarm xdmf for rst outputs

* Fix unsafe/errornous casting

* Add comment on use of nonpersistnet ids in traces

* New CUDA CI+development Docker container (#1162)

* update CI Docker container

Update to CUDA 12.6, Ubuntu 24.04, and Clang 19.

* update OpenMPI

* revert to OpenMPI 4.1.4

* add ADIOS2+openPMD

* add c-blosc ubuntu package

* fix Dockerfile.nvcc

* fix dockerfile

* install python headers

* install cmake from apt (required for aarch64)

* remove duplicate cmake dep

* disable ascent build

* add newer ascent version

* disable ascent build; fix openpmd build

* downgrade to CUDA 12.0

* fix ascent build path

* fix bug in build_ascent.sh

* remove unneeded patches

* ascent complains if MFEM is not built

* control cuda support for ascent with env var

* add MAKEOPTS=--output-sync=target

* add comment to Dockerfile

* Downgrade numpy

* Fix ADIOS2 and OpenPMD versions

* Directly use Ascent script with small patch

* Use Cuda12.1 container and drop to local user

* add emacs and vi

* set build_jobs=`nproc` to avoid OOM kill

* add developer tools for Codespaces/VSCode

* add devcontainer.json

* update to CUDA 12.8 and VTK-m 2.3

* update image ref

* fetch BLT

* extract BLT into correct dir

* avoid uid 1000

* build and publish Docker image based on Dockerfiles in repo

* Update CI image to be used

* Fix python version used for linting

* Bump opmd to stable release

* Add changelog

* Use updated clang for compiler check

* remove docker-publish action

* Fix C++20 build

* Attempt to fix Parthenon Ascent dep

* Try Ben's BLT build fix

* Fix Ascent build

* Include opmd in rocm image

* Use actual image

---------

Co-authored-by: Philipp Grete <pgrete@hs.uni-hamburg.de>

* Fix -m  (#1257)

* Exit cleanly if -m is set after Mesh is constructed but before initialize

* Update CHANGELOG.md

* remove kokkos warning

* add core dump output format

* check that only one core dump file is requested

* changelog

* update docs and delete accidentally added files

* output postfix -> .chdf

* make sure to slurp in fluxes

* add a 5

* CC

* more CC

* example CC

* comment that ghost zones exist

* seemingly working timeout capability

* seemingly working

* add timing option

* format

* changelog

* Remove old receive try test

* remove comm buffer hang check

* use task collection for internal mesh communication

* Add timeout parameter to doc

* format

* Update src/mesh/mesh.cpp

* Update src/tasks/tasks.hpp

* Fix auto-docs and typo

* respond to Philipp's comments

* check the task collection return status

---------

Co-authored-by: Philipp Grete <pgrete@hs.uni-hamburg.de>
Co-authored-by: Jonah Miller <jonahm@lanl.gov>
Co-authored-by: Jonah Miller <jonah.maxwell.miller@gmail.com>
Co-authored-by: Ben Wibking <ben@wibking.com>
Co-authored-by: Adam M. Dempsey <adempsey@lanl.gov>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants