Automated monitoring of machine/device variability #1181

pgrete · 2024-09-27T07:55:23Z

Given recent/past experiences with running at scale, I was wondering if it'd be worth to add some automated monitoring of machine/device variability at runtime.

I could imagine that downstream codes can select a kernel (with constant runtime and that is likely to be followed by a barrier/sync point, say, calculating the timestep) that is explicitly profiled (measuring the runtime).
Then this timing are collected on rank 0 and reported (potentially in a separate performance log we discussed in the past).

This could also be used as safety check (e.g., to not waste resources if there a single bad node/device), e.g., by checking if the slowest process is below a user defined threshold from the median performance (or the deviation across all processes becomes too large), then show a warning and keep track of the number of cycles this happens, and eventually let the sim exit gracefully if the same node/device repeated fails to meet that threshold.

What do people think?

Yurlungur · 2024-09-27T22:59:25Z

I think this is a good idea. It might also be worth thinking about if we can leverage the same machinery to measure node imbalance somehow.

pgrete added the enhancement-proposal label Sep 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automated monitoring of machine/device variability #1181

Automated monitoring of machine/device variability #1181

pgrete commented Sep 27, 2024

Yurlungur commented Sep 27, 2024

Automated monitoring of machine/device variability #1181

Automated monitoring of machine/device variability #1181

Comments

pgrete commented Sep 27, 2024

Yurlungur commented Sep 27, 2024