Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automated monitoring of machine/device variability #1181

Open
pgrete opened this issue Sep 27, 2024 · 1 comment
Open

Automated monitoring of machine/device variability #1181

pgrete opened this issue Sep 27, 2024 · 1 comment

Comments

@pgrete
Copy link
Collaborator

pgrete commented Sep 27, 2024

Given recent/past experiences with running at scale, I was wondering if it'd be worth to add some automated monitoring of machine/device variability at runtime.

I could imagine that downstream codes can select a kernel (with constant runtime and that is likely to be followed by a barrier/sync point, say, calculating the timestep) that is explicitly profiled (measuring the runtime).
Then this timing are collected on rank 0 and reported (potentially in a separate performance log we discussed in the past).

This could also be used as safety check (e.g., to not waste resources if there a single bad node/device), e.g., by checking if the slowest process is below a user defined threshold from the median performance (or the deviation across all processes becomes too large), then show a warning and keep track of the number of cycles this happens, and eventually let the sim exit gracefully if the same node/device repeated fails to meet that threshold.

What do people think?

@Yurlungur
Copy link
Collaborator

I think this is a good idea. It might also be worth thinking about if we can leverage the same machinery to measure node imbalance somehow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants