You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Given recent/past experiences with running at scale, I was wondering if it'd be worth to add some automated monitoring of machine/device variability at runtime.
I could imagine that downstream codes can select a kernel (with constant runtime and that is likely to be followed by a barrier/sync point, say, calculating the timestep) that is explicitly profiled (measuring the runtime).
Then this timing are collected on rank 0 and reported (potentially in a separate performance log we discussed in the past).
This could also be used as safety check (e.g., to not waste resources if there a single bad node/device), e.g., by checking if the slowest process is below a user defined threshold from the median performance (or the deviation across all processes becomes too large), then show a warning and keep track of the number of cycles this happens, and eventually let the sim exit gracefully if the same node/device repeated fails to meet that threshold.
What do people think?
The text was updated successfully, but these errors were encountered:
Given recent/past experiences with running at scale, I was wondering if it'd be worth to add some automated monitoring of machine/device variability at runtime.
I could imagine that downstream codes can select a kernel (with constant runtime and that is likely to be followed by a barrier/sync point, say, calculating the timestep) that is explicitly profiled (measuring the runtime).
Then this timing are collected on rank 0 and reported (potentially in a separate performance log we discussed in the past).
This could also be used as safety check (e.g., to not waste resources if there a single bad node/device), e.g., by checking if the slowest process is below a user defined threshold from the median performance (or the deviation across all processes becomes too large), then show a warning and keep track of the number of cycles this happens, and eventually let the sim exit gracefully if the same node/device repeated fails to meet that threshold.
What do people think?
The text was updated successfully, but these errors were encountered: