Need help Interpreting Perfetto for Profiling #24712
Unanswered
amishra791
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi, I profiled two code snippets on a TPU v4-8 VM that I am having trouble making sense of.
The first code snippet is the following:
Even though I sharded the array across 4 devices, the profiler is showing computation across 8. Why is that?
The second code snippet is the following:
Like in the first code example, there are also 8 TPU devices, despite only sharding across 4. But my main question here is why lax.ones takes up so much computation time, ending even past the actual computation performed across all the TPU devices?
I've attaced the trace file too: perfetto_trace.json.gz
Beta Was this translation helpful? Give feedback.
All reactions