Timeline Trace in Training #1888

npiroozan · 2022-09-02T00:30:45Z

npiroozan
Sep 2, 2022

Hello,

I hope you are doing well. Attached please find the timeline.json script for a training run on a single node with 16 workers and a local batch size of 32. TF_Intra_Op is set to 12 and TF_Inter_Op is set to 4, with OpenMP=3. I expected to see 4 compute threads, reflecting TF_Inter_Op setting.

The strange element of this run is on threads 4-9 with so much time spent on HorovodAllReduce. Is this behavior that you have noticed as well?

In addition, within the Json script there are functions for "enable_profiler" and "profiling" which give differing outputs with respect to the tensorboard trace. May I also ask what the difference between these two functionalities are?

Thank you very much for your time!

Warm Regards

wanghan-iapcm · 2022-09-02T03:32:37Z

wanghan-iapcm
Sep 2, 2022
Maintainer

@shishaochen Could you please take a look? Thanks!

1 reply

npiroozan Sep 27, 2022
Author

Hello,

I am following up on this point from several weeks ago. Hope everyone is doing well!

Thank you.

njzjz · 2022-09-27T23:21:44Z

njzjz
Sep 27, 2022
Maintainer

enable_profiler works with TensorFlow Profiler which is a new feature in TensorFlow v2.5.

For horovod issue, the behavior should come from horovod itself but not from deepmd-kit.

12 replies

npiroozan Oct 13, 2022
Author

In sorry I'm a bit confused by what you mean here. To answer your question, I don't believe so.

npiroozan Oct 13, 2022
Author

CPU I'm using has 72 cores

njzjz Oct 13, 2022
Maintainer

You have 16 workers (via mpi), 4 streams/worker (via TF_Inter_Op), 3 threads/stream (via openmp), so the total number of threads is $16 \times 4 \times 3$.

njzjz Oct 13, 2022
Maintainer

I can see 16 HorovodAllReduce ops in the timeline and it should be the expected. But I don't know why horovod uses 9 streams.

npiroozan Oct 15, 2022
Author

Understood. Yes that's something I'm trying to determine also. It does not seem to be impacted with newer version of Horovod. The image above is derived from horovod 0.24.0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timeline Trace in Training #1888

{{title}}

Replies: 2 comments 13 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Timeline Trace in Training #1888

npiroozan Sep 2, 2022

Replies: 2 comments · 13 replies

wanghan-iapcm Sep 2, 2022 Maintainer

npiroozan Sep 27, 2022 Author

njzjz Sep 27, 2022 Maintainer

npiroozan Oct 13, 2022 Author

npiroozan Oct 13, 2022 Author

njzjz Oct 13, 2022 Maintainer

njzjz Oct 13, 2022 Maintainer

npiroozan Oct 15, 2022 Author

npiroozan
Sep 2, 2022

Replies: 2 comments 13 replies

wanghan-iapcm
Sep 2, 2022
Maintainer

npiroozan Sep 27, 2022
Author

njzjz
Sep 27, 2022
Maintainer

npiroozan Oct 13, 2022
Author

npiroozan Oct 13, 2022
Author

njzjz Oct 13, 2022
Maintainer

njzjz Oct 13, 2022
Maintainer

npiroozan Oct 15, 2022
Author