[Update] Print training log in rank0 #1296

shijungg · 2024-11-21T09:25:00Z

Now the training log is printed in rank_last, it's not friendly to users if they change the amount of GPU node.

For example, a multi-node training job has master-0, worker-0, worker-1...
If we use 32 node(8 gpus each), we need to find the training log in worker-30.
Next time we use 64 node(8 gpus each), we need to find the training log in worker-62.

So print training log in rank0 is more friendly, we can just find the training log in master-0, no mater how many nodes we use for training.

feature/training-log-in-rank0

5551ef4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Update] Print training log in rank0 #1296

[Update] Print training log in rank0 #1296

shijungg commented Nov 21, 2024

[Update] Print training log in rank0 #1296

Are you sure you want to change the base?

[Update] Print training log in rank0 #1296

Conversation

shijungg commented Nov 21, 2024