Clarification about the data shards number and GPUs #7

mahm1846 · 2023-02-24T03:15:22Z

mahm1846
Feb 24, 2023

I was trying to run training on 80 GPUs.
I run the data processing script with 120 shards because, in the data script, there is a condition enforcing that HOURS_PER_YEAR % shard no == 0
Since, HOURS_PER_YEAR is set to 8760, the only permissible shard value bigger than 80 is 120.
However, I found that training ended early and looks like some of the data shards have not been used.

The question is does the number of data shard has to match the number of GPUs?
So, does one need to process data with a new shard value each time the GPU number is changed?
Also, as only some specific number of shards can be made, so code can only be run on a specific number of GPUs
Can you clarify those?
I look forward to your reply

rejuvyesh · 2023-02-24T05:39:11Z

rejuvyesh
Feb 24, 2023
Collaborator

We only used 80 GPUs for pretraining. Sharding is slightly different for pretraining datasets which we are working on releasing vs ERA5 finetuning. I think maximum GPUs we have tested for finetuning is only 16. You want the final number of shards to be divisible by the total number of GPUs.

2 replies

mahm1846 Feb 24, 2023
Author

Thanks for the reply.
I am really looking forward to the pretrain dataset.

Lastly, the case of 16 GPU training. Clearly, 16 data shards are not possible with the current data script, because HOURS_PER_YEAR % 16 != 0 (assertion in the data processing code, and as set HOURS_PER_YEAR = 8760), 32 or 64 are not possible, either.
So, I am wondering what is the data shrad number used for 16 GPUs?

rejuvyesh Feb 24, 2023
Collaborator

I mean the total number of shards over the entire dataset. What you are doing is number of shards per year which doesn't need to be divisible by 16 (although it is divisible by 8 or 10 which I believe should be enough.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarification about the data shards number and GPUs #7

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Clarification about the data shards number and GPUs #7

mahm1846 Feb 24, 2023

Replies: 1 comment · 2 replies

rejuvyesh Feb 24, 2023 Collaborator

mahm1846 Feb 24, 2023 Author

rejuvyesh Feb 24, 2023 Collaborator

mahm1846
Feb 24, 2023

Replies: 1 comment 2 replies

rejuvyesh
Feb 24, 2023
Collaborator

mahm1846 Feb 24, 2023
Author

rejuvyesh Feb 24, 2023
Collaborator