Replies: 1 comment 2 replies
-
We only used 80 GPUs for pretraining. Sharding is slightly different for pretraining datasets which we are working on releasing vs ERA5 finetuning. I think maximum GPUs we have tested for finetuning is only 16. You want the final number of shards to be divisible by the total number of GPUs. |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I was trying to run training on 80 GPUs.
I run the data processing script with 120 shards because, in the data script, there is a condition enforcing that
HOURS_PER_YEAR
% shard no == 0Since,
HOURS_PER_YEAR
is set to8760
, the only permissible shard value bigger than 80 is 120.However, I found that training ended early and looks like some of the data shards have not been used.
The question is does the number of data shard has to match the number of GPUs?
So, does one need to process data with a new shard value each time the GPU number is changed?
Also, as only some specific number of shards can be made, so code can only be run on a specific number of GPUs
Can you clarify those?
I look forward to your reply
Beta Was this translation helpful? Give feedback.
All reactions