Synchronize processes when saving model#2230
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
@pbielak Can you share an example of command where this issue occurs please? |
|
Sure - I found the problem when trying to run an example from the cd examples/image-classification/
pip install -r requirements.txt
python ../gaudi_spawn.py \
--world_size 8 --use_mpi run_image_classification.py \
--model_name_or_path google/vit-base-patch16-224-in21k \
--dataset_name cifar10 \
--output_dir /tmp/outputs/ \
--remove_unused_columns False \
--image_column_name img \
--do_train \
--do_eval \
--learning_rate 2e-4 \
--num_train_epochs 1 \
--per_device_train_batch_size 128 \
--per_device_eval_batch_size 64 \
--eval_strategy epoch \
--save_strategy epoch \
--load_best_model_at_end True \
--save_total_limit 3 \
--seed 1337 \
--use_habana \
--use_lazy_mode False \
--torch_compile_backend hpu_backend \
--torch_compile \
--gaudi_config_name Habana/vit \
--throughput_warmup_steps 8 \
--dataloader_num_workers 1 \
--sdp_on_bf16 \
--bf16On re-runs make sure to remove the |
|
I could reproduce.
Indeed, this happens here in the call to In this case, this error only happens at the end of training but I guess it could happen in other cases that we have not seen/tested so far. I believe we can simply add if self.args.parallel_mode == ParallelMode.DISTRIBUTED:
torch.distributed.barrier()after saving the model here: https://github.com/huggingface/transformers/blob/v4.51-release/src/transformers/trainer.py#L3200 |
d6c4534 to
70179e6
Compare
When running a script on multiple HPUs using MPI, the script hangs. An investigation releaved that the problem is related to a `barrier` in the `Trainer` code [1], i.e., not all processes reach this barrier. In fact, the `self.state.best_model_checkpoint` is only set in certain process, whereas in other ones the variable is set to `None`. The variable should be set by the `_save_checkpoint` method [2] (which is in turn called by the `_maybe_log_save_evaluate` method [3]). Debugging showed that when different processes call the `os.path.exists(best_checkpoint_dir)` function [4], they get random results (sometimes it returns true, always a different subset of processes). This shows that there is a race condition on this directory check - i.e., the main process (rank zero) writes the model weights, but before it finishes other ranks try to check for the existence of this directory. This commit introduces a synchronization barrier before checking if the directory exists. [1] https://github.com/huggingface/optimum-habana/blob/v1.19-release/optimum/habana/transformers/trainer.py#L1144 [2] https://github.com/huggingface/transformers/blob/v4.51-release/src/transformers/trainer.py#L3187 [3] https://github.com/huggingface/optimum-habana/blob/v1.19-release/optimum/habana/transformers/trainer.py#L1296 [4] https://github.com/huggingface/transformers/blob/v4.51-release/src/transformers/trainer.py#L3206
70179e6 to
a092633
Compare
|
@regisss Before merging, we want to run some additional tests to see whether performance will be impacted by this change. I'll mark this PR as ready to review, when the test results are back. |
|
Looks like this change is not introducing any new regressions. I think we can merge it. |
…ce#647) Co-authored-by: Piotr Bielak <pbielak@users.noreply.github.com>
What does this PR do?
When running a script on multiple HPUs using MPI, the script hangs. An investigation releaved that the problem is related to a
barrierin theTrainercode [1], i.e., not all processes reach this barrier. In fact, theself.state.best_model_checkpointis only set in certain process, whereas in other ones the variable is set toNone. The variable should be set by the_save_checkpointmethod [2] (which is in turn called by the_maybe_log_save_evaluatemethod [3]). Debugging showed that when different processes call theos.path.exists(best_checkpoint_dir)function [4], they get random results (sometimes it returns true, always a different subset of processes). This shows that there is a race condition on this directory check - i.e., a process (probably rank zero) writes the model weights, but before it finishes other ranks try to check for the existence of this directory.This commit introduces an additional synchronization barrier before the directory existence check. In order to make sure that performance is not degraded during training, the synchronization is only performed at the training end.
[1] https://github.com/huggingface/optimum-habana/blob/v1.19-release/optimum/habana/transformers/trainer.py#L1144
[2] https://github.com/huggingface/transformers/blob/v4.51-release/src/transformers/trainer.py#L3187
[3] https://github.com/huggingface/optimum-habana/blob/v1.19-release/optimum/habana/transformers/trainer.py#L1296
[4] https://github.com/huggingface/transformers/blob/v4.51-release/src/transformers/trainer.py#L3206