Skip to content

Synchronize processes when saving model#2230

Merged
regisss merged 1 commit into
huggingface:mainfrom
HabanaAI:dev/pbielak/synchronize-processes-for-checkpoint
Sep 2, 2025
Merged

Synchronize processes when saving model#2230
regisss merged 1 commit into
huggingface:mainfrom
HabanaAI:dev/pbielak/synchronize-processes-for-checkpoint

Conversation

@pbielak
Copy link
Copy Markdown
Collaborator

@pbielak pbielak commented Aug 27, 2025

What does this PR do?

When running a script on multiple HPUs using MPI, the script hangs. An investigation releaved that the problem is related to a barrier in the Trainer code [1], i.e., not all processes reach this barrier. In fact, the self.state.best_model_checkpoint is only set in certain process, whereas in other ones the variable is set to None. The variable should be set by the _save_checkpoint method [2] (which is in turn called by the _maybe_log_save_evaluate method [3]). Debugging showed that when different processes call the os.path.exists(best_checkpoint_dir) function [4], they get random results (sometimes it returns true, always a different subset of processes). This shows that there is a race condition on this directory check - i.e., a process (probably rank zero) writes the model weights, but before it finishes other ranks try to check for the existence of this directory.

This commit introduces an additional synchronization barrier before the directory existence check. In order to make sure that performance is not degraded during training, the synchronization is only performed at the training end.

[1] https://github.com/huggingface/optimum-habana/blob/v1.19-release/optimum/habana/transformers/trainer.py#L1144
[2] https://github.com/huggingface/transformers/blob/v4.51-release/src/transformers/trainer.py#L3187
[3] https://github.com/huggingface/optimum-habana/blob/v1.19-release/optimum/habana/transformers/trainer.py#L1296
[4] https://github.com/huggingface/transformers/blob/v4.51-release/src/transformers/trainer.py#L3206

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@regisss
Copy link
Copy Markdown
Collaborator

regisss commented Aug 27, 2025

@pbielak Can you share an example of command where this issue occurs please?

@pbielak
Copy link
Copy Markdown
Collaborator Author

pbielak commented Aug 27, 2025

Sure - I found the problem when trying to run an example from the image-classification README:

cd examples/image-classification/

pip install -r requirements.txt

python ../gaudi_spawn.py \
--world_size 8 --use_mpi run_image_classification.py \
--model_name_or_path google/vit-base-patch16-224-in21k \
--dataset_name cifar10 \
--output_dir /tmp/outputs/ \
--remove_unused_columns False \
--image_column_name img \
--do_train \
--do_eval \
--learning_rate 2e-4 \
--num_train_epochs 1 \
--per_device_train_batch_size 128 \
--per_device_eval_batch_size 64 \
--eval_strategy epoch \
--save_strategy epoch \
--load_best_model_at_end True \
--save_total_limit 3 \
--seed 1337 \
--use_habana \
--use_lazy_mode False \
--torch_compile_backend hpu_backend \
--torch_compile \
--gaudi_config_name Habana/vit \
--throughput_warmup_steps 8 \
--dataloader_num_workers 1 \
--sdp_on_bf16 \
--bf16

On re-runs make sure to remove the /tmp/outputs directory

@regisss
Copy link
Copy Markdown
Collaborator

regisss commented Aug 28, 2025

I could reproduce.

a process (probably rank zero) writes the model weights, but before it finishes other ranks try to check for the existence of this directory

Indeed, this happens here in the call to self.save_model: https://github.com/huggingface/transformers/blob/v4.51-release/src/transformers/trainer.py#L3200

In this case, this error only happens at the end of training but I guess it could happen in other cases that we have not seen/tested so far. I believe we can simply add

if self.args.parallel_mode == ParallelMode.DISTRIBUTED:
    torch.distributed.barrier()

after saving the model here: https://github.com/huggingface/transformers/blob/v4.51-release/src/transformers/trainer.py#L3200
This should not make training slower as the main process will save the model anyway. WDYT?

@pbielak pbielak force-pushed the dev/pbielak/synchronize-processes-for-checkpoint branch from d6c4534 to 70179e6 Compare August 28, 2025 12:28
Comment thread optimum/habana/transformers/trainer.py
When running a script on multiple HPUs using MPI, the script hangs.
An investigation releaved that the problem is related to a `barrier`
in the `Trainer` code [1], i.e., not all processes reach this barrier.
In fact, the `self.state.best_model_checkpoint` is only set in certain
process, whereas in other ones the variable is set to `None`. The
variable should be set by the `_save_checkpoint` method [2] (which is
in turn called by the `_maybe_log_save_evaluate` method [3]). Debugging
showed that when different processes call the `os.path.exists(best_checkpoint_dir)`
function [4], they get random results (sometimes it returns true, always
a different subset of processes). This shows that there is a race
condition on this directory check - i.e., the main process (rank zero)
writes the model weights, but before it finishes other ranks try to
check for the existence of this directory. This commit introduces
a synchronization barrier before checking if the directory exists.

[1] https://github.com/huggingface/optimum-habana/blob/v1.19-release/optimum/habana/transformers/trainer.py#L1144
[2] https://github.com/huggingface/transformers/blob/v4.51-release/src/transformers/trainer.py#L3187
[3] https://github.com/huggingface/optimum-habana/blob/v1.19-release/optimum/habana/transformers/trainer.py#L1296
[4] https://github.com/huggingface/transformers/blob/v4.51-release/src/transformers/trainer.py#L3206
@pbielak pbielak force-pushed the dev/pbielak/synchronize-processes-for-checkpoint branch from 70179e6 to a092633 Compare August 28, 2025 13:16
@pbielak
Copy link
Copy Markdown
Collaborator Author

pbielak commented Aug 28, 2025

@regisss Before merging, we want to run some additional tests to see whether performance will be impacted by this change. I'll mark this PR as ready to review, when the test results are back.

@astachowiczhabana astachowiczhabana marked this pull request as ready for review September 2, 2025 13:26
@astachowiczhabana
Copy link
Copy Markdown
Collaborator

Looks like this change is not introducing any new regressions. I think we can merge it.
@regisss can you please approve and merge?

Copy link
Copy Markdown
Collaborator

@regisss regisss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@regisss regisss merged commit 72f2d81 into huggingface:main Sep 2, 2025
2 of 5 checks passed
@pbielak pbielak deleted the dev/pbielak/synchronize-processes-for-checkpoint branch September 4, 2025 08:27
gplutop7 pushed a commit to HabanaAI/optimum-habana-fork that referenced this pull request Oct 15, 2025
…ce#647)

Co-authored-by: Piotr Bielak <pbielak@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants