Synchronize processes when saving model by pbielak · Pull Request #2230 · huggingface/optimum-habana

pbielak · 2025-08-27T12:50:16Z

What does this PR do?

When running a script on multiple HPUs using MPI, the script hangs. An investigation releaved that the problem is related to a barrier in the Trainer code [1], i.e., not all processes reach this barrier. In fact, the self.state.best_model_checkpoint is only set in certain process, whereas in other ones the variable is set to None. The variable should be set by the _save_checkpoint method [2] (which is in turn called by the _maybe_log_save_evaluate method [3]). Debugging showed that when different processes call the os.path.exists(best_checkpoint_dir) function [4], they get random results (sometimes it returns true, always a different subset of processes). This shows that there is a race condition on this directory check - i.e., a process (probably rank zero) writes the model weights, but before it finishes other ranks try to check for the existence of this directory.

This commit introduces an additional synchronization barrier before the directory existence check. In order to make sure that performance is not degraded during training, the synchronization is only performed at the training end.

[1] https://github.com/huggingface/optimum-habana/blob/v1.19-release/optimum/habana/transformers/trainer.py#L1144
[2] https://github.com/huggingface/transformers/blob/v4.51-release/src/transformers/trainer.py#L3187
[3] https://github.com/huggingface/optimum-habana/blob/v1.19-release/optimum/habana/transformers/trainer.py#L1296
[4] https://github.com/huggingface/transformers/blob/v4.51-release/src/transformers/trainer.py#L3206

HuggingFaceDocBuilderDev · 2025-08-27T12:54:25Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

regisss · 2025-08-27T13:59:47Z

@pbielak Can you share an example of command where this issue occurs please?

pbielak · 2025-08-27T14:16:38Z

Sure - I found the problem when trying to run an example from the image-classification README:

cd examples/image-classification/

pip install -r requirements.txt

python ../gaudi_spawn.py \
--world_size 8 --use_mpi run_image_classification.py \
--model_name_or_path google/vit-base-patch16-224-in21k \
--dataset_name cifar10 \
--output_dir /tmp/outputs/ \
--remove_unused_columns False \
--image_column_name img \
--do_train \
--do_eval \
--learning_rate 2e-4 \
--num_train_epochs 1 \
--per_device_train_batch_size 128 \
--per_device_eval_batch_size 64 \
--eval_strategy epoch \
--save_strategy epoch \
--load_best_model_at_end True \
--save_total_limit 3 \
--seed 1337 \
--use_habana \
--use_lazy_mode False \
--torch_compile_backend hpu_backend \
--torch_compile \
--gaudi_config_name Habana/vit \
--throughput_warmup_steps 8 \
--dataloader_num_workers 1 \
--sdp_on_bf16 \
--bf16

On re-runs make sure to remove the /tmp/outputs directory

regisss · 2025-08-28T08:42:54Z

I could reproduce.

a process (probably rank zero) writes the model weights, but before it finishes other ranks try to check for the existence of this directory

Indeed, this happens here in the call to self.save_model: https://github.com/huggingface/transformers/blob/v4.51-release/src/transformers/trainer.py#L3200

In this case, this error only happens at the end of training but I guess it could happen in other cases that we have not seen/tested so far. I believe we can simply add

if self.args.parallel_mode == ParallelMode.DISTRIBUTED:
    torch.distributed.barrier()

after saving the model here: https://github.com/huggingface/transformers/blob/v4.51-release/src/transformers/trainer.py#L3200
This should not make training slower as the main process will save the model anyway. WDYT?

When running a script on multiple HPUs using MPI, the script hangs. An investigation releaved that the problem is related to a `barrier` in the `Trainer` code [1], i.e., not all processes reach this barrier. In fact, the `self.state.best_model_checkpoint` is only set in certain process, whereas in other ones the variable is set to `None`. The variable should be set by the `_save_checkpoint` method [2] (which is in turn called by the `_maybe_log_save_evaluate` method [3]). Debugging showed that when different processes call the `os.path.exists(best_checkpoint_dir)` function [4], they get random results (sometimes it returns true, always a different subset of processes). This shows that there is a race condition on this directory check - i.e., the main process (rank zero) writes the model weights, but before it finishes other ranks try to check for the existence of this directory. This commit introduces a synchronization barrier before checking if the directory exists. [1] https://github.com/huggingface/optimum-habana/blob/v1.19-release/optimum/habana/transformers/trainer.py#L1144 [2] https://github.com/huggingface/transformers/blob/v4.51-release/src/transformers/trainer.py#L3187 [3] https://github.com/huggingface/optimum-habana/blob/v1.19-release/optimum/habana/transformers/trainer.py#L1296 [4] https://github.com/huggingface/transformers/blob/v4.51-release/src/transformers/trainer.py#L3206

pbielak · 2025-08-28T13:18:05Z

@regisss Before merging, we want to run some additional tests to see whether performance will be impacted by this change. I'll mark this PR as ready to review, when the test results are back.

astachowiczhabana · 2025-09-02T13:27:26Z

Looks like this change is not introducing any new regressions. I think we can merge it.
@regisss can you please approve and merge?

regisss

LGTM

…ce#647) Co-authored-by: Piotr Bielak <pbielak@users.noreply.github.com>

pbielak force-pushed the dev/pbielak/synchronize-processes-for-checkpoint branch from d6c4534 to 70179e6 Compare August 28, 2025 12:28

regisss reviewed Aug 28, 2025

View reviewed changes

Comment thread optimum/habana/transformers/trainer.py

pbielak force-pushed the dev/pbielak/synchronize-processes-for-checkpoint branch from 70179e6 to a092633 Compare August 28, 2025 13:16

astachowiczhabana marked this pull request as ready for review September 2, 2025 13:26

astachowiczhabana approved these changes Sep 2, 2025

View reviewed changes

regisss approved these changes Sep 2, 2025

View reviewed changes

regisss merged commit 72f2d81 into huggingface:main Sep 2, 2025
2 of 5 checks passed

pbielak deleted the dev/pbielak/synchronize-processes-for-checkpoint branch September 4, 2025 08:27

astachowiczhabana pushed a commit that referenced this pull request Sep 4, 2025

Synchronize processes when saving model (#2230)

efb0870

astachowiczhabana pushed a commit that referenced this pull request Sep 17, 2025

Synchronize processes when saving model (#2230)

681438f

gplutop7 pushed a commit to HabanaAI/optimum-habana-fork that referenced this pull request Oct 15, 2025

Synchronize processes when saving model (huggingface#2230) (huggingfa…

408d677

…ce#647) Co-authored-by: Piotr Bielak <pbielak@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Synchronize processes when saving model#2230

Synchronize processes when saving model#2230
regisss merged 1 commit into
huggingface:mainfrom
HabanaAI:dev/pbielak/synchronize-processes-for-checkpoint

pbielak commented Aug 27, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Aug 27, 2025

Uh oh!

regisss commented Aug 27, 2025

Uh oh!

pbielak commented Aug 27, 2025 •

edited

Loading

Uh oh!

regisss commented Aug 28, 2025

Uh oh!

Uh oh!

pbielak commented Aug 28, 2025

Uh oh!

astachowiczhabana commented Sep 2, 2025

Uh oh!

regisss left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

pbielak commented Aug 27, 2025

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented Aug 27, 2025

Uh oh!

regisss commented Aug 27, 2025

Uh oh!

pbielak commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

regisss commented Aug 28, 2025

Uh oh!

Uh oh!

pbielak commented Aug 28, 2025

Uh oh!

astachowiczhabana commented Sep 2, 2025

Uh oh!

regisss left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pbielak commented Aug 27, 2025 •

edited

Loading