Skip to content

Conversation

@Liangliang-Ma
Copy link
Contributor

In current training_args, self._n_gpu is set to device_count on XPU device, which will cause crash on XPU devices.
In Trainer, if self.args.n_gpu greater than one, it will utilize torch.nn.DataParallel to wrap the model. But Ipex(intel_extension_for_pytorch) don't support DataParallel, while it suggests using DDP instead.
So to make huggingface Trainer work on intel devices, this fix should be applied.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey! This seems to have been introduce in #25714, and thus I am not convinced that the fix is as simple as that!
Would you mind also sharing a reproducer of the issue? Might be related to specific hard / soft versions

@Liangliang-Ma
Copy link
Contributor Author

Liangliang-Ma commented Nov 28, 2023

Hey! This seems to have been introduce in #25714, and thus I am not convinced that the fix is as simple as that! Would you mind also sharing a reproducer of the issue? Might be related to specific hard / soft versions

Hi, @ArthurZucker! I have discussed with @abhilash1910 about the issue before this PR and we thought it can be fixed like this. This could be reproduced in a multi intel GPUs env, for the distributedType should be set to MULTI_XPU in accelerator. I think most example scripts using Trainer and more than one gpu should be reproducer.

Copy link
Contributor

@abhilash1910 abhilash1910 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Liangliang-Ma for addressing this.
@ArthurZucker yes the current limitation of pure dp on ipex only allows us to use n_gpu=1 .
Could you help re-trigger the CI test ? Thanks

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright thanks all for checking, triggering the CI

@Liangliang-Ma
Copy link
Contributor Author

@ArthurZucker Seems CI got block by Internet issue. Could you please check with that? Thanks

@ArthurZucker
Copy link
Collaborator

Seems like I can't would you mind merging with main to trigger it?!

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@ArthurZucker ArthurZucker merged commit 9ddbb69 into huggingface:main Dec 1, 2023
@ArthurZucker
Copy link
Collaborator

Thanks both 🤗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants