-
Notifications
You must be signed in to change notification settings - Fork 31.9k
Fix T5 beam search when using parallelize #11717
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix T5 beam search when using parallelize #11717
Conversation
|
@OyvindTafjord Hi, I am trying to figure out how to use model parallelization on T5 but having some problems. I tried to reproduce your result but got the following error:
Could you please help me to figure out the problem and give me some direction that I should start with? I don't have much experience with model parallelization, do I need to modify the Thanks in advance. |
|
@bing0037 Hm, I tested with 4.7.0 now and the above code works for me. I noticed my initial set of commands was missing the critical You could double check that |
|
@OyvindTafjord Thank you for your reply. The problem was the inconsistency of my command and the above command works well.
Could you give me some resources that I could refer to? Thank you! |
|
@bing0037 I haven't tried the parallelize functionality in the context of training, so I'm not of much help on that. |
* Fix LLaMa beam search when using parallelize same issue as T5 #11717 * fix code format in modeling_llama.py * fix format of _reorder_cache in modeling_llama.py
What does this PR do?
As requested by @patrickvonplaten in conversation on issue #9200, this fixes a crash when trying to use beam search on T5 models split across multiple GPUs using
model.parallelize(). It uses the fix from #9219, applied to the T5-specific code (also related is #9596 which refactored the_reorder_cachefunctions).I tested the fix on a t5-small model. Before:
After:
As far as I know this small fix shouldn't have any adverse effects. As to why the tests added in #9219 didn't catch this, possibly that's because they're not generally run in multi-GPU setups?