-
Notifications
You must be signed in to change notification settings - Fork 31.9k
Fix BLOOM DeepSpeed inference issue #18139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- remove element wise multiplication after softmax
* fix tolerance for a bloom slow test * enhance alibi padding - get rid of for loops - deals better with padded batched input - avoid useless cpu/gpu communication when creating alibi Co-authored-by: justheuristic <justheuristic@gmail.com> * optimize attention mask * fix scaled softmax limit values * optimize building alibi tensor Co-authored-by: Younes Belkada <younesbelkada@users.noreply.github.com> * fix attention_mask shape when it's None * minor fixes - fix docstring + arg names * remove colons in docstring * Apply suggestions from code review Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * apply suggestion * remove unsued arg * refactor a bit - use [:, None] for consistency * refactor attention block Co-authored-by: Nouamane Tazi <nouamane98@gmail.com> * quick fixes * first attempt * refactor attention block and fix all tests except "test_simple_generation" - added comments to better explain attention block * remove debug lines and add TODO comment * change `torch.bmm` to `torch.baddbmm` - fixes `test_simple_generation`but breaks `test_batch_generation_padd` * styling * all tests are passing now - use `bmm` - add explanation for `allow_fp16_reduced_precision_reduction` Co-authored-by: Younes Belkada <younesbelkada@users.noreply.github.com> * styling Co-authored-by: Younes Belkada <younesbelkada@users.noreply.github.com> * fix support for accelerate Co-authored-by: Younes Belkada <younesbelkada@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * remove attn softmax in fp32 * refactor comments * refactor a bit - remove warning message - remove print on test * refer to pytorch t5 * change the slow tests - do the tests in fp32 - remove some comments - keep large comments * update expected output for `test_simple_generation` - we now test using fp32 * make style + change comments a bit * fix dtype padd test Co-authored-by: justheuristic <justheuristic@gmail.com> Co-authored-by: Nouamane Tazi <nouamane98@gmail.com> Co-authored-by: Younes Belkada <younesbelkada@users.noreply.github.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
|
@RezaYazdaniAminabadi did you tried to infer by removing the elementwise multiplication after the softmax as proposed in the PR? |
I did try this on 16 A100-40GB previously and it was not giving similar results. I will try with this one and let you know. Anyhow, I think that multiply is not needed since the scores are already masked. |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
|
Thank you very much @RezaYazdaniAminabadi !! |
|
Finally after doing some tests it appears that we need the multiplication with the attention mask because of the following: After replacing all zeros by |
|
@younesbelkada, ok, so we have the first row of Let's perhaps use a small concrete example and use it to document why things are done the way they are - otherwise everybody will keep on questioning why this is done this way. |
|
Is this because of padding, we should not care about the padding row, ie when the padding is the query. The wrong values don't matter when they are in the padding no? |
|
My guess was this will impact the computation of the |
|
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
What does this PR do?
This PR tries to address a strange behaviour observed when inferring bloom-176 model using DeepSpeed!
My intuitions are:
-10000for the attention mask filling value whereas we should usefp32.minas it is written in the original cuda kernel ofFusedScaledSoftmax. This might lead to inconsistent result between the old version and the new version, but the new version should be considered as the correct onecc @RezaYazdaniAminabadi @stas00 @thomasw21