-
Notifications
You must be signed in to change notification settings - Fork 228
Prefix lm #52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prefix lm #52
Conversation
remove erroneous arg.
| if prefix_indices is not None and (reset_attention_mask is False): | ||
| assert isinstance(prefix_indices[b], int), \ | ||
| f"prefix for a row has to be row specific, and consequently return an int, got {prefix_indices[b]}" | ||
| attention_mask[b, 0, :prefix_indices[b], :prefix_indices[b]] = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just to make sure I understand, do you have one prefix index per batch or one prefix index per instance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So there's two case:
- if reset_attention_mask is True: I make use of eod, so I might end up with multiple prefixes in a row
- if reset_attention_mask is False: then I treat the row as a document, so you end up with as many prefixes as you have rows in your batch.
My code might be confusing because of for batch_id in micro_batch_size that should probably be more for row in micro_batch_size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the reset_attention_mask is False case, the way I am reading it is that the size of the attention mask is one mask per batch (not per row), check here
Megatron-DeepSpeed/megatron/utils.py
Line 173 in 28a712d
| att_mask_batch = 1 |
you see that mask shape is
[1, 1, seqlen, seqlen].So here
Megatron-DeepSpeed/megatron/utils.py
Line 235 in 28a712d
| attention_mask[b, 0, :prefix_indices[b], :prefix_indices[b]] = 1 |
This line shouldn't work because
b can be greater than 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah nice catch ! So I'm pretty sure we want to remove the logic here
Megatron-DeepSpeed/megatron/utils.py
Lines 170 to 173 in 28a712d
| if reset_attention_mask: | |
| att_mask_batch = micro_batch_size | |
| else: | |
| att_mask_batch = 1 |
Essentially I'd say this is an optimisation as the masking is batch size independent in gpt, but would be batch size dependent in prefix lm? We can also work with batch size independent also I guess, sampling a single prefix id for the whole batch. WDYT? #0cdb0a941ddeefbdb1ccab3598f1e34bb38c35a3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine with a single prefix index per batch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hum this made the prefix row dependent, thomasw21@0cdb0a9 . I can revert to handle single index for the whole batch as you suggested.
Do you have insights as to why one would work better than the other?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
being row dependent is better assuming it doesn't increase time and memory that much
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay let's keep this as is, and monitor if we're much slower compared to gpt. Bear in mind:
- prefix lm is not using a custom cuda kernel anymore, so some loss in time/memory is expected
Maybe we can test on a 350m version and then switch if we think have a single index might be faster?
* WIP: test * Still trying to figure out deepspeed * WIP * Test test * Test how to setup deepspeed in unit tests * Test something else * Empty strings might be problematic * Remove unecessary arguments * Woops * Remove global variables at the end of each test and init deepspeed * Woops * Maybe adding classmethod * Woops * Add debug print to check that tear down happends * Reset global variables before * Let's test this * Try something else * WIP * More fix * More fix * More stuff to fix * We really want to compare vectors and not coordinates * Reformat * check something out * fix test * Remove prefix-lm flag as it's integrated * Woops * Add test for without reset attention mask * Fix test for non reset attention mask * Fix test
|
Btw I've merged the set of tests here thomasw21@295e8d0 There are some unrelated tests (rotary and gpt). Feel free to disregard as they are tests (which means if they pass good, if they don't we should look inot it) Also, @ibeltagy I've kept the I'm awaiting final review😃 |
ok. Can you confirm that it still matches the
The larger the model the more realistic the time estimates are. Let's just start the 1.3B model and see how it goes. We can stop it early if it doesn't seem promising. |
Well it doesn't match because there are some prefix lm specific code (generating prefix for example) and removing some gpt specific code Running
Ok |
|
The CI failed 4 tests: why was it merged with the errors? |
|
I think the github actions is showing the wrong link to the CI reports, it should be showing the origin but instead it links to a user workflow that shouldn't even run. I had to manually find it via Actions. So the report is in the link above. |
|
I'll disable these tests for now, so that the test suite can be used for testing other PRs. One note though: it fails with Please use |
|
Yeah I saw, I'll fix this. I have to use an alternative method to Thank you for noticing! |
|
ok, so I won't add the skipping then, if you're taking care of it. Thank you, Thomas! |
|
And I know what you mean though. It's much harder to access internal data using the launcher approach. Perhaps using a single GPU then? This is how we do it on transformers for all basic tests and then no launcher/mpi is needed. That is if a single GPU is sufficient to the testing you have in mind. |
|
Hum yes it is, but I'm trying to reproduce the error right now (I might have an outdated version of deepspeed) with little success. |
|
|
|
I don;t have |
|
perhaps you're using 1 gpu setup and thus deepspeed not trying to solve the multi-gpu env automatically.
yickes. Back compat issue as they dropped assert_equal in pt-1.9, I will move it into our testing.py and make sure it works with any pt version. |
Fixed in #106 |
* ICT zeroshot evaluation code * made more generic, aligned with other tasks * Fixed based on review recoemmendation * fixed another issue * implementing DPR * implementation dpr * adding dpr code * removed commnets * removed commnets * removed commnets * DPR evaluation debugging * DPR ongoing * DPR finetune and evaluation * fixing model evaluation of retriver * added pre ad post process * added pre ad post process * evaluation works! * debugging DPR * fix copy-n-paste error remove erroneous arg. * Typo fix in readme * t5 fixes * before cleaning the comments * vit pipeline fixes * cleaning the code * additional cleaning * renaming the folders * Add temporary assert to finetuning until it can be fixed. * Fixed issues with ICT pretraining * updated the evaluation script for retriver * updated the evaluation script for retriver * updated the evaluation script for retriver * updated the evaluation script for retriver * added exit interval for finetuning * updating the scripts * updating no load rng * updating script * Update T5 scripts * resolved hang issue * fixed the tensor size miss-mass issue * fixed the evaluation hangs * Adding readme * Adding readme * Adding readme * Adding readme * Adding readme * Adding readme * Adding readme * Adding readme * Clean up README.md a bit * addressed comments * updated readme * updated readme * updated readme * updated readme * Basic handling of prefix lm by updating the mask * Add prefix option to gpt temporarily and prevent it to use custom kernel * Add argument for prefix lm, in order to configure masking strategy * Woops * loss_on_targets_only flag, assert that current prefix implementation only works with reset_attention_mask set to True and attempt to fix empty slice issue * Format * Reverse renaming * Allow prefix on partial document at the end * WIP: add prefix per row feature * Document the use of None * Woops * Handle empty document better * We might not be able to concat empty tensors * Handle empty tensor seperately * Debug * Test * Add loss masking as script argument * Turns out deepspeed integration of attention matrices prevented dynamic masks * Add more asserts * Prefix can only see the prefix, it cannot see target * Remove prefix-lm argument as we split the pretrain script * Iz PR review * Make masking row dependent when using prefix * Revert "Merge remote-tracking branch 'origin/master' into prefix_lm" This reverts commit d49d6e5, reversing changes made to 28a712d. * Tests (bigscience-workshop#1) * WIP: test * Still trying to figure out deepspeed * WIP * Test test * Test how to setup deepspeed in unit tests * Test something else * Empty strings might be problematic * Remove unecessary arguments * Woops * Remove global variables at the end of each test and init deepspeed * Woops * Maybe adding classmethod * Woops * Add debug print to check that tear down happends * Reset global variables before * Let's test this * Try something else * WIP * More fix * More fix * More stuff to fix * We really want to compare vectors and not coordinates * Reformat * check something out * fix test * Remove prefix-lm flag as it's integrated * Woops * Add test for without reset attention mask * Fix test for non reset attention mask * Fix test * Update code for prefix lm Co-authored-by: Mostofa Patwary <mostofa.patwary@gmail.com> Co-authored-by: Mostofa Patwary <mpatwary@nvidia.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Devrim <46989091+devrimcavusoglu@users.noreply.github.com> Co-authored-by: Stas Bekman <stas@stason.org> Co-authored-by: Vijay Korthikanti <vkorthikanti@nvidia.com> Co-authored-by: Jared Casper <jcasper@nvidia.com> Co-authored-by: Mohammad Shoeybi <mshoeybi@nvidia.com> Co-authored-by: Deepak Narayanan <dnarayanan@nvidia.com>
This reverts commit 4059e57.

Support for prefix-lm
We provide basic support for prefix-lm:
Notable choices:
TODO:
Allow prefix on other scripts thanNeed to have a script to evaluate using prefix lm.pretrain_gpt.py, ie evaluation scripts of such. (Probably on another PR)