Skip to content

Conversation

@thomasw21
Copy link
Member

@thomasw21 thomasw21 commented Aug 5, 2021

Support for prefix-lm

We provide basic support for prefix-lm:

  • Randomly select a split for documents on unsupervised setting
  • Support per document prefix

Notable choices:

TODO:

  • Support Megatron style prefix, ie consider the entire sequence as the document instead of splitting sequences into documents. Pending discussion.
  • Allow prefix on other scripts than pretrain_gpt.py, ie evaluation scripts of such. (Probably on another PR) Need to have a script to evaluate using prefix lm.
  • Create parsing mechanism to obtain prefix split on labeled data. Typically in prompts we should feed the prompts as the prefix and generate the target. (Probably on another PR)
  • Probably support one last split after the last eod if possible.
  • Write test. TBD how do we want to setup tests in the repo? UT vs end to end? You can find some tests here Tests thomasw21/Megatron-DeepSpeed#1 which will be merged after this branch is merged.

if prefix_indices is not None and (reset_attention_mask is False):
assert isinstance(prefix_indices[b], int), \
f"prefix for a row has to be row specific, and consequently return an int, got {prefix_indices[b]}"
attention_mask[b, 0, :prefix_indices[b], :prefix_indices[b]] = 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to make sure I understand, do you have one prefix index per batch or one prefix index per instance?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So there's two case:

  • if reset_attention_mask is True: I make use of eod, so I might end up with multiple prefixes in a row
  • if reset_attention_mask is False: then I treat the row as a document, so you end up with as many prefixes as you have rows in your batch.

My code might be confusing because of for batch_id in micro_batch_size that should probably be more for row in micro_batch_size

Copy link
Member

@ibeltagy ibeltagy Sep 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the reset_attention_mask is False case, the way I am reading it is that the size of the attention mask is one mask per batch (not per row), check here

att_mask_batch = 1

you see that mask shape is [1, 1, seqlen, seqlen].
So here
attention_mask[b, 0, :prefix_indices[b], :prefix_indices[b]] = 1

This line shouldn't work because b can be greater than 1.

Copy link
Member Author

@thomasw21 thomasw21 Sep 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah nice catch ! So I'm pretty sure we want to remove the logic here

if reset_attention_mask:
att_mask_batch = micro_batch_size
else:
att_mask_batch = 1

Essentially I'd say this is an optimisation as the masking is batch size independent in gpt, but would be batch size dependent in prefix lm? We can also work with batch size independent also I guess, sampling a single prefix id for the whole batch. WDYT? #0cdb0a941ddeefbdb1ccab3598f1e34bb38c35a3

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with a single prefix index per batch.

Copy link
Member Author

@thomasw21 thomasw21 Sep 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum this made the prefix row dependent, thomasw21@0cdb0a9 . I can revert to handle single index for the whole batch as you suggested.

Do you have insights as to why one would work better than the other?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

being row dependent is better assuming it doesn't increase time and memory that much

Copy link
Member Author

@thomasw21 thomasw21 Sep 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay let's keep this as is, and monitor if we're much slower compared to gpt. Bear in mind:

  • prefix lm is not using a custom cuda kernel anymore, so some loss in time/memory is expected

Maybe we can test on a 350m version and then switch if we think have a single index might be faster?

* WIP: test

* Still trying to figure out deepspeed

* WIP

* Test test

* Test how to setup deepspeed in unit tests

* Test something else

* Empty strings might be problematic

* Remove unecessary arguments

* Woops

* Remove global variables at the end of each test and init deepspeed

* Woops

* Maybe adding classmethod

* Woops

* Add debug print to check that tear down happends

* Reset global variables before

* Let's test this

* Try something else

* WIP

* More fix

* More fix

* More stuff to fix

* We really want to compare vectors and not coordinates

* Reformat

* check something out

* fix test

* Remove prefix-lm flag as it's integrated

* Woops

* Add test for without reset attention mask

* Fix test for non reset attention mask

* Fix test
@thomasw21
Copy link
Member Author

thomasw21 commented Sep 16, 2021

Btw I've merged the set of tests here thomasw21@295e8d0

There are some unrelated tests (rotary and gpt). Feel free to disregard as they are tests (which means if they pass good, if they don't we should look inot it)

Also, @ibeltagy I've kept the pretrain_prefix script seperate, are you fine with that? Or would you still want to refactor everything? They are a number of things that change between the script which motivated me to split the files.

I'm awaiting final review😃

@ibeltagy
Copy link
Member

kept the pretrain_prefix script separate

ok. Can you confirm that it still matches the pretrain_gpt script before merging?

Maybe we can test on a 350m version and then switch if we think have a single index might be faster?

The larger the model the more realistic the time estimates are. Let's just start the 1.3B model and see how it goes. We can stop it early if it doesn't seem promising.

@ibeltagy ibeltagy self-requested a review September 16, 2021 15:22
@thomasw21
Copy link
Member Author

thomasw21 commented Sep 16, 2021

ok. Can you confirm that it still matches the pretrain_gpt script before merging?

Well it doesn't match because there are some prefix lm specific code (generating prefix for example) and removing some gpt specific code

Running diff pretrain_gpt.py pretrain_prefix_lm.py returns the following:

36d35
< 
53c52,53
<                 parallel_output=True
---
>                 parallel_output=True,
>                 prefix_lm=True
59,75d58
<             # Precompute the attention mask and store it in args. This avoids having to
<             # pipeline it as an activation during training. The mask is constant, and thus
<             # we can reuse it.
<             attention_mask = torch.tril(torch.ones(
<                 (1, args.seq_length, args.seq_length), device=torch.cuda.current_device())).view(
<                     1, 1, args.seq_length, args.seq_length)
< 
<             # Convert attention mask to binary:
<             attention_mask = (attention_mask < 0.5)
<             if args.fp16:
<                 attention_mask = attention_mask.half()
<             elif args.bf16:
<                 attention_mask = attention_mask.bfloat16()
< 
<             # must be bool or the training crashes expecting bool, but getting Half
<             args.attn_mask = attention_mask.to(torch.bool)
< 
81c64,65
<                 post_process=post_process
---
>                 post_process=post_process,
>                 prefix_lm=True
107a92,99
>     # Prefix
>     prefix_indices = get_prefix_indices(
>         tokens,
>         tokenizer.eod,
>         partial_prefix_indices=None,
>         reset_attention_mask=args.reset_attention_mask
>     )
> 
115c107
<         prefix_indices=None,
---
>         prefix_indices=prefix_indices,
121d112
< 
138a130,137
>     # Prefix
>     prefix_indices = get_prefix_indices(
>         tokens,
>         tokenizer.eod,
>         partial_prefix_indices=None,
>         reset_attention_mask=args.reset_attention_mask
>     )
> 
146c145
<         prefix_indices=None,
---
>         prefix_indices=prefix_indices,
150,151c149
<     return (tokens, position_ids, attention_mask), (labels, loss_mask)
< 
---
>     return (tokens, position_ids, attention_mask), (labels, loss_mask), prefix_indices
199d196
< 
204d200
< 

The larger the model the more realistic the time estimates are. Let's just start the 1.3B model and see how it goes. We can stop it early if it doesn't seem promising.

Ok

@thomasw21 thomasw21 merged commit 68b46f2 into bigscience-workshop:main Sep 16, 2021
@stas00
Copy link
Contributor

stas00 commented Sep 16, 2021

The CI failed 4 tests:
https://github.com/bigscience-workshop/Megatron-DeepSpeed/runs/3623547090?check_suite_focus=true

=========================== short test summary info ============================
ERROR tests/test_model.py::MyTestCase::test_gpt - ModuleNotFoundError: No mod...
ERROR tests/test_model.py::MyTestCase::test_gpt_rotary_embeddings - ModuleNot...
ERROR tests/test_model.py::MyTestCase::test_prefix_lm_reset_attention_mask - ...
ERROR tests/test_model.py::MyTestCase::test_prefix_lm_wo_reset_attention_mask

why was it merged with the errors?

@stas00
Copy link
Contributor

stas00 commented Sep 16, 2021

I think the github actions is showing the wrong link to the CI reports, it should be showing the origin but instead it links to a user workflow that shouldn't even run. I had to manually find it via Actions. So the report is in the link above.

@stas00
Copy link
Contributor

stas00 commented Sep 16, 2021

I'll disable these tests for now, so that the test suite can be used for testing other PRs.

One note though: it fails with ModuleNotFoundError: No module named 'mpi4py' - but this means that the test wasn't set up properly. We do not want mpi4py in the dependencies, as it'd swipe the problems under the carpet.

Please use test_training.py as a model to copy from on how to setup a multi-gpu testing.

@thomasw21
Copy link
Member Author

Yeah I saw, I'll fix this. I have to use an alternative method to test_training as I want to freely manipulate the model, ie in order to check some invariants.

Thank you for noticing!

@stas00
Copy link
Contributor

stas00 commented Sep 16, 2021

ok, so I won't add the skipping then, if you're taking care of it. Thank you, Thomas!

@stas00
Copy link
Contributor

stas00 commented Sep 16, 2021

And I know what you mean though. It's much harder to access internal data using the launcher approach.

Perhaps using a single GPU then? This is how we do it on transformers for all basic tests and then no launcher/mpi is needed.

That is if a single GPU is sufficient to the testing you have in mind.

@thomasw21
Copy link
Member Author

Hum yes it is, but I'm trying to reproduce the error right now (I might have an outdated version of deepspeed) with little success.

@stas00
Copy link
Contributor

stas00 commented Sep 16, 2021

pip uninstall mpi4py -y?

@thomasw21
Copy link
Member Author

image

I don;t have mpi4py. I'm running test on JZ and the only ones that are failing for me are activations one.

-- Docs: https://docs.pytest.org/en/stable/warnings.html
=========================== short test summary info ============================
FAILED tests/test_activations.py::TestActivations::test_geglu - AttributeErro...
FAILED tests/test_activations.py::TestActivations::test_liglu - AttributeErro...
FAILED tests/test_activations.py::TestActivations::test_reglu - AttributeErro...
FAILED tests/test_activations.py::TestActivations::test_swiglu - AttributeErr...
============= 4 failed, 11 passed, 14 warnings in 70.07s (0:01:10) =============
AttributeError: module 'torch.testing' has no attribute 'assert_close'

@stas00
Copy link
Contributor

stas00 commented Sep 16, 2021

perhaps you're using 1 gpu setup and thus deepspeed not trying to solve the multi-gpu env automatically.

AttributeError: module 'torch.testing' has no attribute 'assert_close'

yickes. Back compat issue as they dropped assert_equal in pt-1.9, I will move it into our testing.py and make sure it works with any pt version.

@stas00
Copy link
Contributor

stas00 commented Sep 16, 2021

-- Docs: https://docs.pytest.org/en/stable/warnings.html
=========================== short test summary info ============================
FAILED tests/test_activations.py::TestActivations::test_geglu - AttributeErro...
FAILED tests/test_activations.py::TestActivations::test_liglu - AttributeErro...
FAILED tests/test_activations.py::TestActivations::test_reglu - AttributeErro...
FAILED tests/test_activations.py::TestActivations::test_swiglu - AttributeErr...
============= 4 failed, 11 passed, 14 warnings in 70.07s (0:01:10) =============
AttributeError: module 'torch.testing' has no attribute 'assert_close'

Fixed in #106

ofirpress pushed a commit to ofirpress/Megatron-DeepSpeed that referenced this pull request Sep 23, 2021
* ICT zeroshot evaluation code

* made more generic, aligned with other tasks

* Fixed based on review recoemmendation

* fixed another issue

* implementing DPR

* implementation dpr

* adding dpr code

* removed commnets

* removed commnets

* removed commnets

* DPR evaluation debugging

* DPR ongoing

* DPR finetune and evaluation

* fixing model evaluation of retriver

* added pre ad post process

* added pre ad post process

* evaluation works!

* debugging DPR

* fix copy-n-paste error 

remove erroneous arg.

* Typo fix in readme

* t5 fixes

* before cleaning the comments

* vit pipeline fixes

* cleaning the code

* additional cleaning

* renaming the folders

* Add temporary assert to finetuning until it can be fixed.

* Fixed issues with ICT pretraining

* updated the evaluation script for retriver

* updated the evaluation script for retriver

* updated the evaluation script for retriver

* updated the evaluation script for retriver

* added exit interval for finetuning

* updating the scripts

* updating no load rng

* updating script

* Update T5 scripts

* resolved hang issue

* fixed the tensor size miss-mass issue

* fixed the evaluation hangs

* Adding readme

* Adding readme

* Adding readme

* Adding readme

* Adding readme

* Adding readme

* Adding readme

* Adding readme

* Clean up README.md a bit

* addressed comments

* updated readme

* updated readme

* updated readme

* updated readme

* Basic handling of prefix lm by updating the mask

* Add prefix option to gpt temporarily and prevent it to use custom kernel

* Add argument for prefix lm, in order to configure masking strategy

* Woops

* loss_on_targets_only flag, assert that current prefix implementation only works with reset_attention_mask set to True and attempt to fix empty slice issue

* Format

* Reverse renaming

* Allow prefix on partial document at the end

* WIP: add prefix per row feature

* Document the use of None

* Woops

* Handle empty document better

* We might not be able to concat empty tensors

* Handle empty tensor seperately

* Debug

* Test

* Add loss masking as script argument

* Turns out deepspeed integration of attention matrices prevented dynamic masks

* Add more asserts

* Prefix can only see the prefix, it cannot see target

* Remove prefix-lm argument as we split the pretrain script

* Iz PR review

* Make masking row dependent when using prefix

* Revert "Merge remote-tracking branch 'origin/master' into prefix_lm"

This reverts commit d49d6e5, reversing
changes made to 28a712d.

* Tests (bigscience-workshop#1)

* WIP: test

* Still trying to figure out deepspeed

* WIP

* Test test

* Test how to setup deepspeed in unit tests

* Test something else

* Empty strings might be problematic

* Remove unecessary arguments

* Woops

* Remove global variables at the end of each test and init deepspeed

* Woops

* Maybe adding classmethod

* Woops

* Add debug print to check that tear down happends

* Reset global variables before

* Let's test this

* Try something else

* WIP

* More fix

* More fix

* More stuff to fix

* We really want to compare vectors and not coordinates

* Reformat

* check something out

* fix test

* Remove prefix-lm flag as it's integrated

* Woops

* Add test for without reset attention mask

* Fix test for non reset attention mask

* Fix test

* Update code for prefix lm

Co-authored-by: Mostofa Patwary <mostofa.patwary@gmail.com>
Co-authored-by: Mostofa Patwary <mpatwary@nvidia.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: Devrim <46989091+devrimcavusoglu@users.noreply.github.com>
Co-authored-by: Stas Bekman <stas@stason.org>
Co-authored-by: Vijay Korthikanti <vkorthikanti@nvidia.com>
Co-authored-by: Jared Casper <jcasper@nvidia.com>
Co-authored-by: Mohammad Shoeybi <mshoeybi@nvidia.com>
Co-authored-by: Deepak Narayanan <dnarayanan@nvidia.com>
SaulLu added a commit to SaulLu/Megatron-DeepSpeed that referenced this pull request Sep 24, 2021
adammoody pushed a commit to adammoody/Megatron-DeepSpeed that referenced this pull request Oct 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arch&scale Architecture and Scaling Modeling Group enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement prefix-lm as in the T5 paper