Prefix lm #52

thomasw21 · 2021-08-05T14:46:14Z

Support for prefix-lm

Fixed: Implement prefix-lm as in the T5 paper #21

We provide basic support for prefix-lm:

Randomly select a split for documents on unsupervised setting
Support per document prefix

Notable choices:

code duplication with gpt script: I've tried to explain my point of view in this comment. Prefix lm #52 (comment)

TODO:

Support Megatron style prefix, ie consider the entire sequence as the document instead of splitting sequences into documents. Pending discussion.
~~Allow prefix on other scripts than pretrain_gpt.py, ie evaluation scripts of such. (Probably on another PR)~~ Need to have a script to evaluate using prefix lm.
Create parsing mechanism to obtain prefix split on labeled data. Typically in prompts we should feed the prompts as the prefix and generate the target. (Probably on another PR)
Probably support one last split after the last eod if possible.
Write test. TBD how do we want to setup tests in the repo? UT vs end to end? You can find some tests here Tests thomasw21/Megatron-DeepSpeed#1 which will be merged after this branch is merged.

remove erroneous arg.

megatron/utils.py

ibeltagy · 2021-09-15T14:07:02Z

megatron/utils.py

+            if prefix_indices is not None and (reset_attention_mask is False):
+                assert isinstance(prefix_indices[b], int), \
+                    f"prefix for a row has to be row specific, and consequently return an int, got {prefix_indices[b]}"
+                attention_mask[b, 0, :prefix_indices[b], :prefix_indices[b]] = 1


just to make sure I understand, do you have one prefix index per batch or one prefix index per instance?

So there's two case:

if reset_attention_mask is True: I make use of eod, so I might end up with multiple prefixes in a row

if reset_attention_mask is False: then I treat the row as a document, so you end up with as many prefixes as you have rows in your batch.

My code might be confusing because of for batch_id in micro_batch_size that should probably be more for row in micro_batch_size

In the reset_attention_mask is False case, the way I am reading it is that the size of the attention mask is one mask per batch (not per row), check here

Megatron-DeepSpeed/megatron/utils.py

Line 173 in 28a712d

att_mask_batch = 1

you see that mask shape is [1, 1, seqlen, seqlen].
So here

Megatron-DeepSpeed/megatron/utils.py

Line 235 in 28a712d

attention_mask[b, 0, :prefix_indices[b], :prefix_indices[b]] = 1

This line shouldn't work because b can be greater than 1.

Ah nice catch ! So I'm pretty sure we want to remove the logic here

Megatron-DeepSpeed/megatron/utils.py

Lines 170 to 173 in 28a712d

if reset_attention_mask:

att_mask_batch = micro_batch_size

else:

att_mask_batch = 1

Essentially I'd say this is an optimisation as the masking is batch size independent in gpt, but would be batch size dependent in prefix lm? We can also work with batch size independent also I guess, sampling a single prefix id for the whole batch. WDYT? #0cdb0a941ddeefbdb1ccab3598f1e34bb38c35a3

I am fine with a single prefix index per batch.

Hum this made the prefix row dependent, thomasw21@0cdb0a9 . I can revert to handle single index for the whole batch as you suggested.

Do you have insights as to why one would work better than the other?

being row dependent is better assuming it doesn't increase time and memory that much

Okay let's keep this as is, and monitor if we're much slower compared to gpt. Bear in mind:

prefix lm is not using a custom cuda kernel anymore, so some loss in time/memory is expected

Maybe we can test on a 350m version and then switch if we think have a single index might be faster?

This reverts commit d49d6e5, reversing changes made to 28a712d.

* WIP: test * Still trying to figure out deepspeed * WIP * Test test * Test how to setup deepspeed in unit tests * Test something else * Empty strings might be problematic * Remove unecessary arguments * Woops * Remove global variables at the end of each test and init deepspeed * Woops * Maybe adding classmethod * Woops * Add debug print to check that tear down happends * Reset global variables before * Let's test this * Try something else * WIP * More fix * More fix * More stuff to fix * We really want to compare vectors and not coordinates * Reformat * check something out * fix test * Remove prefix-lm flag as it's integrated * Woops * Add test for without reset attention mask * Fix test for non reset attention mask * Fix test

thomasw21 · 2021-09-16T11:53:37Z

Btw I've merged the set of tests here thomasw21@295e8d0

There are some unrelated tests (rotary and gpt). Feel free to disregard as they are tests (which means if they pass good, if they don't we should look inot it)

Also, @ibeltagy I've kept the pretrain_prefix script seperate, are you fine with that? Or would you still want to refactor everything? They are a number of things that change between the script which motivated me to split the files.

I'm awaiting final review😃

ibeltagy · 2021-09-16T15:22:20Z

kept the pretrain_prefix script separate

ok. Can you confirm that it still matches the pretrain_gpt script before merging?

Maybe we can test on a 350m version and then switch if we think have a single index might be faster?

The larger the model the more realistic the time estimates are. Let's just start the 1.3B model and see how it goes. We can stop it early if it doesn't seem promising.

thomasw21 · 2021-09-16T16:04:14Z

ok. Can you confirm that it still matches the pretrain_gpt script before merging?

Well it doesn't match because there are some prefix lm specific code (generating prefix for example) and removing some gpt specific code

Running diff pretrain_gpt.py pretrain_prefix_lm.py returns the following:

36d35
< 
53c52,53
<                 parallel_output=True
---
>                 parallel_output=True,
>                 prefix_lm=True
59,75d58
<             # Precompute the attention mask and store it in args. This avoids having to
<             # pipeline it as an activation during training. The mask is constant, and thus
<             # we can reuse it.
<             attention_mask = torch.tril(torch.ones(
<                 (1, args.seq_length, args.seq_length), device=torch.cuda.current_device())).view(
<                     1, 1, args.seq_length, args.seq_length)
< 
<             # Convert attention mask to binary:
<             attention_mask = (attention_mask < 0.5)
<             if args.fp16:
<                 attention_mask = attention_mask.half()
<             elif args.bf16:
<                 attention_mask = attention_mask.bfloat16()
< 
<             # must be bool or the training crashes expecting bool, but getting Half
<             args.attn_mask = attention_mask.to(torch.bool)
< 
81c64,65
<                 post_process=post_process
---
>                 post_process=post_process,
>                 prefix_lm=True
107a92,99
>     # Prefix
>     prefix_indices = get_prefix_indices(
>         tokens,
>         tokenizer.eod,
>         partial_prefix_indices=None,
>         reset_attention_mask=args.reset_attention_mask
>     )
> 
115c107
<         prefix_indices=None,
---
>         prefix_indices=prefix_indices,
121d112
< 
138a130,137
>     # Prefix
>     prefix_indices = get_prefix_indices(
>         tokens,
>         tokenizer.eod,
>         partial_prefix_indices=None,
>         reset_attention_mask=args.reset_attention_mask
>     )
> 
146c145
<         prefix_indices=None,
---
>         prefix_indices=prefix_indices,
150,151c149
<     return (tokens, position_ids, attention_mask), (labels, loss_mask)
< 
---
>     return (tokens, position_ids, attention_mask), (labels, loss_mask), prefix_indices
199d196
< 
204d200
<

The larger the model the more realistic the time estimates are. Let's just start the 1.3B model and see how it goes. We can stop it early if it doesn't seem promising.

Ok

stas00 · 2021-09-16T17:46:23Z

The CI failed 4 tests:
https://github.com/bigscience-workshop/Megatron-DeepSpeed/runs/3623547090?check_suite_focus=true

=========================== short test summary info ============================
ERROR tests/test_model.py::MyTestCase::test_gpt - ModuleNotFoundError: No mod...
ERROR tests/test_model.py::MyTestCase::test_gpt_rotary_embeddings - ModuleNot...
ERROR tests/test_model.py::MyTestCase::test_prefix_lm_reset_attention_mask - ...
ERROR tests/test_model.py::MyTestCase::test_prefix_lm_wo_reset_attention_mask

why was it merged with the errors?

stas00 · 2021-09-16T17:47:10Z

I think the github actions is showing the wrong link to the CI reports, it should be showing the origin but instead it links to a user workflow that shouldn't even run. I had to manually find it via Actions. So the report is in the link above.

stas00 · 2021-09-16T18:37:38Z

I'll disable these tests for now, so that the test suite can be used for testing other PRs.

One note though: it fails with ModuleNotFoundError: No module named 'mpi4py' - but this means that the test wasn't set up properly. We do not want mpi4py in the dependencies, as it'd swipe the problems under the carpet.

Please use test_training.py as a model to copy from on how to setup a multi-gpu testing.

thomasw21 · 2021-09-16T18:41:01Z

Yeah I saw, I'll fix this. I have to use an alternative method to test_training as I want to freely manipulate the model, ie in order to check some invariants.

Thank you for noticing!

stas00 · 2021-09-16T18:42:00Z

ok, so I won't add the skipping then, if you're taking care of it. Thank you, Thomas!

stas00 · 2021-09-16T18:48:28Z

And I know what you mean though. It's much harder to access internal data using the launcher approach.

Perhaps using a single GPU then? This is how we do it on transformers for all basic tests and then no launcher/mpi is needed.

That is if a single GPU is sufficient to the testing you have in mind.

thomasw21 · 2021-09-16T18:56:10Z

Hum yes it is, but I'm trying to reproduce the error right now (I might have an outdated version of deepspeed) with little success.

stas00 · 2021-09-16T19:06:28Z

pip uninstall mpi4py -y?

thomasw21 · 2021-09-16T19:14:00Z

I don;t have mpi4py. I'm running test on JZ and the only ones that are failing for me are activations one.

-- Docs: https://docs.pytest.org/en/stable/warnings.html
=========================== short test summary info ============================
FAILED tests/test_activations.py::TestActivations::test_geglu - AttributeErro...
FAILED tests/test_activations.py::TestActivations::test_liglu - AttributeErro...
FAILED tests/test_activations.py::TestActivations::test_reglu - AttributeErro...
FAILED tests/test_activations.py::TestActivations::test_swiglu - AttributeErr...
============= 4 failed, 11 passed, 14 warnings in 70.07s (0:01:10) =============

AttributeError: module 'torch.testing' has no attribute 'assert_close'

stas00 · 2021-09-16T19:21:12Z

perhaps you're using 1 gpu setup and thus deepspeed not trying to solve the multi-gpu env automatically.

AttributeError: module 'torch.testing' has no attribute 'assert_close'

yickes. Back compat issue as they dropped assert_equal in pt-1.9, I will move it into our testing.py and make sure it works with any pt version.

stas00 · 2021-09-16T21:39:42Z

-- Docs: https://docs.pytest.org/en/stable/warnings.html
=========================== short test summary info ============================
FAILED tests/test_activations.py::TestActivations::test_geglu - AttributeErro...
FAILED tests/test_activations.py::TestActivations::test_liglu - AttributeErro...
FAILED tests/test_activations.py::TestActivations::test_reglu - AttributeErro...
FAILED tests/test_activations.py::TestActivations::test_swiglu - AttributeErr...
============= 4 failed, 11 passed, 14 warnings in 70.07s (0:01:10) =============

AttributeError: module 'torch.testing' has no attribute 'assert_close'

Fixed in #106

* ICT zeroshot evaluation code * made more generic, aligned with other tasks * Fixed based on review recoemmendation * fixed another issue * implementing DPR * implementation dpr * adding dpr code * removed commnets * removed commnets * removed commnets * DPR evaluation debugging * DPR ongoing * DPR finetune and evaluation * fixing model evaluation of retriver * added pre ad post process * added pre ad post process * evaluation works! * debugging DPR * fix copy-n-paste error remove erroneous arg. * Typo fix in readme * t5 fixes * before cleaning the comments * vit pipeline fixes * cleaning the code * additional cleaning * renaming the folders * Add temporary assert to finetuning until it can be fixed. * Fixed issues with ICT pretraining * updated the evaluation script for retriver * updated the evaluation script for retriver * updated the evaluation script for retriver * updated the evaluation script for retriver * added exit interval for finetuning * updating the scripts * updating no load rng * updating script * Update T5 scripts * resolved hang issue * fixed the tensor size miss-mass issue * fixed the evaluation hangs * Adding readme * Adding readme * Adding readme * Adding readme * Adding readme * Adding readme * Adding readme * Adding readme * Clean up README.md a bit * addressed comments * updated readme * updated readme * updated readme * updated readme * Basic handling of prefix lm by updating the mask * Add prefix option to gpt temporarily and prevent it to use custom kernel * Add argument for prefix lm, in order to configure masking strategy * Woops * loss_on_targets_only flag, assert that current prefix implementation only works with reset_attention_mask set to True and attempt to fix empty slice issue * Format * Reverse renaming * Allow prefix on partial document at the end * WIP: add prefix per row feature * Document the use of None * Woops * Handle empty document better * We might not be able to concat empty tensors * Handle empty tensor seperately * Debug * Test * Add loss masking as script argument * Turns out deepspeed integration of attention matrices prevented dynamic masks * Add more asserts * Prefix can only see the prefix, it cannot see target * Remove prefix-lm argument as we split the pretrain script * Iz PR review * Make masking row dependent when using prefix * Revert "Merge remote-tracking branch 'origin/master' into prefix_lm" This reverts commit d49d6e5, reversing changes made to 28a712d. * Tests (bigscience-workshop#1) * WIP: test * Still trying to figure out deepspeed * WIP * Test test * Test how to setup deepspeed in unit tests * Test something else * Empty strings might be problematic * Remove unecessary arguments * Woops * Remove global variables at the end of each test and init deepspeed * Woops * Maybe adding classmethod * Woops * Add debug print to check that tear down happends * Reset global variables before * Let's test this * Try something else * WIP * More fix * More fix * More stuff to fix * We really want to compare vectors and not coordinates * Reformat * check something out * fix test * Remove prefix-lm flag as it's integrated * Woops * Add test for without reset attention mask * Fix test for non reset attention mask * Fix test * Update code for prefix lm Co-authored-by: Mostofa Patwary <mostofa.patwary@gmail.com> Co-authored-by: Mostofa Patwary <mpatwary@nvidia.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Devrim <46989091+devrimcavusoglu@users.noreply.github.com> Co-authored-by: Stas Bekman <stas@stason.org> Co-authored-by: Vijay Korthikanti <vkorthikanti@nvidia.com> Co-authored-by: Jared Casper <jcasper@nvidia.com> Co-authored-by: Mohammad Shoeybi <mshoeybi@nvidia.com> Co-authored-by: Deepak Narayanan <dnarayanan@nvidia.com>

This reverts commit 4059e57.

mpatwary and others added 30 commits March 10, 2021 15:48

ICT zeroshot evaluation code

8b04e0e

made more generic, aligned with other tasks

661553f

Fixed based on review recoemmendation

43c9137

fixed another issue

4056539

Merge branch 'main' into main_retriver_merge_ict_eval

a5acbf5

implementing DPR

10ff060

Merge branch 'main' into main_retriver_merge_dpr

cdde433

implementation dpr

06076c7

Merge branch 'main' into main_retriver_merge_dpr

957d1c9

adding dpr code

b9fcb7b

removed commnets

8004731

removed commnets

f415dc8

removed commnets

a8d172b

DPR evaluation debugging

220637f

DPR ongoing

d2d5086

DPR finetune and evaluation

6d03d7a

fixing model evaluation of retriver

f926720

added pre ad post process

5409341

added pre ad post process

7e335e1

evaluation works!

f64977f

debugging DPR

dca47cf

fix copy-n-paste error

3f75537

remove erroneous arg.

Typo fix in readme

07ca952

t5 fixes

2dae74b

Merge branch 'main' into main_retriver_merge_dpr

4a09bb3

before cleaning the comments

7a0710e

vit pipeline fixes

ccae9db

cleaning the code

2eaf6c7

additional cleaning

2529380

renaming the folders

8e44d61

ibeltagy suggested changes Sep 15, 2021

View reviewed changes

thomasw21 added 6 commits September 16, 2021 08:08

Merge remote-tracking branch 'origin/master' into prefix_lm

d49d6e5

Iz PR review

bbfac96

Make masking row dependent when using prefix

0cdb0a9

Revert "Merge remote-tracking branch 'origin/master' into prefix_lm"

a7c51aa

This reverts commit d49d6e5, reversing changes made to 28a712d.

Make asserts concerning the choice on loss_on_targets_only

6db9b97

ibeltagy self-requested a review September 16, 2021 15:22

ibeltagy approved these changes Sep 16, 2021

View reviewed changes

Update code for prefix lm

6a96fb9

thomasw21 merged commit 68b46f2 into bigscience-workshop:main Sep 16, 2021

thomasw21 mentioned this pull request Sep 16, 2021

Fix model tests #103

Merged

stas00 mentioned this pull request Sep 16, 2021

[tensor comparisons] support pt-1.8, add torch_assert_close #106

Merged

SaulLu added a commit to SaulLu/Megatron-DeepSpeed that referenced this pull request Sep 24, 2021

Revert "Prefix lm (bigscience-workshop#52)"

74c7430

This reverts commit 4059e57.

adammoody pushed a commit to adammoody/Megatron-DeepSpeed that referenced this pull request Oct 27, 2022

Add Codeowner (bigscience-workshop#52)

87f9650

	if reset_attention_mask:
	att_mask_batch = micro_batch_size
	else:
	att_mask_batch = 1

Prefix lm #52

Prefix lm #52

Uh oh!

Conversation

thomasw21 commented Aug 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Support for prefix-lm

Uh oh!

Uh oh!

Uh oh!

ibeltagy Sep 15, 2021

Choose a reason for hiding this comment

Uh oh!

thomasw21 Sep 15, 2021

Choose a reason for hiding this comment

Uh oh!

ibeltagy Sep 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasw21 Sep 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ibeltagy Sep 16, 2021

Choose a reason for hiding this comment

Uh oh!

thomasw21 Sep 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ibeltagy Sep 16, 2021

Choose a reason for hiding this comment

Uh oh!

thomasw21 Sep 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasw21 commented Sep 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ibeltagy commented Sep 16, 2021

Uh oh!

thomasw21 commented Sep 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stas00 commented Sep 16, 2021

Uh oh!

stas00 commented Sep 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stas00 commented Sep 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thomasw21 commented Sep 16, 2021

Uh oh!

stas00 commented Sep 16, 2021

Uh oh!

stas00 commented Sep 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thomasw21 commented Sep 16, 2021

Uh oh!

stas00 commented Sep 16, 2021

Uh oh!

thomasw21 commented Sep 16, 2021

Uh oh!

stas00 commented Sep 16, 2021

Uh oh!

stas00 commented Sep 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

thomasw21 commented Aug 5, 2021 •

edited

Loading

ibeltagy Sep 15, 2021 •

edited

Loading

thomasw21 Sep 16, 2021 •

edited

Loading

thomasw21 Sep 16, 2021 •

edited

Loading

thomasw21 Sep 16, 2021 •

edited

Loading

thomasw21 commented Sep 16, 2021 •

edited

Loading

thomasw21 commented Sep 16, 2021 •

edited

Loading

stas00 commented Sep 16, 2021 •

edited

Loading

stas00 commented Sep 16, 2021 •

edited

Loading

stas00 commented Sep 16, 2021 •

edited

Loading

stas00 commented Sep 16, 2021 •

edited

Loading