Introduce mBart #29

kauterry · 2023-09-09T22:53:00Z

What does this PR do? Please describe:
Implements the mBart model and its text tokenizer. We are able to successfully load the base model.

Testing the text tokenizer:

VocabularyInfo(size=65539, unk_idx=3, bos_idx=0, eos_idx=2, pad_idx=1)
0 <s>
1 <pad>
2 </s>
3 <unk>
4 .
5 ,
65530 pleasant
65531 ▁glycogen
65532 criminalization
65533 ▁varietal
65534 ▁duplicating
65535 ▁protester
65536 [en]
65537 [es]
65538 <mask>
sample_tokens=tensor([65536,     0,   655,  9692,  2049,    19,    22,   146,    31, 29678,
           13,  1845, 17277,  4120,     5,    56,    22,    15,  5277,     4,
            2], dtype=torch.int32)
decoded_str='Some theories suggest that it may have descendants in Manchuria, but it is unlikely.'
prefix_indices:  tensor([65536,     0])
suffix_indices:  tensor([2])
encoded_tokens=tensor([65536,     0,   655,  9692,  2049,    19,    22,   146,    31, 29678,
           13,  1845, 17277,  4120,     5,    56,    22,    15,  5277,     4,
            2])
round_trip_str='Some theories suggest that it may have descendants in Manchuria, but it is unlikely.'

We see that the encoded_tokens is the same as the sample_tokens and the decoded_str is the same as the round_trip_str.

TODO: Check parity for forward pass through the same checkpoint with fairseq1.

Fixes #{issue number}

Does your PR introduce any breaking changes? If yes, please list them:
List of all backwards-incompatible changes.

Check list:

Was the content of this PR discussed and approved via a GitHub issue? (no need for typos or documentation improvements)
Did you read the contributor guideline?
Did you make sure that your PR does only one thing instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests?
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (no need for typos, documentation, or minor internal changes)

cbalioglu

Overall looks good to me. Just left a few nit comments. Just wondering thoug; have you have any asset cards that we can bundle with this PR? How did you verify parity with the original fairseq implementation?

src/fairseq2/models/mbart/loader.py

src/fairseq2/models/mbart/tokenizer.py

cbalioglu · 2023-09-11T16:08:13Z

src/fairseq2/models/mbart/builder.py

+        num_encoder_attn_heads=16,
+        num_decoder_attn_heads=16,
+        ffn_inner_dim=4096,
+        pos_encoder_type="learned",


Looks like pos_encoder_type and norm_order are always learned, and POST according to this. If that is the case, I would suggest removing these configuration parameters.

I'm having to do this to successfully load the mBart checkpoint with UnitY: https://github.com/fairinternal/seamless_communication/pull/28/files#diff-189811785a49637a011c2db015430cfd708d92f832f8ef30ed7e10dc7f922635R103

The argument about norm_order makes sense, I'll remove that.

src/fairseq2/models/mbart/builder.py

cbalioglu · 2023-09-11T16:12:21Z

src/fairseq2/models/mbart/builder.py

+
+    def build_frontend(self, embed: Embedding) -> TransformerFrontend:
+        """Build a Transformer encoder/decoder front-end."""
+        if self.config.pos_encoder_type == "sinusoidal":


As mentioned above, I don't think that this is necessary. mBART always uses learned positional embeddings.

kauterry · 2023-09-11T17:37:14Z

@cbalioglu I'm yet to verify parity with the fairseq mBart model by running forward passes. The asset has an internal checkpoint, wondering what the best way to open-source that would be.

cbalioglu · 2023-09-11T17:47:41Z

@cbalioglu I'm yet to verify parity with the fairseq mBart model by running forward passes. The asset has an internal checkpoint, wondering what the best way to open-source that would be.

You can use one of mBARTs public checkpoints here (e.g. mbart.CC25) to verify parity and include it as an asset card in your PR.

Load a base mbart model and implement its text tokenizer.

023e30c

kauterry requested a review from cbalioglu as a code owner September 9, 2023 22:53

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 9, 2023

Fixing bug in lang control symbol in prefix, suffix tokens.

0eb2328

cbalioglu reviewed Sep 11, 2023

View reviewed changes

Changing lang tag, removing norm_order, setting to pre-LN.

c52ce3a

kauterry force-pushed the mbart_tokenizer branch from 14a5b9b to c52ce3a Compare September 11, 2023 18:17

Embedding special token reordering to align with fairseq.

c7c7e91

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce mBart #29

Introduce mBart #29

kauterry commented Sep 9, 2023 •

edited

Loading

cbalioglu left a comment

cbalioglu Sep 11, 2023

kauterry Sep 11, 2023

cbalioglu Sep 11, 2023

kauterry commented Sep 11, 2023

cbalioglu commented Sep 11, 2023

Introduce mBart #29

Are you sure you want to change the base?

Introduce mBart #29

Conversation

kauterry commented Sep 9, 2023 • edited Loading

cbalioglu left a comment

Choose a reason for hiding this comment

cbalioglu Sep 11, 2023

Choose a reason for hiding this comment

kauterry Sep 11, 2023

Choose a reason for hiding this comment

cbalioglu Sep 11, 2023

Choose a reason for hiding this comment

kauterry commented Sep 11, 2023

cbalioglu commented Sep 11, 2023

kauterry commented Sep 9, 2023 •

edited

Loading