T5 Encoder #2069

calvinpelletier · 2024-11-25T20:21:09Z

Context

What is the purpose of this PR? Is it to

add a new feature
fix a bug
update tests and/or documentation
other (please add here)

Please link to any issues this PR addresses.

Changelog

What are the changes made in this PR?

T5 tokenizer
T5 encoder
convert weights from HF's T5 to ours
unit tests

Analysis

Comparison to HF's implemention (batch of text -> encoder output):

6.7e-5 MSE output difference
ours is ~5% faster

Test plan

Please make sure to do each of the following if applicable to your PR. If you're unsure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.

run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
add unit tests for any new functionality
update docstrings for any new or updated methods or classes
run unit tests via pytest tests
run recipe tests via pytest tests -m integration_test
manually run any new or modified recipes with sufficient proof of correctness
include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

UX

If your function changed a public API, please add a dummy example of what the user experience will look like when calling it.
Here is a docstring example
and a tutorial example

I did not change any public API
I have added an example to docs or docstrings

Minimal test code

import torch

from torchtune.models.t5 import t5_tokenizer, t5_v1p1_xxl_encoder
from torchtune.training.checkpointing._checkpointer import FullModelHFCheckpointer

MAX_SEQ_LEN = 512

# tune download google/t5-v1_1-xxl --output-dir /tmp/t5-hf
tokenizer = t5_tokenizer("/tmp/t5-hf/spiece.model", max_seq_len=MAX_SEQ_LEN)
checkpointer = FullModelHFCheckpointer(
    "/tmp/t5-hf",
    ["pytorch_model.bin"],
    "T5_ENCODER",
    "/tmp/t5-tt",
)

model = t5_v1p1_xxl_encoder(max_seq_len=MAX_SEQ_LEN)
model.load_state_dict(checkpointer.load_checkpoint()["model"])
model = model.to(device="cuda", dtype=torch.bfloat16).eval().requires_grad_(False)


def tokenize(texts):
    result = torch.full(
        (len(texts), tokenizer.max_seq_len),
        tokenizer.pad_id,
        dtype=torch.int,
    )
    for i, text in enumerate(texts):
        tokens = tokenizer.encode(text)
        result[i, : len(tokens)] = torch.tensor(tokens)
    return result


tokens = tokenize(
    [
        "a cow jumping over the moon",
        "a helpful AI assistant",
    ]
)
encoding = model(tokens.to("cuda"))

pytorch-bot · 2024-11-25T20:21:13Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2069

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 17caaf7 with merge base 32e265d ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

calvinpelletier · 2024-11-25T20:30:59Z

torchtune/models/t5/_encoder.py

+        # attention with relative position bias
+        attn_score = torch.matmul(q, k.transpose(-2, -1))
+        attn_score += rel_pos_bias
+        attn_weight = F.softmax(attn_score.float(), dim=-1).to(attn_score.dtype)
+        attn_out = torch.matmul(attn_weight, v)


This part could be simplified by using F.scaled_dot_product_attention by repurposing the mask argument for rel_pos_bias (because scaled_dot_product_attention simply adds the mask to the attention score when the mask is a float tensor). However, when I tried this it was significantly slower for some reason

I wonder whether this is a case where we could benefit from using flex attention?

Flex attention requires boolean masks AFAICT: https://github.com/pytorch/pytorch/blob/main/torch/nn/attention/flex_attention.py#L826-L827

calvinpelletier · 2024-11-25T20:32:04Z

torchtune/models/t5/_encoder.py

+        return x.permute([2, 0, 1]).unsqueeze(0)
+
+
+def _calc_birectional_rel_pos_to_bucket(


should I add more comments in this function explaining each operation? or is it fine to just leave it a bit opaque

calvinpelletier · 2024-11-25T20:33:27Z

torchtune/models/t5/_model_builders.py

+from torchtune.models.t5._tokenizer import T5Tokenizer
+
+
+def t5_v1p1_xxl_encoder(max_seq_len: int = 512) -> T5Encoder:


Thoughts on writing decimal points as p instead of _ in snake case? IMO it's hard to read the _ decimals when _ is also being used as a word separator. Like in t5_v1_1_xxl_encoder, to my eyes it looks like it's version 1 not 1.1. Plus it's ambiguous: if one day we have a "Qwen3 1.5B" and "Qwen3.1 5B", they're both gonna be named qwen3_1_5b.

I think the p could be better. But I'd want to switch everything to this notation then

I'll make a separate PR for this

pbontrager

I gave this a high level pass so far. This looks really clean and good. My only concern is wish us having to have a custom T5 layer and attention module. If flex attention would let us use our existing modules I'd prefer to go down that route.

torchtune/models/t5/__init__.py

pbontrager · 2024-11-27T21:57:04Z

torchtune/models/t5/_encoder.py

+        self.sa_norm = sa_norm
+        self.mlp_norm = mlp_norm
+
+    def forward(self, x: Tensor, rel_pos_bias: Tensor) -> Tensor:


The rel_pos_bias is just a mask no? Couldn't we use all our standard modules here? Is the only reason we have these custom layers because this needs flex attention to be fast?

Well, kind of... it's a float tensor that gets added to the attention scores (which is how scaled_dot_product_attention deals with float masks, so yeah we could think of it as a mask).

I didn't use our modules because:

our MultiHeadAttention/TransformerSelfAttentionLayer modules expect the masks to be boolean tensors (according to the docstring at least)

MultiHeadAttention uses the default attention scaling (1/sqrt(dim)), but T5 doesn't scale it at all

we can't use flex attention with float masks. MultiHeadAttention would use F.scaled_dot_product_attention, which is much slower for float masks than the manual implementation I went with

We could switch to our modules with a couple small changes (update the docstrings to clarify that the mask can also be a float tensor and add an argument for disabling attention scaling), but given how little code is required to just implement separate versions for T5, I thought it was cleaner to leave our attention/transformer modules alone (especially since this implementation is faster).

pbontrager · 2024-11-27T21:58:15Z

torchtune/modules/tokenizers/_sentencepiece.py

@@ -7,6 +7,7 @@
 from typing import List, Optional

 from sentencepiece import SentencePieceProcessor
+


remove? Or was this from the linter?

its from the linter

pbontrager · 2024-11-27T21:59:37Z

torchtune/models/t5/_tokenizer.py

+        self.max_seq_len = max_seq_len
+        self.truncate = truncate
+
+    def encode(self, text: str) -> List[int]:


This doesn't need decode like the CLIP tokenizer?

It has decode (I test it in the unit test). It's in the base tokenizer class: https://github.com/pytorch/torchtune/blob/main/torchtune/modules/tokenizers/_sentencepiece.py#L102

calvinpelletier added 6 commits November 21, 2024 13:28

convert t5 weights

0f78eaa

t5 model builder

da5ce2d

t5 component builder

903620a

t5 tokenizer and weight converter

df806ae

unit tests

cc49698

Merge remote-tracking branch 'origin/main' into t5

7adaebb

calvinpelletier requested a review from pbontrager November 25, 2024 20:21

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 25, 2024

calvinpelletier commented Nov 25, 2024

View reviewed changes

pbontrager reviewed Nov 27, 2024

View reviewed changes

calvinpelletier added 3 commits December 2, 2024 11:07

Merge remote-tracking branch 'origin/main' into t5

308c8cf

exposing component builder

c608ad8

remove unused line

17caaf7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

T5 Encoder #2069

T5 Encoder #2069

calvinpelletier commented Nov 25, 2024 •

edited

Loading

pytorch-bot bot commented Nov 25, 2024 •

edited

Loading

calvinpelletier Nov 25, 2024

ebsmothers Nov 26, 2024

calvinpelletier Dec 2, 2024

calvinpelletier Nov 25, 2024

calvinpelletier Nov 25, 2024 •

edited

Loading

pbontrager Dec 2, 2024

calvinpelletier Dec 2, 2024

pbontrager left a comment

pbontrager Nov 27, 2024

calvinpelletier Dec 2, 2024 •

edited

Loading

pbontrager Nov 27, 2024

calvinpelletier Dec 2, 2024

pbontrager Nov 27, 2024

calvinpelletier Dec 2, 2024 •

edited

Loading

		return x.permute([2, 0, 1]).unsqueeze(0)


		def _calc_birectional_rel_pos_to_bucket(

		from torchtune.models.t5._tokenizer import T5Tokenizer


		def t5_v1p1_xxl_encoder(max_seq_len: int = 512) -> T5Encoder:

		@@ -7,6 +7,7 @@
		from typing import List, Optional

		from sentencepiece import SentencePieceProcessor

T5 Encoder #2069

Are you sure you want to change the base?

T5 Encoder #2069

Conversation

calvinpelletier commented Nov 25, 2024 • edited Loading

Context

Changelog

Analysis

Test plan

UX

Minimal test code

pytorch-bot bot commented Nov 25, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2069

✅ No Failures

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

calvinpelletier Nov 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pbontrager left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

calvinpelletier Dec 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

calvinpelletier Dec 2, 2024 • edited Loading

Choose a reason for hiding this comment

calvinpelletier commented Nov 25, 2024 •

edited

Loading

pytorch-bot bot commented Nov 25, 2024 •

edited

Loading

calvinpelletier Nov 25, 2024 •

edited

Loading

calvinpelletier Dec 2, 2024 •

edited

Loading

calvinpelletier Dec 2, 2024 •

edited

Loading