feat: use regular tokenizer in static embedding, add max length #3541

stephantul · 2025-10-06T18:52:16Z

Hello!

This PR brings the StaticEmbedding a bit more in line with other modules. I had some trouble integrating it into pylate, because pylate assumes you are using a regular tokenizer. It then occurred to me that there is actually no real reason to not use a transformers-style tokenizer, it actually makes the API a bit nicer even.

So, this PR:

Replaces the tokenizer by a transformers tokenizer. As before, both input arguments are still supported. So it is allowed to pass a tokenizers.Tokenizer and transformers.PreTrainedTokenizer. Both work. The difference is that we turn the former into the latter. This also makes the whole thing a bit more flexible, because now it should also work with slow tokenizers, which it didn't before.
I added the PAD id explicitly. This id is ignored during training. We don't add it anyway, but it's still nice to have for the future.
I added a max_seq_length argument and associated property, which was easy to do because of the new tokenizer. During training, I noticed that long sequences could still slow down the thing, presumably due to the tensors sometimes just being obnoxiously large. I haven't tested whether truncating the sequence length leads to performance differences, however.
I added a word_embedding_dimension property

Also fixed some typing stuff. This still doesn't make it completely compatible with pylate, but almost. The other part has to be done in pylate, unfortunately.

stephantul added 2 commits October 6, 2025 20:47

feat: use regular tokenizer in static embedding, add max length

e4fff0c

ruff

a4f85a1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat: use regular tokenizer in static embedding, add max length #3541

feat: use regular tokenizer in static embedding, add max length #3541

stephantul commented Oct 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

feat: use regular tokenizer in static embedding, add max length #3541

Are you sure you want to change the base?

feat: use regular tokenizer in static embedding, add max length #3541

Conversation

stephantul commented Oct 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant