Skip to content

Conversation

@stephantul
Copy link
Contributor

Hello!

This PR brings the StaticEmbedding a bit more in line with other modules. I had some trouble integrating it into pylate, because pylate assumes you are using a regular tokenizer. It then occurred to me that there is actually no real reason to not use a transformers-style tokenizer, it actually makes the API a bit nicer even.

So, this PR:

  • Replaces the tokenizer by a transformers tokenizer. As before, both input arguments are still supported. So it is allowed to pass a tokenizers.Tokenizer and transformers.PreTrainedTokenizer. Both work. The difference is that we turn the former into the latter. This also makes the whole thing a bit more flexible, because now it should also work with slow tokenizers, which it didn't before.
  • I added the PAD id explicitly. This id is ignored during training. We don't add it anyway, but it's still nice to have for the future.
  • I added a max_seq_length argument and associated property, which was easy to do because of the new tokenizer. During training, I noticed that long sequences could still slow down the thing, presumably due to the tensors sometimes just being obnoxiously large. I haven't tested whether truncating the sequence length leads to performance differences, however.
  • I added a word_embedding_dimension property

Also fixed some typing stuff. This still doesn't make it completely compatible with pylate, but almost. The other part has to be done in pylate, unfortunately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant