Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding Huggingface compatible tokenizers #48

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

justin-barton
Copy link
Contributor

Adding AptTokenizer() and EsmTokenizer() based on the Rust PreTrainedTokenizerFast class and original EsmTokenizer class, respectively.

Some benefits of this change are:

  • multiprocessing for distributed tokenization
  • batch_decode methods
  • attention_masks and special_tokens_masks
  • simplifies dependencies (doesn't require rust_trie)
  • enables easy distribution on Huggingface Hub later

As described in #47

@jamaliki
Copy link
Collaborator

Hi @justin-barton ,

Thank you so much for your contribution! We discussed this change and we are going to put this on the back burner for now, since we get some flexibility benefits from having the tokenizer completely defined within the repo in this early stage of training. However, for later steps in the project, such as when we release models to the HF hub, we will probably end up using this PR to have a fully HF-compatible tokenizer.

Best,
Kiarash.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants