-
Notifications
You must be signed in to change notification settings - Fork 31.9k
[from_pretrained] Allow tokenizer_type ≠ model_type #6995
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
sgugger
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I fully understand the use case, but nothing against the principle of it.
| tokenizer_class_candidate = f"{config.tokenizer_class}Fast" | ||
| else: | ||
| tokenizer_class_candidate = config.tokenizer_class | ||
| tokenizer_class = globals().get(tokenizer_class_candidate) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be cleaner to use some of our internal list/dict containing all tokenizers, just in case there are weird things in the namespace of some users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I was wondering about that. I was wondering if by using globals() someone could even use a tokenizer that's not in the library, but I don't think so, as globals is actually locals in this scope/file if I understand correctly.
LysandreJik
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, this looks good to me! Thanks for working on it.
thomwolf
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice!
| tokenizer_class_candidate = config.tokenizer_class | ||
| tokenizer_class = globals().get(tokenizer_class_candidate) | ||
| if tokenizer_class is None: | ||
| raise ValueError("Tokenizer class {} does not exist or is not currently imported.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The content of the {} is missing? Or there is a magic somewhere in ValueError which fills this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops no it's missing
The idea is to prevent combinatorial explosion of "model types" when only the tokenizer is different (e.g. Flaubert, CamemBERT if we wanted to support them today) In the future we might even want to have a few model-agnostic tokenizer classes like ByteLevelBPETokenizer (basically RobertaTokenizer), as they can be initialized pretty exhaustively from the init args stored in |
patrickvonplaten
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great!
n1t0
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! LGTM
For an exemple usage of this PR, see the
tokenizer_classattribute in this config.json: https://s3.amazonaws.com/models.huggingface.co/bert/julien-c/dummy-diff-tokenizer/config.jsonInstead of a class, we could have used a
tokenizer_typebelonging to the set of allmodel_types, like"bert", etc. but it feels more restrictive, especially in case we start having tokenizer classes that are not obviously linked to a "model", like a potential "TweetTokenizer"Context: #6129
Update: documented by @sgugger in #8152