-
Notifications
You must be signed in to change notification settings - Fork 32.3k
[vllm + v5 fix] handle TokenizersBackend fallback properly for v5 #44255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
+404
−205
Merged
Changes from all commits
Commits
Show all changes
48 commits
Select commit
Hold shift + click to select a range
7db5290
update deepseek v2 for tokenizers v5
5ef0061
adding remote code fix
7be0c57
fix deepseek name
1d22701
handle spm conversion from proto only when overriding bad_models
itazap 43a07e1
add script to compare xlni and code_search_net output of 2 tokenizers
itazap ffb5f09
tiktoken models support
itazap c1a3a0d
fix tests
itazap 4307831
testssss
itazap 1bbe257
fix gemma
itazap b5d0bad
apply some feedback
itazap b7547e4
paligemma processor tests fix
itazap f5fd840
add relevant changes from #44298
itazap 2b9efdf
json serializable fix
itazap e141166
add more xlni cases
itazap ae25381
t5 fix
itazap 77120a2
ruff check code quality
itazap 3b053b0
missed file for t5 test fix
itazap a5542cc
modular failures
itazap 95bba6c
other modular fixes
itazap 7d46f77
tiktoken.model test
itazap be29c60
more feedback updates!
itazap 53753c3
fixing models so AutoTokenizer == TokenizersBackend - aligning with c…
itazap 4745745
seamless m4t
itazap cbda0ca
missed the most important files
itazap e5c8a2f
Revert "missed the most important files"
itazap 8bf6df0
undo changes to big bird , bert, seamless
itazap df12cc4
setup and qual
itazap 08b91c6
lasr
itazap a7c2435
t5
itazap d5e9aba
dpr bert
itazap ceeb319
xlmroberta
itazap b512fc7
reformer
itazap 0c95842
nllb
itazap 34d83ed
style and shit
ArthurZucker 5c8af86
update
ArthurZucker 4d06871
fix
ArthurZucker 2159e92
extract the charsmap
ArthurZucker db0c5b5
fix mbart?
ArthurZucker 31ff32d
style
ArthurZucker 083ec50
nllb and test tok common read spm precompiled charsmap
itazap a4fc098
fix whisper?
ArthurZucker 2710fad
Merge branch 'bad_models_update' of github.com:huggingface/transforme…
ArthurZucker 969c0fc
nllb
itazap 47a772a
checked on v4!
ArthurZucker 8039c2b
Merge branch 'bad_models_update' of github.com:huggingface/transforme…
ArthurZucker e3d3025
fix repo
ArthurZucker 6edc1d3
fix lasr
ArthurZucker dade5e6
style
ArthurZucker File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -192,6 +192,7 @@ def extract(self, model_type, **kwargs) -> tuple[dict[str, int], list[tuple]]: | |
| AddedToken(token, normalized=False, special=special) | ||
| for id, token, special in sorted(spm_added_tokens, key=lambda x: x[0]) | ||
| ] | ||
| kwargs["_spm_precompiled_charsmap"] = getattr(self.proto.normalizer_spec, "precompiled_charsmap", None) | ||
| return kwargs | ||
|
|
||
|
|
||
|
|
@@ -635,6 +636,54 @@ class SpmConverter(Converter): | |
| SpmExtractor = SentencePieceExtractor | ||
| special_tokens = {} | ||
|
|
||
| @staticmethod | ||
| def build_tokenizer_from_spm_proto(proto, vocab, merges=None): | ||
| """ | ||
| Similar to convert_from_spm method, but used only when there is no `model_type` class, i.e. there is no matching class in `TOKENIZERS_MAPPING` and we just create a tokenizer instead of extracting stuff from the sentencepiece file | ||
| """ | ||
| byte_fallback = proto.trainer_spec.byte_fallback | ||
| unk_piece = proto.trainer_spec.unk_piece | ||
| precompiled_charsmap = proto.normalizer_spec.precompiled_charsmap | ||
|
|
||
| # model | ||
| if isinstance(vocab, dict): | ||
| tokenizer = Tokenizer( | ||
| BPE( | ||
| vocab=vocab, | ||
| merges=merges or [], | ||
| unk_token=unk_piece, | ||
| fuse_unk=True, | ||
| byte_fallback=byte_fallback, | ||
| dropout=None, | ||
| ) | ||
| ) | ||
| elif isinstance(vocab, list) and vocab and isinstance(vocab[0], (tuple, list)): | ||
| tokenizer = Tokenizer( | ||
| Unigram( | ||
| vocab=vocab, | ||
| unk_id=proto.trainer_spec.unk_id, | ||
| byte_fallback=byte_fallback, | ||
| ) | ||
| ) | ||
| else: | ||
| return None | ||
|
|
||
| # normalizer | ||
| _normalizers = [normalizers.Replace(" ", "▁")] | ||
| if precompiled_charsmap: | ||
| _normalizers.insert(0, normalizers.Precompiled(precompiled_charsmap)) | ||
| tokenizer.normalizer = normalizers.Sequence(_normalizers) | ||
|
|
||
| # decoder | ||
| if byte_fallback: | ||
| tokenizer.decoder = decoders.Sequence( | ||
| [decoders.Replace("▁", " "), decoders.ByteFallback(), decoders.Fuse()] | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Split digits, split on whitespace etc could be etracted as well! |
||
| ) | ||
| else: | ||
| tokenizer.decoder = decoders.Sequence([decoders.Replace("▁", " ")]) | ||
|
|
||
| return tokenizer | ||
|
|
||
| @classmethod | ||
| def convert_from_spm(cls, vocab=None, **kwargs): | ||
| """ | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's only for some model not all of them (ex gpt2 uses
Ġ)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah MB, sentencepiece never used
Ġ!So ignore this comment probably