[vllm + v5 fix] handle TokenizersBackend fallback properly for v5#44255
[vllm + v5 fix] handle TokenizersBackend fallback properly for v5#44255ArthurZucker merged 48 commits intomainfrom
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
ArthurZucker
left a comment
There was a problem hiding this comment.
from testing:
- olmo2
- qwen2
and probably many other are affected 😢
I'll make a list of arch and test them but at least let's add these 2!
|
Also before we can merge IMO we need to make sure the auto conversion is foolproof otherwise we are ignoring code but the conversion is still not correct |
ArthurZucker
left a comment
There was a problem hiding this comment.
nice, let's add tests to validate which model ids we fixed on tthe hub
| Similar to convert_from_spm method, but used when converting directly from proto and vocab/merges. | ||
| (convert_from_spm requires some class attrs like byte_fallback, unk_piece, precompiled_charsmap, etc.) |
There was a problem hiding this comment.
| Similar to convert_from_spm method, but used when converting directly from proto and vocab/merges. | |
| (convert_from_spm requires some class attrs like byte_fallback, unk_piece, precompiled_charsmap, etc.) | |
| Similar to convert_from_spm method, but used only when there is no `model_type` class, i.e. there is no matching class in `TOKENIZERS_MAPPING` and we just create a tokenizer instead of extracting stuff from the sentencepiece file |
func name can probably be updated as well
ArthurZucker
left a comment
There was a problem hiding this comment.
LGTM, let's just have better names for each functions and a bit more tests
575d566 to
2e03e15
Compare
ArthurZucker
left a comment
There was a problem hiding this comment.
Ty for addressing the comments!
| local_kwargs.setdefault("bos_token", proto_spec.bos_piece or "<s>") | ||
| if proto_spec.eos_id >= 0: | ||
| local_kwargs.setdefault("eos_token", proto_spec.eos_piece or "</s>") | ||
|
|
There was a problem hiding this comment.
should we look for unk id as well?
There was a problem hiding this comment.
let's move these to be done inside:
extractor.extract(cls.model, **local_kwargs)if possible (whatever we can do with the proto there?)
|
run-slow: auto, gemma, lasr, llama, siglip2, t5 |
|
This comment contains models: ["models/auto", "models/gemma", "models/lasr", "models/llama", "models/siglip2", "models/t5"] |
ArthurZucker
left a comment
There was a problem hiding this comment.
Ty!
I think GPT2 needs potentially to be removed from the list as this: https://huggingface.co/Open4bits/granite-4.0-h-tiny-mlx-fp16/tree/main
-pre_tokenizer: ByteLevel(add_prefix_space=False, trim_offsets=True, use_regex=True)
+pre_tokenizer: Sequence(pretokenizers=[Split(pattern=Regex("(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]..."), behavior=Removed, invert=True), ByteLevel(add_prefix_space=False, trim_offsets=True, use_regex=False)])When I check our old converted, indeed it does not have a split pattern. Its a bit weird.
Similarly:
-pre_tokenizer: ByteLevel(add_prefix_space=False, trim_offsets=True, use_regex=True)
+pre_tokenizer: Sequence(pretokenizers=[Digits(individual_digits=True), ByteLevel(add_prefix_space=False, trim_offsets=True, use_regex=True)])appears in some of the GPT2Tokenizer (mapped).
Pegasus has a similar issue:
-normalizer: Sequence(normalizers=[Replace(pattern=Regex("\n"), content=" "), Replace(pattern=Regex(" {2,}"), content=" ")])
-pre_tokenizer: Metaspace(replacement="▁", prepend_scheme=always, split=True)
+normalizer: Sequence(normalizers=[Precompiled(precompiled_charsmap="ALQCAACEAAAAAACAAQAAgMz8AgC4BQAAhyIAgMzkAgC4PQAAeyIAgMzsAgC4BQAAiyIAgMw8AADNvAAAmwkAgJ4JAIChCQCAgx0A..."), Replace(pattern=Regex(" {2,}"), content=" ")])
+pre_tokenizer: Sequence(pretokenizers=[WhitespaceSplit(), Metaspace(replacement="▁", prepend_scheme=always, split=True)])Pattern8 for LlamaConverter is fine, we did a bug fix!
Pattern9: Clip has a small issue:
self._tokenizer.pre_tokenizer = pre_tokenizers.Sequence(
[
pre_tokenizers.Split(
Regex(
r"""<\|startoftext\|>|<\|endoftext\|>|'s|'t|'re|'ve|'m|'ll|'d|[\p{L}]+|[\p{N}]|[^\s\p{L}\p{N}]+"""
),
behavior="removed",
invert=True,
),
pre_tokenizers.ByteLevel(add_prefix_space=False),
]
)should not have r"""<\|startoftext\|>|<\|endoftext\|>
BlenderbotConverter needs updated post processor to roberta processing I think
There was a problem hiding this comment.
-normalizer: Sequence(normalizers=[Strip(strip_left=False, strip_right=True), Replace(pattern=Regex(" {2,}"), content="▁"), Precompiled(precompiled_charsmap="...")])
-pre_tokenizer: Metaspace(replacement="▁", prepend_scheme=always, split=True)
+normalizer: Precompiled(precompiled_charsmap="...")
+pre_tokenizer: Sequence(pretokenizers=[WhitespaceSplit(), Metaspace(replacement="▁", prepend_scheme=always, split=True)])is what most T5 have as an issue.
I think if we add the regex, we need the normalizer to be strip + replace + precompiled !
| return None | ||
|
|
||
| # normalizer | ||
| _normalizers = [normalizers.Replace(" ", "▁")] |
There was a problem hiding this comment.
that's only for some model not all of them (ex gpt2 uses Ġ )
There was a problem hiding this comment.
ah MB, sentencepiece never used Ġ !
So ignore this comment probably
| # decoder | ||
| if byte_fallback: | ||
| tokenizer.decoder = decoders.Sequence( | ||
| [decoders.Replace("▁", " "), decoders.ByteFallback(), decoders.Fuse()] |
There was a problem hiding this comment.
Split digits, split on whitespace etc could be etracted as well!
| local_kwargs["tokenizer_padding"] = tok_from_file.padding | ||
| local_kwargs["tokenizer_truncation"] = tok_from_file.truncation | ||
| # Preserve truncation and padding baked into tokenizer.json so that classes | ||
| # with a custom __init__ that rebuild the backend tokenizer from scratch | ||
| # can still access these settings. | ||
| if tok_from_file.truncation is not None: | ||
| local_kwargs["_json_truncation"] = tok_from_file.truncation | ||
| if tok_from_file.padding is not None: | ||
| local_kwargs["_json_padding"] = tok_from_file.padding |
There was a problem hiding this comment.
not sure we needd both tokenizer_padding and _json_truncation which are the same
| local_kwargs.setdefault("bos_token", proto_spec.bos_piece or "<s>") | ||
| if proto_spec.eos_id >= 0: | ||
| local_kwargs.setdefault("eos_token", proto_spec.eos_piece or "</s>") | ||
|
|
There was a problem hiding this comment.
let's move these to be done inside:
extractor.extract(cls.model, **local_kwargs)if possible (whatever we can do with the proto there?)
|
|
||
| _truncation = self._tokenizer.truncation | ||
|
|
||
| _truncation = kwargs.pop("tokenizer_truncation", None) or self._tokenizer.truncation or _json_truncation |
| normalizers.StripAccents(), | ||
| ] | ||
| ) | ||
| self._tokenizer.normalizer = normalizers.BertNormalizer(lowercase=True) |
02766a3 to
df12cc4
Compare
…rs into bad_models_update
…rs into bad_models_update
|
[For maintainers] Suggested jobs to run (before merge) run-slow: auto, bert, blenderbot, gemma, gpt_neox, lasr, llama, mbart50, nllb, openai, pegasus, reformer, siglip2, t5, xlm_roberta |
What this PR does
Given he different issues that were noticed by @hmellor on vLLM, we wanted to make sure we did not end up with crazy breaks. We ran a full test suite (code can be found in #44298) and the results showed 22 model converters that had potential issues.
We test what would be the
AutoTokenizer.from_pretrained(model_id)._tokenizerin v4 vs in v5, reporting diff in thejsonserialization.The full report is available here:
report.html
Here is an explanation for each converter:
Important fixes
ReformerConverter/PegasusConverter/T5Converter/MBart50Converter/XLMRobertaConverter/NllbConverter: this is the biggest "change" we did miss theprecompiled_charsmapand it is required for very specific whitespace cases. For all of these, we only changed the ones that had XNLI missmatches. The motivation is to have an explicit noramalizer vs something arbitrary. This is to follow the default in: https://github.com/huggingface/transformers/blob/v4.57-release/src/transformers/convert_slow_tokenizer.py#L629-L638Split+ in v5 named tokens (mask_token) are always special. This is a known change for v5 and would require relaxing the constraint that named tokens are special. Will consider this in the future.do_lower_case=Trueby default which we missed going to v5.All of these are unigram models that need the precompiled for super super specific shit.
Not that important, could affect small stuff
BlenderbotConverter: it had aRobertaProcessingbut it should notBigBirdConverter: Should not have the strip normalizer?Mild concerns fixes/ absolutely safe to ignore
Qwen2Converter: For both thepre_tokenizerand thedecoder, thetrim_offsetsanduse_regexflag are useless. Serialization of Bytelevel for decoder and pre_tokenizer is missleading tokenizers#1960 will address this but TLDR: behavior is the same.TikTokenConverter: Same as Qwen2Converter, thedecoderflags are uselessLlamaConverter: this IS expected as most of the old llama models had thelegacyflag with the metaspace issue (link it) prepend normalizer is shitCLIPConverter: the split pattern has"<\|startoftext\|>|<\|endoftext\|>"which we added. It does not make a difference as it's part of the special tokens, but we are removing this.OpenAIGPTConverter: behavior was identical but for fairness it needsBertNormalizerMarkupLMConverter: seems safe to ignore in v4 it did not have RobertaProcessing eitherSeamlessM4TConverter: weirdRobertaConverter: default template is diferent +add_prefix_space=FalseGPT2Converter: equivalent diff / safe to ignore.pre_tokenizer=ByteLevel(add_prefix_space=False, trim_offsets=True, use_regex=True)is equivalent to the default Regex:Sequence(pretokenizers=[Split(pattern=Regex("(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]|\s[\r\n]..."), behavior=Removed, invert=True), ByteLevel(add_prefix_space=False, trim_offsets=True, use_regex=False)])link to tokenizers byte_level and defaultUPDATE TO: https://github.com/huggingface/transformers/pull/44179/changes (deepseek v2, internlm2)
Models with incorrect tokenizer_class in tokenization_config.json that should use TokenziersBackend
Sentencepiece /Tiktoken models:
tokenizer.modelfile, so we need to extract that info and create a generic fallback that will create a_tokenizerobject from theScript to compare roundtrip tokenizer outputs: