Fix Incorrect Token Normalization Method for LlamaCppTokenizer
#992
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #952
Problem
During FSM index creation, the normalized vocabulary is used to determine token validity for each state in the automata.
Because the set of token strings in
tokenizer.vocabulary
isn't equivalent to the token strings created bytokenizer.decode
the vocabulary tokens must be normalized to ensure the FSM selects tokens from the vocabulary which, when decoded, conform to the pattern.In
models.llamacpp
there wasn't any normalization, and the\n
token is represented in vocabulary asĊ
During FSM compilation, Outlines understood
"Ċ"
to be a valid json string, however in reality, the invalid json string"\n"
was decoded.Solution
Implement working normalization function,
convert_token_to_string()
, to ensure behavior is correct when using a huggingface transformers tokenizer withmodels.llamacpp
.