Fix Incorrect Token Normalization Method for `LlamaCppTokenizer` #992

lapp0 · 2024-06-20T06:46:13Z

Fixes #952

Problem

During FSM index creation, the normalized vocabulary is used to determine token validity for each state in the automata.

Because the set of token strings in tokenizer.vocabulary isn't equivalent to the token strings created by tokenizer.decode the vocabulary tokens must be normalized to ensure the FSM selects tokens from the vocabulary which, when decoded, conform to the pattern.

In models.llamacpp there wasn't any normalization, and the \n token is represented in vocabulary as Ċ

>>> tokenizer.vocabulary['Ċ']
198
>>> tokenizer.decode([198])
['\n']

During FSM compilation, Outlines understood "Ċ" to be a valid json string, however in reality, the invalid json string "\n" was decoded.

Solution

Implement working normalization function, convert_token_to_string(), to ensure behavior is correct when using a huggingface transformers tokenizer with models.llamacpp.

lapp0 added structured generation Linked to structured generation tokenization correctness Everything related to the generation correctness llama.cpp Related to the `llama.cpp` integration labels Jun 20, 2024

lapp0 force-pushed the fix-bpe-vocab branch from ef4c03f to 4b03218 Compare June 20, 2024 11:53

fix models.llamacpp vocabulary normalization function

be39aaa

lapp0 force-pushed the fix-bpe-vocab branch from 4b03218 to be39aaa Compare June 21, 2024 08:43

lapp0 mentioned this pull request Jun 21, 2024

Ensure models.llamacpp Doesn't Have Implicit max_tokens #996

Merged

rlouf approved these changes Jun 22, 2024

View reviewed changes

rlouf merged commit 8dcd24e into dottxt-ai:main Jun 22, 2024
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Incorrect Token Normalization Method for `LlamaCppTokenizer` #992

Fix Incorrect Token Normalization Method for `LlamaCppTokenizer` #992

lapp0 commented Jun 20, 2024 •

edited

Loading

Fix Incorrect Token Normalization Method for LlamaCppTokenizer #992

Fix Incorrect Token Normalization Method for LlamaCppTokenizer #992

Conversation

lapp0 commented Jun 20, 2024 • edited Loading

Problem

Solution

Fix Incorrect Token Normalization Method for `LlamaCppTokenizer` #992

Fix Incorrect Token Normalization Method for `LlamaCppTokenizer` #992

lapp0 commented Jun 20, 2024 •

edited

Loading