Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Incorrect Token Normalization Method for LlamaCppTokenizer #992

Merged
merged 1 commit into from
Jun 22, 2024

Conversation

lapp0
Copy link
Contributor

@lapp0 lapp0 commented Jun 20, 2024

Fixes #952

Problem

During FSM index creation, the normalized vocabulary is used to determine token validity for each state in the automata.

Because the set of token strings in tokenizer.vocabulary isn't equivalent to the token strings created by tokenizer.decode the vocabulary tokens must be normalized to ensure the FSM selects tokens from the vocabulary which, when decoded, conform to the pattern.

In models.llamacpp there wasn't any normalization, and the \n token is represented in vocabulary as Ċ

>>> tokenizer.vocabulary['Ċ']
198
>>> tokenizer.decode([198])
['\n']

During FSM compilation, Outlines understood "Ċ" to be a valid json string, however in reality, the invalid json string "\n" was decoded.

Solution

Implement working normalization function, convert_token_to_string(), to ensure behavior is correct when using a huggingface transformers tokenizer with models.llamacpp.

@lapp0 lapp0 added structured generation Linked to structured generation tokenization correctness Everything related to the generation correctness llama.cpp Related to the `llama.cpp` integration labels Jun 20, 2024
@rlouf rlouf merged commit 8dcd24e into dottxt-ai:main Jun 22, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
correctness Everything related to the generation correctness llama.cpp Related to the `llama.cpp` integration structured generation Linked to structured generation tokenization
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Validation Error during pydantic validation for Llama3 GGUF
2 participants