Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Special tokens are not rendered correctly (as empty) -- llama3 specific? #6770

Closed
DreamGenX opened this issue Apr 19, 2024 · 7 comments · Fixed by #6807
Closed

Special tokens are not rendered correctly (as empty) -- llama3 specific? #6770

DreamGenX opened this issue Apr 19, 2024 · 7 comments · Fixed by #6807

Comments

@DreamGenX
Copy link

DreamGenX commented Apr 19, 2024

Hello!

Using this GGUF: https://huggingface.co/LoneStriker/opus-v1.2-llama-3-8b-GGUF

When the output contains any of the special tokens, like <|im_start|> or <|im_end|>, they are rendered as empty string. This breaks custom stopping string functionality (e.g. adding "<|im_end|>" to stop strings does not work as it relies on string comparison).

The tokens are tokenized correctly, just not rendered:

main: prompt: '<|im_end|>'
main: number of tokens in prompt = 1
128009 -> ''
main: prompt: '<|im_start|>'
main: number of tokens in prompt = 1
128006 -> ''

I first tested this with old commit:

version: 2243 (201294ae)
201294ae177b308fb3a99dc504dd6d27e8afa907

And replicated with fresh main:

version: 2698 (637e9a86)
637e9a86c220718d008b54842dfd294aa96d3b7a
@bavellone
Copy link

bavellone commented Apr 19, 2024

I believe I'm also running into this issue using Meta-Llama-3-70B-Instruct.IQ3_XS.gguf - I'm seeing tokens being output from the model but decoding them all return empty strings (I let it run for a few hundred tokens). I'm not seeing this behaviour on a Meta-Llama-3-8B-Instruct.Q6_K.gguf model.

Offloading to ROCm, only loading ~25 layers for 70B.

@DreamGenX
Copy link
Author

DreamGenX commented Apr 20, 2024

KoboldCpp has somewhat of a fix: https://github.com/LostRuins/koboldcpp/releases/tag/v1.63

Added support for special tokens in stop_sequences. Thus, if you set <|eot_id|> as a stop sequence and it can be tokenized into a single token, it will just work and function like the EOS token, allowing multiple EOS-like tokens.

Commit: LostRuins@3170284

As far as I can tell, it will still not render the tokens, but at least stopping should work.

@Lyrcaxis
Copy link

There's also something wrong with the existing tokenizer -- \n\n (Ċ Ċ) should be properly merged into a single token based on tokenizer's merge instructions ĊĊ but unless it's at the end of the prompt, it tokenizes as [\n,\n].

(Note: Only tried via the LLAMA API using LLamaSharp)

I think tokenizer's integration in GGUFs could use some attention overall (merges + added_tokens).

@phymbert
Copy link
Collaborator

@DreamGenX
Copy link
Author

Hey @phymbert -- did you check the description of the issue? I don't think anything in the issues you linked is really relevant or solving this problem -- the problem being that special tokens are not rendered.

@ggerganov
Copy link
Owner

We can start rendering special tokens here:

llama.cpp/llama.cpp

Lines 17017 to 17019 in 0e4802b

} else if (llama_is_control_token(model->vocab, token)) {
;
}

But my personal opinion is that parsing the text of special/control tokens is a poor practice. AFAICT it seems to have worked so far since we have incorrectly exported tokens such as "<|im_end|>" as normal text tokens.

In #6745 we will introduce llama_token_is_eog() which can be used to properly check for end-of-generation tokens. I think this is more robust and it's better to adopt that interface

@DreamGenX
Copy link
Author

DreamGenX commented Apr 22, 2024

Thanks for the PR @ggerganov, awesome.

As for whether it's bad practice depends very much on the use case. For the use case most people deal with, which is generating assistant response based on conversations history, I agree it's not needed -- just pass the input in EOM / EOT token.with the header and stop on end-of-message token.

There are other use cases though, where you fine-tune the model to generate multiple turns. Simplest example would be multiple messages from multiple role-play characters at once, where the message header contains character name and possibly other metadata. Or generating multiple function-call instructions. In those cases special tokens allow you to properly parse the response, rather than rely on ad-hoc formatting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants