-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix spm whitespaces #2806
Fix spm whitespaces #2806
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, my knowledge level for tokenizing stuff falls short of being able to review something like this. I really only have basic familiarity with the broad strokes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The HellaSwag score is the same as pre-GGUF up to 400 tasks. It eventually diverges slightly. I'm running it right now with this PR, and for LLaMA-v2-7B I get 75.1 at task 1000 vs 75.2 pre-GGUF. I guess, this is to be expected given all other tokenizer changes?
|
||
if (llama_vocab_type(ctx) == LLAMA_VOCAB_TYPE_SPM) { | ||
// Add a space in front of the first character to match OG llama tokenizer behavior | ||
params.prompt.insert(0, 1, ' '); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we insist to automatically add white space to the user's prompt? Isn't it better to let the user decide? Me as a user can decide to prompt with -p "Blah blah"
or -p " Blah blah"
. I know the LLM response will be different, and I can decide to do whatever I believe is better. The way it is here, I have no way to prompt the model without a space at the beginning of my prompt, even if I knew exactly that this would be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, but users in general have no clue that the sentencepiece tokenizer should have a space prepended to every word including the first word in the prompt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no way to prompt the model without a space at the beginning of my prompt, even if I knew exactly that this would be better.
Is there any model that llama.cpp
supports where it would actually be better? (Worst comes to worse, could probably add a commandline argument to disable that behavior but there would probably have to be a practical use case.)
We are still doing something wrong somewhere. When I test the OG sentencepiece tokenizer with the following python code: import os
import sys
import argparse
from sentencepiece import SentencePieceProcessor
parser = argparse.ArgumentParser()
parser.add_argument("dir_tokenizer", help="directory containing 'tokenizer.model' file")
args = parser.parse_args()
dir_tokenizer = args.dir_tokenizer
tokenizer = SentencePieceProcessor(dir_tokenizer + '/tokenizer.model')
text = 'Hello, world!'
print(text)
print(tokenizer.encode(text, add_bos=True))
print(tokenizer.decode(tokenizer.encode(text, add_bos=True))) I get the output:
Our tokenizer tokenizes correctly, however the de-tokenization incorrectly prefixes the text with a whitespace:
Also, I don't think that manually having to prefix with a whitespace in user code is correct. |
I just restored the pre-gguf behavior. I can not find any sources that says that the correct way is to prepend a whitespace to the prompt. |
I don't know what was the intent when it was first added. The comment says "to match the OG tokenizer", but not sure in what way it matches it. Regarding the extra leading space during de-tokenization, I can see that in I guess I'll add this to our de-tokenization code as well. This does not affect any of the generation results, but fixes a display problem and makes it consistent with the OG behaviour. |
* master: (773 commits) server : add `/detokenize` endpoint (ggerganov#2802) convert.py : advanced option (ggerganov#2753) llama : use Unicode Escape Sequence to replace encoded characters (ggerganov#2814) flake.nix : add rocm support and cleanup (ggerganov#2808) llama : move #includes out of _GNU_SOURCE conditional (ggerganov#2817) main : fix bug (penalize_nl=false doesn't work) + suppress warning on mingw (ggerganov#1528) llama : use std::abs in llama_sample_tail_free (ggerganov#2800) k-quants : remove unnecessary tensor shape restrictions (ggerganov#2811) Better perplexity for 2- and 3-bit quantization for LLaMA-v2-70B (ggerganov#2807) Fix HellaSwag (ggerganov#2805) flake : build llama.cpp on Intel with nix (ggerganov#2795) Handle null rope scaling value (ggerganov#2793) Fix spm whitespaces (ggerganov#2806) examples : skip unnecessary external lib in server README.md how-to (ggerganov#2804) llama : fix struct decl (ggerganov#2790) Faster perplexity computation (ggerganov#2786) llama : add llama_beam_search() (ggerganov#2267) convert.py : Get rope scale from HuggingFace models (ggerganov#2772) llama-bench : add model sizes (ggerganov#2771) convert.py : export rope freq_base when converting CodeLlama from an HF model (ggerganov#2773) ...
* llama.cpp : fix spm whitespace escaping + clean up * main.cpp : spm - add whitespace in front of prompt * test-tokenizer-0.cpp : spm - add whitespace in front of prompt
Fix so prepending whitespace to prompt when using spm tokenizer is done in user code instead of in the tokenizer.
This will restore pre-gguf accuracy in HellaSwag with models using the spm tokenizer (#2321 (comment)).