Fix spm whitespaces #2806

klosax · 2023-08-26T10:19:56Z

Fix so prepending whitespace to prompt when using spm tokenizer is done in user code instead of in the tokenizer.
This will restore pre-gguf accuracy in HellaSwag with models using the spm tokenizer (#2321 (comment)).

KerfuffleV2

Sorry, my knowledge level for tokenizing stuff falls short of being able to review something like this. I really only have basic familiarity with the broad strokes.

ikawrakow

The HellaSwag score is the same as pre-GGUF up to 400 tasks. It eventually diverges slightly. I'm running it right now with this PR, and for LLaMA-v2-7B I get 75.1 at task 1000 vs 75.2 pre-GGUF. I guess, this is to be expected given all other tokenizer changes?

ikawrakow · 2023-08-26T11:29:56Z

examples/main/main.cpp

+
+    if (llama_vocab_type(ctx) == LLAMA_VOCAB_TYPE_SPM) {
+        // Add a space in front of the first character to match OG llama tokenizer behavior
+        params.prompt.insert(0, 1, ' ');


Why do we insist to automatically add white space to the user's prompt? Isn't it better to let the user decide? Me as a user can decide to prompt with -p "Blah blah" or -p " Blah blah". I know the LLM response will be different, and I can decide to do whatever I believe is better. The way it is here, I have no way to prompt the model without a space at the beginning of my prompt, even if I knew exactly that this would be better.

You are right, but users in general have no clue that the sentencepiece tokenizer should have a space prepended to every word including the first word in the prompt.

I have no way to prompt the model without a space at the beginning of my prompt, even if I knew exactly that this would be better.

Is there any model that llama.cpp supports where it would actually be better? (Worst comes to worse, could probably add a commandline argument to disable that behavior but there would probably have to be a practical use case.)

ggerganov · 2023-08-26T12:54:30Z

We are still doing something wrong somewhere. When I test the OG sentencepiece tokenizer with the following python code:

import os
import sys
import argparse

from sentencepiece import SentencePieceProcessor

parser = argparse.ArgumentParser()
parser.add_argument("dir_tokenizer", help="directory containing 'tokenizer.model' file")
args = parser.parse_args()

dir_tokenizer = args.dir_tokenizer

tokenizer = SentencePieceProcessor(dir_tokenizer + '/tokenizer.model')

text = 'Hello, world!'
print(text)
print(tokenizer.encode(text, add_bos=True))
print(tokenizer.decode(tokenizer.encode(text, add_bos=True)))

I get the output:

Hello, world!
[1, 15043, 29892, 3186, 29991]
Hello, world!

Our tokenizer tokenizes correctly, however the de-tokenization incorrectly prefixes the text with a whitespace:

main : 'Hello, world!' tokenized to ' Hello, world!'

Also, I don't think that manually having to prefix with a whitespace in user code is correct.
The Python tokenizer does not have such a requirement

klosax · 2023-08-26T13:26:01Z

Also, I don't think that manually having to prefix with a whitespace in user code is correct.
The Python tokenizer does not have such a requirement

I just restored the pre-gguf behavior.

I can not find any sources that says that the correct way is to prepend a whitespace to the prompt.
Maybe we could remove it? Or at least have it optional like the add_bos param to the tokenizer.

ggerganov · 2023-08-26T13:35:43Z

I can not find any sources that says that the correct way is to prepend a whitespace to the prompt.

I don't know what was the intent when it was first added. The comment says "to match the OG tokenizer", but not sure in what way it matches it.

Regarding the extra leading space during de-tokenization, I can see that in llama2.c they explicitly remove the leading space if the previous token is BOS:

https://github.com/karpathy/llama2.c/blob/49daf18f2f85cab239e80b8a452a2f78f295032f/run.c#L423-L426

I guess I'll add this to our de-tokenization code as well. This does not affect any of the generation results, but fixes a display problem and makes it consistent with the OG behaviour.

* master: (773 commits) server : add `/detokenize` endpoint (ggerganov#2802) convert.py : advanced option (ggerganov#2753) llama : use Unicode Escape Sequence to replace encoded characters (ggerganov#2814) flake.nix : add rocm support and cleanup (ggerganov#2808) llama : move #includes out of _GNU_SOURCE conditional (ggerganov#2817) main : fix bug (penalize_nl=false doesn't work) + suppress warning on mingw (ggerganov#1528) llama : use std::abs in llama_sample_tail_free (ggerganov#2800) k-quants : remove unnecessary tensor shape restrictions (ggerganov#2811) Better perplexity for 2- and 3-bit quantization for LLaMA-v2-70B (ggerganov#2807) Fix HellaSwag (ggerganov#2805) flake : build llama.cpp on Intel with nix (ggerganov#2795) Handle null rope scaling value (ggerganov#2793) Fix spm whitespaces (ggerganov#2806) examples : skip unnecessary external lib in server README.md how-to (ggerganov#2804) llama : fix struct decl (ggerganov#2790) Faster perplexity computation (ggerganov#2786) llama : add llama_beam_search() (ggerganov#2267) convert.py : Get rope scale from HuggingFace models (ggerganov#2772) llama-bench : add model sizes (ggerganov#2771) convert.py : export rope freq_base when converting CodeLlama from an HF model (ggerganov#2773) ...

* llama.cpp : fix spm whitespace escaping + clean up * main.cpp : spm - add whitespace in front of prompt * test-tokenizer-0.cpp : spm - add whitespace in front of prompt

klosax added 2 commits August 26, 2023 12:08

llama.cpp : fix spm whitespace escaping + clean up

d528946

main.cpp : spm - add whitespace in front of prompt

43f7c16

klosax mentioned this pull request Aug 26, 2023

Fix HellaSwag #2805

Merged

klosax requested review from KerfuffleV2 and ikawrakow August 26, 2023 10:23

KerfuffleV2 reviewed Aug 26, 2023

View reviewed changes

test-tokenizer-0.cpp : spm - add whitespace in front of prompt

c50b1ae

ikawrakow approved these changes Aug 26, 2023

View reviewed changes

klosax merged commit 2ba83c8 into ggerganov:master Aug 26, 2023

ikawrakow mentioned this pull request Aug 27, 2023

llama : more tokenizer fixes #2810

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix spm whitespaces #2806

Fix spm whitespaces #2806

klosax commented Aug 26, 2023 •

edited

Loading

KerfuffleV2 left a comment

ikawrakow left a comment

ikawrakow Aug 26, 2023

klosax Aug 26, 2023

KerfuffleV2 Aug 26, 2023

ggerganov commented Aug 26, 2023 •

edited

Loading

klosax commented Aug 26, 2023 •

edited

Loading

ggerganov commented Aug 26, 2023

Fix spm whitespaces #2806

Fix spm whitespaces #2806

Conversation

klosax commented Aug 26, 2023 • edited Loading

KerfuffleV2 left a comment

Choose a reason for hiding this comment

ikawrakow left a comment

Choose a reason for hiding this comment

ikawrakow Aug 26, 2023

Choose a reason for hiding this comment

klosax Aug 26, 2023

Choose a reason for hiding this comment

KerfuffleV2 Aug 26, 2023

Choose a reason for hiding this comment

ggerganov commented Aug 26, 2023 • edited Loading

klosax commented Aug 26, 2023 • edited Loading

ggerganov commented Aug 26, 2023

klosax commented Aug 26, 2023 •

edited

Loading

ggerganov commented Aug 26, 2023 •

edited

Loading

klosax commented Aug 26, 2023 •

edited

Loading