Some slightly odd results running inference (may just be prompt or tokenizing issue) #5

KerfuffleV2 · 2023-11-05T09:12:17Z

KerfuffleV2
Nov 5, 2023

After converting the model to GGUF, I tried a few generations but the results were a bit strange. I don't know that this implies any problem with the model, however I generally can get pretty reasonable results from LLaMA models with similar prompting.

Output from a few generations: https://gist.github.com/KerfuffleV2/a7dcbc9adab5506c0cb77de37653e51b

1

It started out pretty good, but then it got quite weird. The main thing I'd call strange in this one is "The fox jumped up and ran away with the two dead wolves in his mouth. [...] He was too weak to run very fast" - I'm not surprised he couldn't run fast considering the size difference between wolves and foxes.

2

"小白总是会出现在小白的面前" — He always popped up in front of himself?

"这个时候小黑才意识到自己已经死了" — It didn't seem to affect him too much.

I can say something positive as well though — I'm just learning Mandarin so I don't know how much my impression is worth but compared to the output from every other local LM I've tried, Yi's Mandarin writing feels really natural. The flow, use of particles, etc. I haven't seen other models write like that. When the rabbit said "我没事我只是被一只狼给咬了一口而已没什么大不了的", I had to laugh. I also like how it came up with a 道理: "害人之心不可有防人之心不可无！" (not sure if that's a real saying or if it made that part up.)

3

The first weird thing here is how it wrote some random numbers before the chapters, like "1.4 第二章". A lot of this one is actually really good, really imaginative.

"于是，它们决定要好好照顾这只小鸟。" — But you guys just watched him die. How are you going to 好好照顾 him now?

The main thing I'd call strange in those examples is when it writes something that contradicts what it just wrote previously.

I wonder if it's possible there's something weird going on with the tokenizing of the initial prompt, or inclusion of stuff like the initial BOS token. Here's an example of how llama.cpp tokenizes my prompt:

llm_load_print_meta: BOS token = 1 '<|startoftext|>'
llm_load_print_meta: EOS token = 2 '<|endoftext|>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: PAD token = 0 '<unk>'
llm_load_print_meta: LF token  = 315 '<0x0A>'
[...]
main: prompt: '第一章：从前，在一片幽暗的森林里，住着一只小狐狸、一只狼和几只狗。 这只狐狸他的'
main: number of tokens in prompt = 32
     1 -> ''
  1913 -> '第一'
 60457 -> '章'
   106 -> '：'
   144 -> '
'
 32783 -> '从前'
   101 -> '，'
 59632 -> '在'
 12458 -> '一片'
 61759 -> '幽'
 60962 -> '暗'
 59599 -> '的'
 12028 -> '森林'
 59772 -> '里'
   101 -> '，'
 60118 -> '住'
 59757 -> '着'
 15050 -> '一只'
 59722 -> '小'
 44339 -> '狐狸'
   105 -> '、'
 15050 -> '一只'
 61803 -> '狼'
 59652 -> '和'
 60043 -> '几'
 59866 -> '只'
 60929 -> '狗'
   102 -> '。'
 59568 -> ' '
 40783 -> '这只'
 44339 -> '狐狸'
  3092 -> '他的'

Answered by loofahcus

Nov 6, 2023

I test some benchmarks with BOS token added in the front of prompt, and find that it does affect model performance(about 10% ~ 20%).
I'm not very familiar with llama.cpp, so I can just do some hard code(https://github.com/ggerganov/llama.cpp/blob/master/examples/main/main.cpp#L232). In my limit tests, I feel the results are a bit better, so I think you can have a try~

By the way, the chat model will be released late this month, we can see if it will do better~

View full answer

loofahcus · 2023-11-05T12:24:15Z

loofahcus
Nov 5, 2023
Collaborator

In the base model, we just use EOS token to seperate documents (no BOS at all).
I will investigate these cases tomorrow, thanks~

0 replies

loofahcus · 2023-11-06T08:08:01Z

loofahcus
Nov 6, 2023
Collaborator

I test some benchmarks with BOS token added in the front of prompt, and find that it does affect model performance(about 10% ~ 20%).
I'm not very familiar with llama.cpp, so I can just do some hard code(https://github.com/ggerganov/llama.cpp/blob/master/examples/main/main.cpp#L232). In my limit tests, I feel the results are a bit better, so I think you can have a try~

By the way, the chat model will be released late this month, we can see if it will do better~

8 replies

jezzarax Nov 10, 2023

I've reproduced the issue on my environment with llama.cpp while creating quantised versions of the 6B model. The quantization/perplexity curve for vanilla model is U-shaped, whether the curve for the model with "bos_token_id" set to 2 is going up with stronger quantisation level, as expected. The numbers are below

quantization level	yi-6b-vanilla	yi-6B-bos-fix
Q2_K	10.0182	7.3161
Q3_K_S	9.0627	7.0069
Q3_K_M	8.3788	6.6879
Q3_K_L	8.2645	6.6328
Q4_K_S	7.7184	6.4513
Q4_K_M	7.7623	6.4427
Q5_K_S	7.5672	6.3668
Q5_K_M	7.6329	6.3522
Q6_K	7.6212	6.357
Q8_K	7.5222	6.3199
f16	8.925	6.3151

KerfuffleV2 Nov 10, 2023
Author

I agree. It definitely makes a difference feeding it the BOS token it doesn't expect vs using EOS or nothing. I've been trying to get some of the initial support for respecting add_bos_token, etc in the llama.cpp repo. Hopefully this won't be an issue for too much longer (although existing models will need to be converted or modified). ggerganov/llama.cpp#3981 adds a gguf-set-metadata.py script which can be used to make simple changes like setting the BOS token id from 1 to 2:

gguf-set-metadata.py /path/model.gguf tokenizer.ggml.bos_token_id 2

KerfuffleV2 Nov 11, 2023
Author

Now that the initial pull I linked was merged, I have another open that will actually respect the value: ggerganov/llama.cpp#4040

Once that one gets merged, any GGUF models created with a recent enough version to add the appropriate fields to the metadata should no longer be affected by this issue.

jezzarax Nov 14, 2023

Update: Please ignore the comment below, there seem to be a bigger issue: ggerganov/llama.cpp#4055 (comment)

=====================
Many thanks for the update. The first PR seems to improve the situation, but not by 100% though. For non-quantised 6B model I'm getting perplexity of 6.3932 for vanilla version vs 6.282 for the one with bos forced to 2. This time it's a slightly different data to what was used for the test above, so the numbers are not comparable to the previous ones.

KerfuffleV2 Nov 14, 2023
Author

@jezzarax There can be small differences between running perplexity on GPU, multi-GPU, CPU, even different architectures of GPU. 8.8 vs 6.4 is an absolutely massive difference though, it's like the difference going from a 30b model to a 65b one.

By the way, regarding your original post, lower perplexity is better so you would have been saying the one with BOS forced to 2 was an improvement (6.28 < 6.39). I think using 144 instead of 2 may actually work better (144 is the token ID for Yi's newline) but I didn't do perplexity testing.

KerfuffleV2 · 2023-11-18T11:54:06Z

KerfuffleV2
Nov 18, 2023
Author

The necessary support is now in the latest llama.cpp revision. So this should no longer be an issue as long as the the model was converted recently enough to have the necessary metadata and inference is being done on a recent enough version of llama.cpp to respect it.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some slightly odd results running inference (may just be prompt or tokenizing issue) #5

{{title}}

Replies: 3 comments 8 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Some slightly odd results running inference (may just be prompt or tokenizing issue) #5

KerfuffleV2 Nov 5, 2023

1

2

3

Replies: 3 comments · 8 replies

loofahcus Nov 5, 2023 Collaborator

loofahcus Nov 6, 2023 Collaborator

jezzarax Nov 10, 2023

KerfuffleV2 Nov 10, 2023 Author

KerfuffleV2 Nov 11, 2023 Author

jezzarax Nov 14, 2023

KerfuffleV2 Nov 14, 2023 Author

KerfuffleV2 Nov 18, 2023 Author

KerfuffleV2
Nov 5, 2023

Replies: 3 comments 8 replies

loofahcus
Nov 5, 2023
Collaborator

loofahcus
Nov 6, 2023
Collaborator

KerfuffleV2 Nov 10, 2023
Author

KerfuffleV2 Nov 11, 2023
Author

KerfuffleV2 Nov 14, 2023
Author

KerfuffleV2
Nov 18, 2023
Author