Fix HellaSwag #2805

ikawrakow · 2023-08-26T08:09:21Z

The HellaSwag scores are 4-5 percentage points lower compared to what is posted in #2321. The score change occurred after #2398 was merged. The difference is due to changes in the tokenizer related to how space is being handled. In a HellaSwag task one evaluates 4 possible endings after a given context. These endings begin with space, and this leads to a different tokenization on current master compared to what we had before the GGUF changes in #2398. I tried adding the space to the context rather than the endings, but this did not improve the score. So, what this PR does to avoid space handling issues is to tokenize context+ending together, and then evaluate the tokens after the context tokens. This improves the HellaSwag score significantly. It is not exactly the same as what we had before #2398, but comes close: For LLaMA-v2-7B it is 76.75 (PR) vs 77.25 (before GGUF) after 400 tasks, 74.9 vs 75.2 after 1000 tasks, and 75.35 vs 75.4 after 2000 tasks.

Update: Final HellaSwag scores after 10042 tasks for LLaMA-v2-7B:

Version	Score
Master	71.4
This PR	75.65
Before GGUF	75.82

Update 2: The culprit is this line:

llama.cpp/llama.cpp

Line 3030 in bae5c5f

std::string result = "\xe2\x96\x81";

If I comment it out, I recover pre-GGUF HellaSwag scores. If I replace it with

    std::string result;
    if (!text.empty() && text.front() != ' ') result = "\xe2\x96\x81";

I get the score of this PR without the change in the PR.

So, the question is: why was unconditionally adding an escaped white space to the string to be tokenized required?
@ggerganov @goerch

KerfuffleV2 · 2023-08-26T10:00:25Z

why was unconditionally adding an escaped white space to the string to be tokenized required?

Unconditionally doesn't sound right, but there's a reason to add it if it's not there.

 if (!text.empty() && text.front() != ' ') result = "\xe2\x96\x81";

If just leaving a ' ' is good enough then can't you just do result = " "? I'm pretty sure the tokenizing stuff just does a search and replace of "\xe2\x96\x81" to " " so there shouldn't be a practical difference.

klosax · 2023-08-26T10:21:12Z

PR #2806 restores the pre-gguf accuracy.

ikawrakow · 2023-08-26T11:40:46Z

PR #2806 restores the pre-gguf accuracy.

I can confirm this to be true, so strictly speaking one could just close this PR. Nevertheless, I still think it is worth having this change in the HellaSwag calculation as it makes it more robust to potential future tokenization changes.

klosax · 2023-08-26T12:00:50Z

Nevertheless, I still think it is worth having this change in the HellaSwag calculation as it makes it more robust to potential future tokenization changes.

The reason I did separate the context and endings was to be sure to know exactly where the tokenized context ends.

The decision to have the separation space in the endings was that is more compatible with sentencepiece, which works much better with spaces prepended to all words.

klosax · 2023-08-26T12:19:18Z

Have you tested that this works with the bpe tokenizer ie Falcon-7b? F16 model should give 76.75 on 400 tasks.

IgnacioFDM · 2023-08-26T12:31:54Z

We should still expect the scores to be a couple of points lower than what you'd get with lm-evaluation-harness HellaSwag (with an AutoGPTQ model), right? Any insights into why the difference?

ggerganov

Could the different eps value explain the remaining small difference that you observe after the fixes in the tokenizer?

Pre-GGUF used eps 5e-6 and GGUF uses 1e-5

ikawrakow · 2023-08-26T13:48:37Z

Could the different eps value explain the remaining small difference that you observe after the fixes in the tokenizer?

Pre-GGUF used eps 5e-6 and GGUF uses 1e-5

It could be. Pre-GGUF I ran with eps = 5e-6 and eps = 1e-5, but for the long run with 10042 tasks I have kept only the hellaswag output, so not really sure with what eps it was done.

ikawrakow · 2023-08-26T13:59:53Z

Have you tested that this works with the bpe tokenizer ie Falcon-7b? F16 model should give 76.75 on 400 tasks.

I'm not able to test. Downloaded the pytorch files from HF and ran convert-falcon-hf-to-gguf.py. It told me that it converted it successfully. When I try HellaSwag with the resulting .gguf file, I get scores consistent with random chance. Perplexity is in the range of 400+. This happens on master and on this branch. Not sure what I'm doing wrong.

klosax · 2023-08-26T14:06:10Z

You are using the official model? https://huggingface.co/tiiuae/falcon-7b/tree/main

klosax · 2023-08-26T14:07:10Z

Try parameter -nommq if you are using cublas. The logprobs are probably generating NaNs. And there are known problems with offloading.

slaren · 2023-08-26T14:10:54Z

CUDA currently doesn't work with falcon when offloading the KV. IIRC, the maximum layers that can be offloaded for 7B is 33 and 60 for 40B.

ikawrakow · 2023-08-26T14:11:43Z

You are using the official model? https://huggingface.co/tiiuae/falcon-7b/tree/main

Yes.

Try parameter -nommq if you are using cublas. The logprobs are probably generating NaNs.

Running the fp16 model, so -nommq is not involved. There are no NaNs, but the perplexity is very high (400-600). Something is going wrong during conversion and I don't know what. Is there a ready .gguf model somewhere I can download?

klosax · 2023-08-26T14:15:15Z

Cant find any F16 but here is Q8_0 and Q4_0, I have not tested if they works:
https://huggingface.co/NikolayKozloff/falcon-7b-GGUF/tree/main

ikawrakow · 2023-08-26T14:15:30Z

Thank you, @slaren. This fixes it. I had missed this part (or better, it had slipped my mind).

ikawrakow · 2023-08-26T14:32:17Z

Yes, I now get HellaSwag = 76.75 for Falcon-7B after 400 tasks.

klosax · 2023-08-26T14:33:52Z

Yes, I now get HellaSwag = 76.75 for Falcon-7B after 400 tasks.

Great!

* master: (773 commits) server : add `/detokenize` endpoint (ggerganov#2802) convert.py : advanced option (ggerganov#2753) llama : use Unicode Escape Sequence to replace encoded characters (ggerganov#2814) flake.nix : add rocm support and cleanup (ggerganov#2808) llama : move #includes out of _GNU_SOURCE conditional (ggerganov#2817) main : fix bug (penalize_nl=false doesn't work) + suppress warning on mingw (ggerganov#1528) llama : use std::abs in llama_sample_tail_free (ggerganov#2800) k-quants : remove unnecessary tensor shape restrictions (ggerganov#2811) Better perplexity for 2- and 3-bit quantization for LLaMA-v2-70B (ggerganov#2807) Fix HellaSwag (ggerganov#2805) flake : build llama.cpp on Intel with nix (ggerganov#2795) Handle null rope scaling value (ggerganov#2793) Fix spm whitespaces (ggerganov#2806) examples : skip unnecessary external lib in server README.md how-to (ggerganov#2804) llama : fix struct decl (ggerganov#2790) Faster perplexity computation (ggerganov#2786) llama : add llama_beam_search() (ggerganov#2267) convert.py : Get rope scale from HuggingFace models (ggerganov#2772) llama-bench : add model sizes (ggerganov#2771) convert.py : export rope freq_base when converting CodeLlama from an HF model (ggerganov#2773) ...

Co-authored-by: Iwan Kawrakow <[email protected]>

Fix HellaSwag

d34472c

ikawrakow requested a review from ggerganov August 26, 2023 12:11

ggerganov approved these changes Aug 26, 2023

View reviewed changes

ikawrakow merged commit 771551a into master Aug 26, 2023

ikawrakow deleted the ik/fix_hellaswag branch August 26, 2023 13:48

ggerganov mentioned this pull request Aug 26, 2023

llama : more tokenizer fixes #2810

Merged

akawrykow pushed a commit to akawrykow/llama.cpp that referenced this pull request Aug 29, 2023

Fix HellaSwag (ggerganov#2805)

79d7024

Co-authored-by: Iwan Kawrakow <[email protected]>

ikawrakow restored the ik/fix_hellaswag branch January 16, 2024 16:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix HellaSwag #2805

Fix HellaSwag #2805

ikawrakow commented Aug 26, 2023 •

edited

Loading

KerfuffleV2 commented Aug 26, 2023

klosax commented Aug 26, 2023 •

edited

Loading

ikawrakow commented Aug 26, 2023

klosax commented Aug 26, 2023

klosax commented Aug 26, 2023 •

edited

Loading

IgnacioFDM commented Aug 26, 2023

ggerganov left a comment

ikawrakow commented Aug 26, 2023

ikawrakow commented Aug 26, 2023 •

edited

Loading

klosax commented Aug 26, 2023

klosax commented Aug 26, 2023 •

edited

Loading

slaren commented Aug 26, 2023

ikawrakow commented Aug 26, 2023

klosax commented Aug 26, 2023

ikawrakow commented Aug 26, 2023

ikawrakow commented Aug 26, 2023 •

edited

Loading

klosax commented Aug 26, 2023

Fix HellaSwag #2805

Fix HellaSwag #2805

Conversation

ikawrakow commented Aug 26, 2023 • edited Loading

KerfuffleV2 commented Aug 26, 2023

klosax commented Aug 26, 2023 • edited Loading

ikawrakow commented Aug 26, 2023

klosax commented Aug 26, 2023

klosax commented Aug 26, 2023 • edited Loading

IgnacioFDM commented Aug 26, 2023

ggerganov left a comment

Choose a reason for hiding this comment

ikawrakow commented Aug 26, 2023

ikawrakow commented Aug 26, 2023 • edited Loading

klosax commented Aug 26, 2023

klosax commented Aug 26, 2023 • edited Loading

slaren commented Aug 26, 2023

ikawrakow commented Aug 26, 2023

klosax commented Aug 26, 2023

ikawrakow commented Aug 26, 2023

ikawrakow commented Aug 26, 2023 • edited Loading

klosax commented Aug 26, 2023

ikawrakow commented Aug 26, 2023 •

edited

Loading

klosax commented Aug 26, 2023 •

edited

Loading

klosax commented Aug 26, 2023 •

edited

Loading

ikawrakow commented Aug 26, 2023 •

edited

Loading

klosax commented Aug 26, 2023 •

edited

Loading

ikawrakow commented Aug 26, 2023 •

edited

Loading