Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix HellaSwag #2805

Merged
merged 1 commit into from
Aug 26, 2023
Merged

Fix HellaSwag #2805

merged 1 commit into from
Aug 26, 2023

Conversation

ikawrakow
Copy link
Contributor

@ikawrakow ikawrakow commented Aug 26, 2023

The HellaSwag scores are 4-5 percentage points lower compared to what is posted in #2321. The score change occurred after #2398 was merged. The difference is due to changes in the tokenizer related to how space is being handled. In a HellaSwag task one evaluates 4 possible endings after a given context. These endings begin with space, and this leads to a different tokenization on current master compared to what we had before the GGUF changes in #2398. I tried adding the space to the context rather than the endings, but this did not improve the score. So, what this PR does to avoid space handling issues is to tokenize context+ending together, and then evaluate the tokens after the context tokens. This improves the HellaSwag score significantly. It is not exactly the same as what we had before #2398, but comes close: For LLaMA-v2-7B it is 76.75 (PR) vs 77.25 (before GGUF) after 400 tasks, 74.9 vs 75.2 after 1000 tasks, and 75.35 vs 75.4 after 2000 tasks.

Update: Final HellaSwag scores after 10042 tasks for LLaMA-v2-7B:

Version Score
Master 71.4
This PR 75.65
Before GGUF 75.82

Update 2: The culprit is this line:

std::string result = "\xe2\x96\x81";

If I comment it out, I recover pre-GGUF HellaSwag scores. If I replace it with

    std::string result;
    if (!text.empty() && text.front() != ' ') result = "\xe2\x96\x81";

I get the score of this PR without the change in the PR.

So, the question is: why was unconditionally adding an escaped white space to the string to be tokenized required?
@ggerganov @goerch

@KerfuffleV2
Copy link
Collaborator

why was unconditionally adding an escaped white space to the string to be tokenized required?

Unconditionally doesn't sound right, but there's a reason to add it if it's not there.

 if (!text.empty() && text.front() != ' ') result = "\xe2\x96\x81";

If just leaving a ' ' is good enough then can't you just do result = " "? I'm pretty sure the tokenizing stuff just does a search and replace of "\xe2\x96\x81" to " " so there shouldn't be a practical difference.

@klosax
Copy link
Contributor

klosax commented Aug 26, 2023

PR #2806 restores the pre-gguf accuracy.

@ikawrakow
Copy link
Contributor Author

PR #2806 restores the pre-gguf accuracy.

I can confirm this to be true, so strictly speaking one could just close this PR. Nevertheless, I still think it is worth having this change in the HellaSwag calculation as it makes it more robust to potential future tokenization changes.

@klosax
Copy link
Contributor

klosax commented Aug 26, 2023

Nevertheless, I still think it is worth having this change in the HellaSwag calculation as it makes it more robust to potential future tokenization changes.

The reason I did separate the context and endings was to be sure to know exactly where the tokenized context ends.

The decision to have the separation space in the endings was that is more compatible with sentencepiece, which works much better with spaces prepended to all words.

@ikawrakow ikawrakow requested a review from ggerganov August 26, 2023 12:11
@klosax
Copy link
Contributor

klosax commented Aug 26, 2023

Have you tested that this works with the bpe tokenizer ie Falcon-7b? F16 model should give 76.75 on 400 tasks.

@IgnacioFDM
Copy link
Contributor

We should still expect the scores to be a couple of points lower than what you'd get with lm-evaluation-harness HellaSwag (with an AutoGPTQ model), right? Any insights into why the difference?

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the different eps value explain the remaining small difference that you observe after the fixes in the tokenizer?

Pre-GGUF used eps 5e-6 and GGUF uses 1e-5

@ikawrakow
Copy link
Contributor Author

Could the different eps value explain the remaining small difference that you observe after the fixes in the tokenizer?

Pre-GGUF used eps 5e-6 and GGUF uses 1e-5

It could be. Pre-GGUF I ran with eps = 5e-6 and eps = 1e-5, but for the long run with 10042 tasks I have kept only the hellaswag output, so not really sure with what eps it was done.

@ikawrakow ikawrakow merged commit 771551a into master Aug 26, 2023
@ikawrakow ikawrakow deleted the ik/fix_hellaswag branch August 26, 2023 13:48
@ikawrakow
Copy link
Contributor Author

ikawrakow commented Aug 26, 2023

Have you tested that this works with the bpe tokenizer ie Falcon-7b? F16 model should give 76.75 on 400 tasks.

I'm not able to test. Downloaded the pytorch files from HF and ran convert-falcon-hf-to-gguf.py. It told me that it converted it successfully. When I try HellaSwag with the resulting .gguf file, I get scores consistent with random chance. Perplexity is in the range of 400+. This happens on master and on this branch. Not sure what I'm doing wrong.

@klosax
Copy link
Contributor

klosax commented Aug 26, 2023

You are using the official model? https://huggingface.co/tiiuae/falcon-7b/tree/main

@klosax
Copy link
Contributor

klosax commented Aug 26, 2023

Try parameter -nommq if you are using cublas. The logprobs are probably generating NaNs. And there are known problems with offloading.

@slaren
Copy link
Collaborator

slaren commented Aug 26, 2023

CUDA currently doesn't work with falcon when offloading the KV. IIRC, the maximum layers that can be offloaded for 7B is 33 and 60 for 40B.

@ikawrakow
Copy link
Contributor Author

You are using the official model? https://huggingface.co/tiiuae/falcon-7b/tree/main

Yes.

Try parameter -nommq if you are using cublas. The logprobs are probably generating NaNs.

Running the fp16 model, so -nommq is not involved. There are no NaNs, but the perplexity is very high (400-600). Something is going wrong during conversion and I don't know what. Is there a ready .gguf model somewhere I can download?

@klosax
Copy link
Contributor

klosax commented Aug 26, 2023

Cant find any F16 but here is Q8_0 and Q4_0, I have not tested if they works:
https://huggingface.co/NikolayKozloff/falcon-7b-GGUF/tree/main

@ikawrakow
Copy link
Contributor Author

Thank you, @slaren. This fixes it. I had missed this part (or better, it had slipped my mind).

@ikawrakow
Copy link
Contributor Author

ikawrakow commented Aug 26, 2023

Yes, I now get HellaSwag = 76.75 for Falcon-7B after 400 tasks.

@klosax
Copy link
Contributor

klosax commented Aug 26, 2023

Yes, I now get HellaSwag = 76.75 for Falcon-7B after 400 tasks.

Great!

mattgauf added a commit to mattgauf/llama.cpp that referenced this pull request Aug 26, 2023
* master: (773 commits)
  server : add `/detokenize` endpoint (ggerganov#2802)
  convert.py : advanced option (ggerganov#2753)
  llama : use Unicode Escape Sequence to replace encoded characters (ggerganov#2814)
  flake.nix : add rocm support and cleanup (ggerganov#2808)
  llama : move #includes out of _GNU_SOURCE conditional (ggerganov#2817)
  main : fix bug (penalize_nl=false doesn't work) + suppress warning on mingw (ggerganov#1528)
  llama : use std::abs in llama_sample_tail_free (ggerganov#2800)
  k-quants : remove unnecessary tensor shape restrictions (ggerganov#2811)
  Better perplexity for 2- and 3-bit quantization for LLaMA-v2-70B (ggerganov#2807)
  Fix HellaSwag (ggerganov#2805)
  flake : build llama.cpp on Intel with nix (ggerganov#2795)
  Handle null rope scaling value (ggerganov#2793)
  Fix spm whitespaces (ggerganov#2806)
  examples : skip unnecessary external lib in server README.md how-to (ggerganov#2804)
  llama : fix struct decl (ggerganov#2790)
  Faster perplexity computation (ggerganov#2786)
  llama : add llama_beam_search() (ggerganov#2267)
  convert.py : Get rope scale from HuggingFace models (ggerganov#2772)
  llama-bench : add model sizes (ggerganov#2771)
  convert.py : export rope freq_base when converting CodeLlama from an HF model (ggerganov#2773)
  ...
akawrykow pushed a commit to akawrykow/llama.cpp that referenced this pull request Aug 29, 2023
@ikawrakow ikawrakow restored the ik/fix_hellaswag branch January 16, 2024 16:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants