From 9d9e152b6d4302ea55f52952f66307833f8605ca Mon Sep 17 00:00:00 2001 From: Gary Linscott Date: Wed, 22 Mar 2023 08:19:17 -0700 Subject: [PATCH 1/2] Add details on perplexity to README.md --- README.md | 32 +++++++++++++++++++++++++++++++- 1 file changed, 31 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 7c9a4bf49dfe1..d20fbb05bb4da 100644 --- a/README.md +++ b/README.md @@ -240,6 +240,37 @@ or `shasum -a 256 --ignore-missing -c SHA256SUMS` on macOS +### Perplexity (Measuring model quality) + +You can pass `--perplexity` as a command line option to measure perplexity over the given prompt. For more background, +see https://huggingface.co/docs/transformers/perplexity. However, in general, lower perplexity is better for LLMs. + +#### Measurements + +https://github.com/ggerganov/llama.cpp/pull/270 is the unofficial tracking page for now. llama.cpp is measuring very well +compared to the baseline implementations. Quantization has a small negative impact to quality, but, as you can see, running +13B at q4_0 beats the 7B f16 model by a significant amount. +``` +Perplexity - model options +5.5985 - 13B, q4_0 +5.9565 - 7B, f16 +6.3001 - 7B, q4_1 +6.5949 - 7B, q4_0 +6.5995 - 7B, q4_0, --memory_f16 +``` + +#### How to run + +1. Download/extract: https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-raw-v1.zip?ref=salesforce-research +2. Run `./main --perplexity -m models/7B/ggml-model-q4_0.bin -f wiki.test.raw` +3. Output: +``` +Calculating perplexity over 655 chunks +24.43 seconds per pass - ETA 4.45 hours +[1]4.5970,[2]5.1807,[3]6.0382,... +``` +And after 4.45 hours, you will have the final perplexity. + ### Android You can easily run `llama.cpp` on Android device with [termux](https://play.google.com/store/apps/details?id=com.termux). @@ -290,7 +321,6 @@ docker run -v /llama/models:/models ghcr.io/ggerganov/llama.cpp:light -m /models ## Limitations -- We don't know yet how much the quantization affects the quality of the generated text - Probably the token sampling can be improved - The Accelerate framework is actually currently unused since I found that for tensor shapes typical for the Decoder, there is no benefit compared to the ARM_NEON intrinsics implementation. Of course, it's possible that I simply don't From c65eff0d14add4e865cd8f3100277a6ee3ff95c4 Mon Sep 17 00:00:00 2001 From: Gary Linscott Date: Wed, 22 Mar 2023 08:48:36 -0700 Subject: [PATCH 2/2] Add details on dataset/context length --- README.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/README.md b/README.md index d20fbb05bb4da..b5a113c91025c 100644 --- a/README.md +++ b/README.md @@ -250,6 +250,9 @@ see https://huggingface.co/docs/transformers/perplexity. However, in general, l https://github.com/ggerganov/llama.cpp/pull/270 is the unofficial tracking page for now. llama.cpp is measuring very well compared to the baseline implementations. Quantization has a small negative impact to quality, but, as you can see, running 13B at q4_0 beats the 7B f16 model by a significant amount. + +All measurements are done against wikitext2 test dataset (https://paperswithcode.com/dataset/wikitext-2), with default options (512 length context). +Note that the changing the context length will have a significant impact on perplexity (longer context = better perplexity). ``` Perplexity - model options 5.5985 - 13B, q4_0