Allow convert.py to convert directly to q8_0 #2753

KerfuffleV2 · 2023-08-23T22:00:55Z

The main thing this pull does is add q8_0 as an output type for convert.py.

bounded_parallel_map also had weird behavior where it would fully consume the iterator passed to it (so it could build items_rev) but this really isn't necessary. I added an argument to allow specifying the factor used to create the executor as well.

The way it uses both ThreadPoolExecutor and ProcessPoolExecutor may look weird, but it's necessary to get acceptable performance when quantizing. My Python q8_0 implementation isn't too fast and it doesn't seem like ThreadPoolExecutor parallelization works there, probably because of the GIL. ThreadPoolExecutor is still needed because ProcessPoolExecutor requires objects to be pickleable and there's some stuff involved in loading that isn't.

On my machine (5700G) using 8 threads takes 2 minutes to convert/quantize a 3B model. Extrapolating from that, it would take about 47min to convert and quantize a 70B. The main use for this would be creating a Q8_0 without having to create an intermediate f16 or f32 and then convert to Q8_0 (also you don't get exactly the same results going f16 -> q8_0).

I think my mini q8_0 implementation works (or is very close). Perplexity results for orca_mini_3b:

f16 to q8_0

[612]10.8966,[613]10.9062,[614]10.9136,[615]10.9014,[616]10.9013

f32 to q8_0

[612]10.8995,[613]10.9092,[614]10.9166,[615]10.9043,[616]10.9043

direct to q8_0

[612]10.8995,[613]10.9091,[614]10.9165,[615]10.9043,[616]10.9042

With the improved version:

[612]10.8987,[613]10.9083,[614]10.9156,[615]10.9034,[616]10.9033

edit: Added a --concurrency commandline option. This can help a lot with quantizing, using --concurrency 16 got the 3B model converted/quantized in about 75sec.

edit: With @cebtenzzre's changes we're down to 27sec converting/quantizing the 3B model.

cebtenzzre · 2023-08-24T03:35:15Z

Why not use os.cpu_count() as the default concurrency?

cebtenzzre · 2023-08-24T03:37:50Z

convert.py

+    return np.fromiter(map(quantize_block_q8_0, blocks), count = n_blocks, dtype = BLOCK_Q8_0)
+
+def quantize_block_q8_0(blk, zero = np.float32(0), one = np.float32(1), onetwentyseven = np.float32(127), zero_chunk = (np.int8(0),) * QK8_0):
+    d = abs(blk).max() / onetwentyseven
+    if d == zero:
+        return (np.float16(d), zero_chunk)
+    return (np.float16(d), (blk * (one / d)).round())


numpy is much more efficient if you give it bigger chunks. This is about 4x as fast.

Suggested change

return np.fromiter(map(quantize_block_q8_0, blocks), count = n_blocks, dtype = BLOCK_Q8_0)

def quantize_block_q8_0(blk, zero = np.float32(0), one = np.float32(1), onetwentyseven = np.float32(127), zero_chunk = (np.int8(0),) * QK8_0):

d = abs(blk).max() / onetwentyseven

if d == zero:

return (np.float16(d), zero_chunk)

return (np.float16(d), (blk * (one / d)).round())

return np.fromiter(quantize_blocks_q8_0(blocks), count = n_blocks, dtype = BLOCK_Q8_0)

def quantize_blocks_q8_0(blocks):

d = abs(blocks).max(axis=1) / np.float32(127)

with np.errstate(divide='ignore'):

qs = (blocks / d[:, None]).round()

qs[d == 0] = 0

yield from zip(np.float16(d), qs)

Thanks for the feedback, that's definitely faster. Any idea of why the results are different though, and which would be correct?

Quantizing:

[-0.0062561 -0.00460815 -0.00753784 -0.00469971 -0.00100708 0.00210571 0.01055908 -0.00692749 0.00285339 0.00491333 -0.00164795 -0.00028992 0.00174713 -0.00141144 0.00842285 0.00111389 0.00994873 -0.00041771 -0.00872803 0.00263977 0.01422119 0.01312256 -0.00836182 -0.0078125 -0.01806641 -0.00939941 0.00256348 -0.00195312 -0.00119019 0.00415039 0.0090332 -0.00485229]

Original:

(0.0001422, [ -44, -32, -53, -33, -7, 15, 74, -49, 20, 35, -12, -2, 12, -10, 59, 8, 70, -3, -61, 19, 100, 92, -59, -55, -127, -66, 18, -14, -8, 29, 64, -34])

Yours:

(0.0001422, [ -44, -32, -53, -33, -7, 15, 74, -49, 20, 35, -12, -2, 12, -10, 59, 8, 70, -3, -61, 19, 100, 92, -59, -55, -127, -66, 18, -14, -8, 29, 63, -34])

(The penultimate item in the original version is 64 while in yours it's 63.)

List version of the quantized values for easier pasting:

[-0.0062561035, -0.0046081543, -0.007537842, -0.004699707, -0.0010070801, 0.002105713, 0.010559082, -0.0069274902, 0.0028533936, 0.00491333, -0.0016479492, -0.000289917, 0.0017471313, -0.001411438, 0.008422852, 0.0011138916, 0.0099487305, -0.00041770935, -0.008728027, 0.0026397705, 0.014221191, 0.013122559, -0.008361816, -0.0078125, -0.018066406, -0.009399414, 0.0025634766, -0.001953125, -0.0011901855, 0.0041503906, 0.009033203, -0.004852295]

edit: From the examples I've looked at, it seems like the difference always involves -63 and -64 or 63 and 64 (but it doesn't seem consistent which version chooses which). Probably something involving rounding? The difference seems very small. I can show you more examples if it helps.

I can get your version to output 63 if I change (blk * (one / d)) to blk / d. If you do the math in np.float64 you get 63.49996485, so I think my version is slightly more accurate.

I can get behind that. Thanks! Performance is actually competitive with the C++ version now.

KerfuffleV2 · 2023-08-24T06:59:46Z

Why not use os.cpu_count() as the default concurrency?

Setting high concurrency doesn't really do much when not quantizing. Also whatever the concurrency level is set to, the bounded_parallel_map function will generally try to buffer the data from that many tensors. So if using quantization which uses two levels of that function and you had concurrency 16 (which is what os.cpu_count() returns for my system with 8 real cores) you'd have 16 tensors in memory most of the time.

The value was previously hardcoded to 8, so I just went with that. Partially to be conservative also, because I don't think converting to q8_0 will be the typical use case, just something that could be nice to have available.

ggerganov · 2023-08-24T07:49:43Z

I will take a detailed look a bit later - want to keep up to date with the python code since pre-GGUF I've lost track of it and it got too messy and complicated for my taste. It's still not great, but at least I'm more familiar with it now.

One thing I want to see at some point is to extract the lazy and paralllel loading of the tensors into gguf.py so we can easily reuse it and also to simplify the convert.py script. Just making a note here for future PRs - not needed for this one

Fix issue with bounded_parallel_map and greedy consuming iterator Display elapsed time during conversion

Minor improvements to help text Clean up bounded_parallel_map function a bit

ggerganov · 2023-08-25T11:22:28Z

I have mixed feelings about this change - I'll likely never used it, but at the same time it will be present in convert.py and add to the bloat. Can we try to refactor the Python code so that such change can go into a separate script?

KerfuffleV2 · 2023-08-25T14:00:15Z

Can we try to refactor the Python code so that such change can go into a separate script?

The actual quantization code is very, very small:

QK8_0 = 32
BLOCK_Q8_0 = np.dtype([('d', '<f2'), ('qs', 'i1', (QK8_0,))])
def quantize_array_q8_0(arr):
    assert arr.size % QK8_0 == 0 and arr.size != 0, f'Bad array size {arr.size}'
    assert arr.dtype == np.float32, f'Bad array type {arr.dtype}'
    n_blocks = arr.size // QK8_0
    blocks = arr.reshape((n_blocks, QK8_0))
    return np.fromiter(quantize_blocks_q8_0(blocks), count = n_blocks, dtype = BLOCK_Q8_0)

def quantize_blocks_q8_0(blocks):
    d = abs(blocks).max(axis = 1) / np.float32(127)
    with np.errstate(divide = 'ignore'):
        qs = (blocks / d[:, None]).round()
    qs[d == 0] = 0
    yield from zip(np.float16(d), qs)

Most of the other changes specific to adding the ability to quantize are just handling the type in the relevant classes. I don't think there's a way to add a type without handling that case in those classes, so there isn't really a way to separate out that part.

The rest of the pull is a fix for the bounded_parallel_map function to actually stream the input iterator instead of consuming the whole thing greedily at the start, showing the elapsed time and allowing the user to specify concurrency on the commandline.

So the "bloat" added by quantizing is very small and you could say there's about the same amount of bloat involved in supporting both f16 and f32 as output types. Why support f32 and f16? If the argument is that f16 is half the size and is more convenient to store, might be preferable for people with low disk space and has very minimal quality loss then one could say the same about supporting q8_0.

If convert.py only supported output to f32 it could be simplified a lot.

By the way, the DeferredPermutedTensor class isn't used anywhere in the llama.cpp repo. That could be removed to lower bloat.

edit: I put some more thought into this and the only thing I can really come up to help with the changes relating to supporting an additional type would be to refactor all that stuff so you could register types and handlers for them and that kind of stuff. That way an external script could import/reuse most of convert.py and just register a new type like q8_0 quantization. However, I think it's very likely this would make convert.py more complicated rather than less and it's the kind of thing that would only really pay off if there were 2-3 more types than currently exist. (Which I assume probably isn't all that likely?)

ggerganov

Ok, let's merge it. We should remove the DeferredPermutedTensor - not sure what was its purpose, but it is no longer used

KerfuffleV2 · 2023-08-25T20:15:22Z

We should remove the DeferredPermutedTensor

I can add that change. I'll hold off on merging this until tomorrow when I'll have some time to go through my changes and see if there's anything else I can do to clean up/simplify them.

KerfuffleV2 · 2023-08-26T15:26:27Z

@ggerganov What do you think of these changes? It's, uhh... A lot more than I was planning to do originally but I think it's basically better and more understandable than the original version that special cased the quantization stuff.

I still have to do more testing to make sure I didn't break the BF16 or SafeTensors stuff (I don't have an easy way to test that right now) but I want to make sure you're okay with this approach.

If we did want to add more quantizations or some other data type that requires special handling these changes also make that easier.

I got type linting working for Python in my IDE and I found a bunch of other stuff that I'll try to fix but probably a bad idea to shove everything into this pull. One example in particular is some of the permute stuff is getting called with the wrong number of arguments. I assume if that code path is ever taken it'll just crash.

ggerganov · 2023-08-26T17:04:14Z

Yes, this looks nice.

I got type linting working for Python in my IDE

I need to learn how to do that :D

KerfuffleV2 · 2023-08-26T19:50:48Z

@ggerganov

I need to learn how to do that :D

If you're using VS Codium (or VS Code) I can help you, it's pretty simple. It's also possible to just run mypy manually and look at the lints.

Yes, this looks nice.

Great! I'm relatively happy with this, although my choices for naming things probably aren't very ideal. I tested with Safetensors and a Safetensors file with all the tensors converted to bfloat16. Seems to work fine. I don't really have the ability to mess with gigantic 70B 32bit models and I've been testing with a 3B. The only thing that would be different is the permute stuff and I don't think this pull should affect that (I specially have my quantization run after everything else) but I didn't test converting a 70B LLaMA2 model.

As far as my changes go, I'd say this is now ready to merge. Are you good with the current state, or any further changes/testing needed? (Of course testing/feedback from anyone else reading this is welcome also.)

ggerganov

We should not "promote" this quantization option for general use as it produces different results from the reference C++ quantization and it can become messy if a lot of people start using it and we end up with 2 different Q8_0 models

KerfuffleV2 · 2023-08-26T20:18:29Z

We should not "promote" this quantization option for general use as it produces different results from the reference C++ quantization

Seems reasonable. Do you want any extra warnings/notes added when using that mode or to the --help text, etc? (Maybe the people who know more about quantization stuff could suggest changes to get the same results as the C++ version? It's actually very close.)

ggerganov · 2023-08-26T20:25:26Z

No need - should be fine. I think it will be difficult to make the results 100% exact. The quality should be the same.
The only potential issue that could arise from this is if we want to compute checksums.

slaren · 2023-08-26T20:27:47Z

Considering that this produces better results than converting to f16 and then quantizing, maybe it should be the recommended way to do it.

KerfuffleV2 · 2023-08-26T20:34:15Z

The only potential issue that could arise from this is if we want to compute checksums.

Perplexity/hellaswag/other quality tests might vary a bit also.

Considering that this produces better results than converting to f16 and then quantizing

I'm not an expert on this stuff, but I think the differences here may be too small to be meaningful really. From my tests though, using convert.py to convert to f16 and then using the C++ quantize to q8_0 gave a slightly better result than converting to f32 then using C++ quantize to q8_0. It wouldn't make sense to conclude that f16 is higher quality that f32 though.

I think just any sort of tiny change, whether it's objectively higher/lower quality can just randomly vary the results a bit in either direction.

ggerganov · 2023-08-26T20:45:47Z

Btw, I see people currently discussing potential quality issues with converting BF16 -> F16 (not related to llama.cpp, just in general). I don't think it is reasonable to expect issues with this conversion given the distribution of the weights.

My quick tests show that with llama-7b-v2 there is no noticeable difference between converted to F16 and F32. Observations based on a few text generations - identical output up to few tens of generated tokens and then start to diverge slightly. Which is to be expected as the probs are of course very slightly different between the two

KerfuffleV2 · 2023-08-26T20:54:13Z

One metric I like for comparing perplexity is the difference between a 13B LLaMA1 model and a 7B: that's fairly noticeable. If I remember correctly, that's around 0.68. 10% of that would be about 0.068 — I'd be surprised if users could notice a difference smaller than that. (The biggest difference between the results I saw testing the different conversions was 0.003.)

* master: (773 commits) server : add `/detokenize` endpoint (ggerganov#2802) convert.py : advanced option (ggerganov#2753) llama : use Unicode Escape Sequence to replace encoded characters (ggerganov#2814) flake.nix : add rocm support and cleanup (ggerganov#2808) llama : move #includes out of _GNU_SOURCE conditional (ggerganov#2817) main : fix bug (penalize_nl=false doesn't work) + suppress warning on mingw (ggerganov#1528) llama : use std::abs in llama_sample_tail_free (ggerganov#2800) k-quants : remove unnecessary tensor shape restrictions (ggerganov#2811) Better perplexity for 2- and 3-bit quantization for LLaMA-v2-70B (ggerganov#2807) Fix HellaSwag (ggerganov#2805) flake : build llama.cpp on Intel with nix (ggerganov#2795) Handle null rope scaling value (ggerganov#2793) Fix spm whitespaces (ggerganov#2806) examples : skip unnecessary external lib in server README.md how-to (ggerganov#2804) llama : fix struct decl (ggerganov#2790) Faster perplexity computation (ggerganov#2786) llama : add llama_beam_search() (ggerganov#2267) convert.py : Get rope scale from HuggingFace models (ggerganov#2772) llama-bench : add model sizes (ggerganov#2771) convert.py : export rope freq_base when converting CodeLlama from an HF model (ggerganov#2773) ...

* Allow convert.py to convert to q8_0 Fix issue with bounded_parallel_map and greedy consuming iterator Display elapsed time during conversion * Add --concurrency option Minor improvements to help text Clean up bounded_parallel_map function a bit * Massive speed improvement thanks to Cebtenzzre * Refactor types

KerfuffleV2 added the script Script related label Aug 23, 2023

cebtenzzre reviewed Aug 24, 2023

View reviewed changes

ggerganov self-requested a review August 24, 2023 07:49

KerfuffleV2 added 3 commits August 24, 2023 11:53

Allow convert.py to convert to q8_0

f54daa0

Fix issue with bounded_parallel_map and greedy consuming iterator Display elapsed time during conversion

Add --concurrency option

3efcbb8

Minor improvements to help text Clean up bounded_parallel_map function a bit

Massive speed improvement thanks to Cebtenzzre

8ee186c

KerfuffleV2 force-pushed the feat-convert-to-q8_0 branch from 655990c to 8ee186c Compare August 24, 2023 18:13

ggerganov approved these changes Aug 25, 2023

View reviewed changes

Refactor types

5f23d41

KerfuffleV2 mentioned this pull request Aug 26, 2023

convert.py : handle special tokens #2820

Closed

ggerganov approved these changes Aug 26, 2023

View reviewed changes

ggerganov merged commit 730d9c6 into ggerganov:master Aug 26, 2023

KerfuffleV2 deleted the feat-convert-to-q8_0 branch November 17, 2023 03:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow convert.py to convert directly to q8_0 #2753

Allow convert.py to convert directly to q8_0 #2753

KerfuffleV2 commented Aug 23, 2023 •

edited

Loading

cebtenzzre commented Aug 24, 2023

cebtenzzre Aug 24, 2023 •

edited

Loading

KerfuffleV2 Aug 24, 2023 •

edited

Loading

cebtenzzre Aug 24, 2023

KerfuffleV2 Aug 24, 2023

KerfuffleV2 commented Aug 24, 2023

ggerganov commented Aug 24, 2023

ggerganov commented Aug 25, 2023

KerfuffleV2 commented Aug 25, 2023 •

edited

Loading

ggerganov left a comment

KerfuffleV2 commented Aug 25, 2023

KerfuffleV2 commented Aug 26, 2023

ggerganov commented Aug 26, 2023

KerfuffleV2 commented Aug 26, 2023

ggerganov left a comment

KerfuffleV2 commented Aug 26, 2023

ggerganov commented Aug 26, 2023

slaren commented Aug 26, 2023

KerfuffleV2 commented Aug 26, 2023

ggerganov commented Aug 26, 2023

KerfuffleV2 commented Aug 26, 2023

Allow convert.py to convert directly to q8_0 #2753

Allow convert.py to convert directly to q8_0 #2753

Conversation

KerfuffleV2 commented Aug 23, 2023 • edited Loading

f16 to q8_0

f32 to q8_0

direct to q8_0

cebtenzzre commented Aug 24, 2023

cebtenzzre Aug 24, 2023 • edited Loading

Choose a reason for hiding this comment

KerfuffleV2 Aug 24, 2023 • edited Loading

Choose a reason for hiding this comment

cebtenzzre Aug 24, 2023

Choose a reason for hiding this comment

KerfuffleV2 Aug 24, 2023

Choose a reason for hiding this comment

KerfuffleV2 commented Aug 24, 2023

ggerganov commented Aug 24, 2023

ggerganov commented Aug 25, 2023

KerfuffleV2 commented Aug 25, 2023 • edited Loading

ggerganov left a comment

Choose a reason for hiding this comment

KerfuffleV2 commented Aug 25, 2023

KerfuffleV2 commented Aug 26, 2023

ggerganov commented Aug 26, 2023

KerfuffleV2 commented Aug 26, 2023

ggerganov left a comment

Choose a reason for hiding this comment

KerfuffleV2 commented Aug 26, 2023

ggerganov commented Aug 26, 2023

slaren commented Aug 26, 2023

KerfuffleV2 commented Aug 26, 2023

ggerganov commented Aug 26, 2023

KerfuffleV2 commented Aug 26, 2023

KerfuffleV2 commented Aug 23, 2023 •

edited

Loading

cebtenzzre Aug 24, 2023 •

edited

Loading

KerfuffleV2 Aug 24, 2023 •

edited

Loading

KerfuffleV2 commented Aug 25, 2023 •

edited

Loading