Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow convert.py to convert directly to q8_0 #2753

Merged
merged 4 commits into from
Aug 26, 2023

Conversation

KerfuffleV2
Copy link
Collaborator

@KerfuffleV2 KerfuffleV2 commented Aug 23, 2023

The main thing this pull does is add q8_0 as an output type for convert.py.

bounded_parallel_map also had weird behavior where it would fully consume the iterator passed to it (so it could build items_rev) but this really isn't necessary. I added an argument to allow specifying the factor used to create the executor as well.

The way it uses both ThreadPoolExecutor and ProcessPoolExecutor may look weird, but it's necessary to get acceptable performance when quantizing. My Python q8_0 implementation isn't too fast and it doesn't seem like ThreadPoolExecutor parallelization works there, probably because of the GIL. ThreadPoolExecutor is still needed because ProcessPoolExecutor requires objects to be pickleable and there's some stuff involved in loading that isn't.

On my machine (5700G) using 8 threads takes 2 minutes to convert/quantize a 3B model. Extrapolating from that, it would take about 47min to convert and quantize a 70B. The main use for this would be creating a Q8_0 without having to create an intermediate f16 or f32 and then convert to Q8_0 (also you don't get exactly the same results going f16 -> q8_0).

I think my mini q8_0 implementation works (or is very close). Perplexity results for orca_mini_3b:

f16 to q8_0

[612]10.8966,[613]10.9062,[614]10.9136,[615]10.9014,[616]10.9013

f32 to q8_0

[612]10.8995,[613]10.9092,[614]10.9166,[615]10.9043,[616]10.9043

direct to q8_0

[612]10.8995,[613]10.9091,[614]10.9165,[615]10.9043,[616]10.9042

With the improved version:

[612]10.8987,[613]10.9083,[614]10.9156,[615]10.9034,[616]10.9033


edit: Added a --concurrency commandline option. This can help a lot with quantizing, using --concurrency 16 got the 3B model converted/quantized in about 75sec.

edit: With @cebtenzzre's changes we're down to 27sec converting/quantizing the 3B model.

@KerfuffleV2 KerfuffleV2 added the script Script related label Aug 23, 2023
@cebtenzzre
Copy link
Collaborator

Why not use os.cpu_count() as the default concurrency?

convert.py Outdated
Comment on lines 762 to 768
return np.fromiter(map(quantize_block_q8_0, blocks), count = n_blocks, dtype = BLOCK_Q8_0)

def quantize_block_q8_0(blk, zero = np.float32(0), one = np.float32(1), onetwentyseven = np.float32(127), zero_chunk = (np.int8(0),) * QK8_0):
d = abs(blk).max() / onetwentyseven
if d == zero:
return (np.float16(d), zero_chunk)
return (np.float16(d), (blk * (one / d)).round())
Copy link
Collaborator

@cebtenzzre cebtenzzre Aug 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numpy is much more efficient if you give it bigger chunks. This is about 4x as fast.

Suggested change
return np.fromiter(map(quantize_block_q8_0, blocks), count = n_blocks, dtype = BLOCK_Q8_0)
def quantize_block_q8_0(blk, zero = np.float32(0), one = np.float32(1), onetwentyseven = np.float32(127), zero_chunk = (np.int8(0),) * QK8_0):
d = abs(blk).max() / onetwentyseven
if d == zero:
return (np.float16(d), zero_chunk)
return (np.float16(d), (blk * (one / d)).round())
return np.fromiter(quantize_blocks_q8_0(blocks), count = n_blocks, dtype = BLOCK_Q8_0)
def quantize_blocks_q8_0(blocks):
d = abs(blocks).max(axis=1) / np.float32(127)
with np.errstate(divide='ignore'):
qs = (blocks / d[:, None]).round()
qs[d == 0] = 0
yield from zip(np.float16(d), qs)

Copy link
Collaborator Author

@KerfuffleV2 KerfuffleV2 Aug 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback, that's definitely faster. Any idea of why the results are different though, and which would be correct?

Quantizing:

[-0.0062561  -0.00460815 -0.00753784 -0.00469971 -0.00100708  0.00210571
  0.01055908 -0.00692749  0.00285339  0.00491333 -0.00164795 -0.00028992
  0.00174713 -0.00141144  0.00842285  0.00111389  0.00994873 -0.00041771
 -0.00872803  0.00263977  0.01422119  0.01312256 -0.00836182 -0.0078125
 -0.01806641 -0.00939941  0.00256348 -0.00195312 -0.00119019  0.00415039
  0.0090332  -0.00485229]

Original:

(0.0001422, [ -44,  -32,  -53,  -33,   -7,   15,   74,  -49,   20,   35,  -12,   -2,   12,  -10,   59,    8,   70,   -3,  -61,   19,  100,   92,  -59,  -55, -127,  -66,   18,  -14,   -8,   29,   64,  -34])

Yours:

(0.0001422, [ -44,  -32,  -53,  -33,   -7,   15,   74,  -49,   20,   35,  -12,   -2,   12,  -10,   59,    8,   70,   -3,  -61,   19,  100,   92,  -59,  -55, -127,  -66,   18,  -14,   -8,   29,   63,  -34])

(The penultimate item in the original version is 64 while in yours it's 63.)

List version of the quantized values for easier pasting:

[-0.0062561035, -0.0046081543, -0.007537842, -0.004699707, -0.0010070801, 0.002105713, 0.010559082, -0.0069274902, 0.0028533936, 0.00491333, -0.0016479492, -0.000289917, 0.0017471313, -0.001411438, 0.008422852, 0.0011138916, 0.0099487305, -0.00041770935, -0.008728027, 0.0026397705, 0.014221191, 0.013122559, -0.008361816, -0.0078125, -0.018066406, -0.009399414, 0.0025634766, -0.001953125, -0.0011901855, 0.0041503906, 0.009033203, -0.004852295]

edit: From the examples I've looked at, it seems like the difference always involves -63 and -64 or 63 and 64 (but it doesn't seem consistent which version chooses which). Probably something involving rounding? The difference seems very small. I can show you more examples if it helps.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can get your version to output 63 if I change (blk * (one / d)) to blk / d. If you do the math in np.float64 you get 63.49996485, so I think my version is slightly more accurate.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can get behind that. Thanks! Performance is actually competitive with the C++ version now.

@KerfuffleV2
Copy link
Collaborator Author

Why not use os.cpu_count() as the default concurrency?

Setting high concurrency doesn't really do much when not quantizing. Also whatever the concurrency level is set to, the bounded_parallel_map function will generally try to buffer the data from that many tensors. So if using quantization which uses two levels of that function and you had concurrency 16 (which is what os.cpu_count() returns for my system with 8 real cores) you'd have 16 tensors in memory most of the time.

The value was previously hardcoded to 8, so I just went with that. Partially to be conservative also, because I don't think converting to q8_0 will be the typical use case, just something that could be nice to have available.

@ggerganov
Copy link
Owner

I will take a detailed look a bit later - want to keep up to date with the python code since pre-GGUF I've lost track of it and it got too messy and complicated for my taste. It's still not great, but at least I'm more familiar with it now.

One thing I want to see at some point is to extract the lazy and paralllel loading of the tensors into gguf.py so we can easily reuse it and also to simplify the convert.py script. Just making a note here for future PRs - not needed for this one

@ggerganov ggerganov self-requested a review August 24, 2023 07:49
Fix issue with bounded_parallel_map and greedy consuming iterator

Display elapsed time during conversion
Minor improvements to help text

Clean up bounded_parallel_map function a bit
@KerfuffleV2 KerfuffleV2 force-pushed the feat-convert-to-q8_0 branch from 655990c to 8ee186c Compare August 24, 2023 18:13
@ggerganov
Copy link
Owner

I have mixed feelings about this change - I'll likely never used it, but at the same time it will be present in convert.py and add to the bloat. Can we try to refactor the Python code so that such change can go into a separate script?

@KerfuffleV2
Copy link
Collaborator Author

KerfuffleV2 commented Aug 25, 2023

Can we try to refactor the Python code so that such change can go into a separate script?

The actual quantization code is very, very small:

QK8_0 = 32
BLOCK_Q8_0 = np.dtype([('d', '<f2'), ('qs', 'i1', (QK8_0,))])
def quantize_array_q8_0(arr):
    assert arr.size % QK8_0 == 0 and arr.size != 0, f'Bad array size {arr.size}'
    assert arr.dtype == np.float32, f'Bad array type {arr.dtype}'
    n_blocks = arr.size // QK8_0
    blocks = arr.reshape((n_blocks, QK8_0))
    return np.fromiter(quantize_blocks_q8_0(blocks), count = n_blocks, dtype = BLOCK_Q8_0)

def quantize_blocks_q8_0(blocks):
    d = abs(blocks).max(axis = 1) / np.float32(127)
    with np.errstate(divide = 'ignore'):
        qs = (blocks / d[:, None]).round()
    qs[d == 0] = 0
    yield from zip(np.float16(d), qs)

Most of the other changes specific to adding the ability to quantize are just handling the type in the relevant classes. I don't think there's a way to add a type without handling that case in those classes, so there isn't really a way to separate out that part.

The rest of the pull is a fix for the bounded_parallel_map function to actually stream the input iterator instead of consuming the whole thing greedily at the start, showing the elapsed time and allowing the user to specify concurrency on the commandline.

So the "bloat" added by quantizing is very small and you could say there's about the same amount of bloat involved in supporting both f16 and f32 as output types. Why support f32 and f16? If the argument is that f16 is half the size and is more convenient to store, might be preferable for people with low disk space and has very minimal quality loss then one could say the same about supporting q8_0.

If convert.py only supported output to f32 it could be simplified a lot.

By the way, the DeferredPermutedTensor class isn't used anywhere in the llama.cpp repo. That could be removed to lower bloat.


edit: I put some more thought into this and the only thing I can really come up to help with the changes relating to supporting an additional type would be to refactor all that stuff so you could register types and handlers for them and that kind of stuff. That way an external script could import/reuse most of convert.py and just register a new type like q8_0 quantization. However, I think it's very likely this would make convert.py more complicated rather than less and it's the kind of thing that would only really pay off if there were 2-3 more types than currently exist. (Which I assume probably isn't all that likely?)

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let's merge it. We should remove the DeferredPermutedTensor - not sure what was its purpose, but it is no longer used

@KerfuffleV2
Copy link
Collaborator Author

We should remove the DeferredPermutedTensor

I can add that change. I'll hold off on merging this until tomorrow when I'll have some time to go through my changes and see if there's anything else I can do to clean up/simplify them.

@KerfuffleV2
Copy link
Collaborator Author

@ggerganov What do you think of these changes? It's, uhh... A lot more than I was planning to do originally but I think it's basically better and more understandable than the original version that special cased the quantization stuff.

I still have to do more testing to make sure I didn't break the BF16 or SafeTensors stuff (I don't have an easy way to test that right now) but I want to make sure you're okay with this approach.

If we did want to add more quantizations or some other data type that requires special handling these changes also make that easier.

I got type linting working for Python in my IDE and I found a bunch of other stuff that I'll try to fix but probably a bad idea to shove everything into this pull. One example in particular is some of the permute stuff is getting called with the wrong number of arguments. I assume if that code path is ever taken it'll just crash.

@ggerganov
Copy link
Owner

Yes, this looks nice.

I got type linting working for Python in my IDE

I need to learn how to do that :D

@KerfuffleV2
Copy link
Collaborator Author

@ggerganov

I need to learn how to do that :D

If you're using VS Codium (or VS Code) I can help you, it's pretty simple. It's also possible to just run mypy manually and look at the lints.

Yes, this looks nice.

Great! I'm relatively happy with this, although my choices for naming things probably aren't very ideal. I tested with Safetensors and a Safetensors file with all the tensors converted to bfloat16. Seems to work fine. I don't really have the ability to mess with gigantic 70B 32bit models and I've been testing with a 3B. The only thing that would be different is the permute stuff and I don't think this pull should affect that (I specially have my quantization run after everything else) but I didn't test converting a 70B LLaMA2 model.

As far as my changes go, I'd say this is now ready to merge. Are you good with the current state, or any further changes/testing needed? (Of course testing/feedback from anyone else reading this is welcome also.)

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not "promote" this quantization option for general use as it produces different results from the reference C++ quantization and it can become messy if a lot of people start using it and we end up with 2 different Q8_0 models

@ggerganov ggerganov merged commit 730d9c6 into ggerganov:master Aug 26, 2023
@KerfuffleV2
Copy link
Collaborator Author

We should not "promote" this quantization option for general use as it produces different results from the reference C++ quantization

Seems reasonable. Do you want any extra warnings/notes added when using that mode or to the --help text, etc? (Maybe the people who know more about quantization stuff could suggest changes to get the same results as the C++ version? It's actually very close.)

@ggerganov
Copy link
Owner

No need - should be fine. I think it will be difficult to make the results 100% exact. The quality should be the same.
The only potential issue that could arise from this is if we want to compute checksums.

@slaren
Copy link
Collaborator

slaren commented Aug 26, 2023

Considering that this produces better results than converting to f16 and then quantizing, maybe it should be the recommended way to do it.

@KerfuffleV2
Copy link
Collaborator Author

The only potential issue that could arise from this is if we want to compute checksums.

Perplexity/hellaswag/other quality tests might vary a bit also.

Considering that this produces better results than converting to f16 and then quantizing

I'm not an expert on this stuff, but I think the differences here may be too small to be meaningful really. From my tests though, using convert.py to convert to f16 and then using the C++ quantize to q8_0 gave a slightly better result than converting to f32 then using C++ quantize to q8_0. It wouldn't make sense to conclude that f16 is higher quality that f32 though.

I think just any sort of tiny change, whether it's objectively higher/lower quality can just randomly vary the results a bit in either direction.

@ggerganov
Copy link
Owner

Btw, I see people currently discussing potential quality issues with converting BF16 -> F16 (not related to llama.cpp, just in general). I don't think it is reasonable to expect issues with this conversion given the distribution of the weights.

My quick tests show that with llama-7b-v2 there is no noticeable difference between converted to F16 and F32. Observations based on a few text generations - identical output up to few tens of generated tokens and then start to diverge slightly. Which is to be expected as the probs are of course very slightly different between the two

@KerfuffleV2
Copy link
Collaborator Author

One metric I like for comparing perplexity is the difference between a 13B LLaMA1 model and a 7B: that's fairly noticeable. If I remember correctly, that's around 0.68. 10% of that would be about 0.068 — I'd be surprised if users could notice a difference smaller than that. (The biggest difference between the results I saw testing the different conversions was 0.003.)

mattgauf added a commit to mattgauf/llama.cpp that referenced this pull request Aug 26, 2023
* master: (773 commits)
  server : add `/detokenize` endpoint (ggerganov#2802)
  convert.py : advanced option (ggerganov#2753)
  llama : use Unicode Escape Sequence to replace encoded characters (ggerganov#2814)
  flake.nix : add rocm support and cleanup (ggerganov#2808)
  llama : move #includes out of _GNU_SOURCE conditional (ggerganov#2817)
  main : fix bug (penalize_nl=false doesn't work) + suppress warning on mingw (ggerganov#1528)
  llama : use std::abs in llama_sample_tail_free (ggerganov#2800)
  k-quants : remove unnecessary tensor shape restrictions (ggerganov#2811)
  Better perplexity for 2- and 3-bit quantization for LLaMA-v2-70B (ggerganov#2807)
  Fix HellaSwag (ggerganov#2805)
  flake : build llama.cpp on Intel with nix (ggerganov#2795)
  Handle null rope scaling value (ggerganov#2793)
  Fix spm whitespaces (ggerganov#2806)
  examples : skip unnecessary external lib in server README.md how-to (ggerganov#2804)
  llama : fix struct decl (ggerganov#2790)
  Faster perplexity computation (ggerganov#2786)
  llama : add llama_beam_search() (ggerganov#2267)
  convert.py : Get rope scale from HuggingFace models (ggerganov#2772)
  llama-bench : add model sizes (ggerganov#2771)
  convert.py : export rope freq_base when converting CodeLlama from an HF model (ggerganov#2773)
  ...
akawrykow pushed a commit to akawrykow/llama.cpp that referenced this pull request Aug 29, 2023
* Allow convert.py to convert to q8_0

Fix issue with bounded_parallel_map and greedy consuming iterator

Display elapsed time during conversion

* Add --concurrency option

Minor improvements to help text

Clean up bounded_parallel_map function a bit

* Massive speed improvement thanks to Cebtenzzre

* Refactor types
@KerfuffleV2 KerfuffleV2 deleted the feat-convert-to-q8_0 branch November 17, 2023 03:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
script Script related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants