-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow convert.py to convert directly to q8_0 #2753
Conversation
Why not use |
convert.py
Outdated
return np.fromiter(map(quantize_block_q8_0, blocks), count = n_blocks, dtype = BLOCK_Q8_0) | ||
|
||
def quantize_block_q8_0(blk, zero = np.float32(0), one = np.float32(1), onetwentyseven = np.float32(127), zero_chunk = (np.int8(0),) * QK8_0): | ||
d = abs(blk).max() / onetwentyseven | ||
if d == zero: | ||
return (np.float16(d), zero_chunk) | ||
return (np.float16(d), (blk * (one / d)).round()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
numpy is much more efficient if you give it bigger chunks. This is about 4x as fast.
return np.fromiter(map(quantize_block_q8_0, blocks), count = n_blocks, dtype = BLOCK_Q8_0) | |
def quantize_block_q8_0(blk, zero = np.float32(0), one = np.float32(1), onetwentyseven = np.float32(127), zero_chunk = (np.int8(0),) * QK8_0): | |
d = abs(blk).max() / onetwentyseven | |
if d == zero: | |
return (np.float16(d), zero_chunk) | |
return (np.float16(d), (blk * (one / d)).round()) | |
return np.fromiter(quantize_blocks_q8_0(blocks), count = n_blocks, dtype = BLOCK_Q8_0) | |
def quantize_blocks_q8_0(blocks): | |
d = abs(blocks).max(axis=1) / np.float32(127) | |
with np.errstate(divide='ignore'): | |
qs = (blocks / d[:, None]).round() | |
qs[d == 0] = 0 | |
yield from zip(np.float16(d), qs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback, that's definitely faster. Any idea of why the results are different though, and which would be correct?
Quantizing:
[-0.0062561 -0.00460815 -0.00753784 -0.00469971 -0.00100708 0.00210571
0.01055908 -0.00692749 0.00285339 0.00491333 -0.00164795 -0.00028992
0.00174713 -0.00141144 0.00842285 0.00111389 0.00994873 -0.00041771
-0.00872803 0.00263977 0.01422119 0.01312256 -0.00836182 -0.0078125
-0.01806641 -0.00939941 0.00256348 -0.00195312 -0.00119019 0.00415039
0.0090332 -0.00485229]
Original:
(0.0001422, [ -44, -32, -53, -33, -7, 15, 74, -49, 20, 35, -12, -2, 12, -10, 59, 8, 70, -3, -61, 19, 100, 92, -59, -55, -127, -66, 18, -14, -8, 29, 64, -34])
Yours:
(0.0001422, [ -44, -32, -53, -33, -7, 15, 74, -49, 20, 35, -12, -2, 12, -10, 59, 8, 70, -3, -61, 19, 100, 92, -59, -55, -127, -66, 18, -14, -8, 29, 63, -34])
(The penultimate item in the original version is 64 while in yours it's 63.)
List version of the quantized values for easier pasting:
[-0.0062561035, -0.0046081543, -0.007537842, -0.004699707, -0.0010070801, 0.002105713, 0.010559082, -0.0069274902, 0.0028533936, 0.00491333, -0.0016479492, -0.000289917, 0.0017471313, -0.001411438, 0.008422852, 0.0011138916, 0.0099487305, -0.00041770935, -0.008728027, 0.0026397705, 0.014221191, 0.013122559, -0.008361816, -0.0078125, -0.018066406, -0.009399414, 0.0025634766, -0.001953125, -0.0011901855, 0.0041503906, 0.009033203, -0.004852295]
edit: From the examples I've looked at, it seems like the difference always involves -63
and -64
or 63
and 64
(but it doesn't seem consistent which version chooses which). Probably something involving rounding? The difference seems very small. I can show you more examples if it helps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can get your version to output 63 if I change (blk * (one / d))
to blk / d
. If you do the math in np.float64 you get 63.49996485, so I think my version is slightly more accurate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can get behind that. Thanks! Performance is actually competitive with the C++ version now.
Setting high concurrency doesn't really do much when not quantizing. Also whatever the concurrency level is set to, the The value was previously hardcoded to 8, so I just went with that. Partially to be conservative also, because I don't think converting to |
I will take a detailed look a bit later - want to keep up to date with the python code since pre-GGUF I've lost track of it and it got too messy and complicated for my taste. It's still not great, but at least I'm more familiar with it now. One thing I want to see at some point is to extract the lazy and paralllel loading of the tensors into |
Fix issue with bounded_parallel_map and greedy consuming iterator Display elapsed time during conversion
Minor improvements to help text Clean up bounded_parallel_map function a bit
655990c
to
8ee186c
Compare
I have mixed feelings about this change - I'll likely never used it, but at the same time it will be present in |
The actual quantization code is very, very small: QK8_0 = 32
BLOCK_Q8_0 = np.dtype([('d', '<f2'), ('qs', 'i1', (QK8_0,))])
def quantize_array_q8_0(arr):
assert arr.size % QK8_0 == 0 and arr.size != 0, f'Bad array size {arr.size}'
assert arr.dtype == np.float32, f'Bad array type {arr.dtype}'
n_blocks = arr.size // QK8_0
blocks = arr.reshape((n_blocks, QK8_0))
return np.fromiter(quantize_blocks_q8_0(blocks), count = n_blocks, dtype = BLOCK_Q8_0)
def quantize_blocks_q8_0(blocks):
d = abs(blocks).max(axis = 1) / np.float32(127)
with np.errstate(divide = 'ignore'):
qs = (blocks / d[:, None]).round()
qs[d == 0] = 0
yield from zip(np.float16(d), qs) Most of the other changes specific to adding the ability to quantize are just handling the type in the relevant classes. I don't think there's a way to add a type without handling that case in those classes, so there isn't really a way to separate out that part. The rest of the pull is a fix for the So the "bloat" added by quantizing is very small and you could say there's about the same amount of bloat involved in supporting both If By the way, the edit: I put some more thought into this and the only thing I can really come up to help with the changes relating to supporting an additional type would be to refactor all that stuff so you could register types and handlers for them and that kind of stuff. That way an external script could import/reuse most of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let's merge it. We should remove the DeferredPermutedTensor
- not sure what was its purpose, but it is no longer used
I can add that change. I'll hold off on merging this until tomorrow when I'll have some time to go through my changes and see if there's anything else I can do to clean up/simplify them. |
@ggerganov What do you think of these changes? It's, uhh... A lot more than I was planning to do originally but I think it's basically better and more understandable than the original version that special cased the quantization stuff. I still have to do more testing to make sure I didn't break the BF16 or SafeTensors stuff (I don't have an easy way to test that right now) but I want to make sure you're okay with this approach. If we did want to add more quantizations or some other data type that requires special handling these changes also make that easier. I got type linting working for Python in my IDE and I found a bunch of other stuff that I'll try to fix but probably a bad idea to shove everything into this pull. One example in particular is some of the permute stuff is getting called with the wrong number of arguments. I assume if that code path is ever taken it'll just crash. |
Yes, this looks nice.
I need to learn how to do that :D |
If you're using VS Codium (or VS Code) I can help you, it's pretty simple. It's also possible to just run
Great! I'm relatively happy with this, although my choices for naming things probably aren't very ideal. I tested with Safetensors and a Safetensors file with all the tensors converted to bfloat16. Seems to work fine. I don't really have the ability to mess with gigantic 70B 32bit models and I've been testing with a 3B. The only thing that would be different is the permute stuff and I don't think this pull should affect that (I specially have my quantization run after everything else) but I didn't test converting a 70B LLaMA2 model. As far as my changes go, I'd say this is now ready to merge. Are you good with the current state, or any further changes/testing needed? (Of course testing/feedback from anyone else reading this is welcome also.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not "promote" this quantization option for general use as it produces different results from the reference C++ quantization and it can become messy if a lot of people start using it and we end up with 2 different Q8_0
models
Seems reasonable. Do you want any extra warnings/notes added when using that mode or to the |
No need - should be fine. I think it will be difficult to make the results 100% exact. The quality should be the same. |
Considering that this produces better results than converting to f16 and then quantizing, maybe it should be the recommended way to do it. |
Perplexity/hellaswag/other quality tests might vary a bit also.
I'm not an expert on this stuff, but I think the differences here may be too small to be meaningful really. From my tests though, using convert.py to convert to f16 and then using the C++ quantize to q8_0 gave a slightly better result than converting to f32 then using C++ quantize to q8_0. It wouldn't make sense to conclude that f16 is higher quality that f32 though. I think just any sort of tiny change, whether it's objectively higher/lower quality can just randomly vary the results a bit in either direction. |
Btw, I see people currently discussing potential quality issues with converting BF16 -> F16 (not related to My quick tests show that with llama-7b-v2 there is no noticeable difference between converted to F16 and F32. Observations based on a few text generations - identical output up to few tens of generated tokens and then start to diverge slightly. Which is to be expected as the probs are of course very slightly different between the two |
One metric I like for comparing perplexity is the difference between a 13B LLaMA1 model and a 7B: that's fairly noticeable. If I remember correctly, that's around 0.68. 10% of that would be about 0.068 — I'd be surprised if users could notice a difference smaller than that. (The biggest difference between the results I saw testing the different conversions was 0.003.) |
* master: (773 commits) server : add `/detokenize` endpoint (ggerganov#2802) convert.py : advanced option (ggerganov#2753) llama : use Unicode Escape Sequence to replace encoded characters (ggerganov#2814) flake.nix : add rocm support and cleanup (ggerganov#2808) llama : move #includes out of _GNU_SOURCE conditional (ggerganov#2817) main : fix bug (penalize_nl=false doesn't work) + suppress warning on mingw (ggerganov#1528) llama : use std::abs in llama_sample_tail_free (ggerganov#2800) k-quants : remove unnecessary tensor shape restrictions (ggerganov#2811) Better perplexity for 2- and 3-bit quantization for LLaMA-v2-70B (ggerganov#2807) Fix HellaSwag (ggerganov#2805) flake : build llama.cpp on Intel with nix (ggerganov#2795) Handle null rope scaling value (ggerganov#2793) Fix spm whitespaces (ggerganov#2806) examples : skip unnecessary external lib in server README.md how-to (ggerganov#2804) llama : fix struct decl (ggerganov#2790) Faster perplexity computation (ggerganov#2786) llama : add llama_beam_search() (ggerganov#2267) convert.py : Get rope scale from HuggingFace models (ggerganov#2772) llama-bench : add model sizes (ggerganov#2771) convert.py : export rope freq_base when converting CodeLlama from an HF model (ggerganov#2773) ...
* Allow convert.py to convert to q8_0 Fix issue with bounded_parallel_map and greedy consuming iterator Display elapsed time during conversion * Add --concurrency option Minor improvements to help text Clean up bounded_parallel_map function a bit * Massive speed improvement thanks to Cebtenzzre * Refactor types
The main thing this pull does is add
q8_0
as an output type forconvert.py
.bounded_parallel_map
also had weird behavior where it would fully consume the iterator passed to it (so it could builditems_rev
) but this really isn't necessary. I added an argument to allow specifying the factor used to create the executor as well.The way it uses both
ThreadPoolExecutor
andProcessPoolExecutor
may look weird, but it's necessary to get acceptable performance when quantizing. My Pythonq8_0
implementation isn't too fast and it doesn't seem likeThreadPoolExecutor
parallelization works there, probably because of the GIL.ThreadPoolExecutor
is still needed becauseProcessPoolExecutor
requires objects to be pickleable and there's some stuff involved in loading that isn't.On my machine (5700G) using 8 threads takes 2 minutes to convert/quantize a 3B model. Extrapolating from that, it would take about 47min to convert and quantize a 70B. The main use for this would be creating a
Q8_0
without having to create an intermediatef16
orf32
and then convert toQ8_0
(also you don't get exactly the same results goingf16
->q8_0
).I think my mini
q8_0
implementation works (or is very close). Perplexity results fororca_mini_3b
:f16 to q8_0
[612]10.8966,[613]10.9062,[614]10.9136,[615]10.9014,[616]10.9013
f32 to q8_0
[612]10.8995,[613]10.9092,[614]10.9166,[615]10.9043,[616]10.9043
direct to q8_0
[612]10.8995,[613]10.9091,[614]10.9165,[615]10.9043,[616]10.9042
With the improved version:
[612]10.8987,[613]10.9083,[614]10.9156,[615]10.9034,[616]10.9033
edit: Added a
--concurrency
commandline option. This can help a lot with quantizing, using--concurrency 16
got the 3B model converted/quantized in about 75sec.edit: With @cebtenzzre's changes we're down to 27sec converting/quantizing the 3B model.