Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add QuiP quant support #217

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open

Conversation

waters222
Copy link

@waters222 waters222 commented Dec 7, 2023

This is draft PR for adding QuiP quant into ExllamaV2.
Original QuiP Repo

Works:

  1. right it it can load pre-quant model into model and generate token.
  2. since origin model does not have tokenizer.model file etc. so it needs to copy HF model file into this.
  3. Adding new cuda ext.

Ppl performance benchmark

using dataset: [wikitext-2-v1_validation_0000.parquet]
(https://huggingface.co/datasets/wikitext/tree/refs%2Fconvert%2Fparquet/wikitext-2-v1/validation)
sample cmd

python test_inference.py -m /media/storage/models/Llama-2-7b-E8P-2Bit -ed /media/storage/wikitext/wikitext-2-v1_validation_0000.parquet
Model Performance
2Bit
Llama-2-7b-E8P-2Bit 8.7339
Llama2-7b-exl2-2.5bpw 8.0745
Llama-2-13b-E8P-2Bit 7.1207
Llama2-13b-exl2-2.5bpw 7.2741
Llama-2-70b-E8P-2Bit 6.2192
Llama2-70b-exl2-2.5bpw 5.8270
4Bit
Llama-2-7b-HI-4Bit-Packed 6.0748
Llama2-7b-exl2-4.0bpw 6.0300
Llama-2-13b-HI-4Bit-Packed 7.4169
Llama2-13b-exl2-4.0bpw 5.4905

inference example

7B 2bit E8P

Once upon a time, the 2018 version of Windows 10 will be available in preview and then it will get into beta mode. everybody will install windows update preview and they will see the next release of windows 10 and the next feature will come to life for the first time, but this is not yet.
The 365 Days of Windows 10 is a big part of Microsoft’s plan. Microsoft has been working on the future of Windows for over three years, including Windows Server, Office, Xbox Live etc The new version of Windows 10 is called the cloud, so you can access all

 -- Response generated in 6.75 seconds, 128 tokens, 18.98 tokens/second (includes prompt eval.)

13B 2bit E8P

Once upon a time, when I was in the fifth grade and at an age where you can do what you want.
I don’t know why my teacher had to write it so many times that day but there is something about this sentence that makes me feel like we all need someone else beside us. I am not saying I am a bad person who doesn’t have any friends or anything, I just find myself really alone sometimes, despite of having plenty of people around me. So, even though I might be one of those lucky ones with no friends at all (as long as they are not the worst people ever), I can never den

 -- Response generated in 9.53 seconds, 128 tokens, 13.43 tokens/second (includes prompt eval.)

70B 2bit E8P

Once upon a time, the only way to get a book signed was in person. It could be awk anymore!
I think it'holm very special that you can do both of those things simultaneously!
A few people are still using paper and pen for their signatures, but most of them are using digital signatures as well nowadays because they don't need to write anything down anymore!
Most authors use electronic pens these days so that we don't have any problems signing our books or getting them back from customers when needed (and trust me- there will always be someone who wants us). But if an author does want his

-- Response generated in 18.82 seconds, 128 tokens, 6.80 tokens/second (includes prompt eval.)

@waters222 waters222 changed the title [Draft] Trying to add QuiP quant into inference add QuiP quant into inference Dec 7, 2023
@waters222 waters222 changed the title add QuiP quant into inference add QuiP quant support Dec 7, 2023
@CyberTimon
Copy link
Contributor

Very cool. Thank you for making this, will be awesome when it's done 👍

@KnutJaegersberg
Copy link

Would like to have this feature :)

@tau0-deltav
Copy link

tau0-deltav commented Feb 10, 2024

It might be worth nothing that most of the what QuIP# appears to be achieving seems to be acheived better by llama.cpp's new quantization schema. the dedicated 2 bit quants (IQ2_XS and XXS, 2.03 and 2.3 BPW - as opposed to being more like 3 bits) are very strong.*

There's now an optimisation scheme involved but it's clearly not a 'generate FP64 hessians for a week (no really that's what they suggested - for 6k ctx) on your grandma's Threadripper X cluster' - it's much more like the 'discover the most important weights by throwing words at the decoder' - schema we're familiar with here.

ikawrakow here is well worth a read. I'll link this technical discussion I found instead of the unsightly spat with one of the Cornell team.

I'm long past a claim to being any sort of computer scientist but I'd like to hope EXL2's inheritance from GPTQ (quantize-then-calibrate as GPTQ's design goal, flexibility in weight assignments added by EXL2 from that) could make EXL2 itself a better use for the methods used here than QuIP#. Those imatrix files are bigger than exl2's measurements but 25mb isn't exactly out of reach here, compared to 2mb?

(...Could these just convert directly? Probably not without the E8 enumeration support but I do wonder what exactly is in a GGUF that isn't in a EXL2 or vice-versa.)

*IQ3_XXS is Absolutely Robust it's scary. It's very new. It's making me plug in a 3070 Ti that I ought to sell.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants