Add support for GPTQ-triton using the --gptq-triton flag.#1263
Add support for GPTQ-triton using the --gptq-triton flag.#1263fpgaminer wants to merge 1 commit into
Conversation
|
My modified Dockerfile for testing: |
|
This should already work. |
How so? I don't see any support for GPTQ-triton on |
the newest triton branch changed up something, can you pls check. its breaking the model loading sry for not having time to make a pr for that |
I already made one #1229 |
|
Closing this in favor of #1229 |
That pull request doesn't have anything to do with GPTQ-triton? Are you perhaps confusing GPTQ-triton with GPTQ-for-LLaMa? I'm not affiliated with the latter. |
What's the difference (advantage) of GPTQ-triton over GPTQ-for-LLaMa's triton branch? |
Most of GPTQ-for-LLaMa's triton branch is copied from my code in GPTQ-triton, so it's always going to lag my work on the Triton kernels. GPTQ-triton contains more extensive testing and verification (e.g. Benchmark, Verify, ppl.py). And it has packaging so can be installed with pip (I'll start uploading to pypi once things are more polished). |
|
OK, sorry I confused GPTQ-triton for GPTQ-for-LLaMa. Now we are having 4 different "GPTQ" stuff:
each with different interfaces. And we cannot simply drop support for CUDA not only because of old weights, but more importantly for supporting Windows. |
|
Way more forks than that. 0cc4m's one is another big one on kobold side. I really do want to see what triton does hope they fix it for my card soon. I think the big points are.
It's not one size fits all. |
|
There's now also PanQiWei's AutoGPTQ which provides a really nice transformers-style interface to GPTQ creation and inference. Supports Llama, GPT-J, GptNeoX, and others. It just added Triton support (on top of good CUDA support) and is working on other features as well. IMHO this is the future of GPTQ. It's so much easier to use and I hope it becomes the standard. Example of loading a quantised model and doing inference: Right now the GPTQs produced seem to score fractionally lower than GPTQ-for-LLaMa on perplexity but they're looking into that. |
|
It's definitely not the future for old cards or windows. Plus the newer cuda branch of GPTQ is painfully slow. It's definitely convenient for quantization. |
Why? AutoGPTQ supports CUDA as well, and he's working on creating a PyPi package with pre-built binaries. Ideal for Windows!
To be clear I'm talking about AutoGPTQ, not GPTQ-for-Llama. Have you tried AutoGPTQ and found it slow for inference? I've not really tested its inference performance yet. But its quantisation performance -tested on CUDA - seems quite a bit faster than GPTQ-for-LLaMa to me. |
|
I haven't yet.. I've only used the old commit GPTQ like 0cc4m and oobabooga made along or the 4bit lora repo's autograd for inference. It seems after gptqForLlama added group size + act order to the cuda kernel it cut inference speed by 2/3 or at least 1/2. I thought autoGPTQ was more for quantizing the models than inference and inference performance would be identical to new cuda. I'm for sure willing to try it, any speedup will help. Windows doesn't support triton at all though and neither does my pascal card. Plus its made by OpenAI, ewww. |
|
I think that AutoGPTQ's huge benefit will be in inference. I don't know about current performance because I've not yet tested that, but in design terms it feels to me to be way ahead in terms of how easy it is to use and therefore how easy it will be to implement into clients like text-generation-webui. Like I showed in that example above, you can take existing Transformers code and make a couple of tweaks and load a GPTQ model instead. Earlier today someone on Discord asked me how they access a GPTQ-for-LLaMa model from Python code, and all I could tell them was "check out llama_inference.py and base your code around that" AutoGPTQ on the other hand can be added to any code in minutes, for both inference and quantisation. GPTQs are great but they're also quite a pain right now. I provide a bunch of them at https://huggingface.co/TheBloke/ and I have multiple comments a day from people who can't get them working or they work slow or produce gibberish. There's all these different forks of GPTQ-for-LLaMa, with different performance levels and different features supported. That's why I'm really hoping we can all get behind one system that becomes the new standard for GPTQ and supports everything for everyone. And from what I've seen so far, @PanQiWei's AutoGPTQ could be that. Then if it does have performance or compatibility issues, I'm sure he'll work on them and improve them. He's been doing 10+ commits a day since he started the project and appears to be making great progress. Tonight I'll do some inference and performance tests and report the results. |
|
GPTQ in python just uses make_quant for all the models. Check out gptq_loader in this repo. After that, standard HF code works. I do agree that a unified GPTQ will be best if he can get it there, then I will use it. If it is all triton, or slower cuda.. well I physically can't. I am cagey because so far, there have been constant breaking changes or barriers. Which maybe would be OK for something that didn't need an outlay of investment in HW or re-conversions/re-downloads of huge files. Especially hope for faster working int8 inference so that the 13b models can be smarter. The perplexity difference between BnB and int4 is substantial. But BnB is slower than gptq. Maybe that's not so much the case on newer hardware but won't be able to check that till next week at earliest. |
|
I think that having one more GPTQ loader in the web UI would make things confusing. I want to deprecate the current loaders and focus on https://github.com/PanQiWei/AutoGPTQ |
GPTQ-triton is my WIP implementation of the GPTQ kernels in Triton, which improve inference speed on GPTQ quantized models. On short prompts it provides on average a 10% boost in performance relative to the CUDA kernels. On large prompts it is over 10x faster. It's the source of the Triton branch in GPTQ-for-LLaMa.
While GPTQ-triton is still a work in progress, I did integrate support for it in text-generation-webui and thought it might be useful to share the code, since it's a relatively simple addition. So far I've tested this integration against the latest transformers, llama 7B at 4-bit quantization and groupsize -1. Support for multi-gpu, other wbit settings, and Flash attention are not yet implemented.
If you don't feel comfortable integrating support yet, that's totally fine, just let me know what features are blockers.
Thank you.