Add support for GPTQ-triton using the --gptq-triton flag. by fpgaminer · Pull Request #1263 · oobabooga/textgen

fpgaminer · 2023-04-16T06:08:23Z

GPTQ-triton is my WIP implementation of the GPTQ kernels in Triton, which improve inference speed on GPTQ quantized models. On short prompts it provides on average a 10% boost in performance relative to the CUDA kernels. On large prompts it is over 10x faster. It's the source of the Triton branch in GPTQ-for-LLaMa.

While GPTQ-triton is still a work in progress, I did integrate support for it in text-generation-webui and thought it might be useful to share the code, since it's a relatively simple addition. So far I've tested this integration against the latest transformers, llama 7B at 4-bit quantization and groupsize -1. Support for multi-gpu, other wbit settings, and Flash attention are not yet implemented.

If you don't feel comfortable integrating support yet, that's totally fine, just let me know what features are blockers.

Thank you.

fpgaminer · 2023-04-16T06:09:22Z

My modified Dockerfile for testing:

FROM nvidia/cuda:11.7.0-devel-ubuntu22.04 as builder

RUN apt-get update && \
    apt-get install --no-install-recommends -y git vim build-essential python3-dev python3-venv && \
    rm -rf /var/lib/apt/lists/*

RUN git clone https://github.com/oobabooga/GPTQ-for-LLaMa /build/GPTQ-for-LLaMa
RUN git clone https://github.com/fpgaminer/GPTQ-triton /build/GPTQ-triton

WORKDIR /build

RUN python3 -m venv /build/venv
RUN . /build/venv/bin/activate && \
    pip3 install --upgrade pip setuptools build && \
    pip3 install torch torchvision torchaudio

# https://developer.nvidia.com/cuda-gpus
# for a rtx 2060: ARG TORCH_CUDA_ARCH_LIST="7.5"
ARG TORCH_CUDA_ARCH_LIST="3.5;5.0;6.0;6.1;7.0;7.5;8.0;8.6+PTX"
WORKDIR /build/GPTQ-for-LLaMa
RUN . /build/venv/bin/activate && \
    python3 setup_cuda.py bdist_wheel -d .

WORKDIR /build/GPTQ-triton
RUN . /build/venv/bin/activate && \
    python3 -m build

FROM nvidia/cuda:11.7.0-runtime-ubuntu22.04

LABEL maintainer="Your Name <your.email@example.com>"
LABEL description="Docker image for GPTQ-for-LLaMa and Text Generation WebUI"

RUN apt-get update && \
    apt-get install --no-install-recommends -y git python3 python3-pip make g++ python3-dev && \
    rm -rf /var/lib/apt/lists/*

RUN --mount=type=cache,target=/root/.cache/pip pip3 install virtualenv
RUN mkdir /app

WORKDIR /app

ARG WEBUI_VERSION
RUN test -n "${WEBUI_VERSION}" && git reset --hard ${WEBUI_VERSION} || echo "Using provided webui source"

RUN virtualenv /app/venv
RUN . /app/venv/bin/activate && \
    pip3 install --upgrade pip setuptools && \
    pip3 install torch torchvision torchaudio

COPY --from=builder /build/GPTQ-for-LLaMa /app/repositories/GPTQ-for-LLaMa
RUN . /app/venv/bin/activate && \
    pip3 install /app/repositories/GPTQ-for-LLaMa/*.whl

COPY --from=builder /build/GPTQ-triton /app/repositories/GPTQ-triton
RUN . /app/venv/bin/activate && \
    pip3 install /app/repositories/GPTQ-triton/dist/*.whl

COPY extensions/api/requirements.txt /app/extensions/api/requirements.txt
COPY extensions/elevenlabs_tts/requirements.txt /app/extensions/elevenlabs_tts/requirements.txt
COPY extensions/google_translate/requirements.txt /app/extensions/google_translate/requirements.txt
COPY extensions/silero_tts/requirements.txt /app/extensions/silero_tts/requirements.txt
COPY extensions/whisper_stt/requirements.txt /app/extensions/whisper_stt/requirements.txt
RUN --mount=type=cache,target=/root/.cache/pip . /app/venv/bin/activate && cd extensions/api && pip3 install -r requirements.txt
RUN --mount=type=cache,target=/root/.cache/pip . /app/venv/bin/activate && cd extensions/elevenlabs_tts && pip3 install -r requirements.txt
RUN --mount=type=cache,target=/root/.cache/pip . /app/venv/bin/activate && cd extensions/google_translate && pip3 install -r requirements.txt
RUN --mount=type=cache,target=/root/.cache/pip . /app/venv/bin/activate && cd extensions/silero_tts && pip3 install -r requirements.txt
RUN --mount=type=cache,target=/root/.cache/pip . /app/venv/bin/activate && cd extensions/whisper_stt && pip3 install -r requirements.txt

COPY requirements.txt /app/requirements.txt
RUN . /app/venv/bin/activate && \
    pip3 install -r requirements.txt

RUN cp /app/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118.so /app/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so

COPY . /app/
ENV CLI_ARGS=""
CMD . /app/venv/bin/activate && python3 server.py ${CLI_ARGS}

oobabooga · 2023-04-16T06:21:54Z

This should already work.

fpgaminer · 2023-04-16T06:23:58Z

This should already work.

How so? I don't see any support for GPTQ-triton on main?

catalpaaa · 2023-04-16T09:45:57Z

This should already work.

the newest triton branch changed up something, can you pls check. its breaking the model loading

sry for not having time to make a pr for that

sgsdxzy · 2023-04-16T10:56:26Z

This should already work.

the newest triton branch changed up something, can you pls check. its breaking the model loading

sry for not having time to make a pr for that

I already made one #1229

oobabooga · 2023-04-17T04:19:07Z

Closing this in favor of #1229

fpgaminer · 2023-04-17T05:41:39Z

Closing this in favor of #1229

That pull request doesn't have anything to do with GPTQ-triton? Are you perhaps confusing GPTQ-triton with GPTQ-for-LLaMa? I'm not affiliated with the latter.

sgsdxzy · 2023-04-17T05:50:35Z

Closing this in favor of #1229

That pull request doesn't have anything to do with GPTQ-triton? Are you perhaps confusing GPTQ-triton with GPTQ-for-LLaMa? I'm not affiliated with the latter.

What's the difference (advantage) of GPTQ-triton over GPTQ-for-LLaMa's triton branch?

fpgaminer · 2023-04-17T05:59:56Z

What's the difference (advantage) of GPTQ-triton over GPTQ-for-LLaMa's triton branch?

Most of GPTQ-for-LLaMa's triton branch is copied from my code in GPTQ-triton, so it's always going to lag my work on the Triton kernels. GPTQ-triton contains more extensive testing and verification (e.g. Benchmark, Verify, ppl.py). And it has packaging so can be installed with pip (I'll start uploading to pypi once things are more polished).

sgsdxzy · 2023-04-17T06:22:55Z

OK, sorry I confused GPTQ-triton for GPTQ-for-LLaMa.

Now we are having 4 different "GPTQ" stuff:

ooba's GPTQ-for-LLaMa fork of "old" CUDA
qwopqwop200's "new" CUDA
qwopqwop200's triton
fpgaminer's GPTQ-triton

each with different interfaces. And we cannot simply drop support for CUDA not only because of old weights, but more importantly for supporting Windows.

Ph0rk0z · 2023-04-17T14:11:05Z

Way more forks than that. 0cc4m's one is another big one on kobold side.

I really do want to see what triton does hope they fix it for my card soon.

I think the big points are.

Compatibility (windows, old cards, etc)
Formats - not everyone can requantize, llama isn't the only model in existence for 4bit
Speed - what good is new and shiny when it does .2 it/s
Perplexity -are the models dumb?

It's not one size fits all.

TheBloke · 2023-04-26T14:00:31Z

There's now also PanQiWei's AutoGPTQ which provides a really nice transformers-style interface to GPTQ creation and inference. Supports Llama, GPT-J, GptNeoX, and others.

It just added Triton support (on top of good CUDA support) and is working on other features as well.

IMHO this is the future of GPTQ. It's so much easier to use and I hope it becomes the standard.

Example of loading a quantised model and doing inference:

import torch
from auto_gptq import AutoGPTQForCausalLM
from transformers import AutoTokenizer, TextGenerationPipeline

tokenizer = AutoTokenizer.from_pretrained("/workspace/koalaGPTQ", use_fast=False)

model = AutoGPTQForCausalLM.from_quantized("/workspace/koalaGPTQ", device="cuda:0")

pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer, device="cuda:0")
generated_text = pipeline(
    "AI is the future because",
    return_full_text=False,
    num_beams=1,
    max_length=128
)[0]['generated_text']

print(generated_text)

Right now the GPTQs produced seem to score fractionally lower than GPTQ-for-LLaMa on perplexity but they're looking into that.

Ph0rk0z · 2023-04-26T14:06:48Z

It's definitely not the future for old cards or windows. Plus the newer cuda branch of GPTQ is painfully slow. It's definitely convenient for quantization.

TheBloke · 2023-04-26T14:09:20Z

It's definitely not the future for old cards or windows.

Why? AutoGPTQ supports CUDA as well, and he's working on creating a PyPi package with pre-built binaries. Ideal for Windows!

Plus the newer cuda branch of GPTQ is painfully slow. It's definitely convenient for quantization.

To be clear I'm talking about AutoGPTQ, not GPTQ-for-Llama. Have you tried AutoGPTQ and found it slow for inference? I've not really tested its inference performance yet. But its quantisation performance -tested on CUDA - seems quite a bit faster than GPTQ-for-LLaMa to me.

Ph0rk0z · 2023-04-26T14:56:09Z

I haven't yet.. I've only used the old commit GPTQ like 0cc4m and oobabooga made along or the 4bit lora repo's autograd for inference.

It seems after gptqForLlama added group size + act order to the cuda kernel it cut inference speed by 2/3 or at least 1/2. I thought autoGPTQ was more for quantizing the models than inference and inference performance would be identical to new cuda.

I'm for sure willing to try it, any speedup will help. Windows doesn't support triton at all though and neither does my pascal card. Plus its made by OpenAI, ewww.

TheBloke · 2023-04-26T15:11:54Z

I think that AutoGPTQ's huge benefit will be in inference. I don't know about current performance because I've not yet tested that, but in design terms it feels to me to be way ahead in terms of how easy it is to use and therefore how easy it will be to implement into clients like text-generation-webui.

Like I showed in that example above, you can take existing Transformers code and make a couple of tweaks and load a GPTQ model instead. Earlier today someone on Discord asked me how they access a GPTQ-for-LLaMa model from Python code, and all I could tell them was "check out llama_inference.py and base your code around that"

AutoGPTQ on the other hand can be added to any code in minutes, for both inference and quantisation.

GPTQs are great but they're also quite a pain right now. I provide a bunch of them at https://huggingface.co/TheBloke/ and I have multiple comments a day from people who can't get them working or they work slow or produce gibberish. There's all these different forks of GPTQ-for-LLaMa, with different performance levels and different features supported.

That's why I'm really hoping we can all get behind one system that becomes the new standard for GPTQ and supports everything for everyone. And from what I've seen so far, @PanQiWei's AutoGPTQ could be that.

Then if it does have performance or compatibility issues, I'm sure he'll work on them and improve them. He's been doing 10+ commits a day since he started the project and appears to be making great progress.

Tonight I'll do some inference and performance tests and report the results.

Ph0rk0z · 2023-04-27T12:10:22Z

GPTQ in python just uses make_quant for all the models. Check out gptq_loader in this repo. After that, standard HF code works.

I do agree that a unified GPTQ will be best if he can get it there, then I will use it. If it is all triton, or slower cuda.. well I physically can't. I am cagey because so far, there have been constant breaking changes or barriers. Which maybe would be OK for something that didn't need an outlay of investment in HW or re-conversions/re-downloads of huge files.

Especially hope for faster working int8 inference so that the 13b models can be smarter. The perplexity difference between BnB and int4 is substantial. But BnB is slower than gptq. Maybe that's not so much the case on newer hardware but won't be able to check that till next week at earliest.

oobabooga · 2023-05-15T23:42:52Z

I think that having one more GPTQ loader in the web UI would make things confusing. I want to deprecate the current loaders and focus on https://github.com/PanQiWei/AutoGPTQ

Add support for GPTQ-triton using the --gptq-triton flag.

d832d96

oobabooga closed this Apr 17, 2023

oobabooga reopened this Apr 17, 2023

LaaZa mentioned this pull request Apr 30, 2023

Implement support for AutoGPTQ for loading GPTQ quantized models. #1668

Merged

oobabooga closed this May 15, 2023

Conversation

fpgaminer commented Apr 16, 2023

Uh oh!

fpgaminer commented Apr 16, 2023

Uh oh!

oobabooga commented Apr 16, 2023

Uh oh!

fpgaminer commented Apr 16, 2023

Uh oh!

catalpaaa commented Apr 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sgsdxzy commented Apr 16, 2023

Uh oh!

oobabooga commented Apr 17, 2023

Uh oh!

fpgaminer commented Apr 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sgsdxzy commented Apr 17, 2023

Uh oh!

fpgaminer commented Apr 17, 2023

Uh oh!

sgsdxzy commented Apr 17, 2023

Uh oh!

Ph0rk0z commented Apr 17, 2023

Uh oh!

TheBloke commented Apr 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ph0rk0z commented Apr 26, 2023

Uh oh!

TheBloke commented Apr 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ph0rk0z commented Apr 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TheBloke commented Apr 26, 2023

Uh oh!

Ph0rk0z commented Apr 27, 2023

Uh oh!

oobabooga commented May 15, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

catalpaaa commented Apr 16, 2023 •

edited

Loading

fpgaminer commented Apr 17, 2023 •

edited

Loading

TheBloke commented Apr 26, 2023 •

edited

Loading

TheBloke commented Apr 26, 2023 •

edited

Loading

Ph0rk0z commented Apr 26, 2023 •

edited

Loading