Skip to content

Add support for GPTQ-triton using the --gptq-triton flag.#1263

Closed
fpgaminer wants to merge 1 commit into
oobabooga:mainfrom
fpgaminer:add-gptq-triton-support-20230415b
Closed

Add support for GPTQ-triton using the --gptq-triton flag.#1263
fpgaminer wants to merge 1 commit into
oobabooga:mainfrom
fpgaminer:add-gptq-triton-support-20230415b

Conversation

@fpgaminer

Copy link
Copy Markdown

GPTQ-triton is my WIP implementation of the GPTQ kernels in Triton, which improve inference speed on GPTQ quantized models. On short prompts it provides on average a 10% boost in performance relative to the CUDA kernels. On large prompts it is over 10x faster. It's the source of the Triton branch in GPTQ-for-LLaMa.

While GPTQ-triton is still a work in progress, I did integrate support for it in text-generation-webui and thought it might be useful to share the code, since it's a relatively simple addition. So far I've tested this integration against the latest transformers, llama 7B at 4-bit quantization and groupsize -1. Support for multi-gpu, other wbit settings, and Flash attention are not yet implemented.

If you don't feel comfortable integrating support yet, that's totally fine, just let me know what features are blockers.

Thank you.

@fpgaminer

Copy link
Copy Markdown
Author

My modified Dockerfile for testing:

FROM nvidia/cuda:11.7.0-devel-ubuntu22.04 as builder

RUN apt-get update && \
    apt-get install --no-install-recommends -y git vim build-essential python3-dev python3-venv && \
    rm -rf /var/lib/apt/lists/*

RUN git clone https://github.com/oobabooga/GPTQ-for-LLaMa /build/GPTQ-for-LLaMa
RUN git clone https://github.com/fpgaminer/GPTQ-triton /build/GPTQ-triton

WORKDIR /build

RUN python3 -m venv /build/venv
RUN . /build/venv/bin/activate && \
    pip3 install --upgrade pip setuptools build && \
    pip3 install torch torchvision torchaudio

# https://developer.nvidia.com/cuda-gpus
# for a rtx 2060: ARG TORCH_CUDA_ARCH_LIST="7.5"
ARG TORCH_CUDA_ARCH_LIST="3.5;5.0;6.0;6.1;7.0;7.5;8.0;8.6+PTX"
WORKDIR /build/GPTQ-for-LLaMa
RUN . /build/venv/bin/activate && \
    python3 setup_cuda.py bdist_wheel -d .

WORKDIR /build/GPTQ-triton
RUN . /build/venv/bin/activate && \
    python3 -m build

FROM nvidia/cuda:11.7.0-runtime-ubuntu22.04

LABEL maintainer="Your Name <your.email@example.com>"
LABEL description="Docker image for GPTQ-for-LLaMa and Text Generation WebUI"

RUN apt-get update && \
    apt-get install --no-install-recommends -y git python3 python3-pip make g++ python3-dev && \
    rm -rf /var/lib/apt/lists/*

RUN --mount=type=cache,target=/root/.cache/pip pip3 install virtualenv
RUN mkdir /app

WORKDIR /app

ARG WEBUI_VERSION
RUN test -n "${WEBUI_VERSION}" && git reset --hard ${WEBUI_VERSION} || echo "Using provided webui source"

RUN virtualenv /app/venv
RUN . /app/venv/bin/activate && \
    pip3 install --upgrade pip setuptools && \
    pip3 install torch torchvision torchaudio

COPY --from=builder /build/GPTQ-for-LLaMa /app/repositories/GPTQ-for-LLaMa
RUN . /app/venv/bin/activate && \
    pip3 install /app/repositories/GPTQ-for-LLaMa/*.whl

COPY --from=builder /build/GPTQ-triton /app/repositories/GPTQ-triton
RUN . /app/venv/bin/activate && \
    pip3 install /app/repositories/GPTQ-triton/dist/*.whl

COPY extensions/api/requirements.txt /app/extensions/api/requirements.txt
COPY extensions/elevenlabs_tts/requirements.txt /app/extensions/elevenlabs_tts/requirements.txt
COPY extensions/google_translate/requirements.txt /app/extensions/google_translate/requirements.txt
COPY extensions/silero_tts/requirements.txt /app/extensions/silero_tts/requirements.txt
COPY extensions/whisper_stt/requirements.txt /app/extensions/whisper_stt/requirements.txt
RUN --mount=type=cache,target=/root/.cache/pip . /app/venv/bin/activate && cd extensions/api && pip3 install -r requirements.txt
RUN --mount=type=cache,target=/root/.cache/pip . /app/venv/bin/activate && cd extensions/elevenlabs_tts && pip3 install -r requirements.txt
RUN --mount=type=cache,target=/root/.cache/pip . /app/venv/bin/activate && cd extensions/google_translate && pip3 install -r requirements.txt
RUN --mount=type=cache,target=/root/.cache/pip . /app/venv/bin/activate && cd extensions/silero_tts && pip3 install -r requirements.txt
RUN --mount=type=cache,target=/root/.cache/pip . /app/venv/bin/activate && cd extensions/whisper_stt && pip3 install -r requirements.txt

COPY requirements.txt /app/requirements.txt
RUN . /app/venv/bin/activate && \
    pip3 install -r requirements.txt

RUN cp /app/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118.so /app/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so

COPY . /app/
ENV CLI_ARGS=""
CMD . /app/venv/bin/activate && python3 server.py ${CLI_ARGS}

@oobabooga

Copy link
Copy Markdown
Owner

This should already work.

@fpgaminer

Copy link
Copy Markdown
Author

This should already work.

How so? I don't see any support for GPTQ-triton on main?

@catalpaaa

catalpaaa commented Apr 16, 2023

Copy link
Copy Markdown
Contributor

This should already work.

the newest triton branch changed up something, can you pls check. its breaking the model loading

sry for not having time to make a pr for that

@sgsdxzy

sgsdxzy commented Apr 16, 2023

Copy link
Copy Markdown
Contributor

This should already work.

the newest triton branch changed up something, can you pls check. its breaking the model loading

sry for not having time to make a pr for that

I already made one #1229

@oobabooga

Copy link
Copy Markdown
Owner

Closing this in favor of #1229

@oobabooga oobabooga closed this Apr 17, 2023
@fpgaminer

fpgaminer commented Apr 17, 2023

Copy link
Copy Markdown
Author

Closing this in favor of #1229

That pull request doesn't have anything to do with GPTQ-triton? Are you perhaps confusing GPTQ-triton with GPTQ-for-LLaMa? I'm not affiliated with the latter.

@sgsdxzy

sgsdxzy commented Apr 17, 2023

Copy link
Copy Markdown
Contributor

Closing this in favor of #1229

That pull request doesn't have anything to do with GPTQ-triton? Are you perhaps confusing GPTQ-triton with GPTQ-for-LLaMa? I'm not affiliated with the latter.

What's the difference (advantage) of GPTQ-triton over GPTQ-for-LLaMa's triton branch?

@fpgaminer

Copy link
Copy Markdown
Author

What's the difference (advantage) of GPTQ-triton over GPTQ-for-LLaMa's triton branch?

Most of GPTQ-for-LLaMa's triton branch is copied from my code in GPTQ-triton, so it's always going to lag my work on the Triton kernels. GPTQ-triton contains more extensive testing and verification (e.g. Benchmark, Verify, ppl.py). And it has packaging so can be installed with pip (I'll start uploading to pypi once things are more polished).

@sgsdxzy

sgsdxzy commented Apr 17, 2023

Copy link
Copy Markdown
Contributor

OK, sorry I confused GPTQ-triton for GPTQ-for-LLaMa.

Now we are having 4 different "GPTQ" stuff:

  1. ooba's GPTQ-for-LLaMa fork of "old" CUDA
  2. qwopqwop200's "new" CUDA
  3. qwopqwop200's triton
  4. fpgaminer's GPTQ-triton

each with different interfaces. And we cannot simply drop support for CUDA not only because of old weights, but more importantly for supporting Windows.

@oobabooga oobabooga reopened this Apr 17, 2023
@Ph0rk0z

Ph0rk0z commented Apr 17, 2023

Copy link
Copy Markdown
Contributor

Way more forks than that. 0cc4m's one is another big one on kobold side.

I really do want to see what triton does hope they fix it for my card soon.

I think the big points are.

  1. Compatibility (windows, old cards, etc)
  2. Formats - not everyone can requantize, llama isn't the only model in existence for 4bit
  3. Speed - what good is new and shiny when it does .2 it/s
  4. Perplexity -are the models dumb?

It's not one size fits all.

@TheBloke

TheBloke commented Apr 26, 2023

Copy link
Copy Markdown
Contributor

There's now also PanQiWei's AutoGPTQ which provides a really nice transformers-style interface to GPTQ creation and inference. Supports Llama, GPT-J, GptNeoX, and others.

It just added Triton support (on top of good CUDA support) and is working on other features as well.

IMHO this is the future of GPTQ. It's so much easier to use and I hope it becomes the standard.

Example of loading a quantised model and doing inference:

import torch
from auto_gptq import AutoGPTQForCausalLM
from transformers import AutoTokenizer, TextGenerationPipeline

tokenizer = AutoTokenizer.from_pretrained("/workspace/koalaGPTQ", use_fast=False)

model = AutoGPTQForCausalLM.from_quantized("/workspace/koalaGPTQ", device="cuda:0")

pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer, device="cuda:0")
generated_text = pipeline(
    "AI is the future because",
    return_full_text=False,
    num_beams=1,
    max_length=128
)[0]['generated_text']

print(generated_text)

Right now the GPTQs produced seem to score fractionally lower than GPTQ-for-LLaMa on perplexity but they're looking into that.

@Ph0rk0z

Ph0rk0z commented Apr 26, 2023

Copy link
Copy Markdown
Contributor

It's definitely not the future for old cards or windows. Plus the newer cuda branch of GPTQ is painfully slow. It's definitely convenient for quantization.

@TheBloke

TheBloke commented Apr 26, 2023

Copy link
Copy Markdown
Contributor

It's definitely not the future for old cards or windows.

Why? AutoGPTQ supports CUDA as well, and he's working on creating a PyPi package with pre-built binaries. Ideal for Windows!

Plus the newer cuda branch of GPTQ is painfully slow. It's definitely convenient for quantization.

To be clear I'm talking about AutoGPTQ, not GPTQ-for-Llama. Have you tried AutoGPTQ and found it slow for inference? I've not really tested its inference performance yet. But its quantisation performance -tested on CUDA - seems quite a bit faster than GPTQ-for-LLaMa to me.

@Ph0rk0z

Ph0rk0z commented Apr 26, 2023

Copy link
Copy Markdown
Contributor

I haven't yet.. I've only used the old commit GPTQ like 0cc4m and oobabooga made along or the 4bit lora repo's autograd for inference.

It seems after gptqForLlama added group size + act order to the cuda kernel it cut inference speed by 2/3 or at least 1/2. I thought autoGPTQ was more for quantizing the models than inference and inference performance would be identical to new cuda.

I'm for sure willing to try it, any speedup will help. Windows doesn't support triton at all though and neither does my pascal card. Plus its made by OpenAI, ewww.

@TheBloke

Copy link
Copy Markdown
Contributor

I think that AutoGPTQ's huge benefit will be in inference. I don't know about current performance because I've not yet tested that, but in design terms it feels to me to be way ahead in terms of how easy it is to use and therefore how easy it will be to implement into clients like text-generation-webui.

Like I showed in that example above, you can take existing Transformers code and make a couple of tweaks and load a GPTQ model instead. Earlier today someone on Discord asked me how they access a GPTQ-for-LLaMa model from Python code, and all I could tell them was "check out llama_inference.py and base your code around that"

AutoGPTQ on the other hand can be added to any code in minutes, for both inference and quantisation.

GPTQs are great but they're also quite a pain right now. I provide a bunch of them at https://huggingface.co/TheBloke/ and I have multiple comments a day from people who can't get them working or they work slow or produce gibberish. There's all these different forks of GPTQ-for-LLaMa, with different performance levels and different features supported.

That's why I'm really hoping we can all get behind one system that becomes the new standard for GPTQ and supports everything for everyone. And from what I've seen so far, @PanQiWei's AutoGPTQ could be that.

Then if it does have performance or compatibility issues, I'm sure he'll work on them and improve them. He's been doing 10+ commits a day since he started the project and appears to be making great progress.

Tonight I'll do some inference and performance tests and report the results.

@Ph0rk0z

Ph0rk0z commented Apr 27, 2023

Copy link
Copy Markdown
Contributor

GPTQ in python just uses make_quant for all the models. Check out gptq_loader in this repo. After that, standard HF code works.

I do agree that a unified GPTQ will be best if he can get it there, then I will use it. If it is all triton, or slower cuda.. well I physically can't. I am cagey because so far, there have been constant breaking changes or barriers. Which maybe would be OK for something that didn't need an outlay of investment in HW or re-conversions/re-downloads of huge files.

Especially hope for faster working int8 inference so that the 13b models can be smarter. The perplexity difference between BnB and int4 is substantial. But BnB is slower than gptq. Maybe that's not so much the case on newer hardware but won't be able to check that till next week at earliest.

@oobabooga

Copy link
Copy Markdown
Owner

I think that having one more GPTQ loader in the web UI would make things confusing. I want to deprecate the current loaders and focus on https://github.com/PanQiWei/AutoGPTQ

@oobabooga oobabooga closed this May 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants