[bitsandbytes]: support read bnb pre-quantized model#5753
[bitsandbytes]: support read bnb pre-quantized model#5753mgoin merged 2 commits intovllm-project:mainfrom
Conversation
69decab to
95dabaa
Compare
|
The changes from @thesues improves my previous work on QLoRA & Bnb (#4776). It now supports models whose weights are published as bnb-quantized. Other than that, it also cleaned my previous code and fixed a bug ( the previous version will run into error in scenarios such as GQA). The changes looks good to me. Could you also take a look? |
There was a problem hiding this comment.
I'm not fond of the term "prequant" here, could it be something along the lines of "quantized_checkpoint"?
There was a problem hiding this comment.
What isn't supported about it? It seems like no exception is being thrown
There was a problem hiding this comment.
a typo here, it should be "only quant_state.bitsandbytes__nf4 is supported". other libraries such as hf transformer supports quant_state.bitsandbytes__fp4.
There was a problem hiding this comment.
Ditto on pre_quant, it should be talking about the checkpoint being quantized
mgoin
left a comment
There was a problem hiding this comment.
Appreciate the improvements and ability to natively load! I think it would be great to followup with a documentation page in the quantization section to show how to deploy bnb models directly in vLLM, perhaps straight from a quick finetune in unsloth.
fb0a6d2 to
b7c3aae
Compare
sure, I added a very simple bnb.rst in docs |
|
@mgoin can you review this version? |
|
Is this PR good to merge now? It would be really great if we could get it in before the next scheduled release (#6433)! |
8e891bb to
bb014ef
Compare
SGTM |
is there anything I could do to improve this patch? |
|
Any ETA for this feature to be merged? Really keen to use it |
mgoin
left a comment
There was a problem hiding this comment.
Thanks for pinging, LGTM with a small docs fix
As I mentioned earlier, Look Good To Me! |
|
@thesues do you think you could resolve the new conflicts? |
|
I've just tried installing from this PR's branch and testing it out, this solution worked for me! used to produce issues like: but now model loading seems to work great: |
|
It would be great if someone could validate that the new Llama 3.1 8B BNB checkpoint loads: https://huggingface.co/hugging-quants/Meta-Llama-3.1-8B-Instruct-BNB-NF4 |
Meta-Llama-3.1-8B-Instruct-BNB-NF4 works as expected. logs: |
…el/omost-llama-3-8b-4bits
Co-authored-by: Michael Goin <michael@neuralmagic.com>
done |
mgoin
left a comment
There was a problem hiding this comment.
Thanks a lot for testing further and sticking with this, LGTM!
Co-authored-by: Michael Goin <michael@neuralmagic.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com> Signed-off-by: Alvant <alvasian@yandex.ru>
Co-authored-by: Michael Goin <michael@neuralmagic.com> Signed-off-by: LeiWang1999 <leiwang1999@outlook.com>
huggingface is bitsandbytes pre quantized model such as
...
this will support these pre quantized for vllm
@chenqianfzh @Yard1 @jeejeelee