-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Moving BF16 quantized to int4 to cuda takes ages... #367
Comments
@deepbeepmeep thank you for your feeback. Can you detail the exact workflow you are using ? If |
Sorry, I didn't see you had replied. I am simply doing a parameter.cuda(). However, my original device is 'cpu' since the weights were offloaded in RAM. Unpacking / packing on the CPU is very slow. Is there a way to skip that part if the inference kernel matches the original packed format ? or adapt the packed format to match a specific target device ? or to transfer first and do the unpacking on the GPU ? If I am not mistaken for some types (AWQ ?) you skip the pack |
Indeed. I would need to check the exact sequence of actions, but I think that when loading from serialized weights using |
The worfklow is as follows:
I had a quick look at QuantizedTransformersModels and I didn't see how the unpacking / packing is avoided as it seems that the same quantization calls are done under the hood. Besides QuantizedTransformersModels requires models that are part of the transformers library which may not be the case with newly released models. |
I think the optimal workflow would be:
If you look at the code inside |
Thank for your answer but the issue I see with this workflow is that I won't be able to do an efficient offloading. Indeed, while offloading one keeps moving tensors back and forth between CPU and Cuda. Using the disk as a source device will be way too slow (especially with frequent trips incurred by sequential offloading) and optimisations such as pinned memory would be unrelevant. As the target public for quantization is likely the same one as for offloading (they both address VRAM limitations) quantization needs to be able to offer fast CPU-CUDA transfers. I understand that it is a hard work to do as the format may change depending on the device but if you could address at least Cuda you would cover a big part of the market. In any case I think this issue will become prevalent when the newly Blackwell GPUs will be widespread as they support natively FP4. By the way any plan to support FP4 in quanto ? |
Hi
It seems the intermediate step 'unpacking' is responsible for a huge slow down.
The main consequence is that int4 quantization with quanto is currently not usable when offlining weights.
Hopefully you can optimize that as I use quanto as my default quantization engine for my tools.
By the way I have a RTX 4090.
Many thanks in advance
The text was updated successfully, but these errors were encountered: