Moving BF16 quantized to int4 to cuda takes ages... #367

deepbeepmeep · 2025-01-08T18:11:48Z

Hi

It seems the intermediate step 'unpacking' is responsible for a huge slow down.

The main consequence is that int4 quantization with quanto is currently not usable when offlining weights.

Hopefully you can optimize that as I use quanto as my default quantization engine for my tools.

By the way I have a RTX 4090.

Many thanks in advance

dacorvo · 2025-01-09T08:26:01Z

@deepbeepmeep thank you for your feeback. Can you detail the exact workflow you are using ? If optimum-quanto detects that the weights are loaded on a GPU it will automatically pack them in a format that is suitable for the corresponding optimized inference kernels. It means however that the weights will be repacked to a device-agnostic format if they are serialized. This is usually fast if the weights were on the GPU, but all my tests were done on an A10.

deepbeepmeep · 2025-01-21T10:45:37Z

Sorry, I didn't see you had replied.

I am simply doing a parameter.cuda(). However, my original device is 'cpu' since the weights were offloaded in RAM. Unpacking / packing on the CPU is very slow. Is there a way to skip that part if the inference kernel matches the original packed format ? or adapt the packed format to match a specific target device ? or to transfer first and do the unpacking on the GPU ?
I think some solution will be needed to make Q4 more relevant as you are most likely to have it on systems with low VRAM which needs to offload models frequently.

If I am not mistaken for some types (AWQ ?) you skip the pack

dacorvo · 2025-01-21T10:51:03Z

Unpacking / packing on the CPU is very slow.

Indeed. I would need to check the exact sequence of actions, but I think that when loading from serialized weights using QuantizedTransformersModels it should be optimized.

deepbeepmeep · 2025-01-21T13:08:50Z

The worfklow is as follows:

Load model to "cpu" device
Quantize to int4
Move model to "cuda" device
Wait

I had a quick look at QuantizedTransformersModels and I didn't see how the unpacking / packing is avoided as it seems that the same quantization calls are done under the hood. Besides QuantizedTransformersModels requires models that are part of the transformers library which may not be the case with newly released models.

dacorvo · 2025-01-21T13:16:25Z

I think the optimal workflow would be:

Load model to "cpu" device
Quantize to int4
Serialize as state_dict / to disk (important !)
Create a blank model on the meta device
Call requantize with device='cuda', passing the state_dict

If you look at the code inside requantize, you will see that the key is to move the model to cuda BEFORE loading the weights.

deepbeepmeep · 2025-01-22T14:33:52Z

Thank for your answer but the issue I see with this workflow is that I won't be able to do an efficient offloading.

Indeed, while offloading one keeps moving tensors back and forth between CPU and Cuda.

Using the disk as a source device will be way too slow (especially with frequent trips incurred by sequential offloading) and optimisations such as pinned memory would be unrelevant.

As the target public for quantization is likely the same one as for offloading (they both address VRAM limitations) quantization needs to be able to offer fast CPU-CUDA transfers.

I understand that it is a hard work to do as the format may change depending on the device but if you could address at least Cuda you would cover a big part of the market.

In any case I think this issue will become prevalent when the newly Blackwell GPUs will be widespread as they support natively FP4. By the way any plan to support FP4 in quanto ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Moving BF16 quantized to int4 to cuda takes ages... #367

Moving BF16 quantized to int4 to cuda takes ages... #367

deepbeepmeep commented Jan 8, 2025

dacorvo commented Jan 9, 2025

deepbeepmeep commented Jan 21, 2025

dacorvo commented Jan 21, 2025

deepbeepmeep commented Jan 21, 2025

dacorvo commented Jan 21, 2025

deepbeepmeep commented Jan 22, 2025

Moving BF16 quantized to int4 to cuda takes ages... #367

Moving BF16 quantized to int4 to cuda takes ages... #367

Comments

deepbeepmeep commented Jan 8, 2025

dacorvo commented Jan 9, 2025

deepbeepmeep commented Jan 21, 2025

dacorvo commented Jan 21, 2025

deepbeepmeep commented Jan 21, 2025

dacorvo commented Jan 21, 2025

deepbeepmeep commented Jan 22, 2025