Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Moving BF16 quantized to int4 to cuda takes ages... #367

Open
deepbeepmeep opened this issue Jan 8, 2025 · 6 comments
Open

Moving BF16 quantized to int4 to cuda takes ages... #367

deepbeepmeep opened this issue Jan 8, 2025 · 6 comments

Comments

@deepbeepmeep
Copy link

Hi

It seems the intermediate step 'unpacking' is responsible for a huge slow down.

The main consequence is that int4 quantization with quanto is currently not usable when offlining weights.

Hopefully you can optimize that as I use quanto as my default quantization engine for my tools.

By the way I have a RTX 4090.

Many thanks in advance

@dacorvo
Copy link
Collaborator

dacorvo commented Jan 9, 2025

@deepbeepmeep thank you for your feeback. Can you detail the exact workflow you are using ? If optimum-quanto detects that the weights are loaded on a GPU it will automatically pack them in a format that is suitable for the corresponding optimized inference kernels. It means however that the weights will be repacked to a device-agnostic format if they are serialized. This is usually fast if the weights were on the GPU, but all my tests were done on an A10.

@deepbeepmeep
Copy link
Author

Sorry, I didn't see you had replied.

I am simply doing a parameter.cuda(). However, my original device is 'cpu' since the weights were offloaded in RAM. Unpacking / packing on the CPU is very slow. Is there a way to skip that part if the inference kernel matches the original packed format ? or adapt the packed format to match a specific target device ? or to transfer first and do the unpacking on the GPU ?
I think some solution will be needed to make Q4 more relevant as you are most likely to have it on systems with low VRAM which needs to offload models frequently.

If I am not mistaken for some types (AWQ ?) you skip the pack

@dacorvo
Copy link
Collaborator

dacorvo commented Jan 21, 2025

Unpacking / packing on the CPU is very slow.

Indeed. I would need to check the exact sequence of actions, but I think that when loading from serialized weights using QuantizedTransformersModels it should be optimized.

@deepbeepmeep
Copy link
Author

The worfklow is as follows:

  1. Load model to "cpu" device
  2. Quantize to int4
  3. Move model to "cuda" device
  4. Wait

I had a quick look at QuantizedTransformersModels and I didn't see how the unpacking / packing is avoided as it seems that the same quantization calls are done under the hood. Besides QuantizedTransformersModels requires models that are part of the transformers library which may not be the case with newly released models.

@dacorvo
Copy link
Collaborator

dacorvo commented Jan 21, 2025

I think the optimal workflow would be:

  1. Load model to "cpu" device
  2. Quantize to int4
  3. Serialize as state_dict / to disk (important !)
  4. Create a blank model on the meta device
  5. Call requantize with device='cuda', passing the state_dict

If you look at the code inside requantize, you will see that the key is to move the model to cuda BEFORE loading the weights.

@deepbeepmeep
Copy link
Author

Thank for your answer but the issue I see with this workflow is that I won't be able to do an efficient offloading.

Indeed, while offloading one keeps moving tensors back and forth between CPU and Cuda.

Using the disk as a source device will be way too slow (especially with frequent trips incurred by sequential offloading) and optimisations such as pinned memory would be unrelevant.

As the target public for quantization is likely the same one as for offloading (they both address VRAM limitations) quantization needs to be able to offer fast CPU-CUDA transfers.

I understand that it is a hard work to do as the format may change depending on the device but if you could address at least Cuda you would cover a big part of the market.

In any case I think this issue will become prevalent when the newly Blackwell GPUs will be widespread as they support natively FP4. By the way any plan to support FP4 in quanto ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants