-
-
Notifications
You must be signed in to change notification settings - Fork 7.1k
tracking torch.compile compatibility with cpu offloading #10612
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
cc @dsikka |
Hey @youkaichao - are there any specific quantization methods that are failing? We ran into this problem when originally refactoring the quantization parameters. Inside |
the issue is quite complicated.
Although we reset the loaded weights into a raw nn.Parameter later, it turns out tensor subclasses initialization is quite complicated. The moment we create |
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you! |
This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you! |
v1 cpu offloading will be compatible with |
Your current environment
N/A
Model Input Dumps
No response
🐛 Describe the bug
When we use cpu offloading together with
torch.compile
, it will error:The error is caused by this line:
vllm/vllm/model_executor/models/utils.py
Line 482 in 49628fe
Creating a state dict during forward will error.
I tried another approach of using tensor subclasses in #10609 . It works well for unquantized models, but does not work for quantized models.
The problem with quantized models, is that we have some classes inherits
torch.nn.Parameter
, e.g.vllm/vllm/model_executor/parameter.py
Line 19 in 49628fe
Using both tensor subclasses and parameter subclasses is a known problem in pytorch. See https://github.com/albanD/subclass_zoo/blob/main/custom_parameter.py for example.
To make
torch.compile
compatible with cpu offloading and quantization, we need to refactor the weight loading logic and how we create/store weights.Take the GPTQ linear layer for example:
We should avoid using
nn.Parameter
, and directly register the tensor as buffer:The key ideas are:
nn.Parameter
, no class inheritance. We directly assign a tensor to the module. The tensor should be registered as a buffer.weight_loader
attribute. We can bind arguments needed for weight loading, e.g.self.qweight.weight_loader = partial(generic_weight_loader, args)
self.qweight_packed_factor = self.quant_config.pack_factor
.With all these changes, we should be able to use cpu offloading with quantization and
torch.compile
.Before submitting a new issue...
The text was updated successfully, but these errors were encountered: