tracking torch.compile compatibility with cpu offloading #10612

youkaichao · 2024-11-25T02:50:12Z

Your current environment

N/A

Model Input Dumps

No response

🐛 Describe the bug

When we use cpu offloading together with torch.compile, it will error:

torch._dynamo.exc.Unsupported: builtin: setattr [<class 'torch._dynamo.variables.dicts.ConstDictVariable'>, <class 'torch._dynamo.variables.constant.ConstantVariable'>, <class 'torch._dynamo.variables.dicts.ConstDictVariable'>] False

The error is caused by this line:

vllm/vllm/model_executor/models/utils.py

Line 482 in 49628fe

for k, v in module.state_dict().items()

Creating a state dict during forward will error.

I tried another approach of using tensor subclasses in #10609 . It works well for unquantized models, but does not work for quantized models.

The problem with quantized models, is that we have some classes inherits torch.nn.Parameter, e.g.

vllm/vllm/model_executor/parameter.py

Line 19 in 49628fe

class BasevLLMParameter(Parameter):

Using both tensor subclasses and parameter subclasses is a known problem in pytorch. See https://github.com/albanD/subclass_zoo/blob/main/custom_parameter.py for example.

To make torch.compile compatible with cpu offloading and quantization, we need to refactor the weight loading logic and how we create/store weights.

Take the GPTQ linear layer for example:

        qweight = PackedvLLMParameter(
            data=torch.empty(
                input_size_per_partition // self.quant_config.pack_factor,
                output_size_per_partition,
                dtype=torch.int32,
            ),
            input_dim=0,
            output_dim=1,
            packed_dim=0,
            packed_factor=self.quant_config.pack_factor,
            weight_loader=weight_loader)

We should avoid using nn.Parameter, and directly register the tensor as buffer:

self.qweight = torch.empty(
                input_size_per_partition // self.quant_config.pack_factor,
                output_size_per_partition,
                dtype=torch.int32,
            )
self.qweight.weight_loader = weight_loader
self.qweight_packed_factor = self.quant_config.pack_factor

The key ideas are:

No nn.Parameter, no class inheritance. We directly assign a tensor to the module. The tensor should be registered as a buffer.
Every tensor can have a weight_loader attribute. We can bind arguments needed for weight loading, e.g. self.qweight.weight_loader = partial(generic_weight_loader, args)
For the information we need during runtime (i.e. computing the forward pass), we store them in the module object, e.g. self.qweight_packed_factor = self.quant_config.pack_factor.

With all these changes, we should be able to use cpu offloading with quantization and torch.compile .

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

youkaichao · 2024-11-25T05:37:54Z

cc @dsikka

robertgshaw2-redhat · 2024-11-25T11:34:46Z

Hey @youkaichao - are there any specific quantization methods that are failing? We ran into this problem when originally refactoring the quantization parameters. Inside process_weights_after_loading, we actually save the loaded weights into a raw nn.Parameter - is this code not run during cpu offloading?

youkaichao · 2024-11-25T18:14:31Z

the issue is quite complicated.

the current cpu offloading method is not compatible with torch.compile . this is not relevant to quantization. any user using cpu offloading cannot use torch.compile .
To solve the compatibility, I'm investigating the tensor subclasses approach in [core] improve cpu offloading implementation #10609 . It works for unquantized models, but does not work for quantization.
To get the full compatibility of torch.compile + cpu offloading + quantization, we need to refactor the quantization weight loading logic.

we actually save the loaded weights into a raw nn.Parameter

Although we reset the loaded weights into a raw nn.Parameter later, it turns out tensor subclasses initialization is quite complicated. The moment we create PackedvLLMParameter, there are something we don't want happening.

github-actions · 2025-02-24T02:01:06Z

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

github-actions · 2025-03-26T02:05:20Z

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

youkaichao · 2025-03-27T11:51:44Z

v1 cpu offloading will be compatible with torch.compile, see #15354

youkaichao added the bug Something isn't working label Nov 25, 2024

youkaichao mentioned this issue Nov 25, 2024

[misc] add torch.compile compatibility check #10618

Merged

cennn mentioned this issue Dec 30, 2024

[Quantization/Parameter] WIP: Replace parameter subclasses with raw nn.Parameter with additional attributes #11622

Closed

github-actions bot added the stale Over 90 days of inactivity label Feb 24, 2025

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 26, 2025

youkaichao reopened this Mar 27, 2025

youkaichao closed this as completed Mar 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tracking torch.compile compatibility with cpu offloading #10612

tracking torch.compile compatibility with cpu offloading #10612

youkaichao commented Nov 25, 2024

youkaichao commented Nov 25, 2024

robertgshaw2-redhat commented Nov 25, 2024

youkaichao commented Nov 25, 2024

github-actions bot commented Feb 24, 2025

github-actions bot commented Mar 26, 2025

youkaichao commented Mar 27, 2025

tracking torch.compile compatibility with cpu offloading #10612

tracking torch.compile compatibility with cpu offloading #10612

Comments

youkaichao commented Nov 25, 2024

Your current environment

Model Input Dumps

🐛 Describe the bug

Before submitting a new issue...

youkaichao commented Nov 25, 2024

robertgshaw2-redhat commented Nov 25, 2024

youkaichao commented Nov 25, 2024

github-actions bot commented Feb 24, 2025

github-actions bot commented Mar 26, 2025

youkaichao commented Mar 27, 2025