[Offload] Add offloading logic by kylesayrs · Pull Request #529 · vllm-project/compressed-tensors

kylesayrs · 2025-12-17T22:26:17Z

Purpose

Enable quantization of models which do not follow accelerate's offloading requirements
- Models which access module.weight attributes directly are not compatible with accelerate's offloading, as used by the sequential pipeline
More robust support for adding transforms to modules which are not offloaded directly
- Previously, we required that transform modules only be added to modules which are offloaded directly. This is almost always the case for linear layers, but there are rare cases where linear layers are offloaded by parents. This new offloading system guarantees that transforms (and submodules in general) can be added to any layer, even non-linear layers
Move to an implementation which allows for more expressive offloads, namely distributed offloads and disk offloads
- Accelerate's disk offloading has lots of unsupported features and model configurations
Reduce excess memory movement by decoupling device movement between parameters attached to the same module
- Accelerate onloads all module parameters all at once; This implementation only onloads each parameter as needed

Prerequisites

[Bugfix] Forward quantize better wrapping #521

Implementation Design

OffloadCache

The core of the offloading implementation is the OffloadCache.

class OffloadCache(MutableMapping, ABC):
  # names -> offloaded tensors (populated from _parameters or _buffers)
  offloaded_values: dict[str, torch.Tensor]
  
  # offloaded tensors -> onloaded tensors (only when offloading is disabled)
  keep_onloaded_values: ClassVar[dict[torch.Tensor, torch.Tensor]] = dict()

When offloading a module, the module's _parameter and _buffer private attributes are replaced with instances of the OffloadCache. The OffloadCache instance acts like a dictionary where assigned tensors are offloaded and retrieved tensors are onloaded. The original (now offloaded) tensors are stored in the offloaded_values variable.

module._parameters = cache_cls.from_mapping(module._parameters, onload_device)
module._buffers = cache_cls.from_mapping(module._buffers, onload_device)

While in the disable_offloading context, a strong reference to the tensors is maintained by keep_onloaded_values, which results in cache hits if a tensor is retrieved more than once within the context. Upon exiting the context, these strong references are dropped.

CPUCache

Different offload types need only to implement a subclass of OffloadCache and specify definitions for onloading a tensor and offloading a tensor, the two methods being inverses of one another.

class CPUCache(OffloadCache):
  offload_device = torch.device("cpu")

  def onload(self, offloaded: torch.Tensor | None) -> torch.Tensor:
    return send_tensors(offloaded, device=self.onload_device, copy=False)

  def offload(self, tensor: torch.Tensor | None) -> torch.Tensor:
    return send_tensors(tensor, device=self.offload_device, copy=False)

Dispatch Functions

There are two dispatch functions. offload_model offloads a model in a style most useful for sequential onloading, and dispatch_model offloads a model in a style most useful for autoregression generation.

Limitations

This implementation assumes that only one model is using the disable_offloading context at a time, which is true for single-threaded applications.
Inplace updates of onloaded parameters will not be reflected in offloaded parameters. This was also a limitation of accelerate offloading. However, in addition to supporting update_offload_parameter, this implementation also supports direct parameter assignment, so no existing code needs to be changed.

Testing

All public functions are tested by tests/test_offload
Tested that offloaded models are traceable
Tested that offloaded models work with the sequential pipeline

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

brian-dellabetta

Impressive work! I would like to better understand some of the logic in src/compressed_tensors/offload/dispatch.py, but leaving some additional comments for now

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

brian-dellabetta

sorry forgot to submit this comment previously

HDCharles

see comments but overall looks good

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

kylesayrs added 15 commits December 17, 2025 22:25

add offloading logic

ff376e0

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

remove tracing requirement

bb07f1c

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

fix name change

362c115

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

simplify utils

59904c1

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

rename

32ab4e2

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

add all necessary functions

9a7794a

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

fix rebase

9cdd4be

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

add test_forward_call

6f41d1d

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

remove accelerate dep

8cb0eee

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

docstring

53b9561

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

move around

82e0111

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

fix global access

4f5afb0

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

add back module disable_offloading

6d26e83

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

remove disk cache

e830d85

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

docstrings, todos

7ba45b2

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

kylesayrs mentioned this pull request Dec 18, 2025

[Offload] Remove Accelerate #530

Merged

kylesayrs added 14 commits December 18, 2025 17:43

cleanup, add tests

48d40ef

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

allow direct parameter assignment in disable_onloading

60f5cb5

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

WIP: dispatch

6587de9

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

WIP

9831f20

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

add buffers

cab01ea

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

WIP: dispatch

1cbcd8e

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

fix module, tested dispatch

9a2e319

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

move helpers

38dac49

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

match signature

2830fb2

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

use setitem to invalidate cache when updating

611c8cf

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

fix dispatch typo, add tests

8607704

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

remove shared module logic

f923d23

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

docstrings

16cca67

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

docstrings

6037bed

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

HDCharles reviewed Jan 8, 2026

View reviewed changes

Comment thread src/compressed_tensors/offload/dispatch.py Outdated

HDCharles reviewed Jan 8, 2026

View reviewed changes

Comment thread src/compressed_tensors/offload/dispatch.py

brian-dellabetta reviewed Jan 8, 2026

View reviewed changes

suggestions: fix offloading dispatch bug, rename things

1c4ac41

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

kylesayrs requested review from HDCharles and brian-dellabetta January 9, 2026 22:03

fix typo

b9cf3a4

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

brian-dellabetta reviewed Jan 12, 2026

View reviewed changes

Comment thread src/compressed_tensors/offload/dispatch.py

HDCharles reviewed Jan 12, 2026

View reviewed changes

Comment thread src/compressed_tensors/offload/cache/base.py

HDCharles reviewed Jan 12, 2026

View reviewed changes

Comment thread src/compressed_tensors/offload/module.py Outdated

dsikka mentioned this pull request Jan 12, 2026

[Performance Refactor] Replace accelerate’s offloading with distributed-friendly implementations vllm-project/llm-compressor#2215

Closed

5 tasks

brian-dellabetta previously approved these changes Jan 12, 2026

View reviewed changes

HDCharles reviewed Jan 12, 2026

View reviewed changes

Comment thread src/compressed_tensors/offload/module.py

HDCharles reviewed Jan 12, 2026

View reviewed changes

Comment thread src/compressed_tensors/offload/utils.py

HDCharles previously approved these changes Jan 12, 2026

View reviewed changes

remove no_split arguments where applicable

3d30bdf

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

kylesayrs dismissed stale reviews from HDCharles and brian-dellabetta via 3d30bdf January 12, 2026 22:18

kylesayrs requested review from HDCharles and brian-dellabetta January 12, 2026 22:20

brian-dellabetta approved these changes Jan 13, 2026

View reviewed changes

Comment thread src/compressed_tensors/offload/__init__.py

kylesayrs requested a review from dsikka January 18, 2026 04:23

Merge branch 'main' into kylesayrs/torch_offloader

4432f13

kylesayrs enabled auto-merge (squash) January 21, 2026 16:59

Merge branch 'main' into kylesayrs/torch_offloader

cc3dbca

HDCharles approved these changes Jan 21, 2026

View reviewed changes

dsikka disabled auto-merge January 21, 2026 20:43

dsikka merged commit 116ce4c into main Jan 21, 2026
3 of 4 checks passed

dsikka deleted the kylesayrs/torch_offloader branch January 21, 2026 20:43

kylesayrs mentioned this pull request Jan 28, 2026

[Bugfix] Forward quantize better wrapping #521

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Offload] Add offloading logic#529

[Offload] Add offloading logic#529
dsikka merged 62 commits intomainfrom
kylesayrs/torch_offloader

kylesayrs commented Dec 17, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

brian-dellabetta left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

brian-dellabetta left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HDCharles left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

kylesayrs commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Prerequisites

Implementation Design

OffloadCache

CPUCache

Dispatch Functions

Limitations

Testing

Uh oh!

Uh oh!

Uh oh!

brian-dellabetta left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

brian-dellabetta left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HDCharles left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kylesayrs commented Dec 17, 2025 •

edited

Loading