Skip to content

[Offload] Add offloading logic#529

Merged
dsikka merged 62 commits intomainfrom
kylesayrs/torch_offloader
Jan 21, 2026
Merged

[Offload] Add offloading logic#529
dsikka merged 62 commits intomainfrom
kylesayrs/torch_offloader

Conversation

@kylesayrs
Copy link
Copy Markdown
Collaborator

@kylesayrs kylesayrs commented Dec 17, 2025

Purpose

  • Enable quantization of models which do not follow accelerate's offloading requirements
    • Models which access module.weight attributes directly are not compatible with accelerate's offloading, as used by the sequential pipeline
  • More robust support for adding transforms to modules which are not offloaded directly
    • Previously, we required that transform modules only be added to modules which are offloaded directly. This is almost always the case for linear layers, but there are rare cases where linear layers are offloaded by parents. This new offloading system guarantees that transforms (and submodules in general) can be added to any layer, even non-linear layers
  • Move to an implementation which allows for more expressive offloads, namely distributed offloads and disk offloads
    • Accelerate's disk offloading has lots of unsupported features and model configurations
  • Reduce excess memory movement by decoupling device movement between parameters attached to the same module
    • Accelerate onloads all module parameters all at once; This implementation only onloads each parameter as needed

Prerequisites

Implementation Design

OffloadCache

The core of the offloading implementation is the OffloadCache.

class OffloadCache(MutableMapping, ABC):
  # names -> offloaded tensors (populated from _parameters or _buffers)
  offloaded_values: dict[str, torch.Tensor]
  
  # offloaded tensors -> onloaded tensors (only when offloading is disabled)
  keep_onloaded_values: ClassVar[dict[torch.Tensor, torch.Tensor]] = dict()

When offloading a module, the module's _parameter and _buffer private attributes are replaced with instances of the OffloadCache. The OffloadCache instance acts like a dictionary where assigned tensors are offloaded and retrieved tensors are onloaded. The original (now offloaded) tensors are stored in the offloaded_values variable.

module._parameters = cache_cls.from_mapping(module._parameters, onload_device)
module._buffers = cache_cls.from_mapping(module._buffers, onload_device)

While in the disable_offloading context, a strong reference to the tensors is maintained by keep_onloaded_values, which results in cache hits if a tensor is retrieved more than once within the context. Upon exiting the context, these strong references are dropped.

CPUCache

Different offload types need only to implement a subclass of OffloadCache and specify definitions for onloading a tensor and offloading a tensor, the two methods being inverses of one another.

class CPUCache(OffloadCache):
  offload_device = torch.device("cpu")

  def onload(self, offloaded: torch.Tensor | None) -> torch.Tensor:
    return send_tensors(offloaded, device=self.onload_device, copy=False)

  def offload(self, tensor: torch.Tensor | None) -> torch.Tensor:
    return send_tensors(tensor, device=self.offload_device, copy=False)

Dispatch Functions

There are two dispatch functions. offload_model offloads a model in a style most useful for sequential onloading, and dispatch_model offloads a model in a style most useful for autoregression generation.

Limitations

  • This implementation assumes that only one model is using the disable_offloading context at a time, which is true for single-threaded applications.
  • Inplace updates of onloaded parameters will not be reflected in offloaded parameters. This was also a limitation of accelerate offloading. However, in addition to supporting update_offload_parameter, this implementation also supports direct parameter assignment, so no existing code needs to be changed.

Testing

  • All public functions are tested by tests/test_offload
  • Tested that offloaded models are traceable
  • Tested that offloaded models work with the sequential pipeline

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Comment thread src/compressed_tensors/offload/dispatch.py Outdated
Comment thread src/compressed_tensors/offload/dispatch.py
Copy link
Copy Markdown
Collaborator

@brian-dellabetta brian-dellabetta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impressive work! I would like to better understand some of the logic in src/compressed_tensors/offload/dispatch.py, but leaving some additional comments for now

Comment thread src/compressed_tensors/offload/cache/cpu.py
Comment thread src/compressed_tensors/offload/dispatch.py
Comment thread src/compressed_tensors/offload/utils.py
Comment thread src/compressed_tensors/offload/utils.py Outdated
Comment thread src/compressed_tensors/offload/module.py
Comment thread src/compressed_tensors/utils/binary_search.py Outdated
Comment thread src/compressed_tensors/offload/dispatch.py
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Copy link
Copy Markdown
Collaborator

@brian-dellabetta brian-dellabetta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry forgot to submit this comment previously

Comment thread src/compressed_tensors/offload/dispatch.py
Comment thread src/compressed_tensors/offload/cache/base.py
Comment thread src/compressed_tensors/offload/module.py Outdated
Comment thread src/compressed_tensors/offload/module.py
Comment thread src/compressed_tensors/offload/utils.py
HDCharles
HDCharles previously approved these changes Jan 12, 2026
Copy link
Copy Markdown
Collaborator

@HDCharles HDCharles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comments but overall looks good

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Comment thread src/compressed_tensors/offload/__init__.py
@kylesayrs kylesayrs requested a review from dsikka January 18, 2026 04:23
@kylesayrs kylesayrs enabled auto-merge (squash) January 21, 2026 16:59
@dsikka dsikka disabled auto-merge January 21, 2026 20:43
@dsikka dsikka merged commit 116ce4c into main Jan 21, 2026
3 of 4 checks passed
@dsikka dsikka deleted the kylesayrs/torch_offloader branch January 21, 2026 20:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants