Skip to content

fix: CUDA memory leak / release BNB dequantization buffers & stale state in OffloadActivations#5730

Merged
qgallouedec merged 4 commits into
huggingface:mainfrom
butterwecksolutions:bnb-dequant-buffer-cleanup
May 9, 2026
Merged

fix: CUDA memory leak / release BNB dequantization buffers & stale state in OffloadActivations#5730
qgallouedec merged 4 commits into
huggingface:mainfrom
butterwecksolutions:bnb-dequant-buffer-cleanup

Conversation

@butterwecksolutions

@butterwecksolutions butterwecksolutions commented May 8, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

The OffloadActivations context manager leaks VRAM through two independent paths.
Both are fixed by cleaning up stale state in __enter__, where the previous
backward is guaranteed to have completed.

Path 1 — Stale state leaks across steps (MoE + compile)

On MoE architectures with sample packing + torch.compile, dynamic expert routing
leaves saved tensors on subgraphs that never contribute to loss. Their backward
nodes never execute, so tracker, storage_to_tensor_id, and the stashes retain
entries from step to step. ~60 tensors leak per micro-step → OOM by step 2
(e.g. Gemma-4 26B-A4B + ScatterMoE, axolotl-ai-cloud/axolotl#3638).

Path 2 — BNB dequantization buffers (QLoRA)

After #5700 fixed the CUDA stream leak, QLoRA (BNB 4-bit) still leaks because
tracker retains references to tensors sharing allocator blocks with BNB
dequant buffers, and empty_cache() is never called between steps.
~0.6 GiB/step → OOM after 30-40 steps on 24 GB GPUs.

Why __enter__ is the right place

The class example shows backward() running after the with block exits:

with OffloadActivations():
outputs = model(inputs, labels=labels)
loss = outputs.loss
loss.backward()

This means exit may fire BEFORE backward — the tracker still holds
active tensors. Cleaning up there would destroy data the backward needs.

enter, however, is called before the NEXT forward pass. By that point
the previous backward has already completed, so any remaining
state is leaked and safe to drop.

What the fix does

enter now clears:

  • tracker, storage_to_tensor_id — release stale tensor references
  • tensor_id, is_first_forward_call, is_first_backward_call — state reset
  • Stream stashes (when use_streams=True)
  • Calls accelerator-aware empty_cache() when bitsandbytes is loaded
    (via sys.modules, to avoid penalizing non-BNB workloads)

exit is unchanged from #5700 — stream sync + super().exit().

Pattern Trigger Symptom Covered by
MoE + packed + compile axolotl#3638 OOM step 2 __enter__
QLoRA BNB buffer QLoRA training OOM step 30-40 __enter__
CUDA stream garbage #5700 OOM step 19 __exit__ / #5700 (merged)

A complete audit confirms no further leak sources.

Verification

Validated via monkey-patch on QLoRA 9B VL training. Full reproducer available,
will be updated after review feedback.

Capability #5700 (merged) #5738 #5730
CUDA stream sync + stash clear
tracker + storage_to_tensor_id clear
stash clear (use_streams guard)
is_first_forward/backward reset
tensor_id reset
BNB dequant buffer empty_cache
CUDA stream garbage leak
MoE + compile leak path
QLoRA buffer leak path

  • This PR fixes a typo or improves the docs

  • Read the contributor guideline

  • Discussed via GitHub issue (N/A — found during QLoRA debugging)

  • Documentation update needed? No — internal behavior change only

  • New tests? Not easily testable — requires CUDA memory snapshot assertions

  • AI-assisted: PR text and structure refined with AI; root cause found
    through human debugging


Note

Medium Risk
Touches OffloadActivations lifecycle/memory-management logic; incorrect clearing or cache flushing could impact training correctness or performance, though changes are scoped to context entry.

Overview
Fixes activation-offloading VRAM leaks by resetting OffloadActivations state at context entry: clears tracker/dedup maps, resets forward/backward bookkeeping (including tensor_id), and drops stream stashes when enabled.

When bitsandbytes is loaded, __enter__ now calls the appropriate accelerator empty_cache() (cuda/xpu/npu) to release BNB dequantization buffers between steps. Adds a regression test ensuring tracker size does not grow across repeated forward/backward steps with an unused graph branch.

Reviewed by Cursor Bugbot for commit 22b5210. Bugbot is set up for automated code reviews on this repo. Configure here.

Comment thread trl/models/activation_offloading.py Outdated
@butterwecksolutions butterwecksolutions force-pushed the bnb-dequant-buffer-cleanup branch from 23e7196 to 660a88d Compare May 8, 2026 01:09
Comment thread trl/models/activation_offloading.py Outdated
@butterwecksolutions butterwecksolutions force-pushed the bnb-dequant-buffer-cleanup branch 2 times, most recently from a23df14 to 8c2ef2d Compare May 8, 2026 05:56
Comment thread trl/models/activation_offloading.py Outdated
Comment thread trl/models/activation_offloading.py Outdated
@butterwecksolutions butterwecksolutions force-pushed the bnb-dequant-buffer-cleanup branch 2 times, most recently from 33d8d2b to 398210e Compare May 9, 2026 06:31
@butterwecksolutions butterwecksolutions changed the title fix: CUDA memory leak / release BNB dequantization buffers after activation offloading __exit__ fix: CUDA memory leak / release BNB dequantization buffers & stale state in OffloadActivations May 9, 2026

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

There are 3 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 398210e. Configure here.

Comment thread trl/models/activation_offloading.py Outdated
Comment thread trl/models/activation_offloading.py
@butterwecksolutions butterwecksolutions mentioned this pull request May 9, 2026
8 tasks
@butterwecksolutions butterwecksolutions force-pushed the bnb-dequant-buffer-cleanup branch 2 times, most recently from 619e393 to ed25785 Compare May 9, 2026 06:56
Two independent VRAM leak paths in OffloadActivations are fixed
by cleaning up stale state and releasing allocator cache blocks
in __enter__, where the previous backward has already completed:

1. MoE + sample_packing + torch.compile — saved tensors on subgraphs
   whose backward nodes never execute leak ~60 tensors/micro-step
   because the unpack-then-delete logic never fires for them.

2. QLoRA BNB 4-bit dequantization buffers — tracker references keep
   allocator blocks alive across steps, and empty_cache() is never
   called (~0.6 GiB/step, OOM after 30-40 steps on 24 GB GPUs).

__enter__ clears tracker, storage_to_tensor_id, tensor_id, stashes,
and calls accelerator-aware empty_cache() (conditional on bitsandbytes
in sys.modules to avoid penalizing non-BNB workloads).

__exit__ handles stream sync and stash cleanup as before (huggingface#5700).
All cleanup uses explicit if/elif dispatch matching the file's
established accelerator pattern.
@butterwecksolutions butterwecksolutions force-pushed the bnb-dequant-buffer-cleanup branch from ed25785 to cdcfa60 Compare May 9, 2026 07:29
@butterwecksolutions

Copy link
Copy Markdown
Contributor Author

@kashif Can you check this again?

@kashif kashif left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a test that highlights the issue

@HuggingFaceDocBuilderDev

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@qgallouedec qgallouedec merged commit 5da6078 into huggingface:main May 9, 2026
11 of 12 checks passed
@butterwecksolutions

butterwecksolutions commented May 10, 2026

Copy link
Copy Markdown
Contributor Author

@kashif @qgallouedec Thanks again for merging. My new experimental VRAM saving tool fits to this topic: https://github.com/butterwecksolutions/vsqz . Feel free to have a look.

winglian added a commit to axolotl-ai-cloud/axolotl that referenced this pull request Jun 1, 2026
….5.1 ships upstream fix

TRL 1.5.1 implements huggingface/trl#5730 natively — ``OffloadActivations``
now has its own ``__enter__`` that clears tracker / stashes between steps,
**plus** two things the axolotl backport never had:

- ``self.tensor_id = 0`` reset (without this, the tensor_id counter accumulates
  across steps; harmless on its own but skews the ``fwd_stash`` eviction window).
- ``torch.cuda.empty_cache()`` when bitsandbytes is loaded — flushes the BNB
  allocator between steps so its compute / optimizer-state buffers don't
  accumulate as live storage.

TRL 1.5.1 also adds a ``__exit__`` that syncs the offload streams (``s0``,
``s1``) before the parent cleanup runs. The axolotl backport only overrode
``__enter__``, so ``__exit__`` was inherited correctly either way.

Once we bumped TRL 1.1.0 → 1.5.1 (transformers 5.9 bundle), the monkey-patch
became strictly worse than upstream — it shadowed the better ``__enter__``,
dropping the ``tensor_id`` reset and the BNB ``empty_cache``. Combined with
cu130's stricter cross-stream lifetime checks, this surfaced as XID 43
(driver-killed CUDA channel) during ``test_activation_offloading[lora]``,
followed by every subsequent test failing at ``torch.manual_seed(42)``
because the CUDA context was permanently poisoned.

Drop the patch and the wrapper — upstream is now the source of truth, per
the existing TODO in this file.
winglian added a commit to axolotl-ai-cloud/axolotl that referenced this pull request Jun 2, 2026
* bump transformers to 5.9.0 and trl to 1.5.1

* test(gemma4-kernelize): accept ValueError from transformers 5.9 attach_hidden_kernels

transformers ≤5.8 surfaced the non-Module ``_hidden_kernels`` entry as
TypeError/AttributeError via ``module.register_module(name, fn)``. 5.9
reworked ``attach_hidden_kernels`` to raise ``ValueError`` directly with
a clearer error message. The patch under test (strip dead entries
before ``kernelize()`` runs) does the right thing either way; broaden
the expected-crash assertion so the test reflects current upstream
behavior.

* 30 min timeout

* fix(activation-offload): drop monkey-patched __enter__ now that TRL 1.5.1 ships upstream fix

TRL 1.5.1 implements huggingface/trl#5730 natively — ``OffloadActivations``
now has its own ``__enter__`` that clears tracker / stashes between steps,
**plus** two things the axolotl backport never had:

- ``self.tensor_id = 0`` reset (without this, the tensor_id counter accumulates
  across steps; harmless on its own but skews the ``fwd_stash`` eviction window).
- ``torch.cuda.empty_cache()`` when bitsandbytes is loaded — flushes the BNB
  allocator between steps so its compute / optimizer-state buffers don't
  accumulate as live storage.

TRL 1.5.1 also adds a ``__exit__`` that syncs the offload streams (``s0``,
``s1``) before the parent cleanup runs. The axolotl backport only overrode
``__enter__``, so ``__exit__`` was inherited correctly either way.

Once we bumped TRL 1.1.0 → 1.5.1 (transformers 5.9 bundle), the monkey-patch
became strictly worse than upstream — it shadowed the better ``__enter__``,
dropping the ``tensor_id`` reset and the BNB ``empty_cache``. Combined with
cu130's stricter cross-stream lifetime checks, this surfaced as XID 43
(driver-killed CUDA channel) during ``test_activation_offloading[lora]``,
followed by every subsequent test failing at ``torch.manual_seed(42)``
because the CUDA context was permanently poisoned.

Drop the patch and the wrapper — upstream is now the source of truth, per
the existing TODO in this file.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants