fix: CUDA memory leak / release BNB dequantization buffers & stale state in OffloadActivations by butterwecksolutions · Pull Request #5730 · huggingface/trl

butterwecksolutions · 2026-05-08T00:57:59Z

What does this PR do?

The OffloadActivations context manager leaks VRAM through two independent paths.
Both are fixed by cleaning up stale state in __enter__, where the previous
backward is guaranteed to have completed.

Path 1 — Stale state leaks across steps (MoE + compile)

On MoE architectures with sample packing + torch.compile, dynamic expert routing
leaves saved tensors on subgraphs that never contribute to loss. Their backward
nodes never execute, so tracker, storage_to_tensor_id, and the stashes retain
entries from step to step. ~60 tensors leak per micro-step → OOM by step 2
(e.g. Gemma-4 26B-A4B + ScatterMoE, axolotl-ai-cloud/axolotl#3638).

Path 2 — BNB dequantization buffers (QLoRA)

After #5700 fixed the CUDA stream leak, QLoRA (BNB 4-bit) still leaks because
tracker retains references to tensors sharing allocator blocks with BNB
dequant buffers, and empty_cache() is never called between steps.
~0.6 GiB/step → OOM after 30-40 steps on 24 GB GPUs.

Why `enter` is the right place

The class example shows backward() running after the with block exits:

with OffloadActivations():
outputs = model(inputs, labels=labels)
loss = outputs.loss
loss.backward()

This means exit may fire BEFORE backward — the tracker still holds
active tensors. Cleaning up there would destroy data the backward needs.

enter, however, is called before the NEXT forward pass. By that point
the previous backward has already completed, so any remaining
state is leaked and safe to drop.

What the fix does

enter now clears:

tracker, storage_to_tensor_id — release stale tensor references
tensor_id, is_first_forward_call, is_first_backward_call — state reset
Stream stashes (when use_streams=True)
Calls accelerator-aware empty_cache() when bitsandbytes is loaded
(via sys.modules, to avoid penalizing non-BNB workloads)

exit is unchanged from #5700 — stream sync + super().exit().

Pattern	Trigger	Symptom	Covered by
MoE + packed + compile	axolotl#3638	OOM step 2	`__enter__`
QLoRA BNB buffer	QLoRA training	OOM step 30-40	`__enter__`
CUDA stream garbage	#5700	OOM step 19	`__exit__` / #5700 (merged)

A complete audit confirms no further leak sources.

Verification

Validated via monkey-patch on QLoRA 9B VL training. Full reproducer available,
will be updated after review feedback.

Capability	#5700 (merged)	#5738	#5730
CUDA stream sync + stash clear	✅	—	—
tracker + storage_to_tensor_id clear	—	✅	✅
stash clear (use_streams guard)	—	✅	✅
is_first_forward/backward reset	—	✅	✅
tensor_id reset	—	—	✅
BNB dequant buffer empty_cache	—	—	✅
CUDA stream garbage leak	✅	—	—
MoE + compile leak path	—	✅	✅
QLoRA buffer leak path	—	—	✅

This PR fixes a typo or improves the docs
Read the contributor guideline
Discussed via GitHub issue (N/A — found during QLoRA debugging)
Documentation update needed? No — internal behavior change only
New tests? Not easily testable — requires CUDA memory snapshot assertions
AI-assisted: PR text and structure refined with AI; root cause found
through human debugging

Note

Medium Risk
Touches OffloadActivations lifecycle/memory-management logic; incorrect clearing or cache flushing could impact training correctness or performance, though changes are scoped to context entry.

Overview
Fixes activation-offloading VRAM leaks by resetting OffloadActivations state at context entry: clears tracker/dedup maps, resets forward/backward bookkeeping (including tensor_id), and drops stream stashes when enabled.

When bitsandbytes is loaded, __enter__ now calls the appropriate accelerator empty_cache() (cuda/xpu/npu) to release BNB dequantization buffers between steps. Adds a regression test ensuring tracker size does not grow across repeated forward/backward steps with an unused graph branch.

^{Reviewed by Cursor Bugbot for commit 22b5210. Bugbot is set up for automated code reviews on this repo. Configure here.}

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

There are 3 total unresolved issues (including 1 from previous review).

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 398210e. Configure here.}

Two independent VRAM leak paths in OffloadActivations are fixed by cleaning up stale state and releasing allocator cache blocks in __enter__, where the previous backward has already completed: 1. MoE + sample_packing + torch.compile — saved tensors on subgraphs whose backward nodes never execute leak ~60 tensors/micro-step because the unpack-then-delete logic never fires for them. 2. QLoRA BNB 4-bit dequantization buffers — tracker references keep allocator blocks alive across steps, and empty_cache() is never called (~0.6 GiB/step, OOM after 30-40 steps on 24 GB GPUs). __enter__ clears tracker, storage_to_tensor_id, tensor_id, stashes, and calls accelerator-aware empty_cache() (conditional on bitsandbytes in sys.modules to avoid penalizing non-BNB workloads). __exit__ handles stream sync and stash cleanup as before (huggingface#5700). All cleanup uses explicit if/elif dispatch matching the file's established accelerator pattern.

butterwecksolutions · 2026-05-09T09:39:27Z

@kashif Can you check this again?

kashif

added a test that highlights the issue

HuggingFaceDocBuilderDev · 2026-05-09T10:20:52Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

butterwecksolutions · 2026-05-10T02:01:37Z

@kashif @qgallouedec Thanks again for merging. My new experimental VRAM saving tool fits to this topic: https://github.com/butterwecksolutions/vsqz . Feel free to have a look.

….5.1 ships upstream fix TRL 1.5.1 implements huggingface/trl#5730 natively — ``OffloadActivations`` now has its own ``__enter__`` that clears tracker / stashes between steps, **plus** two things the axolotl backport never had: - ``self.tensor_id = 0`` reset (without this, the tensor_id counter accumulates across steps; harmless on its own but skews the ``fwd_stash`` eviction window). - ``torch.cuda.empty_cache()`` when bitsandbytes is loaded — flushes the BNB allocator between steps so its compute / optimizer-state buffers don't accumulate as live storage. TRL 1.5.1 also adds a ``__exit__`` that syncs the offload streams (``s0``, ``s1``) before the parent cleanup runs. The axolotl backport only overrode ``__enter__``, so ``__exit__`` was inherited correctly either way. Once we bumped TRL 1.1.0 → 1.5.1 (transformers 5.9 bundle), the monkey-patch became strictly worse than upstream — it shadowed the better ``__enter__``, dropping the ``tensor_id`` reset and the BNB ``empty_cache``. Combined with cu130's stricter cross-stream lifetime checks, this surfaced as XID 43 (driver-killed CUDA channel) during ``test_activation_offloading[lora]``, followed by every subsequent test failing at ``torch.manual_seed(42)`` because the CUDA context was permanently poisoned. Drop the patch and the wrapper — upstream is now the source of truth, per the existing TODO in this file.

* bump transformers to 5.9.0 and trl to 1.5.1 * test(gemma4-kernelize): accept ValueError from transformers 5.9 attach_hidden_kernels transformers ≤5.8 surfaced the non-Module ``_hidden_kernels`` entry as TypeError/AttributeError via ``module.register_module(name, fn)``. 5.9 reworked ``attach_hidden_kernels`` to raise ``ValueError`` directly with a clearer error message. The patch under test (strip dead entries before ``kernelize()`` runs) does the right thing either way; broaden the expected-crash assertion so the test reflects current upstream behavior. * 30 min timeout * fix(activation-offload): drop monkey-patched __enter__ now that TRL 1.5.1 ships upstream fix TRL 1.5.1 implements huggingface/trl#5730 natively — ``OffloadActivations`` now has its own ``__enter__`` that clears tracker / stashes between steps, **plus** two things the axolotl backport never had: - ``self.tensor_id = 0`` reset (without this, the tensor_id counter accumulates across steps; harmless on its own but skews the ``fwd_stash`` eviction window). - ``torch.cuda.empty_cache()`` when bitsandbytes is loaded — flushes the BNB allocator between steps so its compute / optimizer-state buffers don't accumulate as live storage. TRL 1.5.1 also adds a ``__exit__`` that syncs the offload streams (``s0``, ``s1``) before the parent cleanup runs. The axolotl backport only overrode ``__enter__``, so ``__exit__`` was inherited correctly either way. Once we bumped TRL 1.1.0 → 1.5.1 (transformers 5.9 bundle), the monkey-patch became strictly worse than upstream — it shadowed the better ``__enter__``, dropping the ``tensor_id`` reset and the BNB ``empty_cache``. Combined with cu130's stricter cross-stream lifetime checks, this surfaced as XID 43 (driver-killed CUDA channel) during ``test_activation_offloading[lora]``, followed by every subsequent test failing at ``torch.manual_seed(42)`` because the CUDA context was permanently poisoned. Drop the patch and the wrapper — upstream is now the source of truth, per the existing TODO in this file.

cursor Bot reviewed May 8, 2026

View reviewed changes

Comment thread trl/models/activation_offloading.py Outdated

butterwecksolutions force-pushed the bnb-dequant-buffer-cleanup branch from 23e7196 to 660a88d Compare May 8, 2026 01:09

cursor Bot reviewed May 8, 2026

View reviewed changes

Comment thread trl/models/activation_offloading.py Outdated

butterwecksolutions force-pushed the bnb-dequant-buffer-cleanup branch 2 times, most recently from a23df14 to 8c2ef2d Compare May 8, 2026 05:56

cursor Bot reviewed May 8, 2026

View reviewed changes

Comment thread trl/models/activation_offloading.py Outdated

Comment thread trl/models/activation_offloading.py Outdated

butterwecksolutions force-pushed the bnb-dequant-buffer-cleanup branch 2 times, most recently from 33d8d2b to 398210e Compare May 9, 2026 06:31

butterwecksolutions changed the title ~~fix: CUDA memory leak / release BNB dequantization buffers after activation offloading __exit__~~ fix: CUDA memory leak / release BNB dequantization buffers & stale state in OffloadActivations May 9, 2026

cursor Bot reviewed May 9, 2026

View reviewed changes

Comment thread trl/models/activation_offloading.py Outdated

Comment thread trl/models/activation_offloading.py

butterwecksolutions mentioned this pull request May 9, 2026

cleanup vram #5738

Open

8 tasks

butterwecksolutions force-pushed the bnb-dequant-buffer-cleanup branch 2 times, most recently from 619e393 to ed25785 Compare May 9, 2026 06:56

butterwecksolutions force-pushed the bnb-dequant-buffer-cleanup branch from ed25785 to cdcfa60 Compare May 9, 2026 07:29

butterwecksolutions and others added 3 commits May 9, 2026 11:53

Merge branch 'main' into bnb-dequant-buffer-cleanup

c1d45fd

test activation offloading stale state

909416a

style activation offloading docstrings

22b5210

kashif approved these changes May 9, 2026

View reviewed changes

ved1beta mentioned this pull request May 9, 2026

test for moe activation vram leak axolotl-ai-cloud/axolotl#3649

Merged

qgallouedec merged commit 5da6078 into huggingface:main May 9, 2026
11 of 12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: CUDA memory leak / release BNB dequantization buffers & stale state in OffloadActivations#5730

fix: CUDA memory leak / release BNB dequantization buffers & stale state in OffloadActivations#5730
qgallouedec merged 4 commits into
huggingface:mainfrom
butterwecksolutions:bnb-dequant-buffer-cleanup

butterwecksolutions commented May 8, 2026 •

edited by cursor Bot

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Uh oh!

butterwecksolutions commented May 9, 2026

Uh oh!

kashif left a comment

Uh oh!

HuggingFaceDocBuilderDev commented May 9, 2026

Uh oh!

Uh oh!

butterwecksolutions commented May 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

butterwecksolutions commented May 8, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Path 1 — Stale state leaks across steps (MoE + compile)

Path 2 — BNB dequantization buffers (QLoRA)

Why __enter__ is the right place

What the fix does

Verification

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

butterwecksolutions commented May 9, 2026

Uh oh!

kashif left a comment

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented May 9, 2026

Uh oh!

Uh oh!

butterwecksolutions commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

butterwecksolutions commented May 8, 2026 •

edited by cursor Bot

Loading

Why `enter` is the right place

butterwecksolutions commented May 10, 2026 •

edited

Loading