fix: CUDA memory leak / release BNB dequantization buffers & stale state in OffloadActivations#5730
Conversation
23e7196 to
660a88d
Compare
a23df14 to
8c2ef2d
Compare
33d8d2b to
398210e
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
There are 3 total unresolved issues (including 1 from previous review).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 398210e. Configure here.
619e393 to
ed25785
Compare
Two independent VRAM leak paths in OffloadActivations are fixed by cleaning up stale state and releasing allocator cache blocks in __enter__, where the previous backward has already completed: 1. MoE + sample_packing + torch.compile — saved tensors on subgraphs whose backward nodes never execute leak ~60 tensors/micro-step because the unpack-then-delete logic never fires for them. 2. QLoRA BNB 4-bit dequantization buffers — tracker references keep allocator blocks alive across steps, and empty_cache() is never called (~0.6 GiB/step, OOM after 30-40 steps on 24 GB GPUs). __enter__ clears tracker, storage_to_tensor_id, tensor_id, stashes, and calls accelerator-aware empty_cache() (conditional on bitsandbytes in sys.modules to avoid penalizing non-BNB workloads). __exit__ handles stream sync and stash cleanup as before (huggingface#5700). All cleanup uses explicit if/elif dispatch matching the file's established accelerator pattern.
ed25785 to
cdcfa60
Compare
|
@kashif Can you check this again? |
kashif
left a comment
There was a problem hiding this comment.
added a test that highlights the issue
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
@kashif @qgallouedec Thanks again for merging. My new experimental VRAM saving tool fits to this topic: https://github.com/butterwecksolutions/vsqz . Feel free to have a look. |
….5.1 ships upstream fix TRL 1.5.1 implements huggingface/trl#5730 natively — ``OffloadActivations`` now has its own ``__enter__`` that clears tracker / stashes between steps, **plus** two things the axolotl backport never had: - ``self.tensor_id = 0`` reset (without this, the tensor_id counter accumulates across steps; harmless on its own but skews the ``fwd_stash`` eviction window). - ``torch.cuda.empty_cache()`` when bitsandbytes is loaded — flushes the BNB allocator between steps so its compute / optimizer-state buffers don't accumulate as live storage. TRL 1.5.1 also adds a ``__exit__`` that syncs the offload streams (``s0``, ``s1``) before the parent cleanup runs. The axolotl backport only overrode ``__enter__``, so ``__exit__`` was inherited correctly either way. Once we bumped TRL 1.1.0 → 1.5.1 (transformers 5.9 bundle), the monkey-patch became strictly worse than upstream — it shadowed the better ``__enter__``, dropping the ``tensor_id`` reset and the BNB ``empty_cache``. Combined with cu130's stricter cross-stream lifetime checks, this surfaced as XID 43 (driver-killed CUDA channel) during ``test_activation_offloading[lora]``, followed by every subsequent test failing at ``torch.manual_seed(42)`` because the CUDA context was permanently poisoned. Drop the patch and the wrapper — upstream is now the source of truth, per the existing TODO in this file.
* bump transformers to 5.9.0 and trl to 1.5.1 * test(gemma4-kernelize): accept ValueError from transformers 5.9 attach_hidden_kernels transformers ≤5.8 surfaced the non-Module ``_hidden_kernels`` entry as TypeError/AttributeError via ``module.register_module(name, fn)``. 5.9 reworked ``attach_hidden_kernels`` to raise ``ValueError`` directly with a clearer error message. The patch under test (strip dead entries before ``kernelize()`` runs) does the right thing either way; broaden the expected-crash assertion so the test reflects current upstream behavior. * 30 min timeout * fix(activation-offload): drop monkey-patched __enter__ now that TRL 1.5.1 ships upstream fix TRL 1.5.1 implements huggingface/trl#5730 natively — ``OffloadActivations`` now has its own ``__enter__`` that clears tracker / stashes between steps, **plus** two things the axolotl backport never had: - ``self.tensor_id = 0`` reset (without this, the tensor_id counter accumulates across steps; harmless on its own but skews the ``fwd_stash`` eviction window). - ``torch.cuda.empty_cache()`` when bitsandbytes is loaded — flushes the BNB allocator between steps so its compute / optimizer-state buffers don't accumulate as live storage. TRL 1.5.1 also adds a ``__exit__`` that syncs the offload streams (``s0``, ``s1``) before the parent cleanup runs. The axolotl backport only overrode ``__enter__``, so ``__exit__`` was inherited correctly either way. Once we bumped TRL 1.1.0 → 1.5.1 (transformers 5.9 bundle), the monkey-patch became strictly worse than upstream — it shadowed the better ``__enter__``, dropping the ``tensor_id`` reset and the BNB ``empty_cache``. Combined with cu130's stricter cross-stream lifetime checks, this surfaced as XID 43 (driver-killed CUDA channel) during ``test_activation_offloading[lora]``, followed by every subsequent test failing at ``torch.manual_seed(42)`` because the CUDA context was permanently poisoned. Drop the patch and the wrapper — upstream is now the source of truth, per the existing TODO in this file.

What does this PR do?
The
OffloadActivationscontext manager leaks VRAM through two independent paths.Both are fixed by cleaning up stale state in
__enter__, where the previousbackward is guaranteed to have completed.
Path 1 — Stale state leaks across steps (MoE + compile)
On MoE architectures with sample packing + torch.compile, dynamic expert routing
leaves saved tensors on subgraphs that never contribute to loss. Their backward
nodes never execute, so
tracker,storage_to_tensor_id, and the stashes retainentries from step to step. ~60 tensors leak per micro-step → OOM by step 2
(e.g. Gemma-4 26B-A4B + ScatterMoE, axolotl-ai-cloud/axolotl#3638).
Path 2 — BNB dequantization buffers (QLoRA)
After #5700 fixed the CUDA stream leak, QLoRA (BNB 4-bit) still leaks because
trackerretains references to tensors sharing allocator blocks with BNBdequant buffers, and
empty_cache()is never called between steps.~0.6 GiB/step → OOM after 30-40 steps on 24 GB GPUs.
Why
__enter__is the right placeThe class example shows
backward()running after thewithblock exits:with OffloadActivations():
outputs = model(inputs, labels=labels)
loss = outputs.loss
loss.backward()
This means exit may fire BEFORE backward — the tracker still holds
active tensors. Cleaning up there would destroy data the backward needs.
enter, however, is called before the NEXT forward pass. By that point
the previous backward has already completed, so any remaining
state is leaked and safe to drop.
What the fix does
enter now clears:
(via sys.modules, to avoid penalizing non-BNB workloads)
exit is unchanged from #5700 — stream sync + super().exit().
__enter____enter____exit__/ #5700 (merged)A complete audit confirms no further leak sources.
Verification
Validated via monkey-patch on QLoRA 9B VL training. Full reproducer available,
will be updated after review feedback.
This PR fixes a typo or improves the docs
Read the contributor guideline
Discussed via GitHub issue (N/A — found during QLoRA debugging)
Documentation update needed? No — internal behavior change only
New tests? Not easily testable — requires CUDA memory snapshot assertions
AI-assisted: PR text and structure refined with AI; root cause found
through human debugging
Note
Medium Risk
Touches
OffloadActivationslifecycle/memory-management logic; incorrect clearing or cache flushing could impact training correctness or performance, though changes are scoped to context entry.Overview
Fixes activation-offloading VRAM leaks by resetting
OffloadActivationsstate at context entry: clearstracker/dedup maps, resets forward/backward bookkeeping (includingtensor_id), and drops stream stashes when enabled.When
bitsandbytesis loaded,__enter__now calls the appropriate acceleratorempty_cache()(cuda/xpu/npu) to release BNB dequantization buffers between steps. Adds a regression test ensuringtrackersize does not grow across repeated forward/backward steps with an unused graph branch.Reviewed by Cursor Bugbot for commit 22b5210. Bugbot is set up for automated code reviews on this repo. Configure here.