Guard GPT-OSS allocator warmup on low-memory 4-bit loads by danielhanchen · Pull Request #521 · unslothai/unsloth-zoo

danielhanchen · 2026-02-26T00:50:33Z

Summary

Guard transformers.modeling_utils.caching_allocator_warmup in Unsloth patch flow.
Auto-skip warmup on low effective accelerator memory (< 24 GiB) across model loads.
Keep explicit override:
- UNSLOTH_ALLOCATOR_WARMUP=on|off|auto
Rename warmup guard identifiers to generic names.
Normalize GPT-OSS model-name guards to handle hyphenated names (gpt-oss -> gpt_oss) consistently.
Remove legacy UNSLOTH_GPT_OSS_ALLOCATOR_WARMUP alias support.
Use the active accelerator index (not hardcoded device 0) for memory and per-process-fraction checks.
Strip @use_kernel_forward_from_hub(...) and @auto_docstring decorators in class rewrite path to remove unknown-decorator warning noise.

Why

caching_allocator_warmup can allocate a large single chunk before weights load. On low-memory setups this can OOM before model load completes.
GPT-OSS patch activation checks were using raw UNSLOTH_MODEL_NAME substring checks in multiple places. Hyphenated names can skip patch paths that should apply for 4-bit GPT-OSS loads.
Multi-GPU/non-default-device runs can mis-detect memory when checks are hardcoded to device 0.

Validation

Run dir:

logs/memory_investigation_20260226_000809/warmup_24gb_global_20260226_014122

On Unsloth 2026.2.1 / transformers 4.56.2:

cap16_auto_unsloth2026_generic_names.log: PASS
- GPU_RESERVED_GB_AFTER_LOAD 11.85
- GPU_PEAK_RESERVED_GB_AFTER_LOAD 12.932
cap16_force_on_unsloth2026_generic_names.log: FAIL (expected control)
- OutOfMemoryError with Tried to allocate 18.93 GiB
test_llama_compile_smoke.log: PASS

Hyphen-guard/load checks:

logs/hyphen_guard_verify_20260226_020925/repro_uninitialized_after_local_patch.log: PASS (LOAD_OK)

Version matrix (transformers==4.56.2):

Run dir: logs/version_matrix_gpt_oss_20260226_020432
old_2025_12 (unsloth==2025.12.10, unsloth-zoo==2025.12.8):
- load_ok=true, train_ok=true, exception_type=null
- losses=[1.165579, 4.171052, 3.063763]
- grad_norms=[2.650939, inf, 1.8630026331731638e+17]
new_2026_2_1 (unsloth==2026.2.1, unsloth-zoo==2026.2.1):
- load_ok=true, train_ok=true, exception_type=null
- losses=[1.156882, 4.137518, 3.050902]
- grad_norms=[2.590222, 5.900850786987662e+17, 9147114459174.867]

Sanity check for active-device memory helpers:

imported patched module from installed wheel path and verified:
- _get_active_accelerator_index present
- _get_accelerator_total_memory_bytes / _get_effective_accelerator_memory_bytes return values

Notes

High-memory accelerators keep warmup by default (auto).
In this environment, the uninitialized-weights load exception did not reproduce on either 2025.12 or 2026.2.1; both loaded successfully.

gemini-code-assist · 2026-02-26T00:50:48Z

Summary of Changes

Hello @danielhanchen, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a critical enhancement to prevent Out-Of-Memory (OOM) errors during GPT-OSS 4-bit model loading on systems with limited GPU memory. By intelligently detecting available accelerator memory and conditionally skipping a large allocator warmup, it ensures models can load successfully in constrained environments. This change improves robustness without impacting performance on high-memory setups and provides user control over the new behavior.

Highlights

Guarded Allocator Warmup: A guarded wrapper was added around transformers.modeling_utils.caching_allocator_warmup specifically for GPT-OSS 4-bit loads.
Automatic Warmup Skipping: The warmup process will now automatically be skipped on accelerators with low effective memory (<= 20 GiB), including environments constrained by torch.cuda.set_per_process_memory_fraction(...).
Configurable Override: An environment variable UNSLOTH_GPT_OSS_ALLOCATOR_WARMUP was introduced, allowing users to explicitly set warmup behavior to 'on', 'off', or 'auto'.
Unchanged Behavior for High-Memory/Non-GPT-OSS: Warmup behavior remains unchanged for devices with larger memory capacities and for model paths not related to GPT-OSS.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

unsloth_zoo/temporary_patches/gpt_oss.py
- Added helper functions _get_accelerator_total_memory_bytes and _get_effective_accelerator_memory_bytes to accurately determine the total and effectively available GPU memory.
- Implemented _should_skip_transformers_allocator_warmup_for_gpt_oss to evaluate conditions for skipping the allocator warmup, considering model name, 4-bit loading, effective memory, and an environment variable override.
- Introduced patch_transformers_caching_allocator_warmup_for_gpt_oss to dynamically wrap the transformers.modeling_utils.caching_allocator_warmup function, applying the new skip logic.
- Registered the new patching function with TEMPORARY_PATCHES to ensure it is applied during runtime.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a mechanism to guard the transformers.modeling_utils.caching_allocator_warmup for GPT-OSS 4-bit loads on low-memory GPUs. The changes are well-structured and address the OOM issue described. The addition of an environment variable for override is a good feature. My main feedback is to refactor one of the new functions to improve readability and maintainability by extracting magic values into constants. Overall, this is a solid improvement.

gemini-code-assist · 2026-02-26T00:52:04Z

+    model_name = os.environ.get("UNSLOTH_MODEL_NAME", "").replace("-", "_")
+    if "gpt_oss" not in model_name:
+        return False
+    if "_load_in_4bit_" not in model_name:
+        return False
+
+    mode = os.environ.get("UNSLOTH_GPT_OSS_ALLOCATOR_WARMUP", "auto").strip().lower()
+    if mode in ("off", "disable", "0", "false"):
+        return True
+    if mode in ("on", "enable", "1", "true"):
+        return False
+
+    total_memory = _get_effective_accelerator_memory_bytes()
+    if total_memory is None:
+        return False
+    return total_memory <= int(20 * 1024**3)


This function can be refactored for better readability and maintainability:

The checks for model_name can be combined into a single if statement for conciseness.

The magic strings for mode checking and the magic number for the memory threshold can be extracted into constants within the function's scope. Using sets for mode checking is also slightly more efficient for lookups.

Adding comments to explain the logic for "auto" mode and the memory threshold would improve clarity for future maintainers.

Here is a suggested refactoring:

model_name = os.environ.get("UNSLOTH_MODEL_NAME", "").replace("-", "_") if "gpt_oss" not in model_name or "_load_in_4bit_" not in model_name: return False mode = os.environ.get("UNSLOTH_GPT_OSS_ALLOCATOR_WARMUP", "auto").strip().lower() _WARMUP_OFF_MODES = {"off", "disable", "0", "false"} if mode in _WARMUP_OFF_MODES: return True _WARMUP_ON_MODES = {"on", "enable", "1", "true"} if mode in _WARMUP_ON_MODES: return False # Auto mode: skip on low memory devices total_memory = _get_effective_accelerator_memory_bytes() if total_memory is None: return False # 20 GiB threshold for low-memory devices _LOW_MEMORY_THRESHOLD_BYTES = 20 * 1024**3 return total_memory <= _LOW_MEMORY_THRESHOLD_BYTES

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d66df48dba

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-02-26T00:54:09Z

+    try:
+        if DEVICE_TYPE == "xpu":
+            return int(torch.xpu.memory.mem_get_info(0)[-1])
+        return int(torch.cuda.memory.mem_get_info(0)[-1])


Detect memory on the active accelerator, not device 0

The auto-skip decision always reads memory from accelerator index 0, so in multi-GPU or non-default-device runs it can evaluate the wrong device and keep warmup enabled even when the actual target device is low-memory/capped. In particular, if the model is loaded on cuda:1 (or another mapped device), this guard may miss the OOM condition it was added to prevent. The check should use the current/target device (or derive the relevant indices from expanded_device_map) instead of hard-coding 0.

Useful? React with 👍 / 👎.

danielhanchen · 2026-02-26T02:55:11Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a valuable safeguard against out-of-memory errors on devices with limited VRAM by conditionally skipping the allocator warmup. The logic for detecting low-memory environments is sound, and providing an environment variable for override is a good practice. Additionally, the refactoring to normalize the model name from the environment variable improves code consistency and fixes a potential bug with hyphenated model names. The changes are well-structured and enhance the robustness of the patching mechanism. I have a few minor suggestions to improve maintainability.

gemini-code-assist · 2026-02-26T02:56:26Z

+        if DEVICE_TYPE == "xpu":
+            return int(torch.xpu.memory.mem_get_info(0)[-1])
+        return int(torch.cuda.memory.mem_get_info(0)[-1])
+    except Exception:


Catching a broad Exception can hide unexpected errors and make debugging more difficult. It's better to catch more specific exceptions that you expect to handle, such as RuntimeError, ImportError, or AttributeError in this context. This makes the code's intent clearer and more robust against unrelated issues.

Suggested change

except Exception:

except (RuntimeError, ImportError, AttributeError):

gemini-code-assist · 2026-02-26T02:56:26Z

+            fraction = float(torch.cuda.get_per_process_memory_fraction(0))
+            if 0.0 < fraction < 1.0:
+                return int(total_memory * fraction)
+        except Exception:


Similar to the previous comment, catching a broad Exception is discouraged. For torch.cuda.get_per_process_memory_fraction, it's better to specifically handle expected exceptions like RuntimeError or NotImplementedError to avoid masking other potential bugs.

Suggested change

except Exception:

except (RuntimeError, NotImplementedError):

gemini-code-assist · 2026-02-26T02:56:26Z

+
+def _should_skip_transformers_allocator_warmup() -> bool:
+    """
+    Skip transformers allocator warmup on low-memory accelerators.
+
+    `caching_allocator_warmup` can allocate large single chunks before weights
+    are loaded, which can OOM constrained GPUs.
+    """
+    mode = os.environ.get("UNSLOTH_ALLOCATOR_WARMUP", "").strip().lower()
+    if mode in ("off", "disable", "0", "false"):
+        return True
+    if mode in ("on", "enable", "1", "true"):
+        return False
+
+    total_memory = _get_effective_accelerator_memory_bytes()
+    if total_memory is None:
+        return False
+    return total_memory <= int(24 * 1024**3)


To improve readability and maintainability, it's a good practice to define magic numbers like the memory threshold as a named constant. This makes the code easier to understand and modify in the future.

Suggested change

def _should_skip_transformers_allocator_warmup() -> bool:

"""

Skip transformers allocator warmup on low-memory accelerators.

`caching_allocator_warmup` can allocate large single chunks before weights

are loaded, which can OOM constrained GPUs.

"""

mode = os.environ.get("UNSLOTH_ALLOCATOR_WARMUP", "").strip().lower()

if mode in ("off", "disable", "0", "false"):

return True

if mode in ("on", "enable", "1", "true"):

return False

total_memory = _get_effective_accelerator_memory_bytes()

if total_memory is None:

return False

return total_memory <= int(24 * 1024**3)

_LOW_MEMORY_THRESHOLD_BYTES = 24 * 1024**3

def _should_skip_transformers_allocator_warmup() -> bool:

"""

Skip transformers allocator warmup on low-memory accelerators.

`caching_allocator_warmup` can allocate large single chunks before weights

are loaded, which can OOM constrained GPUs.

"""

mode = os.environ.get("UNSLOTH_ALLOCATOR_WARMUP", "").strip().lower()

if mode in ("off", "disable", "0", "false"):

return True

if mode in ("on", "enable", "1", "true"):

return False

total_memory = _get_effective_accelerator_memory_bytes()

if total_memory is None:

return False

return total_memory <= _LOW_MEMORY_THRESHOLD_BYTES

danielhanchen · 2026-02-26T03:01:03Z

Triage update:

Addressed functional review point about hardcoded accelerator index 0 in memory probes. Warmup and combo-kernel memory checks now use the active accelerator index.
Kept allocator override env var as UNSLOTH_ALLOCATOR_WARMUP only.
Kept broad exception handling in the memory probe helpers intentionally for robustness/fallback behavior on diverse runtimes.
Did not apply repeated explicit/implicit return style suggestions since behavior is intentional and unchanged.

Implemented in commit 135d6ad.

+    return total_memory <= _LOW_MEMORY_ACCELERATOR_BYTES
+
+
+def patch_transformers_caching_allocator_warmup():


+    if hasattr(warmup_fn, "__unsloth_gpt_oss_guarded__"):
+        return
+
+    def guarded_caching_allocator_warmup(model, expanded_device_map, hf_quantizer):


+            return 0
+        if hasattr(torch, "cuda") and hasattr(torch.cuda, "current_device"):
+            return int(torch.cuda.current_device())
+    except Exception:


+            fraction = float(torch.cuda.get_per_process_memory_fraction(device_index))
+            if 0.0 < fraction < 1.0:
+                return int(total_memory * fraction)
+        except Exception:


danielhanchen · 2026-02-26T03:19:51Z

Additional follow-up on #521 before final matrix runs:

Added compiler decorator stripping for @use_kernel_forward_from_hub(...) and @auto_docstring in class rewrite path.
This removes the previously observed unknown-decorator warning noise for GPT-OSS/Llama compile paths.
Kept unknown-decorator warnings for truly unrecognized decorators.

Commit included on this branch: eaf1f4a (with prior 684f684).

Datta0 · 2026-02-26T05:46:53Z

What are you trying which gave grads like 5.900850786987662e+17? Unless the data is pure random, we should perhaps investigate a little more on this

Datta0 · 2026-02-26T05:58:15Z

+    return 0
+
+
+def _get_accelerator_total_memory_bytes():


NIT: We should put these in utils.py (away from temporary_patches) as these have potential to be reused.
Also there was a PR about DeviceContext iirc which might better handle this
Here it is unslothai/unsloth#3875

Datta0 · 2026-02-26T06:00:12Z

+    if DEVICE_TYPE != "xpu" and hasattr(torch.cuda, "get_per_process_memory_fraction"):
+        try:
+            device_index = _get_active_accelerator_index()
+            fraction = float(torch.cuda.get_per_process_memory_fraction(device_index))


Interesting. Are we restricting this somewhere else or is this to respect some hypothetical user setting such a limit?

Datta0 · 2026-02-26T06:02:43Z

    Set UNSLOTH_GPT_OSS_BNB4BIT_DISABLE=1 to force BF16 path.
    """
-    if "gpt_oss" not in os.environ.get("UNSLOTH_MODEL_NAME", ""):
+    if "gpt_oss" not in _normalized_unsloth_model_name():


I see this being a repeat of #519 (some of it). We should try to consolidate to avoid merge conflicts

Datta0 · 2026-02-26T06:05:32Z

-                        logger.info(f'Unsloth: stripped use_experts_implementation decorator from {module}')
+                    if (
+                        "use_experts_implementation" in stripped
+                        or "use_kernel_forward_from_hub" in stripped


Actually nice. Never noticed this. Seems like a very recent change.
But this function is to strip the decorator. Why not have the decorator and use the kernel?

Skip GPT-OSS allocator warmup on low-memory 4-bit loads

d66df48

gemini-code-assist Bot reviewed Feb 26, 2026

View reviewed changes