Skip to content

Guard GPT-OSS allocator warmup on low-memory 4-bit loads#521

Open
danielhanchen wants to merge 9 commits into
mainfrom
fix/gpt-oss-lowmem-warmup
Open

Guard GPT-OSS allocator warmup on low-memory 4-bit loads#521
danielhanchen wants to merge 9 commits into
mainfrom
fix/gpt-oss-lowmem-warmup

Conversation

@danielhanchen

@danielhanchen danielhanchen commented Feb 26, 2026

Copy link
Copy Markdown
Member

Summary

  • Guard transformers.modeling_utils.caching_allocator_warmup in Unsloth patch flow.
  • Auto-skip warmup on low effective accelerator memory (< 24 GiB) across model loads.
  • Keep explicit override:
    • UNSLOTH_ALLOCATOR_WARMUP=on|off|auto
  • Rename warmup guard identifiers to generic names.
  • Normalize GPT-OSS model-name guards to handle hyphenated names (gpt-oss -> gpt_oss) consistently.
  • Remove legacy UNSLOTH_GPT_OSS_ALLOCATOR_WARMUP alias support.
  • Use the active accelerator index (not hardcoded device 0) for memory and per-process-fraction checks.
  • Strip @use_kernel_forward_from_hub(...) and @auto_docstring decorators in class rewrite path to remove unknown-decorator warning noise.

Why

  • caching_allocator_warmup can allocate a large single chunk before weights load. On low-memory setups this can OOM before model load completes.
  • GPT-OSS patch activation checks were using raw UNSLOTH_MODEL_NAME substring checks in multiple places. Hyphenated names can skip patch paths that should apply for 4-bit GPT-OSS loads.
  • Multi-GPU/non-default-device runs can mis-detect memory when checks are hardcoded to device 0.

Validation

Run dir:

  • logs/memory_investigation_20260226_000809/warmup_24gb_global_20260226_014122

On Unsloth 2026.2.1 / transformers 4.56.2:

  • cap16_auto_unsloth2026_generic_names.log: PASS
    • GPU_RESERVED_GB_AFTER_LOAD 11.85
    • GPU_PEAK_RESERVED_GB_AFTER_LOAD 12.932
  • cap16_force_on_unsloth2026_generic_names.log: FAIL (expected control)
    • OutOfMemoryError with Tried to allocate 18.93 GiB
  • test_llama_compile_smoke.log: PASS

Hyphen-guard/load checks:

  • logs/hyphen_guard_verify_20260226_020925/repro_uninitialized_after_local_patch.log: PASS (LOAD_OK)

Version matrix (transformers==4.56.2):

  • Run dir: logs/version_matrix_gpt_oss_20260226_020432
  • old_2025_12 (unsloth==2025.12.10, unsloth-zoo==2025.12.8):
    • load_ok=true, train_ok=true, exception_type=null
    • losses=[1.165579, 4.171052, 3.063763]
    • grad_norms=[2.650939, inf, 1.8630026331731638e+17]
  • new_2026_2_1 (unsloth==2026.2.1, unsloth-zoo==2026.2.1):
    • load_ok=true, train_ok=true, exception_type=null
    • losses=[1.156882, 4.137518, 3.050902]
    • grad_norms=[2.590222, 5.900850786987662e+17, 9147114459174.867]

Sanity check for active-device memory helpers:

  • imported patched module from installed wheel path and verified:
    • _get_active_accelerator_index present
    • _get_accelerator_total_memory_bytes / _get_effective_accelerator_memory_bytes return values

Notes

  • High-memory accelerators keep warmup by default (auto).
  • In this environment, the uninitialized-weights load exception did not reproduce on either 2025.12 or 2026.2.1; both loaded successfully.

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @danielhanchen, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a critical enhancement to prevent Out-Of-Memory (OOM) errors during GPT-OSS 4-bit model loading on systems with limited GPU memory. By intelligently detecting available accelerator memory and conditionally skipping a large allocator warmup, it ensures models can load successfully in constrained environments. This change improves robustness without impacting performance on high-memory setups and provides user control over the new behavior.

Highlights

  • Guarded Allocator Warmup: A guarded wrapper was added around transformers.modeling_utils.caching_allocator_warmup specifically for GPT-OSS 4-bit loads.
  • Automatic Warmup Skipping: The warmup process will now automatically be skipped on accelerators with low effective memory (<= 20 GiB), including environments constrained by torch.cuda.set_per_process_memory_fraction(...).
  • Configurable Override: An environment variable UNSLOTH_GPT_OSS_ALLOCATOR_WARMUP was introduced, allowing users to explicitly set warmup behavior to 'on', 'off', or 'auto'.
  • Unchanged Behavior for High-Memory/Non-GPT-OSS: Warmup behavior remains unchanged for devices with larger memory capacities and for model paths not related to GPT-OSS.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • unsloth_zoo/temporary_patches/gpt_oss.py
    • Added helper functions _get_accelerator_total_memory_bytes and _get_effective_accelerator_memory_bytes to accurately determine the total and effectively available GPU memory.
    • Implemented _should_skip_transformers_allocator_warmup_for_gpt_oss to evaluate conditions for skipping the allocator warmup, considering model name, 4-bit loading, effective memory, and an environment variable override.
    • Introduced patch_transformers_caching_allocator_warmup_for_gpt_oss to dynamically wrap the transformers.modeling_utils.caching_allocator_warmup function, applying the new skip logic.
    • Registered the new patching function with TEMPORARY_PATCHES to ensure it is applied during runtime.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a mechanism to guard the transformers.modeling_utils.caching_allocator_warmup for GPT-OSS 4-bit loads on low-memory GPUs. The changes are well-structured and address the OOM issue described. The addition of an environment variable for override is a good feature. My main feedback is to refactor one of the new functions to improve readability and maintainability by extracting magic values into constants. Overall, this is a solid improvement.

Comment on lines +1126 to +1141
model_name = os.environ.get("UNSLOTH_MODEL_NAME", "").replace("-", "_")
if "gpt_oss" not in model_name:
return False
if "_load_in_4bit_" not in model_name:
return False

mode = os.environ.get("UNSLOTH_GPT_OSS_ALLOCATOR_WARMUP", "auto").strip().lower()
if mode in ("off", "disable", "0", "false"):
return True
if mode in ("on", "enable", "1", "true"):
return False

total_memory = _get_effective_accelerator_memory_bytes()
if total_memory is None:
return False
return total_memory <= int(20 * 1024**3)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This function can be refactored for better readability and maintainability:

  1. The checks for model_name can be combined into a single if statement for conciseness.
  2. The magic strings for mode checking and the magic number for the memory threshold can be extracted into constants within the function's scope. Using sets for mode checking is also slightly more efficient for lookups.
  3. Adding comments to explain the logic for "auto" mode and the memory threshold would improve clarity for future maintainers.

Here is a suggested refactoring:

    model_name = os.environ.get("UNSLOTH_MODEL_NAME", "").replace("-", "_")
    if "gpt_oss" not in model_name or "_load_in_4bit_" not in model_name:
        return False

    mode = os.environ.get("UNSLOTH_GPT_OSS_ALLOCATOR_WARMUP", "auto").strip().lower()

    _WARMUP_OFF_MODES = {"off", "disable", "0", "false"}
    if mode in _WARMUP_OFF_MODES:
        return True

    _WARMUP_ON_MODES = {"on", "enable", "1", "true"}
    if mode in _WARMUP_ON_MODES:
        return False

    # Auto mode: skip on low memory devices
    total_memory = _get_effective_accelerator_memory_bytes()
    if total_memory is None:
        return False

    # 20 GiB threshold for low-memory devices
    _LOW_MEMORY_THRESHOLD_BYTES = 20 * 1024**3
    return total_memory <= _LOW_MEMORY_THRESHOLD_BYTES

Comment thread unsloth_zoo/temporary_patches/gpt_oss.py Fixed
Comment thread unsloth_zoo/temporary_patches/gpt_oss.py Fixed
Comment thread unsloth_zoo/temporary_patches/gpt_oss.py Fixed

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d66df48dba

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

try:
if DEVICE_TYPE == "xpu":
return int(torch.xpu.memory.mem_get_info(0)[-1])
return int(torch.cuda.memory.mem_get_info(0)[-1])

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Detect memory on the active accelerator, not device 0

The auto-skip decision always reads memory from accelerator index 0, so in multi-GPU or non-default-device runs it can evaluate the wrong device and keep warmup enabled even when the actual target device is low-memory/capped. In particular, if the model is loaded on cuda:1 (or another mapped device), this guard may miss the OOM condition it was added to prevent. The check should use the current/target device (or derive the relevant indices from expanded_device_map) instead of hard-coding 0.

Useful? React with 👍 / 👎.

Comment thread unsloth_zoo/temporary_patches/gpt_oss.py Fixed
Comment thread unsloth_zoo/temporary_patches/gpt_oss.py Fixed
Comment thread unsloth_zoo/temporary_patches/gpt_oss.py Fixed
Comment thread unsloth_zoo/temporary_patches/gpt_oss.py Fixed
Comment thread unsloth_zoo/temporary_patches/gpt_oss.py Fixed
Comment thread unsloth_zoo/temporary_patches/gpt_oss.py Fixed
@danielhanchen

Copy link
Copy Markdown
Member Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable safeguard against out-of-memory errors on devices with limited VRAM by conditionally skipping the allocator warmup. The logic for detecting low-memory environments is sound, and providing an environment variable for override is a good practice. Additionally, the refactoring to normalize the model name from the environment variable improves code consistency and fixes a potential bug with hyphenated model names. The changes are well-structured and enhance the robustness of the patching mechanism. I have a few minor suggestions to improve maintainability.

if DEVICE_TYPE == "xpu":
return int(torch.xpu.memory.mem_get_info(0)[-1])
return int(torch.cuda.memory.mem_get_info(0)[-1])
except Exception:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Catching a broad Exception can hide unexpected errors and make debugging more difficult. It's better to catch more specific exceptions that you expect to handle, such as RuntimeError, ImportError, or AttributeError in this context. This makes the code's intent clearer and more robust against unrelated issues.

Suggested change
except Exception:
except (RuntimeError, ImportError, AttributeError):

fraction = float(torch.cuda.get_per_process_memory_fraction(0))
if 0.0 < fraction < 1.0:
return int(total_memory * fraction)
except Exception:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the previous comment, catching a broad Exception is discouraged. For torch.cuda.get_per_process_memory_fraction, it's better to specifically handle expected exceptions like RuntimeError or NotImplementedError to avoid masking other potential bugs.

Suggested change
except Exception:
except (RuntimeError, NotImplementedError):

Comment on lines +1119 to +1136

def _should_skip_transformers_allocator_warmup() -> bool:
"""
Skip transformers allocator warmup on low-memory accelerators.

`caching_allocator_warmup` can allocate large single chunks before weights
are loaded, which can OOM constrained GPUs.
"""
mode = os.environ.get("UNSLOTH_ALLOCATOR_WARMUP", "").strip().lower()
if mode in ("off", "disable", "0", "false"):
return True
if mode in ("on", "enable", "1", "true"):
return False

total_memory = _get_effective_accelerator_memory_bytes()
if total_memory is None:
return False
return total_memory <= int(24 * 1024**3)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To improve readability and maintainability, it's a good practice to define magic numbers like the memory threshold as a named constant. This makes the code easier to understand and modify in the future.

Suggested change
def _should_skip_transformers_allocator_warmup() -> bool:
"""
Skip transformers allocator warmup on low-memory accelerators.
`caching_allocator_warmup` can allocate large single chunks before weights
are loaded, which can OOM constrained GPUs.
"""
mode = os.environ.get("UNSLOTH_ALLOCATOR_WARMUP", "").strip().lower()
if mode in ("off", "disable", "0", "false"):
return True
if mode in ("on", "enable", "1", "true"):
return False
total_memory = _get_effective_accelerator_memory_bytes()
if total_memory is None:
return False
return total_memory <= int(24 * 1024**3)
_LOW_MEMORY_THRESHOLD_BYTES = 24 * 1024**3
def _should_skip_transformers_allocator_warmup() -> bool:
"""
Skip transformers allocator warmup on low-memory accelerators.
`caching_allocator_warmup` can allocate large single chunks before weights
are loaded, which can OOM constrained GPUs.
"""
mode = os.environ.get("UNSLOTH_ALLOCATOR_WARMUP", "").strip().lower()
if mode in ("off", "disable", "0", "false"):
return True
if mode in ("on", "enable", "1", "true"):
return False
total_memory = _get_effective_accelerator_memory_bytes()
if total_memory is None:
return False
return total_memory <= _LOW_MEMORY_THRESHOLD_BYTES

@danielhanchen

Copy link
Copy Markdown
Member Author

Triage update:

  • Addressed functional review point about hardcoded accelerator index 0 in memory probes. Warmup and combo-kernel memory checks now use the active accelerator index.
  • Kept allocator override env var as UNSLOTH_ALLOCATOR_WARMUP only.
  • Kept broad exception handling in the memory probe helpers intentionally for robustness/fallback behavior on diverse runtimes.
  • Did not apply repeated explicit/implicit return style suggestions since behavior is intentional and unchanged.

Implemented in commit 135d6ad.

return total_memory <= _LOW_MEMORY_ACCELERATOR_BYTES


def patch_transformers_caching_allocator_warmup():
if hasattr(warmup_fn, "__unsloth_gpt_oss_guarded__"):
return

def guarded_caching_allocator_warmup(model, expanded_device_map, hf_quantizer):
return 0
if hasattr(torch, "cuda") and hasattr(torch.cuda, "current_device"):
return int(torch.cuda.current_device())
except Exception:
fraction = float(torch.cuda.get_per_process_memory_fraction(device_index))
if 0.0 < fraction < 1.0:
return int(total_memory * fraction)
except Exception:
@danielhanchen

Copy link
Copy Markdown
Member Author

Additional follow-up on #521 before final matrix runs:

  • Added compiler decorator stripping for @use_kernel_forward_from_hub(...) and @auto_docstring in class rewrite path.
  • This removes the previously observed unknown-decorator warning noise for GPT-OSS/Llama compile paths.
  • Kept unknown-decorator warnings for truly unrecognized decorators.

Commit included on this branch: eaf1f4a (with prior 684f684).

@Datta0

Datta0 commented Feb 26, 2026

Copy link
Copy Markdown
Collaborator

What are you trying which gave grads like 5.900850786987662e+17? Unless the data is pure random, we should perhaps investigate a little more on this

return 0


def _get_accelerator_total_memory_bytes():

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: We should put these in utils.py (away from temporary_patches) as these have potential to be reused.
Also there was a PR about DeviceContext iirc which might better handle this
Here it is unslothai/unsloth#3875

if DEVICE_TYPE != "xpu" and hasattr(torch.cuda, "get_per_process_memory_fraction"):
try:
device_index = _get_active_accelerator_index()
fraction = float(torch.cuda.get_per_process_memory_fraction(device_index))

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Are we restricting this somewhere else or is this to respect some hypothetical user setting such a limit?

Set UNSLOTH_GPT_OSS_BNB4BIT_DISABLE=1 to force BF16 path.
"""
if "gpt_oss" not in os.environ.get("UNSLOTH_MODEL_NAME", ""):
if "gpt_oss" not in _normalized_unsloth_model_name():

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this being a repeat of #519 (some of it). We should try to consolidate to avoid merge conflicts

Comment thread unsloth_zoo/compiler.py
logger.info(f'Unsloth: stripped use_experts_implementation decorator from {module}')
if (
"use_experts_implementation" in stripped
or "use_kernel_forward_from_hub" in stripped

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually nice. Never noticed this. Seems like a very recent change.
But this function is to strip the decorator. Why not have the decorator and use the kernel?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants