Skip to content

Fix tokenizer guard, ModernBERT attention, gpt_oss MoE unwrap#472

Merged
danielhanchen merged 18 commits into
mainfrom
fix/transformers-4.57-notebook-compat
Feb 9, 2026
Merged

Fix tokenizer guard, ModernBERT attention, gpt_oss MoE unwrap#472
danielhanchen merged 18 commits into
mainfrom
fix/transformers-4.57-notebook-compat

Conversation

@danielhanchen
Copy link
Copy Markdown
Member

Summary

Fixes notebook failures for transformers 4.57.6 + TRL 0.22-0.27, companion to unslothai/unsloth#3998.

Changes

  • Tokenizer None guard (tokenizer_utils.py): Return early from patch_tokenizer if tokenizer is None. Guard inner tokenizer unwrap when processor.tokenizer is None. Prevents crashes when VLM processors (like ERNIE VL) have None tokenizer during loading.
  • ModernBERT attention mask (temporary_patches/misc.py): Add patch_modernbert_attention_mask() to fix stride alignment issues in SDPA backward pass with torch.compile. The _update_attention_mask uses .expand() which creates non-contiguous strides not aligned to multiples of 4, causing reinterpret_tensor errors in the inductor backward graph. Fix: make masks contiguous before they enter compiled regions.
  • gpt_oss ParamWrapper unwrap (temporary_patches/gpt_oss.py): Unwrap PEFT ParamWrapper from MoE experts before accessing hidden_size in both GptOssMLP.forward() and model inference forward. ParamWrapper (from peft.tuners.lora.layer.LoraLayer) wraps modules via base_layer attribute; check base_layer, module, _module in order.

Test Results

Tested with all 125 notebooks on 8x B200 GPUs:

  • 92/125 pass with TRL 0.22.2 (no regressions)
  • 94/125 pass with TRL 0.27.1 (no regressions)
  • gpt_oss 20B GRPO, RL 2048, A100 GRPO, RL DGX Spark: all pass after ParamWrapper fix
  • ModernBERT sentence transformer training: pass after attention mask fix

Companion PR

Datta0 and others added 7 commits February 6, 2026 15:10
Fixes notebook failures for transformers 4.57.6 + TRL 0.22-0.27.

Tokenizer None guard (tokenizer_utils.py):
- Return early from patch_tokenizer if tokenizer is None (some VLM
  processors like ERNIE VL may have None tokenizer during loading)
- Guard inner tokenizer unwrap when processor.tokenizer is None

ModernBERT attention mask fix (temporary_patches/misc.py):
- Add patch_modernbert_attention_mask() to fix stride alignment issues
  in SDPA backward pass with torch.compile
- The _update_attention_mask uses .expand() which creates non-contiguous
  strides not aligned to multiples of 4, causing reinterpret_tensor
  errors in the inductor backward graph
- Fix: make masks contiguous before they enter compiled regions

gpt_oss ParamWrapper unwrap (temporary_patches/gpt_oss.py):
- Unwrap PEFT ParamWrapper from MoE experts before accessing
  hidden_size attribute in both GptOssMLP.forward() and the model
  inference forward path
- ParamWrapper (from peft.tuners.lora.layer) wraps nn.Parameter via
  base_layer attribute; check base_layer, module, _module in order

Tested with all 125 notebooks: no regressions on TRL 0.22.2 or 0.27.1.
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @danielhanchen, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on enhancing the stability and compatibility of the system by addressing critical issues related to tokenizer handling, attention mask generation in ModernBERT models, and interaction with PEFT-wrapped MoE layers in GPT-OSS. The changes aim to prevent crashes and ensure correct behavior across various configurations, particularly when integrating with different transformers and TRL versions, and specialized model architectures.

Highlights

  • Tokenizer Guarding: Implemented robust checks in patch_tokenizer to handle None tokenizer instances and scenarios where processor.tokenizer is None, preventing crashes, especially with VLM processors.
  • ModernBERT Attention Mask Fix: Introduced a new patch for ModernBERT models to ensure attention masks are contiguous. This resolves stride alignment issues encountered in the SDPA backward pass when using torch.compile.
  • GPT-OSS MoE ParamWrapper Unwrap: Added logic to recursively unwrap PEFT ParamWrapper from Mixture-of-Experts (MoE) layers in GPT-OSS models. This ensures correct access to the hidden_size attribute during both GptOssMLP.forward() and model inference.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • unsloth_zoo/temporary_patches/gpt_oss.py
    • Added logic within GptOssMLP.forward to unwrap ParamWrapper from self.experts to correctly access the hidden_size attribute.
    • Applied similar ParamWrapper unwrapping logic to decoder_layer.mlp.experts in the main model forward method to ensure proper hidden_size access during inference.
  • unsloth_zoo/temporary_patches/misc.py
    • Introduced patch_modernbert_attention_mask to override ModernBertModel._update_attention_mask. This ensures that global_attention_mask and sliding_window_mask are made contiguous, resolving stride alignment issues for SDPA backward passes with torch.compile.
  • unsloth_zoo/tokenizer_utils.py
    • Added an early return in patch_tokenizer if the input tokenizer is None, preventing errors when VLM processors lack a tokenizer.
    • Modified the logic that unwraps an inner tokenizer (tokenizer.tokenizer) to also check if the inner tokenizer is None, returning the original tokenizer in such cases to prevent further issues.
Activity
  • No specific activity (comments, reviews, or progress updates) has been recorded for this pull request since its creation.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Comment thread unsloth_zoo/temporary_patches/misc.py Fixed
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several important fixes for notebook compatibility. The changes include adding a None guard for the tokenizer in tokenizer_utils.py to handle VLM processors, patching ModernBERT's attention mask in misc.py to ensure contiguity for SDPA compatibility, and unwrapping PEFT's ParamWrapper for MoE experts in gpt_oss.py. The changes are well-implemented and address the described issues. I have one suggestion regarding code duplication in gpt_oss.py to improve maintainability.

Comment on lines +1245 to +1251
# Unwrap ParamWrapper from experts if needed (PEFT LoRA wraps modules)
_experts = decoder_layer.mlp.experts
for _attr in ("base_layer", "module", "_module"):
while not hasattr(_experts, "hidden_size") and hasattr(_experts, _attr):
_experts = getattr(_experts, _attr)
if _experts is not decoder_layer.mlp.experts:
decoder_layer.mlp.experts = _experts
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This logic to unwrap the PEFT ParamWrapper is duplicated from GptOssMLP.forward (lines 644-651). To improve maintainability and reduce code duplication, consider extracting this logic into a helper function.

For example, you could define a helper function like this:

def _unwrap_peft_module(module):
    """Recursively unwraps a module from PEFT wrappers until 'hidden_size' is found."""
    if hasattr(module, "hidden_size"):
        return module

    _m = module
    for attr in ("base_layer", "module", "_module"):
        while not hasattr(_m, "hidden_size") and hasattr(_m, attr):
            _m = getattr(_m, attr)
    return _m

Then you could simplify the code in both places to:

decoder_layer.mlp.experts = _unwrap_peft_module(decoder_layer.mlp.experts)

And in GptOssMLP.forward:

self.experts = _unwrap_peft_module(self.experts)

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d27d339225

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +649 to +651
_e = getattr(_e, _attr)
if _e is not self.experts:
self.experts = _e
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid replacing LoRA ParamWrapper with base layer

When the PEFT ParamWrapper lacks hidden_size, this code permanently replaces self.experts with the unwrapped base layer, which bypasses the wrapper’s LoRA logic in subsequent forwards. That silently disables LoRA adapters for gpt_oss models using PEFT, changing model outputs/training despite adapters being loaded. Consider reading hidden_size from a temporary unwrapped view (or caching just that attribute) without overwriting self.experts.

Useful? React with 👍 / 👎.

Datta0 and others added 4 commits February 8, 2026 10:27
Probe causal_conv1d CUDA kernels at startup and force the PyTorch slow
path when they fail (e.g. sm_100 on B200). Uses identity checks against
the original function objects to avoid clobbering vllm's independent
Triton-based implementations. Dynamically scans sys.modules instead of
hardcoding model module lists, so new models like qwen3_next and
mamba_ssm are automatically covered.
Comment thread unsloth_zoo/temporary_patches/misc.py Fixed
Comment thread unsloth_zoo/temporary_patches/misc.py Fixed
Comment thread unsloth_zoo/temporary_patches/misc.py Fixed
Comment thread unsloth_zoo/temporary_patches/misc.py Fixed
Comment thread unsloth_zoo/temporary_patches/misc.py Fixed
Comment thread unsloth_zoo/temporary_patches/misc.py Fixed
Comment thread unsloth_zoo/temporary_patches/misc.py Fixed
Comment thread unsloth_zoo/temporary_patches/misc.py Fixed
Comment thread unsloth_zoo/temporary_patches/misc.py Fixed
…ers-4.57-notebook-compat

# Conflicts:
#	unsloth_zoo/temporary_patches/gpt_oss.py
…FT dispatch fix, FP8Linear device, VLM dataset tokenization, push_to_hub_token and DPO vision mapping patches

- tokenizer_utils.py: Use getattr for pad_token_id to handle missing attr
- bitsandbytes.py: Guard fix_4bit_weight on packed weight shape
- misc.py: Add patch_peft_dispatch_bnb_4bit for compress_statistics AttributeError
- misc.py: Add patch_trl_push_to_hub_token to ensure to_dict() includes it
- misc.py: Add patch_trl_vision_model_mapping for DPO on TRL 0.22.x
- vllm_utils.py: Version-gate FP8Linear device kwarg
- dataset_utils.py: Add _maybe_tokenize_dataset for VLM skip_prepare_dataset
…ormers auto module

Inject MODEL_FOR_VISION_2_SEQ_MAPPING_NAMES as an alias of MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES
into transformers.models.auto.modeling_auto before TRL imports it. This allows TRL 0.22.x's bare import
to succeed on transformers 5.0+ without needing to modify installed TRL files.
if isinstance(dataset, IterableDataset):
_map_kwargs = {"batched": True}
return dataset.map(_tokenize_fn, **_map_kwargs)
pass
…ict change

transformers 5.0.0 changed apply_chat_template(tokenize=True) to default
return_dict=True, returning BatchEncoding instead of list[int]. vLLM's
safe_apply_chat_template doesn't pass return_dict=False, causing TypeError
in _validate_model_input when max(BatchEncoding) yields a string key.

Patch wraps the original function to inject return_dict=False when tokenize=True.
Version-gated to transformers >= 5.0.0, no-op if vLLM is not installed.
…without quant_state

compiler.py: When UNSLOTH_COMPILE_OVERWRITE=0 is set, check if the cached
file's transformers version differs from the current one. If so, force a
recompile instead of silently using stale compiled cache.

bitsandbytes.py: Guard Linear4bit.forward against layers with no quant_state
(not quantized) by falling back to regular F.linear. Use local quant_state
variable in the matmul_4bit call.
…ersion check

compiler.py: Switch logger.warning_once to print for the OVERWRITE=0
version mismatch message.

vllm_utils.py: Use Version("transformers") instead of importing the
module and reading __version__ manually.
gpt_oss.py: Reset to main since PR #471 (fix_gpt_oss2) handles all MoE fixes.
compiler.py: Remove torch_compile/KWARGS_TYPE import hunk (added by #471),
keep the OVERWRITE version-mismatch recompile logic which is unique to this PR.
Also keep main's data-dependent compile check (.nonzero/.tolist/.item).
return global_attention_mask, sliding_window_mask

ModernBertModel._update_attention_mask = _update_attention_mask_contiguous
pass
from causal_conv1d import causal_conv1d_update
except ImportError:
return # Package not installed, transformers already handles this
pass

if causal_conv1d_fn is None:
return # Already nullified
pass

if not torch.cuda.is_available():
return
pass
return # CUDA kernels work fine
except Exception:
pass # Fall through to disable
pass
mod.causal_conv1d_fn = None
if hasattr(mod, "causal_conv1d_update"):
mod.causal_conv1d_update = None
pass
if hasattr(mod, "causal_conv1d_update"):
mod.causal_conv1d_update = None
pass
pass
transformers.utils.import_utils.is_causal_conv1d_available = lambda: False
except Exception:
pass
pass
@danielhanchen danielhanchen merged commit 6c3d65e into main Feb 9, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants