Fix tokenizer guard, ModernBERT attention, gpt_oss MoE unwrap by danielhanchen · Pull Request #472 · unslothai/unsloth-zoo

danielhanchen · 2026-02-07T21:09:05Z

Summary

Fixes notebook failures for transformers 4.57.6 + TRL 0.22-0.27, companion to unslothai/unsloth#3998.

Changes

Tokenizer None guard (tokenizer_utils.py): Return early from patch_tokenizer if tokenizer is None. Guard inner tokenizer unwrap when processor.tokenizer is None. Prevents crashes when VLM processors (like ERNIE VL) have None tokenizer during loading.
ModernBERT attention mask (temporary_patches/misc.py): Add patch_modernbert_attention_mask() to fix stride alignment issues in SDPA backward pass with torch.compile. The _update_attention_mask uses .expand() which creates non-contiguous strides not aligned to multiples of 4, causing reinterpret_tensor errors in the inductor backward graph. Fix: make masks contiguous before they enter compiled regions.
gpt_oss ParamWrapper unwrap (temporary_patches/gpt_oss.py): Unwrap PEFT ParamWrapper from MoE experts before accessing hidden_size in both GptOssMLP.forward() and model inference forward. ParamWrapper (from peft.tuners.lora.layer.LoraLayer) wraps modules via base_layer attribute; check base_layer, module, _module in order.

Test Results

Tested with all 125 notebooks on 8x B200 GPUs:

92/125 pass with TRL 0.22.2 (no regressions)
94/125 pass with TRL 0.27.1 (no regressions)
gpt_oss 20B GRPO, RL 2048, A100 GRPO, RL DGX Spark: all pass after ParamWrapper fix
ModernBERT sentence transformer training: pass after attention mask fix

Companion PR

Fix notebook compatibility for transformers 4.57.6 and TRL 0.22-0.27 unsloth#3998: warning suppression, VLM tokenizer loading, trainer compat, _is_vlm detection

Fixes notebook failures for transformers 4.57.6 + TRL 0.22-0.27. Tokenizer None guard (tokenizer_utils.py): - Return early from patch_tokenizer if tokenizer is None (some VLM processors like ERNIE VL may have None tokenizer during loading) - Guard inner tokenizer unwrap when processor.tokenizer is None ModernBERT attention mask fix (temporary_patches/misc.py): - Add patch_modernbert_attention_mask() to fix stride alignment issues in SDPA backward pass with torch.compile - The _update_attention_mask uses .expand() which creates non-contiguous strides not aligned to multiples of 4, causing reinterpret_tensor errors in the inductor backward graph - Fix: make masks contiguous before they enter compiled regions gpt_oss ParamWrapper unwrap (temporary_patches/gpt_oss.py): - Unwrap PEFT ParamWrapper from MoE experts before accessing hidden_size attribute in both GptOssMLP.forward() and the model inference forward path - ParamWrapper (from peft.tuners.lora.layer) wraps nn.Parameter via base_layer attribute; check base_layer, module, _module in order Tested with all 125 notebooks: no regressions on TRL 0.22.2 or 0.27.1.

gemini-code-assist · 2026-02-07T21:09:22Z

Summary of Changes

Hello @danielhanchen, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on enhancing the stability and compatibility of the system by addressing critical issues related to tokenizer handling, attention mask generation in ModernBERT models, and interaction with PEFT-wrapped MoE layers in GPT-OSS. The changes aim to prevent crashes and ensure correct behavior across various configurations, particularly when integrating with different transformers and TRL versions, and specialized model architectures.

Highlights

Tokenizer Guarding: Implemented robust checks in patch_tokenizer to handle None tokenizer instances and scenarios where processor.tokenizer is None, preventing crashes, especially with VLM processors.
ModernBERT Attention Mask Fix: Introduced a new patch for ModernBERT models to ensure attention masks are contiguous. This resolves stride alignment issues encountered in the SDPA backward pass when using torch.compile.
GPT-OSS MoE ParamWrapper Unwrap: Added logic to recursively unwrap PEFT ParamWrapper from Mixture-of-Experts (MoE) layers in GPT-OSS models. This ensures correct access to the hidden_size attribute during both GptOssMLP.forward() and model inference.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

unsloth_zoo/temporary_patches/gpt_oss.py
- Added logic within GptOssMLP.forward to unwrap ParamWrapper from self.experts to correctly access the hidden_size attribute.
- Applied similar ParamWrapper unwrapping logic to decoder_layer.mlp.experts in the main model forward method to ensure proper hidden_size access during inference.
unsloth_zoo/temporary_patches/misc.py
- Introduced patch_modernbert_attention_mask to override ModernBertModel._update_attention_mask. This ensures that global_attention_mask and sliding_window_mask are made contiguous, resolving stride alignment issues for SDPA backward passes with torch.compile.
unsloth_zoo/tokenizer_utils.py
- Added an early return in patch_tokenizer if the input tokenizer is None, preventing errors when VLM processors lack a tokenizer.
- Modified the logic that unwraps an inner tokenizer (tokenizer.tokenizer) to also check if the inner tokenizer is None, returning the original tokenizer in such cases to prevent further issues.

Activity

No specific activity (comments, reviews, or progress updates) has been recorded for this pull request since its creation.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces several important fixes for notebook compatibility. The changes include adding a None guard for the tokenizer in tokenizer_utils.py to handle VLM processors, patching ModernBERT's attention mask in misc.py to ensure contiguity for SDPA compatibility, and unwrapping PEFT's ParamWrapper for MoE experts in gpt_oss.py. The changes are well-implemented and address the described issues. I have one suggestion regarding code duplication in gpt_oss.py to improve maintainability.

gemini-code-assist · 2026-02-07T21:11:02Z

+                # Unwrap ParamWrapper from experts if needed (PEFT LoRA wraps modules)
+                _experts = decoder_layer.mlp.experts
+                for _attr in ("base_layer", "module", "_module"):
+                    while not hasattr(_experts, "hidden_size") and hasattr(_experts, _attr):
+                        _experts = getattr(_experts, _attr)
+                if _experts is not decoder_layer.mlp.experts:
+                    decoder_layer.mlp.experts = _experts


This logic to unwrap the PEFT ParamWrapper is duplicated from GptOssMLP.forward (lines 644-651). To improve maintainability and reduce code duplication, consider extracting this logic into a helper function.

For example, you could define a helper function like this:

def _unwrap_peft_module(module): """Recursively unwraps a module from PEFT wrappers until 'hidden_size' is found.""" if hasattr(module, "hidden_size"): return module _m = module for attr in ("base_layer", "module", "_module"): while not hasattr(_m, "hidden_size") and hasattr(_m, attr): _m = getattr(_m, attr) return _m

Then you could simplify the code in both places to:

decoder_layer.mlp.experts = _unwrap_peft_module(decoder_layer.mlp.experts)

And in GptOssMLP.forward:

self.experts = _unwrap_peft_module(self.experts)

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d27d339225

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-02-07T21:11:37Z

+                    _e = getattr(_e, _attr)
+            if _e is not self.experts:
+                self.experts = _e


Avoid replacing LoRA ParamWrapper with base layer

When the PEFT ParamWrapper lacks hidden_size, this code permanently replaces self.experts with the unwrapped base layer, which bypasses the wrapper’s LoRA logic in subsequent forwards. That silently disables LoRA adapters for gpt_oss models using PEFT, changing model outputs/training despite adapters being loaded. Consider reading hidden_size from a temporary unwrapped view (or caching just that attribute) without overwriting self.experts.

Useful? React with 👍 / 👎.

Probe causal_conv1d CUDA kernels at startup and force the PyTorch slow path when they fail (e.g. sm_100 on B200). Uses identity checks against the original function objects to avoid clobbering vllm's independent Triton-based implementations. Dynamically scans sys.modules instead of hardcoding model module lists, so new models like qwen3_next and mamba_ssm are automatically covered.

…ers-4.57-notebook-compat # Conflicts: # unsloth_zoo/temporary_patches/gpt_oss.py

…FT dispatch fix, FP8Linear device, VLM dataset tokenization, push_to_hub_token and DPO vision mapping patches - tokenizer_utils.py: Use getattr for pad_token_id to handle missing attr - bitsandbytes.py: Guard fix_4bit_weight on packed weight shape - misc.py: Add patch_peft_dispatch_bnb_4bit for compress_statistics AttributeError - misc.py: Add patch_trl_push_to_hub_token to ensure to_dict() includes it - misc.py: Add patch_trl_vision_model_mapping for DPO on TRL 0.22.x - vllm_utils.py: Version-gate FP8Linear device kwarg - dataset_utils.py: Add _maybe_tokenize_dataset for VLM skip_prepare_dataset

…ormers auto module Inject MODEL_FOR_VISION_2_SEQ_MAPPING_NAMES as an alias of MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES into transformers.models.auto.modeling_auto before TRL imports it. This allows TRL 0.22.x's bare import to succeed on transformers 5.0+ without needing to modify installed TRL files.

+        if isinstance(dataset, IterableDataset):
+            _map_kwargs = {"batched": True}
+        return dataset.map(_tokenize_fn, **_map_kwargs)
+    pass


…ict change transformers 5.0.0 changed apply_chat_template(tokenize=True) to default return_dict=True, returning BatchEncoding instead of list[int]. vLLM's safe_apply_chat_template doesn't pass return_dict=False, causing TypeError in _validate_model_input when max(BatchEncoding) yields a string key. Patch wraps the original function to inject return_dict=False when tokenize=True. Version-gated to transformers >= 5.0.0, no-op if vLLM is not installed.

…without quant_state compiler.py: When UNSLOTH_COMPILE_OVERWRITE=0 is set, check if the cached file's transformers version differs from the current one. If so, force a recompile instead of silently using stale compiled cache. bitsandbytes.py: Guard Linear4bit.forward against layers with no quant_state (not quantized) by falling back to regular F.linear. Use local quant_state variable in the matmul_4bit call.

…ersion check compiler.py: Switch logger.warning_once to print for the OVERWRITE=0 version mismatch message. vllm_utils.py: Use Version("transformers") instead of importing the module and reading __version__ manually.

gpt_oss.py: Reset to main since PR #471 (fix_gpt_oss2) handles all MoE fixes. compiler.py: Remove torch_compile/KWARGS_TYPE import hunk (added by #471), keep the OVERWRITE version-mismatch recompile logic which is unique to this PR. Also keep main's data-dependent compile check (.nonzero/.tolist/.item).

+        return global_attention_mask, sliding_window_mask
+
+    ModernBertModel._update_attention_mask = _update_attention_mask_contiguous
+pass


+        from causal_conv1d import causal_conv1d_update
+    except ImportError:
+        return  # Package not installed, transformers already handles this
+    pass


+
+    if causal_conv1d_fn is None:
+        return  # Already nullified
+    pass


+
+    if not torch.cuda.is_available():
+        return
+    pass


+        return  # CUDA kernels work fine
+    except Exception:
+        pass  # Fall through to disable
+    pass


+                mod.causal_conv1d_fn = None
+            if hasattr(mod, "causal_conv1d_update"):
+                mod.causal_conv1d_update = None
+        pass


+            if hasattr(mod, "causal_conv1d_update"):
+                mod.causal_conv1d_update = None
+        pass
+    pass


+        transformers.utils.import_utils.is_causal_conv1d_available = lambda: False
+    except Exception:
+        pass
+    pass


Datta0 and others added 7 commits February 6, 2026 15:10

Working GPT OSS

48aeb06

Working GPT OSS

2eb8ce3

Update gpt_oss.py, make it transformers 4 compatible

27af91c

Update gpt_oss.py, needed if statement

b032d22

undo spacing

b1bd22a

remove logger

46376a0

github-code-quality Bot found potential problems Feb 7, 2026

View reviewed changes

Comment thread unsloth_zoo/temporary_patches/misc.py Fixed

gemini-code-assist Bot reviewed Feb 7, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Feb 7, 2026

View reviewed changes

Datta0 and others added 4 commits February 8, 2026 10:27

dtype cast

4e2db2d

dtype cast

2004c58

fix gpt oss grpo

e22008b

github-code-quality Bot found potential problems Feb 8, 2026

View reviewed changes

danielhanchen added 3 commits February 8, 2026 13:44

Merge remote-tracking branch 'datta0/fix_gpt_oss2' into fix/transform…

03340e5

…ers-4.57-notebook-compat # Conflicts: # unsloth_zoo/temporary_patches/gpt_oss.py

github-code-quality Bot found potential problems Feb 9, 2026

View reviewed changes

Comment thread unsloth_zoo/dataset_utils.py

if isinstance(dataset, IterableDataset):

_map_kwargs = {"batched": True}

return dataset.map(_tokenize_fn, **_map_kwargs)

pass

danielhanchen added 4 commits February 9, 2026 10:51

github-code-quality Bot found potential problems Feb 9, 2026

View reviewed changes

danielhanchen merged commit 6c3d65e into main Feb 9, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix tokenizer guard, ModernBERT attention, gpt_oss MoE unwrap#472

Fix tokenizer guard, ModernBERT attention, gpt_oss MoE unwrap#472
danielhanchen merged 18 commits into
mainfrom
fix/transformers-4.57-notebook-compat

danielhanchen commented Feb 7, 2026

Uh oh!

gemini-code-assist Bot commented Feb 7, 2026

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Feb 7, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Feb 7, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

danielhanchen commented Feb 7, 2026

Summary

Changes

Test Results

Companion PR

Uh oh!

gemini-code-assist Bot commented Feb 7, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants