Fix tokenizer guard, ModernBERT attention, gpt_oss MoE unwrap#472
Conversation
Fixes notebook failures for transformers 4.57.6 + TRL 0.22-0.27. Tokenizer None guard (tokenizer_utils.py): - Return early from patch_tokenizer if tokenizer is None (some VLM processors like ERNIE VL may have None tokenizer during loading) - Guard inner tokenizer unwrap when processor.tokenizer is None ModernBERT attention mask fix (temporary_patches/misc.py): - Add patch_modernbert_attention_mask() to fix stride alignment issues in SDPA backward pass with torch.compile - The _update_attention_mask uses .expand() which creates non-contiguous strides not aligned to multiples of 4, causing reinterpret_tensor errors in the inductor backward graph - Fix: make masks contiguous before they enter compiled regions gpt_oss ParamWrapper unwrap (temporary_patches/gpt_oss.py): - Unwrap PEFT ParamWrapper from MoE experts before accessing hidden_size attribute in both GptOssMLP.forward() and the model inference forward path - ParamWrapper (from peft.tuners.lora.layer) wraps nn.Parameter via base_layer attribute; check base_layer, module, _module in order Tested with all 125 notebooks: no regressions on TRL 0.22.2 or 0.27.1.
Summary of ChangesHello @danielhanchen, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request focuses on enhancing the stability and compatibility of the system by addressing critical issues related to tokenizer handling, attention mask generation in ModernBERT models, and interaction with PEFT-wrapped MoE layers in GPT-OSS. The changes aim to prevent crashes and ensure correct behavior across various configurations, particularly when integrating with different Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces several important fixes for notebook compatibility. The changes include adding a None guard for the tokenizer in tokenizer_utils.py to handle VLM processors, patching ModernBERT's attention mask in misc.py to ensure contiguity for SDPA compatibility, and unwrapping PEFT's ParamWrapper for MoE experts in gpt_oss.py. The changes are well-implemented and address the described issues. I have one suggestion regarding code duplication in gpt_oss.py to improve maintainability.
| # Unwrap ParamWrapper from experts if needed (PEFT LoRA wraps modules) | ||
| _experts = decoder_layer.mlp.experts | ||
| for _attr in ("base_layer", "module", "_module"): | ||
| while not hasattr(_experts, "hidden_size") and hasattr(_experts, _attr): | ||
| _experts = getattr(_experts, _attr) | ||
| if _experts is not decoder_layer.mlp.experts: | ||
| decoder_layer.mlp.experts = _experts |
There was a problem hiding this comment.
This logic to unwrap the PEFT ParamWrapper is duplicated from GptOssMLP.forward (lines 644-651). To improve maintainability and reduce code duplication, consider extracting this logic into a helper function.
For example, you could define a helper function like this:
def _unwrap_peft_module(module):
"""Recursively unwraps a module from PEFT wrappers until 'hidden_size' is found."""
if hasattr(module, "hidden_size"):
return module
_m = module
for attr in ("base_layer", "module", "_module"):
while not hasattr(_m, "hidden_size") and hasattr(_m, attr):
_m = getattr(_m, attr)
return _mThen you could simplify the code in both places to:
decoder_layer.mlp.experts = _unwrap_peft_module(decoder_layer.mlp.experts)And in GptOssMLP.forward:
self.experts = _unwrap_peft_module(self.experts)There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d27d339225
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| _e = getattr(_e, _attr) | ||
| if _e is not self.experts: | ||
| self.experts = _e |
There was a problem hiding this comment.
Avoid replacing LoRA ParamWrapper with base layer
When the PEFT ParamWrapper lacks hidden_size, this code permanently replaces self.experts with the unwrapped base layer, which bypasses the wrapper’s LoRA logic in subsequent forwards. That silently disables LoRA adapters for gpt_oss models using PEFT, changing model outputs/training despite adapters being loaded. Consider reading hidden_size from a temporary unwrapped view (or caching just that attribute) without overwriting self.experts.
Useful? React with 👍 / 👎.
Probe causal_conv1d CUDA kernels at startup and force the PyTorch slow path when they fail (e.g. sm_100 on B200). Uses identity checks against the original function objects to avoid clobbering vllm's independent Triton-based implementations. Dynamically scans sys.modules instead of hardcoding model module lists, so new models like qwen3_next and mamba_ssm are automatically covered.
…ers-4.57-notebook-compat # Conflicts: # unsloth_zoo/temporary_patches/gpt_oss.py
…FT dispatch fix, FP8Linear device, VLM dataset tokenization, push_to_hub_token and DPO vision mapping patches - tokenizer_utils.py: Use getattr for pad_token_id to handle missing attr - bitsandbytes.py: Guard fix_4bit_weight on packed weight shape - misc.py: Add patch_peft_dispatch_bnb_4bit for compress_statistics AttributeError - misc.py: Add patch_trl_push_to_hub_token to ensure to_dict() includes it - misc.py: Add patch_trl_vision_model_mapping for DPO on TRL 0.22.x - vllm_utils.py: Version-gate FP8Linear device kwarg - dataset_utils.py: Add _maybe_tokenize_dataset for VLM skip_prepare_dataset
…ormers auto module Inject MODEL_FOR_VISION_2_SEQ_MAPPING_NAMES as an alias of MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES into transformers.models.auto.modeling_auto before TRL imports it. This allows TRL 0.22.x's bare import to succeed on transformers 5.0+ without needing to modify installed TRL files.
| if isinstance(dataset, IterableDataset): | ||
| _map_kwargs = {"batched": True} | ||
| return dataset.map(_tokenize_fn, **_map_kwargs) | ||
| pass |
…ict change transformers 5.0.0 changed apply_chat_template(tokenize=True) to default return_dict=True, returning BatchEncoding instead of list[int]. vLLM's safe_apply_chat_template doesn't pass return_dict=False, causing TypeError in _validate_model_input when max(BatchEncoding) yields a string key. Patch wraps the original function to inject return_dict=False when tokenize=True. Version-gated to transformers >= 5.0.0, no-op if vLLM is not installed.
…without quant_state compiler.py: When UNSLOTH_COMPILE_OVERWRITE=0 is set, check if the cached file's transformers version differs from the current one. If so, force a recompile instead of silently using stale compiled cache. bitsandbytes.py: Guard Linear4bit.forward against layers with no quant_state (not quantized) by falling back to regular F.linear. Use local quant_state variable in the matmul_4bit call.
…ersion check
compiler.py: Switch logger.warning_once to print for the OVERWRITE=0
version mismatch message.
vllm_utils.py: Use Version("transformers") instead of importing the
module and reading __version__ manually.
gpt_oss.py: Reset to main since PR #471 (fix_gpt_oss2) handles all MoE fixes. compiler.py: Remove torch_compile/KWARGS_TYPE import hunk (added by #471), keep the OVERWRITE version-mismatch recompile logic which is unique to this PR. Also keep main's data-dependent compile check (.nonzero/.tolist/.item).
| return global_attention_mask, sliding_window_mask | ||
|
|
||
| ModernBertModel._update_attention_mask = _update_attention_mask_contiguous | ||
| pass |
| from causal_conv1d import causal_conv1d_update | ||
| except ImportError: | ||
| return # Package not installed, transformers already handles this | ||
| pass |
|
|
||
| if causal_conv1d_fn is None: | ||
| return # Already nullified | ||
| pass |
|
|
||
| if not torch.cuda.is_available(): | ||
| return | ||
| pass |
| return # CUDA kernels work fine | ||
| except Exception: | ||
| pass # Fall through to disable | ||
| pass |
| mod.causal_conv1d_fn = None | ||
| if hasattr(mod, "causal_conv1d_update"): | ||
| mod.causal_conv1d_update = None | ||
| pass |
| if hasattr(mod, "causal_conv1d_update"): | ||
| mod.causal_conv1d_update = None | ||
| pass | ||
| pass |
| transformers.utils.import_utils.is_causal_conv1d_available = lambda: False | ||
| except Exception: | ||
| pass | ||
| pass |
Summary
Fixes notebook failures for transformers 4.57.6 + TRL 0.22-0.27, companion to unslothai/unsloth#3998.
Changes
tokenizer_utils.py): Return early frompatch_tokenizerif tokenizer is None. Guard inner tokenizer unwrap whenprocessor.tokenizeris None. Prevents crashes when VLM processors (like ERNIE VL) have None tokenizer during loading.temporary_patches/misc.py): Addpatch_modernbert_attention_mask()to fix stride alignment issues in SDPA backward pass with torch.compile. The_update_attention_maskuses.expand()which creates non-contiguous strides not aligned to multiples of 4, causingreinterpret_tensorerrors in the inductor backward graph. Fix: make masks contiguous before they enter compiled regions.temporary_patches/gpt_oss.py): Unwrap PEFTParamWrapperfrom MoE experts before accessinghidden_sizein bothGptOssMLP.forward()and model inference forward.ParamWrapper(frompeft.tuners.lora.layer.LoraLayer) wraps modules viabase_layerattribute; checkbase_layer,module,_modulein order.Test Results
Tested with all 125 notebooks on 8x B200 GPUs:
Companion PR