Fix GPT-OSS BlockMask error during inference by danielhanchen · Pull Request #3982 · unslothai/unsloth

danielhanchen · 2026-02-05T12:27:34Z

Summary

Exclude GPT-OSS models from using flex_attention in prefer_flex_attn_if_supported()
Prevents TypeError: unsupported operand type(s) for +=: 'Tensor' and 'BlockMask' during inference/generation

Problem

GPT-OSS models use eager attention during inference because flex attention returns incorrect results (likely due to left padding issues). See the comment in gpt_oss.py:

# Weirdly for inference, flex attention returns gibberish
# Most likely due to left padding

However, when _attn_implementation is set to "flex_attention", transformers creates BlockMask objects. When these are passed to the eager attention path during inference, it causes:

TypeError: unsupported operand type(s) for +=: 'Tensor' and 'BlockMask'

Fix

Check config.model_type and skip setting flex_attention for gpt_oss models, keeping them on the eager path.

Testing

Verified GPT-OSS now shows _attn_implementation: eager
Inference generation works correctly
GRPO training completes without errors

GPT-OSS models use eager attention during inference because flex attention returns incorrect results (likely due to left padding). However, when _attn_implementation is set to "flex_attention", transformers creates BlockMask objects which cause a TypeError when passed to the eager attention path: TypeError: unsupported operand type(s) for +=: 'Tensor' and 'BlockMask' This fix excludes GPT-OSS from using flex_attention, keeping it on the eager path to avoid the BlockMask/Tensor type mismatch.

gemini-code-assist · 2026-02-05T12:27:47Z

Summary of Changes

Hello @danielhanchen, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical runtime error in GPT-OSS models during inference. By preventing these models from being configured with flex_attention, it ensures they correctly utilize eager attention, thereby resolving a TypeError related to BlockMask objects and enabling successful generation. The change is localized to the attention configuration utility, ensuring minimal impact while providing a robust solution for GPT-OSS model stability.

Highlights

GPT-OSS Flex Attention Exclusion: GPT-OSS models are now explicitly excluded from using flex_attention in the prefer_flex_attn_if_supported() function.
TypeError Resolution: This change resolves a TypeError: unsupported operand type(s) for +=: 'Tensor' and 'BlockMask' that occurred during inference/generation for GPT-OSS models.
Eager Attention Enforcement: The fix ensures that GPT-OSS models consistently use eager attention during inference, as flex_attention previously led to incorrect results, likely due to left padding issues.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

unsloth/models/_utils.py
- Added a conditional check to prefer_flex_attn_if_supported to prevent gpt_oss models from having _attn_implementation set to flex_attention.

Activity

The author, danielhanchen, has provided a clear problem description, detailing why GPT-OSS models require eager attention and how flex_attention was causing BlockMask type errors.
The proposed fix involves checking the config.model_type to specifically exclude gpt_oss models from flex_attention configuration.
Testing has been performed, verifying that GPT-OSS models now correctly show _attn_implementation: eager, inference generation works, and GRPO training completes without errors.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request effectively resolves a TypeError that occurs with GPT-OSS models during inference. The fix correctly prevents the use of flex_attention for these models by adding a specific check for the gpt_oss model type within the prefer_flex_attn_if_supported function. The change is clear, targeted, and directly addresses the problem described. The implementation is solid.

GPT-OSS models use eager attention during inference because flex attention returns incorrect results (likely due to left padding). However, when _attn_implementation is set to "flex_attention", transformers creates BlockMask objects which cause a TypeError when passed to the eager attention path: TypeError: unsupported operand type(s) for +=: 'Tensor' and 'BlockMask' This fix excludes GPT-OSS from using flex_attention, keeping it on the eager path to avoid the BlockMask/Tensor type mismatch.

danielhanchen merged commit e309bca into main Feb 5, 2026
4 checks passed

danielhanchen deleted the fix-gpt-oss-flex-attention-blockmask branch February 5, 2026 12:28

gemini-code-assist Bot reviewed Feb 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix GPT-OSS BlockMask error during inference#3982

Fix GPT-OSS BlockMask error during inference#3982
danielhanchen merged 1 commit into
mainfrom
fix-gpt-oss-flex-attention-blockmask

danielhanchen commented Feb 5, 2026

Uh oh!

gemini-code-assist Bot commented Feb 5, 2026

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

danielhanchen commented Feb 5, 2026

Summary

Problem

Fix

Testing

Uh oh!

gemini-code-assist Bot commented Feb 5, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant