Skip to content

Fix GPT-OSS BlockMask error during inference#3982

Merged
danielhanchen merged 1 commit into
mainfrom
fix-gpt-oss-flex-attention-blockmask
Feb 5, 2026
Merged

Fix GPT-OSS BlockMask error during inference#3982
danielhanchen merged 1 commit into
mainfrom
fix-gpt-oss-flex-attention-blockmask

Conversation

@danielhanchen

Copy link
Copy Markdown
Member

Summary

  • Exclude GPT-OSS models from using flex_attention in prefer_flex_attn_if_supported()
  • Prevents TypeError: unsupported operand type(s) for +=: 'Tensor' and 'BlockMask' during inference/generation

Problem

GPT-OSS models use eager attention during inference because flex attention returns incorrect results (likely due to left padding issues). See the comment in gpt_oss.py:

# Weirdly for inference, flex attention returns gibberish
# Most likely due to left padding

However, when _attn_implementation is set to "flex_attention", transformers creates BlockMask objects. When these are passed to the eager attention path during inference, it causes:

TypeError: unsupported operand type(s) for +=: 'Tensor' and 'BlockMask'

Fix

Check config.model_type and skip setting flex_attention for gpt_oss models, keeping them on the eager path.

Testing

  • Verified GPT-OSS now shows _attn_implementation: eager
  • Inference generation works correctly
  • GRPO training completes without errors

GPT-OSS models use eager attention during inference because flex
attention returns incorrect results (likely due to left padding).
However, when _attn_implementation is set to "flex_attention",
transformers creates BlockMask objects which cause a TypeError
when passed to the eager attention path:

  TypeError: unsupported operand type(s) for +=: 'Tensor' and 'BlockMask'

This fix excludes GPT-OSS from using flex_attention, keeping it on
the eager path to avoid the BlockMask/Tensor type mismatch.
@gemini-code-assist

Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @danielhanchen, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical runtime error in GPT-OSS models during inference. By preventing these models from being configured with flex_attention, it ensures they correctly utilize eager attention, thereby resolving a TypeError related to BlockMask objects and enabling successful generation. The change is localized to the attention configuration utility, ensuring minimal impact while providing a robust solution for GPT-OSS model stability.

Highlights

  • GPT-OSS Flex Attention Exclusion: GPT-OSS models are now explicitly excluded from using flex_attention in the prefer_flex_attn_if_supported() function.
  • TypeError Resolution: This change resolves a TypeError: unsupported operand type(s) for +=: 'Tensor' and 'BlockMask' that occurred during inference/generation for GPT-OSS models.
  • Eager Attention Enforcement: The fix ensures that GPT-OSS models consistently use eager attention during inference, as flex_attention previously led to incorrect results, likely due to left padding issues.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • unsloth/models/_utils.py
    • Added a conditional check to prefer_flex_attn_if_supported to prevent gpt_oss models from having _attn_implementation set to flex_attention.
Activity
  • The author, danielhanchen, has provided a clear problem description, detailing why GPT-OSS models require eager attention and how flex_attention was causing BlockMask type errors.
  • The proposed fix involves checking the config.model_type to specifically exclude gpt_oss models from flex_attention configuration.
  • Testing has been performed, verifying that GPT-OSS models now correctly show _attn_implementation: eager, inference generation works, and GRPO training completes without errors.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@danielhanchen danielhanchen merged commit e309bca into main Feb 5, 2026
4 checks passed
@danielhanchen danielhanchen deleted the fix-gpt-oss-flex-attention-blockmask branch February 5, 2026 12:28

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively resolves a TypeError that occurs with GPT-OSS models during inference. The fix correctly prevents the use of flex_attention for these models by adding a specific check for the gpt_oss model type within the prefer_flex_attn_if_supported function. The change is clear, targeted, and directly addresses the problem described. The implementation is solid.

abiswas-realadvice pushed a commit to abiswas-realadvice/unsloth that referenced this pull request May 14, 2026
GPT-OSS models use eager attention during inference because flex
attention returns incorrect results (likely due to left padding).
However, when _attn_implementation is set to "flex_attention",
transformers creates BlockMask objects which cause a TypeError
when passed to the eager attention path:

  TypeError: unsupported operand type(s) for +=: 'Tensor' and 'BlockMask'

This fix excludes GPT-OSS from using flex_attention, keeping it on
the eager path to avoid the BlockMask/Tensor type mismatch.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant