[Bugfix] [Kernel] Triton attention kernels: mask out V blocks that fall outside sliding window by tdoublep · Pull Request #30887 · vllm-project/vllm

tdoublep · 2025-12-17T16:46:19Z

Purpose

There is currently a bug in the Triton attention kernels where we don't correctly mask out V blocks that fall out of the sliding window. On main, we can be reading garbage blocks (that may even contain NaN values) which will corrupt the output. This PR resolves it

Potentially fix:

[Bug]: cannot send two POST to /v1/chat/completions endpoint with identic tool function name with model GPT-OSS-120B #29998
[Bug][v0.11.0]: gpt-oss-120b generates with no output #26480

Test Plan

Server:

VLLM_ATTENTION_BACKEND=TRITON_ATTN vllm serve openai/gpt-oss-20b   --tool-call-parser openai   --enable-auto-tool-choice   --tensor-parallel-size 2 --max-num-seqs 1

Client:

while true; do curl -X POST http://localhost:8000/v1/chat/completions  -H "Content-Type: application/json"   -d '{
    "model": "openai/gpt-oss-20b",
    "stream": false,
    "messages": [
      {
        "role": "system",
        "content": "Be a helpful assistant."
      },
      {
        "role": "user",
        "content": "Hi"
      },
      {
        "role": "assistant",
        "content": "How can I help you?"
      },
      {
        "role": "user",
        "content": "Do you like Monty Python?"
      }
    ],
    "tools": [
      {
        "type": "function",
        "function": {
          "name": "CHANGE-NAME-BEFORE-SENDING",
          "description": "Use this tool if you need to extract information from a website.",
          "parameters": {
            "type": "object",
            "properties": {
              "url": {
                "type": "string",
                "description": "The URL to search or extract information from."
              }
            },
            "required": ["url"]
          }
        }
      }
    ]
  }'; done

On main the above hangs after the first request.

Test Result

After, the PR, the test does not hang anymore.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

chatgpt-codex-connector · 2025-12-17T16:46:30Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

gemini-code-assist

Code Review

This pull request effectively addresses a critical bug in the 3D Triton attention kernel by correctly masking out V blocks that fall outside the sliding window. This prevents potential NaN corruption in the output, as described in the PR purpose. The implementation uses tl.where for conditional masking, which is an appropriate and efficient approach within Triton kernels. The logic for determining if a V block is within the sliding window appears correct and consistent with how attention scores (S) are handled. This fix significantly improves the correctness and stability of the 3D kernel.

bbrowning · 2025-12-18T02:40:52Z

While this does fix that curl request for me, I'm still getting a lot of repeated "!!!!" generations (token id 0) using the gsm8k lm_eval dataset on my A5500. So, I believe this helps, but does not fix all cases where this kind of things has been reported on Triton attention.

If you want to reproduce what I'm seeing (which copies what was reported in #29539), install lm_eval, then spin up gpt-oss-20b and run the gsm8k dataset test:

Serve gpt-oss-20b in vLLM with TRITON_ATTN

VLLM_ATTENTION_BACKEND=TRITON_ATTN \
vllm serve openai/gpt-oss-20b \
  --tensor-parallel-size 2 \
  --max-num-seqs 16

Run lm_eval
From your locally cloned lm-evaluation-harness directory:

python -m lm_eval \
  --model local-completions \
  --model_args model=openai/gpt-oss-20b,base_url=http://127.0.0.1:8000/v1/completions,num_concurrent=16 \
  --tasks gsm8k \
  --log_samples --output_path /tmp/lm_eval/

Grep for repeated !!! in your generated samples - ie in my case:

grep \!\!\! /tmp/lm_eval/samples_gsm8k_2025-12-16T13-48-03.250538.jsonl

I had 56 different samples with this infinite token id 0 repeated generation when testing this change. For reference, I get zero when testing the change from #30650

bbrowning · 2025-12-18T02:47:58Z

My output from multiple runs of this yesterday and today:

$ grep -c '!!!' samples_*.jsonl
samples_gsm8k_2025-12-16T13-48-03.250538.jsonl:96  # baseline from main, 96 problems
samples_gsm8k_2025-12-16T13-59-31.309081.jsonl:0   # PR 30650 applied , 0 problems
samples_gsm8k_2025-12-17T21-29-14.661304.jsonl:56  # PR 30887 applied, 56 problems
samples_gsm8k_2025-12-17T21-45-33.466879.jsonl:0   # PR 30887 and 30650 applied, 0 problems

tdoublep · 2025-12-18T10:32:29Z

Thanks @bbrowning - looking into it

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

tdoublep · 2025-12-18T10:58:22Z

Oh right, the issue is that the fix also needs to be applied to the 2D kernel. I think @bbrowning mentioned on Slack that he had also observed this for 2D kernel actually.

While in the 2D kernel we prune tiles that fall fully outside the sliding window, a tile can contain multiple KV blocks, some of which might actually fall outside the window, some of which might be inside. So pruning tiles is not sufficient, we also need to mask out the blocks within those tiles that fall outside the window.

I ran your reproducer above and debugged this as follows:

/tmp/lm_eval/openai__gpt-oss-20b/samples_gsm8k_2025-12-18T05-36-11.918323.jsonl:68 # 3D fix only 
/tmp/lm_eval/openai__gpt-oss-20b/samples_gsm8k_2025-12-18T05-41-49.428051.jsonl:68 # force 2D kernel (we see that issue persists)
/tmp/lm_eval/openai__gpt-oss-20b/samples_gsm8k_2025-12-18T05-46-40.558049.jsonl:0 # force 2D kernel with additional fix applied
/tmp/lm_eval/openai__gpt-oss-20b/samples_gsm8k_2025-12-18T05-52-47.077647.jsonl:0 # do not force 2D (both 2D and 3D are fixed, current state of PR)

I hope this issue is now properly fixed.

bbrowning · 2025-12-18T12:21:08Z

I can confirm this latest fix to both the 2D and 3D triton kernels removes all the infinite generations in both the manual curl test case and the gsm8k eval. Thank you for figuring this out!

tdoublep · 2025-12-19T08:15:00Z

@Isotr0py Could you help review this since you've been working on these kernels recently?

Isotr0py

LGTM

…ll outside sliding window (vllm-project#30887) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

…ll outside sliding window (vllm-project#30887) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com> Signed-off-by: Ubuntu <mjtaheri68@gmail.com>

…ll outside sliding window (vllm-project#30887) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

…ll outside sliding window (vllm-project#30887) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

BearBiscuit05 · 2026-03-04T15:22:40Z

Thanks for the fix. Could you confirm whether this patch has been included in the latest ARM image? I pulled the latest image, but I still hit the same issue with gpt-oss-120b: a response that should be around ~300 tokens keeps generating until it reaches the max_tokens limit. This looks similar to what’s described in the link when I enable --enforce-eager, the issue disappears.

BearBiscuit05 · 2026-03-04T15:24:23Z

It’s hard for me to reliably reproduce this issue because it only occurs when the system is under heavy load with many concurrent requests. On a server that hasn’t handled any prior requests, the output is normal.

tdoublep · 2026-03-04T19:54:04Z

Which attention backend are you using? This fix relates specifically to Triton

BearBiscuit05 · 2026-03-10T12:25:50Z

Sorry for the very late reply. I’m not very familiar with vLLM, but my understanding is that it was started with the default settings. I’m not sure whether Triton is being used. Code is here:
vllm serve /path/gpt-oss-120b \ --served-model-name gpt-oss-120b \ --host "::" \ --port ${PORT0} \ --tensor-parallel-size 2 \ --max-model-len 32768 \ --enforce-eager

Mask out V blocks that are out of sliding window

0703d40

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

gemini-code-assist bot reviewed Dec 17, 2025

View reviewed changes

xyang16 mentioned this pull request Dec 18, 2025

[LoRA] Set default MXFP4 LoRA backend to Marlin #30598

Merged

5 tasks

tdoublep added 2 commits December 18, 2025 05:53

Apply same fix to 2D kernel

7ce3a87

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

Merge branch 'main' into fix-3d-kernel-swa

16f6d5c

tdoublep changed the title ~~[Bugfix] [Kernel] 3D Triton kernel: mask out V blocks that fall outside sliding window~~ [Bugfix] [Kernel] Triton attention kernels: mask out V blocks that fall outside sliding window Dec 18, 2025

tdoublep added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 18, 2025

bbrowning mentioned this pull request Dec 18, 2025

[Bugfix] CustomAR + TritonAttn[AMPERE] + FULL_CG - gpt-oss #30650

Closed

Merge branch 'main' into fix-3d-kernel-swa

3de9911

tdoublep requested a review from Isotr0py December 19, 2025 08:14

Isotr0py approved these changes Dec 19, 2025

View reviewed changes

Isotr0py merged commit b5545d9 into vllm-project:main Dec 19, 2025
46 checks passed

This was referenced Dec 19, 2025

[Bug]: cannot send two POST to /v1/chat/completions endpoint with identic tool function name with model GPT-OSS-120B #29998

Closed

[Bug][v0.11.0]: gpt-oss-120b generates with no output #26480

Open

yugong333 pushed a commit to yugong333/vllm that referenced this pull request Dec 22, 2025

[Bugfix] [Kernel] Triton attention kernels: mask out V blocks that fa…

fe89c0f

…ll outside sliding window (vllm-project#30887) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

dcmaddix mentioned this pull request Jan 6, 2026

[Bugfix] Fix Triton FusedMoE LoRA #30585

Merged

5 tasks

robertgshaw2-redhat mentioned this pull request Jan 21, 2026

[Bug]: SM120/SM100: gpt-oss gibberish with tp=2 #31422

Open

ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026

[Bugfix] [Kernel] Triton attention kernels: mask out V blocks that fa…

a99b012

…ll outside sliding window (vllm-project#30887) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

Uh oh!

Conversation

tdoublep commented Dec 17, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

chatgpt-codex-connector bot commented Dec 17, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

bbrowning commented Dec 18, 2025

Uh oh!

bbrowning commented Dec 18, 2025

Uh oh!

tdoublep commented Dec 18, 2025

Uh oh!

tdoublep commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bbrowning commented Dec 18, 2025

Uh oh!

tdoublep commented Dec 19, 2025

Uh oh!

Isotr0py left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

BearBiscuit05 commented Mar 4, 2026

Uh oh!

BearBiscuit05 commented Mar 4, 2026

Uh oh!

tdoublep commented Mar 4, 2026

Uh oh!

BearBiscuit05 commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tdoublep commented Dec 17, 2025 •

edited by github-actions bot

Loading

tdoublep commented Dec 18, 2025 •

edited

Loading