Skip to content

[Bugfix] [Kernel] Triton attention kernels: mask out V blocks that fall outside sliding window#30887

Merged
Isotr0py merged 4 commits intovllm-project:mainfrom
tdoublep:fix-3d-kernel-swa
Dec 19, 2025
Merged

[Bugfix] [Kernel] Triton attention kernels: mask out V blocks that fall outside sliding window#30887
Isotr0py merged 4 commits intovllm-project:mainfrom
tdoublep:fix-3d-kernel-swa

Conversation

@tdoublep
Copy link
Member

@tdoublep tdoublep commented Dec 17, 2025

Purpose

There is currently a bug in the Triton attention kernels where we don't correctly mask out V blocks that fall out of the sliding window. On main, we can be reading garbage blocks (that may even contain NaN values) which will corrupt the output. This PR resolves it

Potentially fix:

Test Plan

Server:

VLLM_ATTENTION_BACKEND=TRITON_ATTN vllm serve openai/gpt-oss-20b   --tool-call-parser openai   --enable-auto-tool-choice   --tensor-parallel-size 2 --max-num-seqs 1

Client:

while true; do curl -X POST http://localhost:8000/v1/chat/completions  -H "Content-Type: application/json"   -d '{
    "model": "openai/gpt-oss-20b",
    "stream": false,
    "messages": [
      {
        "role": "system",
        "content": "Be a helpful assistant."
      },
      {
        "role": "user",
        "content": "Hi"
      },
      {
        "role": "assistant",
        "content": "How can I help you?"
      },
      {
        "role": "user",
        "content": "Do you like Monty Python?"
      }
    ],
    "tools": [
      {
        "type": "function",
        "function": {
          "name": "CHANGE-NAME-BEFORE-SENDING",
          "description": "Use this tool if you need to extract information from a website.",
          "parameters": {
            "type": "object",
            "properties": {
              "url": {
                "type": "string",
                "description": "The URL to search or extract information from."
              }
            },
            "required": ["url"]
          }
        }
      }
    ]
  }'; done

On main the above hangs after the first request.

Test Result

After, the PR, the test does not hang anymore.


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
@chatgpt-codex-connector
Copy link

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses a critical bug in the 3D Triton attention kernel by correctly masking out V blocks that fall outside the sliding window. This prevents potential NaN corruption in the output, as described in the PR purpose. The implementation uses tl.where for conditional masking, which is an appropriate and efficient approach within Triton kernels. The logic for determining if a V block is within the sliding window appears correct and consistent with how attention scores (S) are handled. This fix significantly improves the correctness and stability of the 3D kernel.

@bbrowning
Copy link
Contributor

While this does fix that curl request for me, I'm still getting a lot of repeated "!!!!" generations (token id 0) using the gsm8k lm_eval dataset on my A5500. So, I believe this helps, but does not fix all cases where this kind of things has been reported on Triton attention.

If you want to reproduce what I'm seeing (which copies what was reported in #29539), install lm_eval, then spin up gpt-oss-20b and run the gsm8k dataset test:

Serve gpt-oss-20b in vLLM with TRITON_ATTN

VLLM_ATTENTION_BACKEND=TRITON_ATTN \
vllm serve openai/gpt-oss-20b \
  --tensor-parallel-size 2 \
  --max-num-seqs 16

Run lm_eval
From your locally cloned lm-evaluation-harness directory:

python -m lm_eval \
  --model local-completions \
  --model_args model=openai/gpt-oss-20b,base_url=http://127.0.0.1:8000/v1/completions,num_concurrent=16 \
  --tasks gsm8k \
  --log_samples --output_path /tmp/lm_eval/

Grep for repeated !!! in your generated samples - ie in my case:

grep \!\!\! /tmp/lm_eval/samples_gsm8k_2025-12-16T13-48-03.250538.jsonl

I had 56 different samples with this infinite token id 0 repeated generation when testing this change. For reference, I get zero when testing the change from #30650

@bbrowning
Copy link
Contributor

My output from multiple runs of this yesterday and today:

$ grep -c '!!!' samples_*.jsonl
samples_gsm8k_2025-12-16T13-48-03.250538.jsonl:96  # baseline from main, 96 problems
samples_gsm8k_2025-12-16T13-59-31.309081.jsonl:0   # PR 30650 applied , 0 problems
samples_gsm8k_2025-12-17T21-29-14.661304.jsonl:56  # PR 30887 applied, 56 problems
samples_gsm8k_2025-12-17T21-45-33.466879.jsonl:0   # PR 30887 and 30650 applied, 0 problems

@tdoublep
Copy link
Member Author

Thanks @bbrowning - looking into it

@tdoublep
Copy link
Member Author

tdoublep commented Dec 18, 2025

Oh right, the issue is that the fix also needs to be applied to the 2D kernel. I think @bbrowning mentioned on Slack that he had also observed this for 2D kernel actually.

While in the 2D kernel we prune tiles that fall fully outside the sliding window, a tile can contain multiple KV blocks, some of which might actually fall outside the window, some of which might be inside. So pruning tiles is not sufficient, we also need to mask out the blocks within those tiles that fall outside the window.

I ran your reproducer above and debugged this as follows:

/tmp/lm_eval/openai__gpt-oss-20b/samples_gsm8k_2025-12-18T05-36-11.918323.jsonl:68 # 3D fix only 
/tmp/lm_eval/openai__gpt-oss-20b/samples_gsm8k_2025-12-18T05-41-49.428051.jsonl:68 # force 2D kernel (we see that issue persists)
/tmp/lm_eval/openai__gpt-oss-20b/samples_gsm8k_2025-12-18T05-46-40.558049.jsonl:0 # force 2D kernel with additional fix applied
/tmp/lm_eval/openai__gpt-oss-20b/samples_gsm8k_2025-12-18T05-52-47.077647.jsonl:0 # do not force 2D (both 2D and 3D are fixed, current state of PR) 

I hope this issue is now properly fixed.

@tdoublep tdoublep changed the title [Bugfix] [Kernel] 3D Triton kernel: mask out V blocks that fall outside sliding window [Bugfix] [Kernel] Triton attention kernels: mask out V blocks that fall outside sliding window Dec 18, 2025
@bbrowning
Copy link
Contributor

I can confirm this latest fix to both the 2D and 3D triton kernels removes all the infinite generations in both the manual curl test case and the gsm8k eval. Thank you for figuring this out!

@tdoublep tdoublep added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 18, 2025
@tdoublep tdoublep requested a review from Isotr0py December 19, 2025 08:14
@tdoublep
Copy link
Member Author

@Isotr0py Could you help review this since you've been working on these kernels recently?

Copy link
Member

@Isotr0py Isotr0py left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Isotr0py Isotr0py merged commit b5545d9 into vllm-project:main Dec 19, 2025
46 checks passed
yugong333 pushed a commit to yugong333/vllm that referenced this pull request Dec 22, 2025
…ll outside sliding window (vllm-project#30887)

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Majid-Taheri pushed a commit to Majid-Taheri/vllm that referenced this pull request Dec 23, 2025
…ll outside sliding window (vllm-project#30887)

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Ubuntu <mjtaheri68@gmail.com>
@dcmaddix dcmaddix mentioned this pull request Jan 6, 2026
5 tasks
dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026
…ll outside sliding window (vllm-project#30887)

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026
…ll outside sliding window (vllm-project#30887)

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
@BearBiscuit05
Copy link

Thanks for the fix. Could you confirm whether this patch has been included in the latest ARM image? I pulled the latest image, but I still hit the same issue with gpt-oss-120b: a response that should be around ~300 tokens keeps generating until it reaches the max_tokens limit. This looks similar to what’s described in the link when I enable --enforce-eager, the issue disappears.

@BearBiscuit05
Copy link

It’s hard for me to reliably reproduce this issue because it only occurs when the system is under heavy load with many concurrent requests. On a server that hasn’t handled any prior requests, the output is normal.

@tdoublep
Copy link
Member Author

tdoublep commented Mar 4, 2026

Which attention backend are you using? This fix relates specifically to Triton

@BearBiscuit05
Copy link

Sorry for the very late reply. I’m not very familiar with vLLM, but my understanding is that it was started with the default settings. I’m not sure whether Triton is being used. Code is here:
vllm serve /path/gpt-oss-120b \ --served-model-name gpt-oss-120b \ --host "::" \ --port ${PORT0} \ --tensor-parallel-size 2 \ --max-model-len 32768 \ --enforce-eager

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants