Fix bug in Squeeze for getting the value of total_seq_len by Honry · Pull Request #1886 · microsoft/onnxruntime-genai

Honry · 2025-11-21T11:46:39Z

The Squeeze op is used for removing single-dimensional entries from the shape of a tensor. In this node the axes input is set to [0] which would only eliminate the first axis and lead to the output shape to be [1] if the batch_size is 1.This would cause ShapeInference error at https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/graph/graph.cc#L111-L132 if it is not a strict mode.

This PR fixes the issue by:

Changing the ReduceSum output shape to [batch_size] by adding keepdims=0 attribute
Using ReduceMax instead of Squeeze to get the value of total_seq_len and make it as a scalar, this would cover scenarios when batch_size > 1

The Squeeze is used for removing single-dimensional entries from the shape of a tensor. In this node the axes is set to [0] which would only eliminate the first axis and lead to the output shape to be [1] if the batch_size is 1.This would cause ShapeInference error https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/graph/graph.cc#L111-L132 if it is not a strict mode. This PR just removes the axes input to ensure all the single dimensions be removed from the shape.

Honry · 2025-11-21T11:47:11Z

@qjia7, PTAL, thanks!

tianleiwu · 2025-11-21T18:48:40Z

Since the reduceSum output shape is [batch_size, 1]. Squeeze might not be the right way to get total_seq_len unless we assume that batch_size==1.

A better way is to let ReduceSum output shape [batch_size] by adding keepdims=0 attribute. Then use ReduceMax to get the maximum one (assumption is the longest sequence does not have padding), or use Gather to get first item if we assume that batch_size==1 (if we only enable graph capture for one batch scenario)

If we do not use keepdims=0, the seqlen_k will have shape [batch_size, 1], while the expected shape is [batch_size]. That's not expected as well.

qjia7

A better way is to let ReduceSum output shape [batch_size] by adding keepdims=0 attribute. Then use ReduceMax to get the maximum one (assumption is the longest sequence does not have padding), or use Gather to get first item if we assume that batch_size==1 (if we only enable graph capture for one batch scenario)

Great idea! I did assume that batch_size was one when adding those code. Your suggestion looks great. @tianleiwu one more question: why expose a scalar total_seq_len since the batch can be larger than one?

        #          attention_mask
        #               |
        #         Cast to int32
        #               |
        #    ReduceSum (keepdims=0)
        #              /    \
        #             /      \
        #           Sub    ReduceMax
        #            |        |
        #       seqlens_k  total_seq_len
        #         (1D)       (scalar)

Honry · 2025-11-24T02:44:44Z

Thanks @tianleiwu @qjia7, I also had the concern about how to handle batch_size > 1.

@tianleiwu's suggestion is really a great idea! I've addressed it in the new commit, PTAL again, thanks!

qjia7 · 2025-11-24T02:51:18Z

@Honry You may also need to update the ReduceSum in make_attention_mask_standard_reformatting_for_gqa in the similar way to get the correct 1D seqlens_k.

…_gqa

Honry · 2025-11-24T03:01:18Z

@Honry You may also need to update the ReduceSum in make_attention_mask_standard_reformatting_for_gqa in the similar way to get the correct 1D seqlens_k.

@qjia7 thanks! Done, pls. take another look.

kunal-vaishnavi · 2025-11-24T03:24:58Z

When the original logic to obtain seqlens_k and total_seq_len from the attention mask for the GQA op was added, it was assumed that batch_size = 1 since most inference workloads with ORT GenAI are for batch size = 1.

The `Squeeze` op is used for removing single-dimensional entries from the shape of a tensor. In this node the `axes` input is set to `[0]` which would only eliminate the first axis and lead to the output shape to be `[1]` if the `batch_size` is 1.This would cause ShapeInference error at https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/graph/graph.cc#L111-L132 if it is not a strict mode. This PR fixes the issue by: - Changing the ReduceSum output shape to [batch_size] by adding keepdims=0 attribute - Using ReduceMax instead of Squeeze to get the value of total_seq_len and make it as a scalar, this would cover scenarios when batch_size > 1

qjia7 previously approved these changes Nov 21, 2025

View reviewed changes

qjia7 requested a review from kunal-vaishnavi November 21, 2025 12:52

qjia7 reviewed Nov 22, 2025

View reviewed changes

qjia7 self-requested a review November 22, 2025 03:31

Address comments

60caa6e

Honry dismissed qjia7’s stale review via 60caa6e November 24, 2025 02:36

update the ReduceSum in make_attention_mask_standard_reformatting_for…

2d38382

…_gqa

qjia7 requested a review from tianleiwu November 24, 2025 03:06

kunal-vaishnavi reviewed Nov 24, 2025

View reviewed changes

Comment thread src/python/py/models/builders/base.py

Honry added 2 commits November 24, 2025 13:03

update ReduceSum for the attention mask subgraph as well

435172c

Simplify make_reduce_sum by using keepdims=False by default

245a0e2

tianleiwu approved these changes Nov 24, 2025

View reviewed changes

kunal-vaishnavi approved these changes Nov 25, 2025

View reviewed changes

qjia7 approved these changes Nov 25, 2025

View reviewed changes

kunal-vaishnavi enabled auto-merge (squash) November 26, 2025 02:07

kunal-vaishnavi merged commit 5492721 into microsoft:main Nov 26, 2025
15 checks passed

dependabot Bot mentioned this pull request Dec 15, 2025

Bump Microsoft.ML.OnnxRuntimeGenAI from 0.11.2 to 0.11.4 yuniko-software/qwen3-onnx#10

Merged

dependabot Bot mentioned this pull request Feb 16, 2026

Bump Microsoft.ML.OnnxRuntimeGenAI from 0.11.4 to 0.12.0 yuniko-software/qwen3-onnx#23

Closed

dependabot Bot mentioned this pull request Mar 2, 2026

Bump Microsoft.ML.OnnxRuntimeGenAI from 0.11.4 to 0.12.1 yuniko-software/qwen3-onnx#27

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bug in Squeeze for getting the value of total_seq_len#1886

Fix bug in Squeeze for getting the value of total_seq_len#1886
kunal-vaishnavi merged 5 commits into
microsoft:mainfrom
Honry:fix-squeeze

Honry commented Nov 21, 2025 •

edited

Loading

Uh oh!

Honry commented Nov 21, 2025

Uh oh!

tianleiwu commented Nov 21, 2025 •

edited

Loading

Uh oh!

qjia7 left a comment

Uh oh!

Honry commented Nov 24, 2025

Uh oh!

qjia7 commented Nov 24, 2025 •

edited

Loading

Uh oh!

Honry commented Nov 24, 2025

Uh oh!

kunal-vaishnavi commented Nov 24, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Honry commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Honry commented Nov 21, 2025

Uh oh!

tianleiwu commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qjia7 left a comment

Choose a reason for hiding this comment

Uh oh!

Honry commented Nov 24, 2025

Uh oh!

qjia7 commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Honry commented Nov 24, 2025

Uh oh!

kunal-vaishnavi commented Nov 24, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Honry commented Nov 21, 2025 •

edited

Loading

tianleiwu commented Nov 21, 2025 •

edited

Loading

qjia7 commented Nov 24, 2025 •

edited

Loading