1.24.0 release cherry-pick round 2 by tianleiwu · Pull Request #27122 · microsoft/onnxruntime

tianleiwu · 2026-01-23T17:38:17Z

No description provided.

Fix [#27079](#27079) - Qwen3 model quality regression on CUDA backend. The parity issue was caused by **buffer pointer misconfiguration** in the GQA (Group Query Attention) QKV preprocessing pipeline. The original implementation used multiple separate kernels for: 1. Unpacking packed QKV tensor 2. Applying RoPE (Rotary Position Embedding) to Q and K 3. Appending K/V to cache This multi-kernel approach created opportunities for misconfiguration: - Buffers were allocated but not properly used - Pointers could reference memory that was not yet allocated or initialized - Buffer sharing logic was fragmented across different code paths Consolidate QKV preprocessing into a **single fused kernel** (`UnpackRoPEAppend`) that performs all operations in one pass: 1. **Unified kernel design**: A single kernel handles unpacking, RoPE application, and cache append operations 2. **Simplified buffer management**: The new `PrepareQKV` function clearly manages buffer allocation and ensures proper initialization 3. **Explicit past-to-present cache copy**: When `past_present_share_buffer` is false, explicitly copy past KV cache to present buffer before appending new tokens 4. **Zero-initialization for non-shared buffers**: Clear present KV buffers when not sharing with past to ensure deterministic output | File | Changes | |------|---------| | [group_query_attention_qkv.cuh](cci:7://file:///home/tlwu/onnxruntime/onnxruntime/contrib_ops/cuda/bert/group_query_attention_qkv.cuh:0:0-0:0) | New fused `UnpackRoPEAppend` kernel with shared memory optimization for non-interleaved RoPE | | `group_query_attention_impl.cu` | New `PrepareQKV` helper function that orchestrates buffer setup and kernel launch | | `group_query_attention.cc` | Simplified operator logic by delegating QKV prep to unified helper | | `test_gqa.py` | Enhanced test coverage for various QKV configurations | - **Reduced kernel launches**: From 4-5 separate kernel calls to a single fused kernel - **Better memory safety**: All buffer pointers are validated in a single location - **Improved RoPE handling**: Uses shared memory for efficient non-interleaved RoPE computation - **Deterministic output**: Explicit buffer initialization ensures consistent results across runs - **Compatible with quantized KV cache**: The new preprocessing kernel design supports future quantization work - All existing GQA unit tests pass - Verified Qwen3 model no longer produces gibberish output - Tested both fp16/bf16 and various head configurations

titaiwangms

Thanks!

@gedoensmax

### Description  This change adds PCIe bus_id to the properties detected during Linux device discovery. This property is used to enable device discovery on Linux for the TRT-RTX execution provider. ### Motivation and Context  I want to use device discovery for TRT-EP also on Linux. This changes have already been tested with the newly added inference samples microsoft/onnxruntime-inference-examples#529 . @gedoensmax for visibilty

tianleiwu requested review from kunal-vaishnavi and titaiwangms January 23, 2026 17:39

titaiwangms previously approved these changes Jan 23, 2026

View reviewed changes

tianleiwu dismissed titaiwangms’s stale review via 5dc32a4 January 23, 2026 18:22

tianleiwu requested a review from titaiwangms January 23, 2026 18:23

tianleiwu enabled auto-merge (squash) January 23, 2026 18:49

kunal-vaishnavi approved these changes Jan 23, 2026

View reviewed changes

adrianlizarraga approved these changes Jan 23, 2026

View reviewed changes

tianleiwu merged commit d7dffa0 into rel-1.24.0 Jan 23, 2026
72 of 76 checks passed

tianleiwu deleted the tlwu/rel_1.24_cherry_pick_round2 branch January 23, 2026 19:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1.24.0 release cherry-pick round 2#27122

1.24.0 release cherry-pick round 2#27122
tianleiwu merged 2 commits intorel-1.24.0from
tlwu/rel_1.24_cherry_pick_round2

tianleiwu commented Jan 23, 2026

Uh oh!

titaiwangms left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

tianleiwu commented Jan 23, 2026

Uh oh!

titaiwangms left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants