Skip to content

1.24.0 release cherry-pick round 2#27122

Merged
tianleiwu merged 2 commits intorel-1.24.0from
tlwu/rel_1.24_cherry_pick_round2
Jan 23, 2026
Merged

1.24.0 release cherry-pick round 2#27122
tianleiwu merged 2 commits intorel-1.24.0from
tlwu/rel_1.24_cherry_pick_round2

Conversation

@tianleiwu
Copy link
Copy Markdown
Contributor

No description provided.

Fix [#27079](#27079) -
Qwen3 model quality regression on CUDA backend.
The parity issue was caused by **buffer pointer misconfiguration** in
the GQA (Group Query Attention) QKV preprocessing pipeline. The original
implementation used multiple separate kernels for:
1. Unpacking packed QKV tensor
2. Applying RoPE (Rotary Position Embedding) to Q and K
3. Appending K/V to cache
This multi-kernel approach created opportunities for misconfiguration:
- Buffers were allocated but not properly used
- Pointers could reference memory that was not yet allocated or
initialized
- Buffer sharing logic was fragmented across different code paths
Consolidate QKV preprocessing into a **single fused kernel**
(`UnpackRoPEAppend`) that performs all operations in one pass:
1. **Unified kernel design**: A single kernel handles unpacking, RoPE
application, and cache append operations
2. **Simplified buffer management**: The new `PrepareQKV` function
clearly manages buffer allocation and ensures proper initialization
3. **Explicit past-to-present cache copy**: When
`past_present_share_buffer` is false, explicitly copy past KV cache to
present buffer before appending new tokens
4. **Zero-initialization for non-shared buffers**: Clear present KV
buffers when not sharing with past to ensure deterministic output
| File | Changes |
|------|---------|
|
[group_query_attention_qkv.cuh](cci:7://file:///home/tlwu/onnxruntime/onnxruntime/contrib_ops/cuda/bert/group_query_attention_qkv.cuh:0:0-0:0)
| New fused `UnpackRoPEAppend` kernel with shared memory optimization
for non-interleaved RoPE |
| `group_query_attention_impl.cu` | New `PrepareQKV` helper function
that orchestrates buffer setup and kernel launch |
| `group_query_attention.cc` | Simplified operator logic by delegating
QKV prep to unified helper |
| `test_gqa.py` | Enhanced test coverage for various QKV configurations
|
- **Reduced kernel launches**: From 4-5 separate kernel calls to a
single fused kernel
- **Better memory safety**: All buffer pointers are validated in a
single location
- **Improved RoPE handling**: Uses shared memory for efficient
non-interleaved RoPE computation
- **Deterministic output**: Explicit buffer initialization ensures
consistent results across runs
- **Compatible with quantized KV cache**: The new preprocessing kernel
design supports future quantization work
- All existing GQA unit tests pass
- Verified Qwen3 model no longer produces gibberish output
- Tested both fp16/bf16 and various head configurations
titaiwangms
titaiwangms previously approved these changes Jan 23, 2026
Copy link
Copy Markdown
Contributor

@titaiwangms titaiwangms left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

### Description
<!-- Describe your changes. -->

This change adds PCIe bus_id to the properties detected
during Linux device discovery.

This property is used to enable device discovery on Linux for the
TRT-RTX execution provider.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve? -->
I want to use device discovery for TRT-EP also on Linux.


This changes have already been tested with the newly added inference
samples
microsoft/onnxruntime-inference-examples#529 .

@gedoensmax for visibilty
@tianleiwu tianleiwu merged commit d7dffa0 into rel-1.24.0 Jan 23, 2026
72 of 76 checks passed
@tianleiwu tianleiwu deleted the tlwu/rel_1.24_cherry_pick_round2 branch January 23, 2026 19:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants