Add Paged Attention Op for CUDA SM80 support by aciddelgado · Pull Request #24595 · microsoft/onnxruntime

aciddelgado · 2025-04-29T23:11:45Z

Description

Adds Paged Attention Op which enables of Paged KV Cache. Inputs to this op are unpadded (packed / varlen) so Cumulative Sequence Lengths are a required input.

Motivation and Context

Adding this op to ONNXRuntime is necessary to allow the GenAI team to enable a continuous batching server API.

github-actions

You can commit the suggested changes from lintrunner.

github-actions

You can commit the suggested changes from lintrunner.

github-actions

You can commit the suggested changes from lintrunner.

github-actions

You can commit the suggested changes from lintrunner.

github-actions

You can commit the suggested changes from lintrunner.

github-actions

You can commit the suggested changes from lintrunner.

tianleiwu

What design change needed if we want to support FP8 or FP4 paged attention in the future?

aciddelgado · 2025-06-11T20:56:55Z

What design change needed if we want to support FP8 or FP4 paged attention in the future?

New kernel necessary

### Description Adds Paged Attention Op which enables of Paged KV Cache. Inputs to this op are unpadded (packed / varlen) so Cumulative Sequence Lengths are a required input. ### Motivation and Context Adding this op to ONNXRuntime is necessary to allow the GenAI team to enable a continuous batching server API.

@tianleiwu

…ttention (#28200) ### Description Adds a CUTLASS memory-efficient attention (MEA) fallback to the CUDA PagedAttention op, enabling the operator on **sm<80 (Turing / Volta / Pascal) with fp16** for the first time. On sm>=80 the default FlashAttention path is unchanged; MEA is reachable via `ORT_DISABLE_FLASH_ATTENTION=1` or the `sdpa_kernel` CUDA provider option for debugging and perf comparison. | Environment | Before | After | |---|:---:|:---:| | sm<80 + fp16 | ❌ error | ✅ MEA | | sm<80 + bf16 | ❌ error | ❌ error (MEA requires sm>=80 for bf16) | | sm>=80 + fp16/bf16 (default) | ✅ FA | ✅ FA (unchanged) | | sm>=80 + `ORT_DISABLE_FLASH_ATTENTION=1` / `sdpa_kernel=EFFICIENT_ATTENTION` | ❌ error | ✅ MEA | ### Motivation and Context The original PagedAttention PR (#24595) landed with the title "CUDA SM80 support" — the op errors out immediately whenever FlashAttention isn't available (sm<80 or `USE_FLASH_ATTENTION=0` builds). During that review, @tianleiwu flagged that the interface was too FlashAttention-specific (*"not good for other EP like WebGPU, CPU etc."*) and @aciddelgado agreed the FA-specific dependencies could be lifted at the kernel level. This PR closes that gap for sm<80 fp16 by mirroring the exact pattern established in #20012 ("Packed QKV and Rotary Embedding Support for sm<80 GQA"). The same CUTLASS memory-efficient attention backend that covers GQA's sm<80 path now covers PagedAttention. Related work: - #20012 — direct pattern template (sm<80 GQA MEA fallback) - #24595 — original PagedAttention PR - #27516 — MS canonical FA → MEA → Unfused cascade ordering - #27880 — ONNX Attention CUDA fallback coverage gaps - #27992 — MEA decode + unfused softcap work (same flavor) ### Implementation **Dispatch cascade** in `paged_attention.cc`: FlashAttention preferred; fall back to MemoryEfficientAttention via `has_memory_efficient_attention(sm, is_half, is_bf16, head_size, head_size)`. No custom head-size or dtype bounds hardcoded — MEA's own helper gates fp16 sm>=53 / bf16 sm>=80 / head_size <= 1024 and `% 8 == 0`. This keeps us forward-compatible with any future expansion of MEA's supported range. **MEA path** (`UnfusedAttention<T>`): 1. Reuses existing preprocessing: `LaunchGetCumulativeSeqlensKV` (hoisted to `paged_attention.cc` so both FA and MEA paths consume a pre-populated buffer — single-producer refactor), rotary, packed-QKV unpack, `ReshapeAndCache`. 2. New `GatherAndExpandPagedKVCache` CUDA kernel walks `block_table` to gather paged K/V into a packed-varlen `[total_kv_tokens, num_heads, head_size]` buffer, folding in GQA head expansion (so downstream MEA sees `num_heads` uniformly). 3. Dispatches to `run_memory_efficient_attention` in **varlen mode** via `seqstart_q_ptr = cumulative_seqlens_q` + `seqstart_k_ptr = cumulative_seqlens_kv` (and `has_custom_right_padding = false`). No padding required; layout matches the kernel's expected `[total_tokens, num_heads, head_size]` with BSNH strides. **Scratch allocation**: the MEA path D->H syncs `cumulative_seqlens_kv[batch_size]` via a pinned buffer to obtain `total_kv_tokens` on the host for tight `gathered_key` / `gathered_value` / `fmha_buffer` allocation. This adds a forward-per-call `cudaStreamSynchronize` — acceptable for a compatibility fallback (FA remains the hot path on supported hardware). Over-allocation (the no-sync alternative) would consume `B × max_num_blocks_per_seq × block_size × num_heads × head_size × 2 × sizeof(T)`, which reaches GB-scale for realistic GQA models and was rejected. `fmha_buffer` is sized with `sizeof(float)` (matching the GQA EfficientAttention pattern at `group_query_attention.cc:482`) because MEA's output accumulator is fp32 regardless of input dtype. ### Testing New `TestPagedAttentionMEA` class in `test_paged_attention_cuda.py` runs the existing parity matrix (rotary on/off, rotary_interleaved on/off, packed-QKV on/off, local window on/off, softcap 0/50, varied head sizes/shapes) against the MEA path via the `sdpa_kernel` CUDA provider option set to `EFFICIENT_ATTENTION` (=2, from `AttentionBackend` enum). Using a per-session provider option instead of an env var means both FA and MEA test classes coexist in the same pytest process — each InferenceSession creates its own CUDA EP with its own `attention_kernel_options_`. The existing `TestPagedAttention` class is skipped wholesale on sm<80 by its `has_flash_attention()` gate, so without the new MEA class the fallback path would have no CI coverage. **Local verification** (NVIDIA A100 80GB, CUDA 12.8, GCC 13.3): ``` TestPagedAttention: 24/24 passed (~60s) # FA baseline — no regression TestPagedAttentionMEA: 24/24 passed (~59s) # new MEA path ``` Tolerance: `rtol = atol = 5e-3` against the same torch reference used by the FA parity test. All combinations match. **sm<80 hardware coverage**: I don't have local Turing / Volta / Pascal hardware, so real-SM coverage relies on MS CI. The code path exercised on A100 via `sdpa_kernel=EFFICIENT_ATTENTION` is the same one taken on sm<80; only the underlying CUTLASS kernel (`run_memory_efficient_attention_sm50/70/75/80`) differs per SM, and those are upstream and unmodified by this change. **Build note**: built with `--cmake_extra_defines CMAKE_CUDA_ARCHITECTURES=80 CMAKE_CXX_STANDARD=20`. The explicit C++20 define was needed because the initial configure resolved `CMAKE_CXX_STANDARD=17`, under which `ort_version_check.h`'s `consteval` usage fails to compile. Unrelated to this change.

aciddelgado added 4 commits April 11, 2025 15:53

paged attention op

6e75e22

test file amid-debug

aa6fd44

paged attention works

5583aaa

everything works and is implemented

c628e66

aciddelgado requested review from baijumeswani, kunal-vaishnavi and tianleiwu April 29, 2025 23:11

github-actions Bot reviewed Apr 29, 2025

View reviewed changes

github-advanced-security AI found potential problems Apr 29, 2025

View reviewed changes

small stuff

554d0c7

github-actions Bot reviewed Apr 29, 2025

View reviewed changes

Comment thread onnxruntime/test/python/transformers/test_paged_attention_cuda.py

Comment thread onnxruntime/test/python/transformers/test_paged_attention_cuda.py

kunal-vaishnavi reviewed Apr 30, 2025

View reviewed changes

Comment thread onnxruntime/contrib_ops/cpu/bert/attention_parameters.h Outdated

kunal-vaishnavi reviewed Apr 30, 2025

View reviewed changes

Comment thread onnxruntime/contrib_ops/cpu/bert/attention_parameters.h Outdated

kunal-vaishnavi reviewed Apr 30, 2025

View reviewed changes

Comment thread onnxruntime/contrib_ops/cuda/bert/paged_attention.cc

kunal-vaishnavi reviewed Apr 30, 2025

View reviewed changes

Comment thread onnxruntime/contrib_ops/cuda/bert/paged_attention_helper.h Outdated

kunal-vaishnavi reviewed Apr 30, 2025

View reviewed changes

Comment thread onnxruntime/contrib_ops/cuda/bert/paged_attention_helper.h Outdated

kunal-vaishnavi reviewed Apr 30, 2025

View reviewed changes

Comment thread onnxruntime/contrib_ops/cuda/bert/paged_attention_impl.cu Outdated

kunal-vaishnavi reviewed Apr 30, 2025

View reviewed changes

Comment thread onnxruntime/core/graph/contrib_ops/bert_defs.cc

lint

497a22e

github-actions Bot reviewed Apr 30, 2025

View reviewed changes

Comment thread onnxruntime/contrib_ops/cuda/cuda_contrib_kernels.cc

Comment thread onnxruntime/contrib_ops/cuda/cuda_contrib_kernels.cc

Comment thread onnxruntime/contrib_ops/cuda/cuda_contrib_kernels.cc

address comments

c3276d1

github-actions Bot reviewed Apr 30, 2025

View reviewed changes

Comment thread onnxruntime/contrib_ops/cuda/cuda_contrib_kernels.cc

Comment thread onnxruntime/contrib_ops/cuda/cuda_contrib_kernels.cc

Comment thread onnxruntime/contrib_ops/cuda/cuda_contrib_kernels.cc