support q offset w.r.t k/v in flash_attention function #24830

yamingx · 2024-11-11T07:18:44Z

Thanks for open-sourcing the flash_attention kernel! A feature is badly needed to support prefix caching.

When q_seq_len < kv_seq_len, current implementation left-aligns q to k/v, e.g.:

# Left alignment:
[1, 2, 3, 4]  # q
[1, 2, 3, 4, 5, 6, 7, 8]  # k/v

However, in case of prefix-cache-aware prefill, we need q right-aligns to k/v, like:

# Right alignment:
            [1, 2, 3, 4]  # q
[1, 2, 3, 4, 5, 6, 7, 8]  # k/v

Would be great to add offset parameter in flash_attention defaulting to 0. If offset > 0, right-shift q by offset tokens; otherwise, left-shift it.

The text was updated successfully, but these errors were encountered:

sharadmv · 2024-11-12T18:18:32Z

Which attention kernel are you referring to? (We have like 3 floating around now).

yamingx · 2024-11-12T18:24:56Z

this one:

jax/jax/experimental/pallas/ops/tpu/flash_attention.py

Line 140 in 3a5ac48

def flash_attention(

yamingx added the enhancement New feature or request label Nov 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support q offset w.r.t k/v in flash_attention function #24830

support q offset w.r.t k/v in flash_attention function #24830

yamingx commented Nov 11, 2024

sharadmv commented Nov 12, 2024

yamingx commented Nov 12, 2024

support q offset w.r.t k/v in flash_attention function #24830

support q offset w.r.t k/v in flash_attention function #24830

Comments

yamingx commented Nov 11, 2024

sharadmv commented Nov 12, 2024

yamingx commented Nov 12, 2024