optimize paged attention on triton3 #2553

grimoire · 2024-10-08T03:03:50Z

triton3 has move the cuda fast math location.
This PR support fast expf in paged attention with triton3.0.

Note

None-cuda backend end might not work.

The fill kv kernel and attention is updated so we can change kv layout in the future.

grimoire · 2024-10-11T11:43:16Z

python mdeploy/benchmark/profile_generation.py \
   internlm2_5-7b-chat-1m/ \
    --tp 4 \
    -c 1 \
    -ct 1 \
    -pt 1000000 \
    --session-len 1048576 \
    --backend pytorch -w 0 -tr 1

--------------------------------------------------
total time: 683.11s
concurrency: 1, test_round: 1
input_tokens: 1000000, output_tokens: 1
first_token latency(min, max, ave): 679.787s, 679.787s, 679.787s
total_token latency(min, max, ave): 679.787s, 679.787s, 679.787s
token_latency percentiles(50%,75%,95%,99%)(s): [679.787, 679.787, 679.787, 679.787]
throughput(output): 0.0 token/s
throughput(total): 1463.9 token/s
--------------------------------------------------

lmdeploy/pytorch/kernels/cuda/pagedattention.py

lmdeploy/pytorch/kernels/cuda/w8a8_triton_kernels.py

AllentDan

May update UT for the new layout type.

RunningLeon

flash_attn-2.6.3+cu118torch2.3cxx11abiFALSE-cp38-cp38-linux_x86_64.whl

AllentDan

LGTM

lvhan028 · 2024-10-17T07:26:26Z

lmdeploy/pytorch/kernels/cuda/pagedattention.py

-                 key=['BLOCK_H', 'BLOCK_N', 'BLOCK_DMODEL', 'BLOCK_DV'])
+                 key=['BLOCK_H', 'BLOCK_N', 'BLOCK_DMODEL', 'BLOCK_DV'],
+                 warmup=10,
+                 rep=25)


what does rep mean?

repeat benchmark for autotuning.

How long does this setting warmup=10, rep=25 take?

each tuning config take 34~35ms

RunningLeon

LGTM

optimize paged attention on triton3

23117ec

lvhan028 added the improvement label Oct 9, 2024

grimoire added 2 commits October 11, 2024 15:17

fix w8a8 kernel

8e3680b

optimize prefill

c96d10b

grimoire added 3 commits October 14, 2024 15:58

optimize short decoding

22ccea5

optimize sm<8

1421a7e

optimize short context

2236084

lvhan028 requested review from AllentDan and RunningLeon October 14, 2024 12:04

RunningLeon reviewed Oct 15, 2024

View reviewed changes

lmdeploy/pytorch/kernels/cuda/pagedattention.py Show resolved Hide resolved

merge main

00596da

AllentDan reviewed Oct 16, 2024

View reviewed changes

lmdeploy/pytorch/kernels/cuda/w8a8_triton_kernels.py Show resolved Hide resolved

AllentDan reviewed Oct 16, 2024

View reviewed changes

grimoire added 4 commits October 16, 2024 20:23

fix triton2.2.0

bcea89f

recovery test

9ffd896

add ut for custom layout

e705775

update stride

a5c8933

RunningLeon reviewed Oct 17, 2024

View reviewed changes

update ut

9dedbad

AllentDan approved these changes Oct 17, 2024

View reviewed changes

lvhan028 reviewed Oct 17, 2024

View reviewed changes

RunningLeon approved these changes Oct 18, 2024

View reviewed changes

lvhan028 merged commit 7dc0a5c into InternLM:main Oct 18, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimize paged attention on triton3 #2553

optimize paged attention on triton3 #2553

grimoire commented Oct 8, 2024

grimoire commented Oct 11, 2024 •

edited

Loading

AllentDan left a comment

RunningLeon left a comment •

edited

Loading

AllentDan left a comment

lvhan028 Oct 17, 2024

grimoire Oct 17, 2024

lvhan028 Oct 17, 2024

grimoire Oct 17, 2024

RunningLeon left a comment

optimize paged attention on triton3 #2553

optimize paged attention on triton3 #2553

Conversation

grimoire commented Oct 8, 2024

grimoire commented Oct 11, 2024 • edited Loading

AllentDan left a comment

Choose a reason for hiding this comment

RunningLeon left a comment • edited Loading

Choose a reason for hiding this comment

AllentDan left a comment

Choose a reason for hiding this comment

lvhan028 Oct 17, 2024

Choose a reason for hiding this comment

grimoire Oct 17, 2024

Choose a reason for hiding this comment

lvhan028 Oct 17, 2024

Choose a reason for hiding this comment

grimoire Oct 17, 2024

Choose a reason for hiding this comment

RunningLeon left a comment

Choose a reason for hiding this comment

grimoire commented Oct 11, 2024 •

edited

Loading

RunningLeon left a comment •

edited

Loading