Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize paged attention on triton3 #2553

Merged
merged 12 commits into from
Oct 18, 2024

Conversation

grimoire
Copy link
Collaborator

@grimoire grimoire commented Oct 8, 2024

triton3 has move the cuda fast math location.
This PR support fast expf in paged attention with triton3.0.

Note

None-cuda backend end might not work.

The fill kv kernel and attention is updated so we can change kv layout in the future.

@grimoire
Copy link
Collaborator Author

grimoire commented Oct 11, 2024

python mdeploy/benchmark/profile_generation.py \
   internlm2_5-7b-chat-1m/ \
    --tp 4 \
    -c 1 \
    -ct 1 \
    -pt 1000000 \
    --session-len 1048576 \
    --backend pytorch -w 0 -tr 1
--------------------------------------------------
total time: 683.11s
concurrency: 1, test_round: 1
input_tokens: 1000000, output_tokens: 1
first_token latency(min, max, ave): 679.787s, 679.787s, 679.787s
total_token latency(min, max, ave): 679.787s, 679.787s, 679.787s
token_latency percentiles(50%,75%,95%,99%)(s): [679.787, 679.787, 679.787, 679.787]
throughput(output): 0.0 token/s
throughput(total): 1463.9 token/s
--------------------------------------------------

Copy link
Collaborator

@AllentDan AllentDan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May update UT for the new layout type.

Copy link
Collaborator

@RunningLeon RunningLeon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flash_attn-2.6.3+cu118torch2.3cxx11abiFALSE-cp38-cp38-linux_x86_64.whl

Copy link
Collaborator

@AllentDan AllentDan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

key=['BLOCK_H', 'BLOCK_N', 'BLOCK_DMODEL', 'BLOCK_DV'])
key=['BLOCK_H', 'BLOCK_N', 'BLOCK_DMODEL', 'BLOCK_DV'],
warmup=10,
rep=25)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does rep mean?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

repeat benchmark for autotuning.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How long does this setting warmup=10, rep=25 take?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

each tuning config take 34~35ms

Copy link
Collaborator

@RunningLeon RunningLeon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lvhan028 lvhan028 merged commit 7dc0a5c into InternLM:main Oct 18, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants