Skip to content

[FRONTEND] Improve grid calculation for persistent kernels to hoist pe…#2283

Merged
qliu93 merged 1 commit intotriton-lang:mainfrom
jsh-20:main
Sep 12, 2023
Merged

[FRONTEND] Improve grid calculation for persistent kernels to hoist pe…#2283
qliu93 merged 1 commit intotriton-lang:mainfrom
jsh-20:main

Conversation

@jsh-20
Copy link
Copy Markdown
Contributor

@jsh-20 jsh-20 commented Sep 12, 2023

…rf on problems that need few blocks.

constrain the number of launched blocks to what it exactely needs for persistent warp specialized kernel. It's useful when problems need very few blocks.
e.g. MxNxK=800x800x60000, f16_f16_f32, block size=128x128x64, non-split-k. Experiments show it can achieve ~16% speedup.

@jsh-20 jsh-20 requested a review from ptillet as a code owner September 12, 2023 07:07
@jsh-20 jsh-20 changed the title [FRONTEND] Modify grid calculation for persistent kernels to hoist pe… [FRONTEND] Improve grid calculation for persistent kernels to hoist pe… Sep 12, 2023
@bealwang
Copy link
Copy Markdown
Contributor

LGTM

Copy link
Copy Markdown
Collaborator

@qliu93 qliu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve additionally as this PR has been reviewed and approved by @bealwang

@qliu93 qliu93 enabled auto-merge (squash) September 12, 2023 09:02
@qliu93 qliu93 merged commit fc5d7e6 into triton-lang:main Sep 12, 2023
alexander-zinoviev pushed a commit to alexander-zinoviev/triton that referenced this pull request Sep 21, 2023
…e… (triton-lang#2283)

…rf on problems that need few blocks.

constrain the number of launched blocks to what it exactely needs for
persistent warp specialized kernel. It's useful when problems need very
few blocks.
e.g. MxNxK=800x800x60000, f16_f16_f32, block size=128x128x64,
non-split-k. Experiments show it can achieve ~16% speedup.
pingzhuu pushed a commit to siliconflow/triton that referenced this pull request Apr 2, 2024
…e… (triton-lang#2283)

…rf on problems that need few blocks.

constrain the number of launched blocks to what it exactely needs for
persistent warp specialized kernel. It's useful when problems need very
few blocks.
e.g. MxNxK=800x800x60000, f16_f16_f32, block size=128x128x64,
non-split-k. Experiments show it can achieve ~16% speedup.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants