Skip to content

Conversation

@MasterJH5574
Copy link
Contributor

@MasterJH5574 MasterJH5574 commented Aug 2, 2024

This PR bumps FlashInfer and updates PagedKVCache accordingly for performance improvement.

Some notes on this bump:

  • When the Grouped-Query Attention group size is at least 4 and FlashInfer is enabled, we use the prefill attn kernel for better performance.
  • We enlarge the temporary workspace for FlashInfer use accordingly, as FlashInfer in the current version may consume much larger workspace. We turn off the workspace when FlashInfer is not enabled.
  • We reduce the max block depth to be 2, in observation of the limited help of cascade inference when batch size is not large and the prompt reuse is low.

This PR bumps FlashInfer and updates PagedKVCache accordingly
for performance improvement.

Some notes on this bump:

* When the Grouped-Query Attention group size is at least 4 and
FlashInfer is enabled, we use the prefill attn kernel for better
performance.
* We enlarge the temporary workspace for FlashInfer use accordingly,
as FlashInfer in the current version may consume much larger workspace.
We turn off the workspace when FlashInfer is not enabled.
* We reduce the max block depth to be 2, in observation of the limited
help of cascade inference when batch size is not large and the prompt
reuse is low.
@MasterJH5574 MasterJH5574 force-pushed the tvm-dev/2024-08-02-bump-flashinfer branch from d695af4 to e6987df Compare August 2, 2024 21:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants