Skip to content

Conversation

@LaurentMazare
Copy link
Collaborator

@LaurentMazare LaurentMazare commented Mar 16, 2024

Add a dedicated kernel for efficiently doing the copy in this case.

Benchmarks

Concatenating a (1, 32, 2000, 128) tensor with a (1, 32, 1, 128) one which is typical for a kv-cache operation.

  • On cpu (ryzen 2600x), before 9.05ms, after: 7.3ms.
  • On cpu (macbook pro, M2 pro 16GB), before 3.8ms, after 1.95ms.
  • On gpu (RTX 2080 - 8GB), before 480us, after 290us (this is in kernel blocking mode).

Using a q4k quantized llama 7b model, generating a 1k sequence.

  • On gpu (RTX 2080 - 8GB), before 19.9 token/s, after 22.6 token/s.
  • On gpu (macbook pro, M2 pro 16GB with metal), before 6.3 token/s, after 9.3 token/s.

Ideally this would use cudaMemcpy2d on the cuda side rather than the current kernel which does a division. Another possibility would be to have specialized kernels for common shapes (e.g. powers of 2 for d2).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants