Optimize the cat operation on contiguous tensors #1855

LaurentMazare · 2024-03-16T15:48:34Z

Add a dedicated kernel for efficiently doing the copy in this case.

Benchmarks

Concatenating a (1, 32, 2000, 128) tensor with a (1, 32, 1, 128) one which is typical for a kv-cache operation.

On cpu (ryzen 2600x), before 9.05ms, after: 7.3ms.
On cpu (macbook pro, M2 pro 16GB), before 3.8ms, after 1.95ms.
On gpu (RTX 2080 - 8GB), before 480us, after 290us (this is in kernel blocking mode).

Using a q4k quantized llama 7b model, generating a 1k sequence.

On gpu (RTX 2080 - 8GB), before 19.9 token/s, after 22.6 token/s.
On gpu (macbook pro, M2 pro 16GB with metal), before 6.3 token/s, after 9.3 token/s.

Ideally this would use cudaMemcpy2d on the cuda side rather than the current kernel which does a division. Another possibility would be to have specialized kernels for common shapes (e.g. powers of 2 for d2).

LaurentMazare added 18 commits March 16, 2024 15:38

Add a specialized kernel for copy2d.

89cd219

Move the cat operations.

ca0d043

Avoid transpositions in cat.

b03667e

Bugfix.

7234635

Bugfix for the cuda kernel.

998cc83

Add a benchmark.

a6d3d22

Add more testing.

cb1b517

Test fix.

676f1ea

Faster kernel.

a364f55

Add the missing kernel.

d2a3fa2

Tweak the test.

b4ae26a

Add a metal kernel.

79dd688

Merge remote-tracking branch 'origin/copy2d' into copy2d

6dad419

Fix for the metal kernel.

475ed31

Get the tests to pass on metal.

8c7561d

Also use this opportunity to fix the metal kernel for ELU.

538c3e1

Add some bf16 kernels.

dbc1385

Clippy fixes.

948d374

LaurentMazare mentioned this pull request Mar 17, 2024

Improve performance for long sequence generation (kvconcat kernel). #1848

Open

LaurentMazare merged commit ce9fbc3 into main Mar 17, 2024

LaurentMazare deleted the copy2d branch March 17, 2024 09:49

DrJesseGlass mentioned this pull request Oct 21, 2025

feat(candle-nn) ConcatKvCache for 2-5x GPU speedup on autoregressive generation #3143

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize the cat operation on contiguous tensors #1855

Optimize the cat operation on contiguous tensors #1855

Uh oh!

LaurentMazare commented Mar 16, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Optimize the cat operation on contiguous tensors #1855

Optimize the cat operation on contiguous tensors #1855

Uh oh!

Conversation

LaurentMazare commented Mar 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

LaurentMazare commented Mar 16, 2024 •

edited

Loading