Skip to content

Conversation

@AidanBeltonS
Copy link
Contributor

This PR provides improvements to the dequantize_block_q4_K kernel. It focuses on improving the global memory accesses.

Three main changes are implemented:

  • Single 32 bit load for half2 rather than two 16 bit loads
  • Load all scales in to local memory then do random access on results
  • Vectorize the q load so we load 32bits each time rather than 8bits

All results below collected on A100 GPU

Without Changes With Changes % Change
LLama-bench 70 B PP Throughput (t/s) 503.36 564.04 -11.85 Negative change is better
NSYS Avg Kernel time (us) 587.54 409.52 30.30 Positive change is better

No meaningful change in Intel GPU results have been observed.

@JohannesGaessler JohannesGaessler added the SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language label Jul 2, 2024
Copy link
Contributor

@OuadiElfarouki OuadiElfarouki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improvement observed on Nvidia A4000 & RTX 4070 as well (7B & 13B - Q4_K_*).
Thanks!

@joeatodd
Copy link
Contributor

joeatodd commented Jul 2, 2024

Ping @airMeng to check for regressions on Intel side

Copy link
Contributor

@airMeng airMeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is weird that I don't see any performance improvements on Arc A770, no regression either.

ping our performance expert @luoyu-intel

@luoyu-intel
Copy link
Contributor

The code looks good. It can prevent cache misses. So you may not see the performance improvement if there are no cache misses in your case.

@airMeng airMeng merged commit fadde67 into ggml-org:master Jul 3, 2024
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Jul 3, 2024
* Single load for half2

* Store scales in local mem

* Vec load quantized values
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants