Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warmup and NUMA changes added, MLA changes updated #1

Draft
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

saood06
Copy link
Owner

@saood06 saood06 commented Feb 3, 2025

This is actually usable at much higher context size, main worked fine under llama-batched-bench but paged out and had worse performance in server, and was limited to ~8k. This is tested in server at ~8k, with TG at 2.18, and is currently launched with 64K, 128K errors out. Tested on dual socket Xeon E5-2690 v3 with 384 GB RAM.

CPU buffer size = 362010.72 MiB
n_ctx      = 64000
CPU NUMA KV buffer size = 313101.56 MiB
KV self size  = 305000.00 MiB, K (f16): 183000.00 MiB, V (f16): 122000.00 MiB
KV self size  = 4289.06 MiB, K^R (f16):  476.56 MiB, c^KV (f16): 3812.50 MiB
NUMA compute buffer size = 32343.01 MiB
  • To-do: Fix NUMA to actually toggle depending on if NUMA is enabled or not.
  • To-do: Sync RPC to make it functional and add performance and add override model tensor buffers [#11397] ( will have limited practical benefit while old KV cache is still being allocated and not used). Edit: Based on initial tests on llama.cpp this causes performance loss, and also does not function past ~28 layers.
  • To-do: Grab any FA implementation if one appears.

ikawrakow and others added 7 commits January 29, 2025 14:05
* Adding gp option to llama-bench

Similar to pg, but it only looks at TG speed with a given
prompt length.

* Make q8_0_r4 work with tensor row sizes that are not a multiple of 128

They still need to be divisible by 32.

* Make q8_0_r4 work with tensor row sizes that are not a multiple of 128

.. on NEON

* Make q8_0_r4 work with tensor row sizes that are not a multiple of 128

.., on AVX2

* Make q4_0_r4 work with tensor row sizes that are not a multiple of 128

.., on AVX2

* Make q4_0_r4 work with tensor row sizes that are not a multiple of 128

... on NEON

* Make q4_0_r4 work with tensor row sizes that are not a multiple of 128

... on Zen4.

Also fix q8_0 K-cache for head sizes that are not multiple of 128.

---------

Co-authored-by: Iwan Kawrakow <[email protected]>
* Slightly faster AVX2 implementation for q4_k_r4

* Even better AVX2 implementation for q4_k_r4

We now arrive at PP-512 = 328 t/s for LLaMA-3.1-8B on a
Ryzen-5975WX CPU, up from 291 t/s when I last measured
on 3c5f872.
With FA and Q8_0 K-cache we get to 339.5 t/s.

* Fix llama-bench labels that I broke with ikawrakow#181

* Faster AVX2 implementation for q5_k_q4

We arrive at 302 t/s for LLaMA-3.1-8B on a Ryzen-5975WX CPU,
up from 273 t/s.

* Use AVX2 implementation of q4_k_r4 and q5_k_r4 also on Zen4

After the changes I made to AVX2, it ends up being slightly faster
compared to what I had for Zen4.

* Minor tweak

* Cleanup

---------

Co-authored-by: Iwan Kawrakow <[email protected]>
* Quantization mixes tweaks

* Make iq4_nl_r4 work with row size that are not a multiple of 128

... on Zen4

* Make iq4_nl_r4 work with row size that are not a multiple of 128

... on AVX2

* Make iq4_nl_r4 work with row size that are not a multiple of 128

... on AVX2

* Make q6_0_w4 work with row size that are not a multiple of 128

... on Zen4

* Make q6_0_w4 work with row size that are not a multiple of 128

... on Zen4

* Make q5_0_r4 work with row size that are not a multiple of 128

... on Zen4 and AVX2

* Make q5,6_0_r4, iq4_nl_e4 work with row size that are not a multiple of 128

also on NEON.

---------

Co-authored-by: Iwan Kawrakow <[email protected]>
Co-authored-by: Stanisław Szymczyk <[email protected]>
Co-authored-by: Stanisław Szymczyk <[email protected]>
@saood06 saood06 changed the title Mla update warmup numa Warmup and NUMA changes added, MLA changes updated Feb 3, 2025
@fairydreaming
Copy link

Cool! Note that this mmap based KV buffer allocator shall work even without NUMA, I shall probably name it ggml_backend_mmap_buffer_type.

@saood06
Copy link
Owner Author

saood06 commented Feb 3, 2025

Cool! Note that this mmap based KV buffer allocator shall work even without NUMA, I shall probably name it ggml_backend_mmap_buffer_type.

Makes sense, even though it synergizes with NUMA, it still has benefits without NUMA.

If you end up using this branch, I'd appreciate performance numbers.

I'm going to try and crudely make it so that the old KV cache does not allocate, because I do think if that's done you can offload everything besides the non shared experts with just 24GB of VRAM at ~23k context, assuming it works. I was only able to RPC 29 layers even when I had ample VRAM, 30+ would just silently crash the RPC server ( without the MLA branch, I have yet to test the diff you gave me to make the MLA branch work).

@saood06
Copy link
Owner Author

saood06 commented Feb 4, 2025

Some runtime numbers showing 30K context.

kv cache rm [p0, end) | timestamp=1738584936 p0=3
kv cache rm [p0, end) | timestamp=1738585206 p0=2051
kv cache rm [p0, end) | timestamp=1738585594 p0=4099
kv cache rm [p0, end) | timestamp=1738586098 p0=6147
kv cache rm [p0, end) | timestamp=1738586716 p0=8195
kv cache rm [p0, end) | timestamp=1738587443 p0=10243
kv cache rm [p0, end) | timestamp=1738588289 p0=12291
kv cache rm [p0, end) | timestamp=1738589245 p0=14339
kv cache rm [p0, end) | timestamp=1738590323 p0=16387
kv cache rm [p0, end) | timestamp=1738591540 p0=18435
kv cache rm [p0, end) | timestamp=1738592866 p0=20483
kv cache rm [p0, end) | timestamp=1738594456 p0=22531
kv cache rm [p0, end) | timestamp=1738596175 p0=24579
prompt eval time     = 12074054.06 ms / 25522 tokens (  473.08 ms per token,     2.11 tokens per second) | timestamp=1738599260 t_prompt_processing=12074054.06 n_prompt_tokens_processed=25522 t_token=473.08416503408824 n_tokens_second=2.113788779905463
generation eval time = 2250383.89 ms /  2088 runs   ( 1077.77 ms per token,     0.93 tokens per second) | timestamp=1738599260 t_token_generation=2250383.888 n_decoded=2088 t_token=1077.7700613026818 n_tokens_second=0.9278416945366968
total time = 14324437.95 ms | timestamp=1738599260  t_prompt_processing=12074054.06 t_token_generation=2250383.888 t_total=14324437.948

At higher context you can see how PP slows down as it gets deeper into the prompt.

Another high context generation, less PP in this one as it cached most of the previous prompt.

kv cache rm [p0, end) | timestamp=1738649710 p0=26931
prompt eval time     =  126917.48 ms /   143 tokens (  887.53 ms per token,     1.13 tokens per second) | timestamp=1738652616 t_prompt_processing=126917.477 n_prompt_tokens_processed=143 t_token=887.5348041958042 n_tokens_second=1.1267163780761258
generation eval time = 2778653.10 ms /  2726 runs   ( 1019.32 ms per token,     0.98 tokens per second) |  timestamp=1738652616 t_token_generation=2778653.096 n_decoded=2726 t_token=1019.3151489361702 n_tokens_second=0.9810508565909877
 total time = 2905570.57 ms | timestamp=1738652616 id_slot=0 id_task=11466 t_prompt_processing=126917.477 t_token_generation=2778653.096 t_total=2905570.573

ikawrakow and others added 3 commits February 5, 2025 13:49
* iq1_s_r4: basics - quantize/dequantize

* iq1_s_r4: gemm/gemv works on AVX2/Zen4

* Don't forget to make sure we have a multiple of 4 rows per thread

* iq1_s_r4: this is better

* iq1_s_r4: fix Zen4 after AVX2 changes

* iq1_s_r4: NEON gemm/gemv

* iq1_s_r4: more bits for shared experts

With this mix we arrive at PPL(512) = 9.4140
for Deepseek-Lite using 1.766 bpw for the repeating layers.

On the Ryzen-7950X we get PP-512 = 494 t/s and
TG-128 = 52 t/s @ 16 threads.

* Forgotten counter increment

* iq1_s_r4: slightly faster AVX2/Zen4 gemm/gemv

* Compiler warnings

---------

Co-authored-by: Iwan Kawrakow <[email protected]>
ikawrakow and others added 9 commits February 6, 2025 14:08
* iq1_m_r4: basics (quantize/dequantize)

* iq1_m_r4: Zen4 gemm

* iq1_m_r4: neon gemm

* iq1_m_r4: switch to q8_0_x4 also on AVX2/Zen4

With the deltas being per group of 8, we cannot make use
of the q8 sums stored in q8_1, so we get a tiny gain by
using q8_0_x4.

* iq1_m_r4: rename mul_mat_iq1_m_r4_q8_1 to mul_mat_iq1_m_r4_q8_0

---------

Co-authored-by: Iwan Kawrakow <[email protected]>
* Rename q4_0_r4 to q4_0_r8 to reflect actual row interleaving

* Rename q8_0_r4 to q8_0_r8 to reflect actual row interleaving

* Rename iq4_xs_r4 to iq4_xs_r8 to reflect actual row interleaving

---------

Co-authored-by: Iwan Kawrakow <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants