Warmup and NUMA changes added, MLA changes updated #1

saood06 · 2025-02-03T04:34:44Z

This is actually usable at much higher context size, main worked fine under llama-batched-bench but paged out and had worse performance in server, and was limited to ~8k. This is tested in server at ~8k, with TG at 2.18, and is currently launched with 64K, 128K errors out. Tested on dual socket Xeon E5-2690 v3 with 384 GB RAM.

CPU buffer size = 362010.72 MiB
n_ctx      = 64000
CPU NUMA KV buffer size = 313101.56 MiB
KV self size  = 305000.00 MiB, K (f16): 183000.00 MiB, V (f16): 122000.00 MiB
KV self size  = 4289.06 MiB, K^R (f16):  476.56 MiB, c^KV (f16): 3812.50 MiB
NUMA compute buffer size = 32343.01 MiB

To-do: Fix NUMA to actually toggle depending on if NUMA is enabled or not.
To-do: Sync RPC to make it functional and add performance and add override model tensor buffers [#11397] ( will have limited practical benefit while old KV cache is still being allocated and not used). Edit: Based on initial tests on llama.cpp this causes performance loss, and also does not function past ~28 layers.
To-do: Grab any FA implementation if one appears.

* Adding gp option to llama-bench Similar to pg, but it only looks at TG speed with a given prompt length. * Make q8_0_r4 work with tensor row sizes that are not a multiple of 128 They still need to be divisible by 32. * Make q8_0_r4 work with tensor row sizes that are not a multiple of 128 .. on NEON * Make q8_0_r4 work with tensor row sizes that are not a multiple of 128 .., on AVX2 * Make q4_0_r4 work with tensor row sizes that are not a multiple of 128 .., on AVX2 * Make q4_0_r4 work with tensor row sizes that are not a multiple of 128 ... on NEON * Make q4_0_r4 work with tensor row sizes that are not a multiple of 128 ... on Zen4. Also fix q8_0 K-cache for head sizes that are not multiple of 128. --------- Co-authored-by: Iwan Kawrakow <[email protected]>

* Slightly faster AVX2 implementation for q4_k_r4 * Even better AVX2 implementation for q4_k_r4 We now arrive at PP-512 = 328 t/s for LLaMA-3.1-8B on a Ryzen-5975WX CPU, up from 291 t/s when I last measured on 3c5f872. With FA and Q8_0 K-cache we get to 339.5 t/s. * Fix llama-bench labels that I broke with ikawrakow#181 * Faster AVX2 implementation for q5_k_q4 We arrive at 302 t/s for LLaMA-3.1-8B on a Ryzen-5975WX CPU, up from 273 t/s. * Use AVX2 implementation of q4_k_r4 and q5_k_r4 also on Zen4 After the changes I made to AVX2, it ends up being slightly faster compared to what I had for Zen4. * Minor tweak * Cleanup --------- Co-authored-by: Iwan Kawrakow <[email protected]>

* Quantization mixes tweaks * Make iq4_nl_r4 work with row size that are not a multiple of 128 ... on Zen4 * Make iq4_nl_r4 work with row size that are not a multiple of 128 ... on AVX2 * Make iq4_nl_r4 work with row size that are not a multiple of 128 ... on AVX2 * Make q6_0_w4 work with row size that are not a multiple of 128 ... on Zen4 * Make q6_0_w4 work with row size that are not a multiple of 128 ... on Zen4 * Make q5_0_r4 work with row size that are not a multiple of 128 ... on Zen4 and AVX2 * Make q5,6_0_r4, iq4_nl_e4 work with row size that are not a multiple of 128 also on NEON. --------- Co-authored-by: Iwan Kawrakow <[email protected]>

Co-authored-by: Stanisław Szymczyk <[email protected]>

fairydreaming · 2025-02-03T06:58:10Z

Cool! Note that this mmap based KV buffer allocator shall work even without NUMA, I shall probably name it ggml_backend_mmap_buffer_type.

saood06 · 2025-02-03T07:24:45Z

Cool! Note that this mmap based KV buffer allocator shall work even without NUMA, I shall probably name it ggml_backend_mmap_buffer_type.

Makes sense, even though it synergizes with NUMA, it still has benefits without NUMA.

If you end up using this branch, I'd appreciate performance numbers.

I'm going to try and crudely make it so that the old KV cache does not allocate, because I do think if that's done you can offload everything besides the non shared experts with just 24GB of VRAM at ~23k context, assuming it works. I was only able to RPC 29 layers even when I had ample VRAM, 30+ would just silently crash the RPC server ( without the MLA branch, I have yet to test the diff you gave me to make the MLA branch work).

saood06 · 2025-02-04T10:44:14Z

Some runtime numbers showing 30K context.

kv cache rm [p0, end) | timestamp=1738584936 p0=3
kv cache rm [p0, end) | timestamp=1738585206 p0=2051
kv cache rm [p0, end) | timestamp=1738585594 p0=4099
kv cache rm [p0, end) | timestamp=1738586098 p0=6147
kv cache rm [p0, end) | timestamp=1738586716 p0=8195
kv cache rm [p0, end) | timestamp=1738587443 p0=10243
kv cache rm [p0, end) | timestamp=1738588289 p0=12291
kv cache rm [p0, end) | timestamp=1738589245 p0=14339
kv cache rm [p0, end) | timestamp=1738590323 p0=16387
kv cache rm [p0, end) | timestamp=1738591540 p0=18435
kv cache rm [p0, end) | timestamp=1738592866 p0=20483
kv cache rm [p0, end) | timestamp=1738594456 p0=22531
kv cache rm [p0, end) | timestamp=1738596175 p0=24579
prompt eval time     = 12074054.06 ms / 25522 tokens (  473.08 ms per token,     2.11 tokens per second) | timestamp=1738599260 t_prompt_processing=12074054.06 n_prompt_tokens_processed=25522 t_token=473.08416503408824 n_tokens_second=2.113788779905463
generation eval time = 2250383.89 ms /  2088 runs   ( 1077.77 ms per token,     0.93 tokens per second) | timestamp=1738599260 t_token_generation=2250383.888 n_decoded=2088 t_token=1077.7700613026818 n_tokens_second=0.9278416945366968
total time = 14324437.95 ms | timestamp=1738599260  t_prompt_processing=12074054.06 t_token_generation=2250383.888 t_total=14324437.948

At higher context you can see how PP slows down as it gets deeper into the prompt.

Another high context generation, less PP in this one as it cached most of the previous prompt.

kv cache rm [p0, end) | timestamp=1738649710 p0=26931
prompt eval time     =  126917.48 ms /   143 tokens (  887.53 ms per token,     1.13 tokens per second) | timestamp=1738652616 t_prompt_processing=126917.477 n_prompt_tokens_processed=143 t_token=887.5348041958042 n_tokens_second=1.1267163780761258
generation eval time = 2778653.10 ms /  2726 runs   ( 1019.32 ms per token,     0.98 tokens per second) |  timestamp=1738652616 t_token_generation=2778653.096 n_decoded=2726 t_token=1019.3151489361702 n_tokens_second=0.9810508565909877
 total time = 2905570.57 ms | timestamp=1738652616 id_slot=0 id_task=11466 t_prompt_processing=126917.477 t_token_generation=2778653.096 t_total=2905570.573

* iq1_s_r4: basics - quantize/dequantize * iq1_s_r4: gemm/gemv works on AVX2/Zen4 * Don't forget to make sure we have a multiple of 4 rows per thread * iq1_s_r4: this is better * iq1_s_r4: fix Zen4 after AVX2 changes * iq1_s_r4: NEON gemm/gemv * iq1_s_r4: more bits for shared experts With this mix we arrive at PPL(512) = 9.4140 for Deepseek-Lite using 1.766 bpw for the repeating layers. On the Ryzen-7950X we get PP-512 = 494 t/s and TG-128 = 52 t/s @ 16 threads. * Forgotten counter increment * iq1_s_r4: slightly faster AVX2/Zen4 gemm/gemv * Compiler warnings --------- Co-authored-by: Iwan Kawrakow <[email protected]>

Co-authored-by: Iwan Kawrakow <[email protected]>

* iq1_m_r4: basics (quantize/dequantize) * iq1_m_r4: Zen4 gemm * iq1_m_r4: neon gemm * iq1_m_r4: switch to q8_0_x4 also on AVX2/Zen4 With the deltas being per group of 8, we cannot make use of the q8 sums stored in q8_1, so we get a tiny gain by using q8_0_x4. * iq1_m_r4: rename mul_mat_iq1_m_r4_q8_1 to mul_mat_iq1_m_r4_q8_0 --------- Co-authored-by: Iwan Kawrakow <[email protected]>

* Rename q4_0_r4 to q4_0_r8 to reflect actual row interleaving * Rename q8_0_r4 to q8_0_r8 to reflect actual row interleaving * Rename iq4_xs_r4 to iq4_xs_r8 to reflect actual row interleaving --------- Co-authored-by: Iwan Kawrakow <[email protected]>

Co-authored-by: Iwan Kawrakow <[email protected]>

This reverts commit 0bf4d99.

…ate_warmup_Numa

ikawrakow and others added 7 commits January 29, 2025 14:05

Updated Optimizations

0971a92

Co-authored-by: Stanisław Szymczyk <[email protected]>

Load all MoE experts during warmup

64effcc

Co-authored-by: Stanisław Szymczyk <[email protected]>

NUMA-aware KV cache buffer type (experimental)

99b2444

Co-authored-by: Stanisław Szymczyk <[email protected]>

Fixes to make previous commits compile

5d9e28e

saood06 changed the title ~~Mla update warmup numa~~ Warmup and NUMA changes added, MLA changes updated Feb 3, 2025

ikawrakow and others added 3 commits February 5, 2025 13:49

iq1_s_r4: slightly faster NEON gemm/gemv (ikawrakow#186)

a6f9f2e

Co-authored-by: Iwan Kawrakow <[email protected]>

Merge remote-tracking branch 'origin/main' into mla_update_warmup_Numa

ac73205

saood06 mentioned this pull request Feb 6, 2025

IQ1_S_R4: better 1.5 bpw quants ikawrakow/ik_llama.cpp#185

Merged

ikawrakow and others added 9 commits February 6, 2025 14:08

Add additional checks for iq1_s_r4 quantization (ikawrakow#191)

b08a2e9

Co-authored-by: Iwan Kawrakow <[email protected]>

cuda: non-contiguous rms norm (ikawrakow#190)

4601a8c

Co-authored-by: Iwan Kawrakow <[email protected]>

Revert "Do not quantize activations if not necessary (ikawrakow#79)"

4daff2f

This reverts commit 0bf4d99.

Fixed compilation after revert

df226f3

Quant tweaks

b17a6fe

Merge remote-tracking branch 'origin/ik/revert_0bf4d997' into mla_upd…

6a4607c

…ate_warmup_Numa

Quant tweaks

7cc1d89

saood06 mentioned this pull request Feb 8, 2025

Add optional MLA ikawrakow/ik_llama.cpp#188

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Warmup and NUMA changes added, MLA changes updated #1

Warmup and NUMA changes added, MLA changes updated #1

saood06 commented Feb 3, 2025 •

edited

Loading

fairydreaming commented Feb 3, 2025

saood06 commented Feb 3, 2025 •

edited

Loading

saood06 commented Feb 4, 2025 •

edited

Loading

Warmup and NUMA changes added, MLA changes updated #1

Are you sure you want to change the base?

Warmup and NUMA changes added, MLA changes updated #1

Conversation

saood06 commented Feb 3, 2025 • edited Loading

fairydreaming commented Feb 3, 2025

saood06 commented Feb 3, 2025 • edited Loading

saood06 commented Feb 4, 2025 • edited Loading

saood06 commented Feb 3, 2025 •

edited

Loading

saood06 commented Feb 3, 2025 •

edited

Loading

saood06 commented Feb 4, 2025 •

edited

Loading