Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
312 commits
Select commit Hold shift + click to select a range
8a132fa
vulkan: unify type macros to use Vx instead of _VECx (#21605)
0cc4m Apr 9, 2026
8a65a7a
ci: drop v5 `all:` composition from labeler.yml (#21627)
Marxist-Leninist Apr 9, 2026
b54cb2e
sycl : add flash-attn support for head size 512 (#21654)
qnixsynapse Apr 9, 2026
75511a8
webui: Add option to pre-encode conversation for faster next turns (#…
allozaur Apr 9, 2026
3ee9da0
server : fix grammar commandline args (#21543)
AUTOMATIC1111 Apr 9, 2026
9949ad0
fix: Model Selector choice sync (#21628)
allozaur Apr 9, 2026
5e9c635
metal : add missing mm-id specializations for q1_0 (#21662)
ggerganov Apr 9, 2026
243532e
jinja : support ensure_ascii=true, string repetition and int/float se…
kwajiehao Apr 9, 2026
0ec191e
vocab: add gemma4 tokenizer tests, fix edge case (#21534)
pwilkin Apr 9, 2026
501aeed
mtmd: support dots.ocr (#17575)
ngxson Apr 9, 2026
057dba3
model: fix multimodal padding token for gemma3n/gemma4 (#21625)
ngxson Apr 9, 2026
2622975
common : simplify autoparser tagged parser rules (#21216)
aldehir Apr 9, 2026
ddf03c6
common : fix ambiguous grammar rule in gemma4 (#21661)
aldehir Apr 9, 2026
4ef9301
webui: add "Send message on Enter" setting (#21577)
mourix Apr 9, 2026
c8ac02f
requirements : update transformers to 5.5.1 (#21617)
danbev Apr 9, 2026
009a113
ggml : check return value of CUB calls used in argsort and top-k (the…
fairydreaming Apr 9, 2026
d6f3030
ggml: backend-agnostic tensor parallelism (experimental) (#19378)
JohannesGaessler Apr 9, 2026
d132f22
HIP: add CDNA4 (gfx950) architecture support for MI350X/MI355X (#21570)
andyluo7 Apr 9, 2026
e34f042
CUDA: fuse muls (#21665)
am17an Apr 10, 2026
e095a48
common : add fluidity to the progress bar (#21671)
angt Apr 10, 2026
7b69125
vulkan: Support Q1_0 (#21539)
jeffbolznv Apr 10, 2026
3f8752b
docs : fix broken link to ggml-openvino in OPENVINO.md (#21709)
ibelem Apr 10, 2026
d7ff074
common : enable reasoning budget sampler for gemma4 (#21697)
berkidem Apr 10, 2026
f989a6e
webui: Static build output improvements (#21667)
allozaur Apr 10, 2026
0893f50
common: mark --split-mode tensor as experimental (#21684)
JohannesGaessler Apr 10, 2026
fb38d6f
common : fix when loading a cached HF models with unavailable API (#2…
angt Apr 10, 2026
5dd1025
server : ignore --alias when using --models-preset (#21380)
angt Apr 10, 2026
e4fed9d
ggml-webgpu: address quantization precision and backend lifecycle man…
Constannnnnt Apr 10, 2026
bfd1f45
ggml-webgpu: support non-square subgroup matrix configs for Intel GPU…
SharmaRithik Apr 10, 2026
e62fa13
model : make Gemma 4 shared-KV tail attn_k tensors optional on load (…
MoonRide303 Apr 10, 2026
05b3caa
common : add callback interface for download progress (#21735)
angt Apr 10, 2026
3fc6506
common : better align to the updated official gemma4 template (#21704)
aldehir Apr 10, 2026
9aa2807
hexagon: improved Op queuing, buffer and cache management (#21705)
max-krasnyansky Apr 10, 2026
81069a8
hexagon: add support for linux on snapdragon (#21707)
tboinovski1 Apr 10, 2026
b136b62
fix: Fix broken structured output when using $refs in json_schema (#2…
Galunid Apr 10, 2026
a29e4c0
CUDA: also store node->src ne/nb for graph equality (#21736)
am17an Apr 11, 2026
660386f
py : Bump typer to latest to fix huggingface_hub issue (#21701)
bartowski1182 Apr 11, 2026
2b2cd57
ggml : fix a few instances of missing GGML_TYPE_Q1_0 cases (#21716)
CISC Apr 11, 2026
865ff06
TP: fix Qwen 3 Next data split (#21732)
JohannesGaessler Apr 11, 2026
af1127d
opencl: add basic support for q5_k (#21593)
shaofeiqi Apr 11, 2026
073bb2c
mtmd : add MERaLiON-2 multimodal audio support (#21756)
SiruiHe Apr 11, 2026
ff5ef82
CUDA: skip compilation of superfluous FA kernels (#21768)
JohannesGaessler Apr 11, 2026
6313acb
docs: add guide on how to add multimodal support (#21778)
ngxson Apr 12, 2026
9e209c5
fix: Proper messages rendering for "Show raw output" (#21672)
allozaur Apr 12, 2026
547765a
mtmd: add Gemma 4 audio conformer encoder support (#21421)
stephencox-ict Apr 12, 2026
aa4695c
mtmd: add gemma 4 test (vision + audio) [no ci] (#21806)
ngxson Apr 12, 2026
1e9d771
convert : force f16 or f32 on step3-vl conv weights (#21646)
CISC Apr 12, 2026
21a4933
mtmd: qwen3 audio support (qwen3-omni and qwen3-asr) (#19441)
ngxson Apr 12, 2026
82764d8
mtmd: fix crash when sending image under 2x2 pixels (#21711)
mzsergiu Apr 12, 2026
873c825
sycl: disable Q1_0 in backend and cleanup unused variables (#21807)
qnixsynapse Apr 13, 2026
bafae27
Remove extra conditional check on debug mode. (#21798)
yomaytk Apr 13, 2026
227ed28
webui: MCP Diagnostics improvements (#21803)
allozaur Apr 13, 2026
974c8c9
webui: add setting for first-line chat titles (#21797)
crodjer Apr 13, 2026
920b3e7
mtmd: use causal attn for gemma 4 audio (#21824)
ngxson Apr 13, 2026
9f5e1ed
CUDA: Limit DeviceSegmentedSort to immediate mode (#21718)
ORippler Apr 13, 2026
ce8fd4b
server: Expose build_info in router mode (#21835)
gaspardpetit Apr 13, 2026
aa00911
common : add download cancellation and temp file cleanup (#21813)
angt Apr 13, 2026
75f3bc9
vulkan: Flash Attention DP4A shader for quantized KV cache (#20797)
0cc4m Apr 13, 2026
a8bad38
ci: Also exempt 'security' tag from auto-close (#21844)
ckastner Apr 13, 2026
1c0d908
chat: dedicated DeepSeek v3.2 parser + "official" template (#21785)
pwilkin Apr 13, 2026
e974923
docs: listing qwen3-asr and qwen3-omni as supported (#21857)
ngxson Apr 13, 2026
e21cdc1
common/gemma4 : handle parsing edge cases (#21760)
aldehir Apr 13, 2026
e489a5c
server: support OAI /v1/audio/transcriptions API (#21863)
ngxson Apr 14, 2026
6a6780a
vulkan: Support GGML_TYPE_NVFP4 (#21455)
jeffbolznv Apr 14, 2026
56666fa
common: skip reasoning budget sampler when no budget is requested (#2…
berkidem Apr 14, 2026
5a23695
ggml-webgpu: Update register tiling matmul to use f32 accumulation (#…
reeselevine Apr 14, 2026
acc37a4
cmake: fix CMP0194 warning on Windows with MSVC (#21630)
texasich Apr 14, 2026
2e05f06
ggml : fix ARM NEON nvfp4 dot product on non-dotprod targets (#21559)
richarddd Apr 14, 2026
be76dd0
vendor : update BoringSSL to 0.20260413.0 (#21881)
angt Apr 14, 2026
aa0f189
metal : add XIELU unary op (#20802)
seyoungjeong Apr 14, 2026
f4b5bf2
ci : re-enable mac workflows (#21894)
ggerganov Apr 14, 2026
1f30ac0
vulkan: Programmatically add RoundingModeRTE to all shaders when the …
jeffbolznv Apr 14, 2026
707c0b7
mtmd: add mtmd_image_tokens_get_decoder_pos() API (#21851)
ngxson Apr 14, 2026
c0de6ed
metal : fix FA support logic (#21898)
ggerganov Apr 14, 2026
fae3a28
ggml : remove ggml-ext.h (#21869)
ngxson Apr 14, 2026
5d14e5d
hexagon: optimization for HMX mat_mul (#21554)
njsyw1997 Apr 14, 2026
e39eba2
read n_ctx back after making llama_context (#21939)
smashedpumpkin Apr 15, 2026
e1a9a6d
autoparser: support case of JSON_NATIVE with per-call markers (test c…
pwilkin Apr 15, 2026
8dc530b
ci: disable test-backend-ops on Vulkan llvmpipe run and resture defau…
0cc4m Apr 15, 2026
80d8770
docs: more extensive RoPE documentation [no ci] (#21953)
ngxson Apr 15, 2026
adb541a
rpc : add native RDMA transport for RPC backend (RoCEv2) (#20590)
dvv101111 Apr 15, 2026
014dca4
CUDA: manage NCCL communicators in context (#21891)
JohannesGaessler Apr 15, 2026
a620695
CUDA: require explicit opt-in for P2P access (#21910)
JohannesGaessler Apr 15, 2026
20d3bc2
ggml-webgpu: Fix dequantization helpers to not pass in pointers (#21872)
reeselevine Apr 15, 2026
7e72b38
cuda: Q1_0 initial backend (#21629)
khosravipasha Apr 15, 2026
b3d7587
vulkan: optimize im2col (#21713)
0cc4m Apr 15, 2026
0b37448
WIP: add TurboQuant KV cache types (turbo3, turbo4)
TheTom Mar 25, 2026
b37e5f0
feat: Metal kernels for TurboQuant KV cache (turbo3, turbo4) #21
TheTom Mar 25, 2026
63e25a5
feat: full TurboQuant with rotation matrices in Metal kernels #21
TheTom Mar 25, 2026
f40ccae
feat: inline rotation matrices in Metal shader + C round-trip test #21
TheTom Mar 25, 2026
d10ca51
fix: remove thread static from Metal dequantize, fix stale code #23
TheTom Mar 25, 2026
3777275
feat: replace dense 128x128 matvec with Fast Walsh-Hadamard rotation #26
TheTom Mar 25, 2026
fa52236
docs: detailed speed investigation plan for TurboQuant Metal shader #23
TheTom Mar 25, 2026
8cf626a
docs: log simd_broadcast attempt — no speed improvement #23
TheTom Mar 25, 2026
2448468
docs: log threadgroup attempt — no speed improvement, rethinking #23
TheTom Mar 25, 2026
d675e12
docs: CRITICAL — dequant is NOT the bottleneck, no-op still 2.4 tok/s…
TheTom Mar 25, 2026
adfbc59
fix: inline turbo-wht.h — was causing CPU fallback, not Metal! #23
TheTom Mar 25, 2026
e8f23eb
docs: real Metal benchmarks after #include fix — 8× gap not 35× #23
TheTom Mar 25, 2026
447ca79
docs: final investigation summary + upstream tracking #23 #27
TheTom Mar 25, 2026
5787361
docs: upstream competitive intel — pre-rotate-queries is the key #28
TheTom Mar 25, 2026
786f0c9
docs: speed ceiling test — 49 tok/s without dequant rotation (4.6× ga…
TheTom Mar 25, 2026
79c08dd
docs: pre-rotate-queries implementation plan + speed ceiling 49 tok/s
TheTom Mar 25, 2026
808e171
feat: pre-rotate-queries optimization — 51.4 tok/s (5× speedup) #23
TheTom Mar 25, 2026
ac157ee
docs: final investigation summary — 2.4 → 51.4 tok/s journey complete…
TheTom Mar 25, 2026
d044965
feat: MSE-only mode — drop QJL, all 3 bits to PolarQuant #23
TheTom Mar 25, 2026
068715c
docs: Change 2 not needed — Q rotation overhead is negligible
TheTom Mar 25, 2026
c551be4
docs: block size is the bottleneck — q4_0 at block 32 = 100% of q8_0
TheTom Mar 25, 2026
fff2092
feat: block size 32 — 77.7 tok/s MoE (91% of q8_0), 17.0 Qwopus (97%) 🎉
TheTom Mar 25, 2026
6a52a43
fix: TURBO_D=128 independent of QK_TURBO3, file turbo4 bugs #29
TheTom Mar 25, 2026
2462ab3
docs: final investigation log — 77.7 tok/s, 91% of q8_0
TheTom Mar 25, 2026
b4270ec
CRITICAL: turbo3 perplexity is 165.6 vs q8_0 6.1 — quality broken #30
TheTom Mar 25, 2026
8755e26
CRITICAL: found TWO root causes for PPL=165 #30
TheTom Mar 25, 2026
b0234bd
docs: bisect confirms block size innocent, rotation access is the bug…
TheTom Mar 25, 2026
ebe0a23
fix: restore inverse rotation in dequant — PPL 6.19 (1.2% of q8_0) #3…
TheTom Mar 25, 2026
091e12a
docs: perplexity 6.194 confirmed — 1.4% of q8_0 #30
TheTom Mar 25, 2026
538e5af
docs: complete quality benchmark summary + lessons learned #30
TheTom Mar 25, 2026
7300255
perf: fp16 WHT dequant + SIMD cooperative dequant — 45% speedup
TheTom Mar 25, 2026
095aa0b
chore: move turboquant docs to turboquant_plus repo
TheTom Mar 25, 2026
557b033
perf: vectorized half4 WHT butterfly — 31% speedup (1074 → 1411 tok/s)
TheTom Mar 25, 2026
cc31405
perf: pre-packed half4 sign arrays — minor speedup (1411 → 1424 tok/s)
TheTom Mar 25, 2026
4de7c1c
perf: graph-side WHT rotation — 2095 tok/s (0.78x q8_0, was 0.53x)
TheTom Mar 25, 2026
e5b7470
perf: block-32 + graph WHT — 2747 tok/s (1.02x q8_0!!!)
TheTom Mar 25, 2026
73f0008
feat: layer-adaptive KV cache — q8_0 quality with 80% turbo3 compression
TheTom Mar 25, 2026
701e085
fix: address Codex review on layer-adaptive — thread safety + underfl…
TheTom Mar 25, 2026
6267548
wip: context scaling fix — skip unnecessary ggml_cont + 32x32 rotatio…
TheTom Mar 26, 2026
af0cb0a
experiment: group-32 rotation FAILED — PPL 7.06 (target 6.19)
TheTom Mar 26, 2026
012faec
feat: add GGML_OP_TURBO_WHT — custom O(d log d) Walsh-Hadamard Transform
TheTom Mar 26, 2026
7173be9
perf: optimized turbo3 dequant — eliminates context scaling regression
TheTom Mar 26, 2026
8fa5bff
ci: quality+speed gate script — PPL + context scaling check before push
TheTom Mar 26, 2026
b696c5d
perf: fp16 centroid LUT — decode +6-14% at long context (#33)
TheTom Mar 26, 2026
e398a4e
perf: float norm broadcast in vec dequant — decode +2-3% over fp16 LUT
TheTom Mar 26, 2026
5e6277b
fix: add turbo3/turbo4 cache types to llama-bench arg parser
TheTom Mar 26, 2026
6c9cfb1
experiment: split 2x4-entry constant LUT for M1 decode fix
TheTom Mar 26, 2026
d96703a
fix: Metal shader comment accuracy per Codex review
TheTom Mar 26, 2026
33b3d8a
cleanup: remove stray diagnostic output files
TheTom Mar 26, 2026
440d324
feat: turbo3 norm correction — PPL 6.211 → 6.176 (free quality win)
TheTom Mar 26, 2026
4bf7ddc
fix: auto-enable flash attention for turbo cache types + fix ggml con…
TheTom Mar 26, 2026
884da3f
experiment: register centroid LUT tested — register spill on Metal
TheTom Mar 26, 2026
3eca09a
fix: turbo4 SET_ROWS corruption, tail-block truncation, constant coup…
seanrasch Mar 27, 2026
a5a0f7b
fix: stack overflow in turbo4 CPU init — 64KB array on worker thread …
seanrasch Mar 27, 2026
06a6b62
experiment: batched byte extraction + explicit bit field pre-extract
TheTom Mar 27, 2026
86a5bbf
experiment: profiling modes for turbo3 decode bottleneck isolation
TheTom Mar 27, 2026
baa6116
experiment: 4-entry magnitude LUT + branchless sign (XOR trick)
TheTom Mar 27, 2026
1c2e558
experiment: force non-vec FA path for turbo3 (nl=2 vs nl=8)
TheTom Mar 27, 2026
c527333
experiment: zero-LUT select chain — 2-level ternary, no constant memory
TheTom Mar 27, 2026
f19c98c
feat: auto-detect hardware, use 4-mag LUT on pre-M5 (+38-45% decode)
TheTom Mar 27, 2026
fbd5ec9
experiment: 2-pair half2 LUT — only 2 constant addresses per lookup
TheTom Mar 27, 2026
7c2d880
experiment: deferred norm multiply (batch float4 * norm at end)
TheTom Mar 27, 2026
35034f1
revert to proven 4-mag + per-element norm (deferred norm was slower)
TheTom Mar 27, 2026
616c7b9
experiment: named-register centroid×norm — 4 constant reads upfront, …
TheTom Mar 27, 2026
73d512c
revert to 4-mag LUT (proven best), document all findings
TheTom Mar 27, 2026
bdcd8ec
experiment: inline block processing — bypass template dequant in FA i…
TheTom Mar 27, 2026
b596a5c
experiment: inline block WORSE on M2 (-10-15%), reverted to 4-mag
TheTom Mar 27, 2026
34f7c39
experiment: FULLY BRANCHLESS FMA decode — zero ternary, zero memory, …
TheTom Mar 27, 2026
9637c1c
final: 12 approaches tested, 4-mag LUT is the hardware limit
TheTom Mar 27, 2026
5755740
experiment: SIMD SHUFFLE magnitude select — cross-lane LUT replacement
TheTom Mar 27, 2026
927e68a
experiment: simd_shuffle 14.7 at 8K — close to 4-mag (15.1) but not b…
TheTom Mar 27, 2026
4b44b2b
experiment: fused block dot — per-centroid Q accumulation, 4 constant…
TheTom Mar 27, 2026
d9ba9bf
experiment: fused block dot 8.1 at 8K — worst result, 64 comparisons …
TheTom Mar 27, 2026
b3cd6a7
experiment: 4-mag helps M5 at 16K (+2.4%) but hurts at 32K (-7.3%)
TheTom Mar 27, 2026
927cfc1
experiment: M5 LUT cost grows to 34% at 32K context
TheTom Mar 27, 2026
26b0bcc
feat: sparse V dequant — +12% decode at 32K, zero quality loss
TheTom Mar 27, 2026
de44bfe
feat: sparse V dequant — +22% decode at 32K on M5, auto-enabled
TheTom Mar 27, 2026
687b184
Revert "Merge pull request #4 from seanrasch/feature/turboquant-kv-ca…
TheTom Mar 27, 2026
3d40ac4
experiment: dedicated turbo4 SET_ROWS kernel + prefill FA kernels
TheTom Mar 28, 2026
cb4d495
experiment: turbo4 2+1 bit packing — +33% decode, drop QJL
TheTom Mar 28, 2026
280c242
experiment: direct-extract turbo4 dequant — matches turbo3 speed
TheTom Mar 28, 2026
a56ed21
experiment: 4-bit half-precision centroid LUT for turbo4 vec path
TheTom Mar 28, 2026
b4b6a30
experiment: fix turbo4 struct for 4-bit — Codex-caught OOB bug
TheTom Mar 28, 2026
c4e98a5
experiment: 8-mag LUT tested, reverted — direct 16-LUT faster on M5
TheTom Mar 28, 2026
7053f4d
experiment: add turbo4_dequant_f16 compute shader (prefill prep)
TheTom Mar 28, 2026
e7ac14d
feat: TURBO4_USE_4BIT ifdef for ABI compatibility
TheTom Mar 28, 2026
f4d9d3e
feat: complete 4-bit C reference for turbo4 — quantize + dequant
TheTom Mar 28, 2026
5e954d4
fix: add turbo WHT rotation to ISWA build_attn — fixes Gemma 2
TheTom Mar 28, 2026
4489686
feat: CUDA port of TurboQuant3 KV cache compression (RTX 5090 / SM 12.0)
signalnine Mar 26, 2026
1127f5b
perf: enable MMA/TILE flash attention for turbo3 — 0.97x q8_0 prefill
signalnine Mar 26, 2026
9a4afd8
perf: parallel k_set_rows_turbo3 + optimise KQ/V dequant — +31% decod…
signalnine Mar 27, 2026
a17a63a
fix: VEC flash-attn Q/K stride mismatch in vec_dot_fattn_vec_KQ_turbo3_0
signalnine Mar 27, 2026
e9ab045
fix: graceful fallback for turbo3 with non-128-aligned head dims (iss…
signalnine Mar 28, 2026
fb2d86d
fix: graceful fallback for turbo3 on non-128-aligned head dims (issue…
signalnine Mar 28, 2026
d86034d
feat: 64-element WHT groups + MLA Q rotation fix (issue #13)
signalnine Mar 28, 2026
567fadf
feat: mixed turbo3/q8_0 KV cache types (-ctk turbo3 -ctv q8_0 and vic…
signalnine Mar 28, 2026
f0601a7
fix: implement CPU turbo3 quantize (was a stub that zeroed qs/signs)
signalnine Mar 28, 2026
25a19f2
feat: GGML_TYPE_TURBO2_0 — 2-bit TurboQuant KV cache (6.4x compression)
signalnine Mar 28, 2026
52178bb
fix: MLA inverse WHT group_size derived from K (not V) — fixes GLM-4.7
signalnine Mar 28, 2026
f3f7c3c
feat: InnerQ per-channel equalization + turbo2 64-group fallback
signalnine Mar 28, 2026
65a2c69
perf: sparse V dequant — skip negligible attention weights in VEC kernel
signalnine Mar 28, 2026
df33248
fix: require head_dim % 128 for turbo KV — fall back to q8_0 otherwise
signalnine Mar 29, 2026
e3b87f8
feat: Metal support for turbo2 (2-bit KV cache, 6.4x compression)
TheTom Mar 29, 2026
fe6749f
feat: asymmetric K/V quant support for Metal flash attention
TheTom Mar 29, 2026
59e2139
feat: asymmetric K/V support + q8_0 × turbo FA kernel instantiations
TheTom Mar 29, 2026
d158db5
feat: zero-pad non-128 heads for full 7-stage WHT (replaces q8_0 fall…
signalnine Mar 29, 2026
0836309
perf: CUDA MMA flash attention for D=640 (GLM-4.7 turbo3: 37→192 t/s)
signalnine Mar 29, 2026
1d5f5c7
fix: add turbo3/turbo2 cross-type VEC FA instances (issue #25 bug 2)
signalnine Mar 29, 2026
821f843
feat: CUDA port of turbo4 (4-bit, 3.8x compression) — fixes issue #25…
signalnine Mar 29, 2026
de03f39
fix: turbo4 on GLM-4.7 — context init check accounts for zero-padding…
signalnine Mar 29, 2026
a3e97d8
feat: Boundary V (experimental) — layer-aware V compression
TheTom Mar 29, 2026
ee82c55
fix: KV state serialization uses padded tensor width (issue #28 follo…
signalnine Mar 29, 2026
a30a59e
feat: HIP/ROCm porting for TheTom's turbo3/turbo2 warp-cooperative ke…
Tuklus Mar 29, 2026
b9df4c3
Increase turbo3/turbo2 block size from 32 to 128
TheTom Mar 30, 2026
f028c66
fix: CUDA warp-to-block mapping for block_size=128 (turbo3, turbo2)
Mar 30, 2026
a6ff538
Enable Sparse V on all Metal, auto-enable Boundary V for turbo2-V
TheTom Mar 31, 2026
d9552b9
fix: add missing TurboQuant FA template instances for HIP/ROCm build
terrysimons Mar 31, 2026
d11b426
Remove unused CENTROIDS_1BIT constant
TheTom Mar 31, 2026
3a2fad1
feat: TQ3_1S + TQ4_1S weight quantization with V2.1 fused Metal kernels
TheTom Apr 1, 2026
7b5094b
fix: add post-unrotate memory barrier for in-layer mixing safety
TheTom Apr 2, 2026
c452be6
fix: disable upstream attn rotation by default (conflicts with TurboQ…
TheTom Apr 2, 2026
2ebd519
feat: CUDA port of TQ4_1S/TQ3_1S weight dequant (signalnine)
TheTom Apr 2, 2026
bf9bf31
fix: TQ4_1S CUDA — mmvq exclusion for fused path + quantize tool regi…
signalnine Apr 2, 2026
8c2e0d8
perf: fused TQ4_1S/TQ3_1S mul_mat_vec — 3.4x decode speedup
signalnine Apr 2, 2026
ec6b8d4
fix: TQ4_1S on MoE models — disable CUDA graphs for TQ MUL_MAT_ID
signalnine Apr 2, 2026
76ebd26
perf: V12 single-phase fused TQ mmvq — shmem activation, no global sc…
TheTom Apr 2, 2026
adda3bc
fix: Windows MSVC build compatibility for TQ weight types
TheTom Apr 2, 2026
694ed03
fix: AMD HIP/ROCm build support for TQ4_1S weight compression
TheTom Apr 2, 2026
2962592
fix: add dk512 Metal FA kernel instances for turbo types (Gemma 4 sup…
TheTom Apr 2, 2026
b1a6f79
fix: CPU vec_dot heap allocation for turbo/TQ types (n > 4096 models)
TheTom Apr 2, 2026
fe2ead9
Fix turbo4 C reference WHT dequant mismatch (#43)
TheTom Apr 2, 2026
e3ce079
feat: load-time TQ4_1S -> q8_0 conversion for CUDA dp4a speed
TheTom Apr 2, 2026
71c7a4c
fix: remove redundant extern from GGML_API macro (GCC 13.3 hard error)
TheTom Apr 3, 2026
c29fab6
Enhance Metal operations for TQ weights and concurrency handling for …
iamwavecut Apr 4, 2026
5bad823
Update GGMLQuantizationType and LlamaFileType enums to include TQ3_1S…
iamwavecut Apr 4, 2026
e5ac94d
fix: GCC double extern in ops.cpp turbo3_cpu_wht_group_size
TheTom Apr 5, 2026
753f199
feat: add MoE expert count kernel instantiations + TQ4_1S backend tests
TheTom Apr 6, 2026
6571604
fix: cap map0 kernel shmem for 256-expert MoE models
TheTom Apr 7, 2026
51481c3
perf: TQ4_1S native kernel 3.5× faster — 240 t/s (was 68), smaller VR…
signalnine Apr 6, 2026
cc1bae2
perf: warp-cooperative TQ4_1S dequant (16× less compute per block)
signalnine Apr 6, 2026
941d456
feat: multi-token TQ4_1S dp4a kernel + multi-GPU fix + static build fix
signalnine Apr 6, 2026
579db29
fix: replace __dp4a with ggml_cuda_dp4a for HIP/ROCm compatibility
signalnine Apr 6, 2026
0bf1eef
fix: AMD/RDNA4 arch dispatch — scalar half path for TQ4_1S on AMD GPUs
TheTom Apr 7, 2026
a494833
feat: Vulkan compute shader support for turbo3 KV cache
Tuklus Mar 30, 2026
e596f47
metal: add TurboFlash attention kernel for turbo3 KV cache decode
TheTom Apr 8, 2026
3b5e148
docs: add AMD Instinct MI300X (gfx942) ROCm test results
andyluo7 Apr 7, 2026
1df4783
feat: add CDNA4 (gfx950/MI355X) support + test results
andyluo7 Apr 7, 2026
ff8bb73
vulkan: fix and complete turbo3 KV cache support
Titaniumtown Apr 9, 2026
88fcb67
vulkan: add turbo3 backend tests
Titaniumtown Apr 9, 2026
a4736ff
Add GitHub Sponsors funding link
TheTom Apr 14, 2026
e53f802
Fix memory explosion on Apple Silicon
huwprosser Apr 13, 2026
6775542
Fix GGML_OP_COUNT assertion for RPC
cpburnz Apr 10, 2026
0009301
ci: fix turbo build and test failures
Tuklus Apr 10, 2026
1073622
fix: add TURBO2_0 to flash_attn auto-enable check
TheTom Apr 16, 2026
59798f1
fix(cuda): allow f16/bf16 + q8_0 mixed KV without GGML_CUDA_FA_ALL_QU…
TheTom Apr 17, 2026
0198d58
vulkan: fix turbo3 build + coopmat FA after April upstream sync
Tuklus Apr 17, 2026
8993d4f
fix: force VEC FA path for quantized KV on HIP/ROCm
TheTom Apr 18, 2026
0757ff4
fix(hip): bypass pool for FA f16 temp buffers to prevent OOM
TheTom Apr 18, 2026
7ca13d2
Merge pull request #90 from TheTom/fix/hip-force-vec-quantized-kv
TheTom Apr 18, 2026
627ebbc
Merge pull request #87 from apollosenvy/pr/vulkan-turbo3-april-fix
TheTom Apr 18, 2026
6112eb4
fix: gate turbo V unpad on V type, not K type
TheTom Apr 20, 2026
a1bcb34
fix(metal): disable TurboFlash by default — corrupt output on Apple10
TheTom Apr 20, 2026
d3271ac
fix: gate turbo V unpad on V type + disable TurboFlash on Apple10 (#91)
TheTom Apr 20, 2026
0b05974
hip: bypass memory pool for flash attention f16 temp buffers
TheTom Apr 18, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
2 changes: 1 addition & 1 deletion .devops/vulkan.Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ RUN apt update && apt install -y git build-essential cmake wget xz-utils

# Install SSL and Vulkan SDK dependencies
RUN apt install -y libssl-dev curl \
libxcb-xinput0 libxcb-xinerama0 libxcb-cursor-dev libvulkan-dev glslc
libxcb-xinput0 libxcb-xinerama0 libxcb-cursor-dev libvulkan-dev glslc spirv-headers

# Build it
WORKDIR /app
Expand Down
1 change: 1 addition & 0 deletions .github/FUNDING.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
github: [TheTom]
8 changes: 8 additions & 0 deletions .github/labeler.yml
Original file line number Diff line number Diff line change
Expand Up @@ -73,10 +73,18 @@ android:
- changed-files:
- any-glob-to-any-file:
- examples/llama.android/**
server/webui:
- changed-files:
- any-glob-to-any-file:
- tools/server/webui/**
- tools/server/public/**
server:
- changed-files:
- any-glob-to-any-file:
- tools/server/**



ggml:
- changed-files:
- any-glob-to-any-file:
Expand Down
38 changes: 14 additions & 24 deletions .github/workflows/build-riscv.yml
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ env:

jobs:
ubuntu-riscv64-native-sanitizer:
runs-on: RISCV64
runs-on: ubuntu-24.04-riscv

continue-on-error: true

Expand All @@ -50,17 +50,18 @@ jobs:
sudo apt-get update

# Install necessary packages
sudo apt-get install -y libatomic1 libtsan2 gcc-14 g++-14 rustup cmake build-essential wget ccache git-lfs
sudo apt-get install -y libatomic1 libtsan2 gcc-14 g++-14 cmake build-essential wget git-lfs

# Set gcc-14 and g++-14 as the default compilers
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-14 100
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-14 100
sudo ln -sf /usr/bin/gcc-14 /usr/bin/gcc
sudo ln -sf /usr/bin/g++-14 /usr/bin/g++

# Install Rust stable version
rustup install stable
rustup default stable
if ! which rustc; then
# Install Rust stable version
sudo apt-get install -y rustup
rustup install stable
rustup default stable
fi

git lfs install

Expand All @@ -73,23 +74,12 @@ jobs:
id: checkout
uses: actions/checkout@v6

- name: Setup ccache
run: |
# Unique cache directory per matrix combination
export CCACHE_DIR="$HOME/.ccache/sanitizer-${{ matrix.sanitizer }}-${{ matrix.build_type }}"
mkdir -p "$CCACHE_DIR"

# Configure ccache
ccache --set-config=max_size=5G
ccache --set-config=compression=true
ccache --set-config=compression_level=6
ccache --set-config=cache_dir="$CCACHE_DIR"
ccache --set-config=sloppiness=file_macro,time_macros,include_file_mtime,include_file_ctime
ccache --set-config=hash_dir=false

# Export for subsequent steps
echo "CCACHE_DIR=$CCACHE_DIR" >> $GITHUB_ENV
echo "PATH=/usr/lib/ccache:$PATH" >> $GITHUB_ENV
# FIXME: Enable when ggml-org/ccache-action works on riscv64
# - name: ccache
# uses: ggml-org/ccache-action@v1.2.21
# with:
# key: ubuntu-riscv64-native-sanitizer-${{ matrix.sanytizer }}-${{ matrix.build_type }}
# save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}

- name: Build
id: cmake_build
Expand Down
108 changes: 53 additions & 55 deletions .github/workflows/build-self-hosted.yml
Original file line number Diff line number Diff line change
Expand Up @@ -141,61 +141,59 @@ jobs:
# amd-smi static
# GG_BUILD_ROCM=1 GG_BUILD_AMDGPU_TARGETS="gfx1101" bash ./ci/run.sh ~/results/llama.cpp /mnt/llama.cpp

# TODO: sandbox Mac runners
# ggml-ci-mac-metal:
# runs-on: [self-hosted, macOS, ARM64]
#
# steps:
# - name: Clone
# id: checkout
# uses: actions/checkout@v6
#
# - name: Test
# id: ggml-ci
# run: |
# GG_BUILD_METAL=1 bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
#
# ggml-ci-mac-webgpu:
# runs-on: [self-hosted, macOS, ARM64]
#
# steps:
# - name: Clone
# id: checkout
# uses: actions/checkout@v6
#
# - name: Dawn Dependency
# id: dawn-depends
# run: |
# DAWN_VERSION="v2.0.0"
# DAWN_OWNER="reeselevine"
# DAWN_REPO="dawn"
# DAWN_ASSET_NAME="Dawn-5e9a4865b1635796ccc77dd30057f2b4002a1355-macos-latest-Release"
# echo "Fetching release asset from https://github.com/${DAWN_OWNER}/${DAWN_REPO}/releases/download/${DAWN_VERSION}/${DAWN_ASSET_NAME}.zip"
# curl -L -o artifact.zip \
# "https://github.com/${DAWN_OWNER}/${DAWN_REPO}/releases/download/${DAWN_VERSION}/${DAWN_ASSET_NAME}.zip"
# mkdir dawn
# unzip artifact.zip
# tar -xvf ${DAWN_ASSET_NAME}.tar.gz -C dawn --strip-components=1
#
# - name: Test
# id: ggml-ci
# run: |
# GG_BUILD_WEBGPU=1 GG_BUILD_WEBGPU_DAWN_PREFIX="$GITHUB_WORKSPACE/dawn" \
# bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
#
# ggml-ci-mac-vulkan:
# runs-on: [self-hosted, macOS, ARM64]
#
# steps:
# - name: Clone
# id: checkout
# uses: actions/checkout@v6
#
# - name: Test
# id: ggml-ci
# run: |
# vulkaninfo --summary
# GG_BUILD_VULKAN=1 bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp
ggml-ci-mac-metal:
runs-on: [self-hosted, macOS, ARM64]

steps:
- name: Clone
id: checkout
uses: actions/checkout@v6

- name: Test
id: ggml-ci
run: |
GG_BUILD_METAL=1 bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp

ggml-ci-mac-webgpu:
runs-on: [self-hosted, macOS, ARM64]

steps:
- name: Clone
id: checkout
uses: actions/checkout@v6

- name: Dawn Dependency
id: dawn-depends
run: |
DAWN_VERSION="v20260317.182325"
DAWN_OWNER="google"
DAWN_REPO="dawn"
DAWN_ASSET_NAME="Dawn-18eb229ef5f707c1464cc581252e7603c73a3ef0-macos-latest-Release"
echo "Fetching release asset from https://github.com/google/dawn/releases/download/${DAWN_VERSION}/${DAWN_ASSET_NAME}.tar.gz"
curl -L -o artifact.tar.gz \
"https://github.com/google/dawn/releases/download/${DAWN_VERSION}/${DAWN_ASSET_NAME}.tar.gz"
mkdir dawn
tar -xvf artifact.tar.gz -C dawn --strip-components=1

- name: Test
id: ggml-ci
run: |
GG_BUILD_WEBGPU=1 GG_BUILD_WEBGPU_DAWN_PREFIX="$GITHUB_WORKSPACE/dawn" \
bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp

ggml-ci-mac-vulkan:
runs-on: [self-hosted, macOS, ARM64]

steps:
- name: Clone
id: checkout
uses: actions/checkout@v6

- name: Test
id: ggml-ci
run: |
vulkaninfo --summary
GG_BUILD_VULKAN=1 bash ./ci/run.sh ~/results/llama.cpp ~/mnt/llama.cpp

ggml-ci-linux-intel-vulkan:
runs-on: [self-hosted, Linux, Intel]
Expand Down
5 changes: 3 additions & 2 deletions .github/workflows/build-vulkan.yml
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ jobs:

- name: Setup Vulkan SDK
if: steps.cache-sdk.outputs.cache-hit != 'true'
uses: ./.github/actions/linux-setup-vulkan-llvmpipe
uses: ./.github/actions/linux-setup-vulkan
with:
path: ./vulkan_sdk
version: ${{ env.VULKAN_SDK_VERSION }}
Expand All @@ -93,4 +93,5 @@ jobs:
export GGML_VK_DISABLE_F16=1
export GGML_VK_DISABLE_COOPMAT=1
# This is using llvmpipe and runs slower than other backends
ctest -L main --verbose --timeout 4800
# test-backend-ops is too slow on llvmpipe, skip it
ctest -L main -E test-backend-ops --verbose --timeout 900
49 changes: 19 additions & 30 deletions .github/workflows/build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -318,7 +318,7 @@ jobs:
id: depends
run: |
sudo apt-get update
sudo apt-get install -y gcc-14 g++-14 build-essential glslc libvulkan-dev libssl-dev ninja-build
sudo apt-get install -y gcc-14 g++-14 build-essential glslc libvulkan-dev spirv-headers libssl-dev ninja-build
echo "CC=gcc-14" >> "$GITHUB_ENV"
echo "CXX=g++-14" >> "$GITHUB_ENV"

Expand Down Expand Up @@ -996,32 +996,29 @@ jobs:
cmake --build build -j ${env:NUMBER_OF_PROCESSORS}

ubuntu-cpu-riscv64-native:
runs-on: RISCV64
runs-on: ubuntu-24.04-riscv

steps:
- name: Install dependencies
run: |
sudo apt-get update

# Install necessary packages
sudo apt-get install -y libatomic1 libtsan2 gcc-14 g++-14 rustup cmake build-essential libssl-dev wget ccache git-lfs
sudo apt-get install -y libatomic1 libtsan2 gcc-14 g++-14 cmake build-essential libssl-dev wget git-lfs

# Set gcc-14 and g++-14 as the default compilers
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-14 100
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-14 100
sudo ln -sf /usr/bin/gcc-14 /usr/bin/gcc
sudo ln -sf /usr/bin/g++-14 /usr/bin/g++

# Install Rust stable version
rustup install stable
rustup default stable
if ! which rustc; then
# Install Rust stable version
sudo apt-get install -y rustup
rustup install stable
rustup default stable
fi

git lfs install

- name: Clone
id: checkout
uses: actions/checkout@v6

- name: Check environment
run: |
uname -a
Expand All @@ -1031,25 +1028,17 @@ jobs:
cmake --version
rustc --version

- name: Setup ccache
run: |
# Set unique cache directory for this job
export CCACHE_DIR="$HOME/.ccache/cpu-cmake-rv64-native"
mkdir -p "$CCACHE_DIR"

# Configure ccache for optimal performance
ccache --set-config=max_size=5G
ccache --set-config=compression=true
ccache --set-config=compression_level=6
ccache --set-config=cache_dir="$CCACHE_DIR"

# Enable more aggressive caching
ccache --set-config=sloppiness=file_macro,time_macros,include_file_mtime,include_file_ctime
ccache --set-config=hash_dir=false
- name: Clone
id: checkout
uses: actions/checkout@v6

# Export for subsequent steps
echo "CCACHE_DIR=$CCACHE_DIR" >> $GITHUB_ENV
echo "PATH=/usr/lib/ccache:$PATH" >> $GITHUB_ENV
# FIXME: Enable when ggml-org/ccache-action works on riscv64
# - name: ccache
# uses: ggml-org/ccache-action@v1.2.21
# with:
# key: ubuntu-cpu-riscv64-native
# evict-old-files: 1d
# save: ${{ github.event_name == 'push' && github.ref == 'refs/heads/master' }}

- name: Build
id: cmake_build
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/close-issue.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ jobs:
steps:
- uses: actions/stale@v10
with:
exempt-issue-labels: "refactoring,help wanted,good first issue,research 🔬,bug,roadmap"
exempt-issue-labels: "refactoring,help wanted,good first issue,research 🔬,bug,roadmap,security"
days-before-issue-stale: 30
days-before-issue-close: 14
stale-issue-label: "stale"
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/docker.yml
Original file line number Diff line number Diff line change
Expand Up @@ -73,8 +73,8 @@ jobs:
{ "tag": "cpu", "dockerfile": ".devops/cpu.Dockerfile", "platforms": "linux/amd64", "full": true, "light": true, "server": true, "free_disk_space": false, "runs_on": "ubuntu-24.04" },
{ "tag": "cpu", "dockerfile": ".devops/cpu.Dockerfile", "platforms": "linux/arm64", "full": true, "light": true, "server": true, "free_disk_space": false, "runs_on": "ubuntu-24.04-arm" },
{ "tag": "cpu", "dockerfile": ".devops/s390x.Dockerfile", "platforms": "linux/s390x", "full": true, "light": true, "server": true, "free_disk_space": false, "runs_on": "ubuntu-24.04-s390x" },
{ "tag": "cuda cuda12", "dockerfile": ".devops/cuda.Dockerfile", "cuda_version": "12.9.1", "platforms": "linux/amd64", "full": true, "light": true, "server": true, "free_disk_space": true, "runs_on": "ubuntu-24.04" },
{ "tag": "cuda cuda12", "dockerfile": ".devops/cuda.Dockerfile", "cuda_version": "12.9.1", "platforms": "linux/arm64", "full": true, "light": true, "server": true, "free_disk_space": true, "runs_on": "ubuntu-24.04-arm" },
{ "tag": "cuda cuda12", "dockerfile": ".devops/cuda.Dockerfile", "cuda_version": "12.8.1", "platforms": "linux/amd64", "full": true, "light": true, "server": true, "free_disk_space": true, "runs_on": "ubuntu-24.04" },
{ "tag": "cuda cuda12", "dockerfile": ".devops/cuda.Dockerfile", "cuda_version": "12.8.1", "platforms": "linux/arm64", "full": true, "light": true, "server": true, "free_disk_space": true, "runs_on": "ubuntu-24.04-arm" },
{ "tag": "cuda13", "dockerfile": ".devops/cuda.Dockerfile", "cuda_version": "13.1.1", "platforms": "linux/amd64", "full": true, "light": true, "server": true, "free_disk_space": true, "runs_on": "ubuntu-24.04" },
{ "tag": "cuda13", "dockerfile": ".devops/cuda.Dockerfile", "cuda_version": "13.1.1", "platforms": "linux/arm64", "full": true, "light": true, "server": true, "free_disk_space": true, "runs_on": "ubuntu-24.04-arm" },
{ "tag": "musa", "dockerfile": ".devops/musa.Dockerfile", "platforms": "linux/amd64", "full": true, "light": true, "server": true, "free_disk_space": true, "runs_on": "ubuntu-24.04" },
Expand Down
Loading
Loading