Skip to content

Catch-Up to Current Master#3

Merged
mfielding92 merged 50 commits into
mfielding92:masterfrom
ggml-org:master
May 30, 2026
Merged

Catch-Up to Current Master#3
mfielding92 merged 50 commits into
mfielding92:masterfrom
ggml-org:master

Conversation

@mfielding92
Copy link
Copy Markdown
Owner

Overview

Additional information

Requirements

max-krasnyansky and others added 30 commits May 28, 2026 04:49
* hex-fa: clean up qf32/fp32 handling and stride handling

* hex-fa: fix corner case fp NAN issues that were cause bad output from gemma4 on v79

* hex-fa: vectorize leftover handling

* hex-fa: avoid HVX fallback during token gen HMX has more FP16 compute capacity

* hmx-mm: remove dead code

* hmx-mm: use fastdiv in x4x2 dequant

* hmx-mm: sandwich dequant and scatter to improve perf

* hmx-mm: fixed rebase conflicts

* hmx-mm: further improve weight dequant by doing early type dispatch and precomputing fastdiv

* hmx-mm: an even earlier dispatch for per-type dequant

* hmx-mm: dequant linear types like q4_0 and q4_1 without the LUTs

This is a bit faster than LUT.

* hex-cmake: one more tweak for lto

---------

Co-authored-by: Trivikram Reddy <tamarnat@qti.qualcomm.com>
* misc(server): add default port to impl RAII

* misc(server): register_gcp_compat() can be const

* misc(server): use proper cpp const/auto methods

* misc(server): do not reset a unique_ptr, use make_unique instead to be exception safe
…3227)

* CUDA: per-quant MMVQ/MMQ batch threshold on AMD MFMA hardware

The dispatcher uses a single global threshold (MMVQ_MAX_BATCH_SIZE = 8)
to choose between mul_mat_vec_q (per-row GEMV) and mul_mat_q (MFMA-tiled
GEMM) for quantized matmul. On AMD CDNA, the optimal crossover differs
substantially by quant family because the per-row GEMV cost is dominated
by dequantisation, not the dot-product itself: K-quants pay a heavier
super-block decode and so MMQ wins sooner; legacy and IQ quants have
lean decode and stay ahead until the batch fully populates an MFMA tile.

This patch introduces ggml_cuda_should_use_mmvq(type, cc, ne11) -> bool,
mirroring the existing ggml_cuda_should_use_mmq, and gates per-quant
thresholds on amd_mfma_available(cc):

  Q3_K, Q4_K, Q5_K  : MMVQ <= 3   (MMQ wins from batch=4: +5% .. +76%)
  Q2_K, Q6_K        : MMVQ <= 5   (MMQ wins from batch=6: +8% .. +35%)
  others            : MMVQ <= 8   (legacy & IQ regress under MMQ; unchanged)

Non-AMD-MFMA paths (NVIDIA, RDNA, CDNA1 without MFMA) are byte-identical
to master. GGML_CUDA_FORCE_MMVQ=1 restores the original global threshold
for A/B testing.

Measured on MI250X (gfx90a, ROCm 7.2.1) with Llama-3.2-3B-Instruct,
llama-bench pp512 across all 20 supported quants, ubatch 1..8, 10 reps.
Full table in PR description.

  Selected pp512 throughput (tok/s, ub=8):
    Q4_K_S:  559 -> 940  (+68%)
    Q5_K_S:  503 -> 884  (+76%)
    Q3_K_S:  629 -> 879  (+40%)
    Q2_K  :  615 -> 809  (+32%)
    Q6_K  :  582 -> 776  (+33%)

  Selected pp512 throughput (tok/s, ub=4):
    Q4_K_S:  444 -> 480  (+ 8%)
    Q4_0  :  682 -> 685  (+ 0%)   (no regression - retains MMVQ)
    IQ4_XS:  706 -> 698  (- 1%)   (no regression - retains MMVQ)

* CUDA: address review — inline MMVQ batch table, drop env hatch & doc block

* tune kernel selection logic for CDNA1

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
…#23729)

* mmvq Optim:  add MMVQ_PARAMETERS_TURING(mmvq_parameter_table_id) for SM75 TURING

* avoid a mismatch for JIT compilation of Turing device code for Ampere or newer

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* ci : disable all CPU variant builds for Vulkan workflow

* cont : change cache key

* cont : change build type
* mtmd: fix gemma 4 audio rms norm eps

* Update tools/mtmd/clip.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
removed AI-generated comment
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* ci : releases use Github-hosted builds for the UI

* cont : fix name
When model props are fetched asynchronously from the server,
modelPropsVersion is incremented to trigger reactivity, but
only the vision effect was listening to it.
* run ui publish on self-hosted fast

* run on ubuntu-slim
* opencl: move backend info print into its own function

* opencl: move new log line

* opencl: fix for non adreno path
* mtmd-debug: add color and rainbow mode

* fix M_PI

* max_dist
)

Updating infra to enable op fusion and using RMS_NORM+MUL as the use-case.
…23480)

Without this at least the vulkan backend will skip the `* 0` for
!COMPUTE tensors, causing corrupt output.
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* llama: add llm_graph_input_mtp

* rename input_mtp -> input_token_embd

* add TODO about mtmd embedding

* cont : clean-up

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
[no release]

Signed-off-by: Omid Azizi <oazizi@gimletlabs.ai>
* llama: use f16 mask for FA

* review: add llama_cast + formatting

* simplify
…se Attention (DSA) implementation (#23346)

* llama : support DeepSeek V3.2 model family (with DSA lightning indexer)

* convert : handle DeepseekV32ForCausalLM architecture

* ggml : support for f16 GGML_OP_FILL

* memory : separate hparams argument in llama_kv_cache constructor

* memory : add llama_kv_cache_dsa memory (KV cache + lightning indexer cache)

* llama : support for LLM_ARCH_DEEPSEEK32

* model : llama_model_deepseek32 implementation

* model : merge two scale operations into one in DSA lightning indexer implementation

* chore : remove unused code

* model : support NVFP4 in DeepSeek V3.2

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* memory : refactoring TODO

Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>

---------

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>
* server: bump timeout to 3600s

* nits: change wording
* CUDA: Check PTX version on host side to guard PDL dispatch

Checking on `__CUDA_ARCH_LIST__` alone is insufficient for JIT, as this
variable doesn't differentiate between compiling for say sm_90, sm_90a
or sm_90f (so forward-jittable PTX vs. arch/family-specific PTX).

Thus, one can have a bug when compiling with
`DCMAKE_CUDA_ARCHITECTURES="89;90a"`, where current code would wrongly
dispatch to PDL on sm_90/sm_120 in forward-JIT mode.

This PR fixes this issue by checking `cudaFuncAttributes::ptxVersion` of
the incoming kernel at runtime. A check on ptxVersion alone is
sufficient, as device-codes will always be >= ptxVersion (and any
violation of this would be a severe bug in CUDA/nvcc), see:
 https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/#gpu-code-code-code

* Implement MurmurHash3 mixer for better hash distribution

Magic constants were taken from boost:
https://github.com/boostorg/container_hash/blob/2698b43803c012601e6bb1a6116e83767b97986c/include/boost/container_hash/detail/hash_mix.hpp#L19-L65

* Update ggml/src/ggml-cuda/common.cuh

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Address review comments, make seed non-zero

* Apply code-formatting

* Replace std::size_t -> size_t for consistency

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* mtmd: DeepSeek-OCR 2 support, with multi-tile dynamic resolution

* introduced clip_image_f32::add_viewsep

* address PR review

- drop redundant ggml_cpy ops in both deepseekocr versions build
- drop no-op ggml_cont in build_sam
- assert num_image_tokens deepseekocr2
- view_seperator as (1, n_embd) at conversion (for both versions)
- drop redundant ggml_reshape_2d

* Update tools/mtmd/models/deepseekocr2.cpp

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

---------

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
Xuan-Son Nguyen and others added 20 commits May 29, 2026 16:30
* download: add option to skip_download

* fix

* fix 2

* if file doesn't exist, respect skip_download flag
* vocab: Support tokenizer for LFM2.5-8B-A1B

* Keep liquid6 tokenizer in models
Firefox on Linux uses this MIME type
* wip: llama update POC

* cleaning: llama update

* llama-gen-docs

* app: delegate llama update to the install script

* app: spawn the installer detached so llama update can replace a running binary

* cleaning: inline llama update into llama.cpp, drop app-update.{cpp,h}

* app: make llama_update static

Address review from @angt
…23869)

* spec: add speed-bench support for benchmarking

* speed-bench : add trailing newline to requirements.txt

* speed-bench : bump datasets to 4.8.0 to fix ty check

* server-bench : remove now-unused type: ignore after datasets bump
* Add q8_0 and q4_0 set_rows

* Add fast(er) quantization set_rows path

* formatting/naming

* a little more naming

* Remove unused constant

* Don't override other override

* Avoid bitcast

* Narrow relaxation
* server: in SSE mode, send HTTP headers when slot starts

* ref to pr

* stream should be false by default
After #23007 reclassified integrated CUDA/HIP devices as IGPU, the device
selection logic dropped the local iGPU whenever any RPC server was added,
because RPC devices made `model->devices` non-empty. On systems where the
"iGPU" is the main compute device (e.g. Strix Halo with 128 GiB of unified
memory), this caused all tensors to be allocated on the RPC peer alone and
model loading to fail.

Gate the iGPU inclusion on `gpus.empty()` instead, so RPC peers no longer
suppress the local iGPU.

closes: #23858
* ci : ios use macos-15 again

* ci : add and test ccache-clear

* cont : fix

* cont : set permission

* cont : another permission

* cont : token

* cont : print key

* cont : bring back perms

* cont : test windows

* cont : add token

* cont : cleanup

* ci : make release jobs clean-up their ccache
* ci : fix s390x release job

* ci : multi-thread build for `ios-xcode`

* ocd : names
* vulkan: add flash attention bf16 kv support

* vulkan: bf16 FA coopmat1 support

* vulkan: bf16 FA coopmat2 support

* fix FA bf16 f32 fallback

* fix FA bf16 coopmat1 shader

* fix FA bf16 coopmat2 shader

* code cleanup

* cleanup comment change

* address feedback

* add O_TYPE for cm2 FA

* use O_TYPE for gqaStore function

* reduce BFLOAT16 ifdefs
* loongarch : optimize LSX fp16 load/store with native intrinsics

Use __lsx_vfcvtl_s_h and __lsx_vfcvt_h_s instead of scalar loops in
__lsx_f16x4_load and __lsx_f16x4_store.

* loongarch : add LSX implementation for q8_0 dot product

* loongarch : add LSX implementation for q6_K dot product

* loongarch : add LSX implementation for iq4_xs dot product

* Improve reduce ops when sun int16 pairs to int32
* ci : disable libcommon build from xcframework

* ocd : fix name

* ci : ios-xcode change to macos-26

* cont : pin xcode

* cont : pin xcode to minor version
* TP: fix granularity for Qwen 3.5/3.6 + 3 GPUs

* fix afmoe TP
@mfielding92 mfielding92 merged commit d6f7a66 into mfielding92:master May 30, 2026
4 of 59 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.