Add MiniCPM5 tokenizer support by zhangtao2-1 · Pull Request #23384 · ggml-org/llama.cpp

zhangtao2-1 · 2026-05-20T06:39:50Z

Adds MiniCPM5 support for HF → GGUF conversion and inference.

Detect MiniCPM5 in LlamaModel and use the correct Llama3-style BPE + ByteLevel vocab path
Register the minicpm5 BPE pre-tokenizer fingerprint
Extract HF pre-tokenizer regexes into GGUF (tokenizer.ggml.pre_tokenizer_regex) and apply them at runtime via LLAMA_VOCAB_PRE_TYPE_CUSTOM_REGEX
Scoped to MiniCPM5 only; other Llama-family models are unchanged.

ggml-gh-bot · 2026-05-20T06:44:22Z

Hi @zhangtao2-1, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

CISC · 2026-05-20T08:51:17Z

Since you're new here and MiniCPM 5 does not exist yet, cc/ @tc-mb

We do not want to store the original regex in the GGUF, it should merely be accounted for in creating the chkhsh (which I believe it still is), and we use that to identify and recreate the correct tokenization internally.

Do not add the chkhsh manually, it is done automatically when you add the repo to and run convert_hf_to_gguf_update.py.

zhangtao2-1 · 2026-05-25T13:22:15Z

@CISC @tc-mb
Updated: added MiniCPM5 to convert_hf_to_gguf_update.py and regenerated the chkhsh via the script, instead of adding it manually.

CISC · 2026-05-25T17:05:48Z

Updated: added MiniCPM5 to convert_hf_to_gguf_update.py and regenerated the chkhsh via the script, instead of adding it manually.

Good, but we still do not want to store the regex, see how other pre-tokenizers are handled in llama-vocab.cpp.

zhangtao2-1 · 2026-05-26T03:51:42Z

@CISC
Updated per review:

Removed storing pre-tokenizer regex in GGUF
Added minicpm5 to convert_hf_to_gguf_update.py and regenerated the chkhsh via the script
Implemented the MiniCPM5 pre-tokenizer in llama-vocab.cpp (hardcoded regex, same pattern as other pre-tokenizers)
Rebased onto latest master and resolved conflicts
Please take another look when you have a chance.

Add minicpm5 pre-tokenizer hash via convert_hf_to_gguf_update.py and implement hardcoded regex handling in llama-vocab.cpp, consistent with other BPE pre-tokenizers.

zhangtao2-1 · 2026-05-26T07:45:09Z

@CISC Removed the MiniCPM5-specific routing in llama.py per review — the default vocab fallback already reaches _set_vocab_gpt2().

zhangtao2-1 · 2026-05-26T12:09:46Z

@CISC Several CI jobs are red, but many fail at checkout with Your account is suspended (403). The PR-related checks (pre-tokenizer-hashes, lint, type-check) passed. Is there anything I should fix on my side, or can this wait for maintainer re-run? Thanks.

CISC · 2026-05-26T12:25:32Z

Several CI jobs are red, but many fail at checkout with Your account is suspended (403). The PR-related checks (pre-tokenizer-hashes, lint, type-check) passed. Is there anything I should fix on my side, or can this wait for maintainer re-run? Thanks.

Nope, everything is fine, it's just another GitHub outage. :)

zhangtao2-1 · 2026-05-26T12:34:21Z

@CISC
Thanks for confirming! Just to double-check the next step on my side:

Should re-trigger CI after the GitHub outage clears, or
Is the PR good to merge as-is ?

CISC · 2026-05-26T12:37:08Z

Should re-trigger CI after the GitHub outage clears, or

Is the PR good to merge as-is ?

PR is good to merge, just need to wait to GHA to be back up for Release.

zhangtao2-1 · 2026-05-27T04:20:18Z

@CISC Is GHA restored yet, or still blocked?

* origin/master: hexagon: add support for Q4_1 in MUL_MAT and MUL_MAT_ID (ggml-org#23647) ggml-webgpu: Fix how to dispatch WG to some ops (ggml-org#23750) vulkan: Switch MUL_MAT_VEC to 4 K per iteration for F16/32 (ggml-org#22887) vulkan: use GL_NV_cooperative_matrix_decode_vector for faster matmul (ggml-org#23541) vulkan: add REPEAT op support for f16 to f16. (ggml-org#23298) ci : move ARM jobs to self-hosted + disable kleidiai mac release (ggml-org#23780) vendor : update cpp-httplib to 0.46.0 (ggml-org#23650) pyproject : add conversion folder and update dependencies (ggml-org#23746) CUDA: restrict PDL to CTK >= 12.3 due to MSVC issues (ggml-org#23742) ci : bump cuda release to 13.3 (ggml-org#23749) common : fix env names to all have LLAMA_ARG_ prefix (ggml-org#23778) ci : fix windows ccaches (ggml-org#23777) ci : remove wasm test (ggml-org#23733) vulkan: avoid preferring transfer queue on AMD UMA devices (ggml-org#22455) ci : add ccache to server builds + fix undefined sanitizer build (ggml-org#23763) docs : fix duplicated "the" in granitevision and model-conversion docs (ggml-org#23767) convert: add MiniCPM5 tokenizer support (ggml-org#23384) server : fix the log message when using SSL (ggml-org#23393)

Add minicpm5 pre-tokenizer hash via convert_hf_to_gguf_update.py and implement hardcoded regex handling in llama-vocab.cpp, consistent with other BPE pre-tokenizers. Co-authored-by: zhangtao <zhangtao2@modelbest.cn>

zhangtao2-1 requested review from CISC and ggerganov as code owners May 20, 2026 06:39

github-actions Bot added the python python script changes label May 20, 2026

zhangtao2-1 closed this May 20, 2026

zhangtao2-1 reopened this May 21, 2026

zhangtao2-1 force-pushed the master branch from 74f9f86 to 1c600c3 Compare May 25, 2026 13:27

zhangtao2-1 force-pushed the master branch 3 times, most recently from fb50971 to 8a1dc08 Compare May 26, 2026 05:59

CISC reviewed May 26, 2026

View reviewed changes

Comment thread conversion/llama.py Outdated

Comment thread conversion/llama.py Outdated

convert: add MiniCPM5 tokenizer support

9830508

Add minicpm5 pre-tokenizer hash via convert_hf_to_gguf_update.py and implement hardcoded regex handling in llama-vocab.cpp, consistent with other BPE pre-tokenizers.

zhangtao2-1 force-pushed the master branch from 8a1dc08 to 9830508 Compare May 26, 2026 07:44

CISC approved these changes May 26, 2026

View reviewed changes

CISC added the merge ready A maintainer can use this label to indicate that they consider the changes final and ready to merge. label May 26, 2026

ggerganov merged commit 9777256 into ggml-org:master May 27, 2026
24 of 45 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MiniCPM5 tokenizer support#23384

Add MiniCPM5 tokenizer support#23384
ggerganov merged 1 commit into
ggml-org:masterfrom
zhangtao2-1:master

zhangtao2-1 commented May 20, 2026

Uh oh!

ggml-gh-bot Bot commented May 20, 2026

Uh oh!

CISC commented May 20, 2026

Uh oh!

zhangtao2-1 commented May 25, 2026

Uh oh!

CISC commented May 25, 2026

Uh oh!

zhangtao2-1 commented May 26, 2026

Uh oh!

Uh oh!

Uh oh!

zhangtao2-1 commented May 26, 2026

Uh oh!

zhangtao2-1 commented May 26, 2026

Uh oh!

CISC commented May 26, 2026

Uh oh!

zhangtao2-1 commented May 26, 2026

Uh oh!

CISC commented May 26, 2026

Uh oh!

zhangtao2-1 commented May 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

zhangtao2-1 commented May 20, 2026

Uh oh!

ggml-gh-bot Bot commented May 20, 2026

Uh oh!

CISC commented May 20, 2026

Uh oh!

zhangtao2-1 commented May 25, 2026

Uh oh!

CISC commented May 25, 2026

Uh oh!

zhangtao2-1 commented May 26, 2026

Uh oh!

Uh oh!

Uh oh!

zhangtao2-1 commented May 26, 2026

Uh oh!

zhangtao2-1 commented May 26, 2026

Uh oh!

CISC commented May 26, 2026

Uh oh!

zhangtao2-1 commented May 26, 2026

Uh oh!

CISC commented May 26, 2026

Uh oh!

zhangtao2-1 commented May 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants