Add MiniCPM5 tokenizer support#23384
Conversation
|
Hi @zhangtao2-1, thanks for your contribution! Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:
Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below. |
|
Since you're new here and MiniCPM 5 does not exist yet, cc/ @tc-mb We do not want to store the original regex in the GGUF, it should merely be accounted for in creating the Do not add the |
Good, but we still do not want to store the regex, see how other pre-tokenizers are handled in |
|
@CISC Removed storing pre-tokenizer regex in GGUF |
fb50971 to
8a1dc08
Compare
Add minicpm5 pre-tokenizer hash via convert_hf_to_gguf_update.py and implement hardcoded regex handling in llama-vocab.cpp, consistent with other BPE pre-tokenizers.
|
@CISC Removed the MiniCPM5-specific routing in llama.py per review — the default vocab fallback already reaches _set_vocab_gpt2(). |
|
@CISC Several CI jobs are red, but many fail at checkout with Your account is suspended (403). The PR-related checks (pre-tokenizer-hashes, lint, type-check) passed. Is there anything I should fix on my side, or can this wait for maintainer re-run? Thanks. |
Nope, everything is fine, it's just another GitHub outage. :) |
|
@CISC
|
PR is good to merge, just need to wait to GHA to be back up for Release. |
|
@CISC Is GHA restored yet, or still blocked? |
* origin/master: hexagon: add support for Q4_1 in MUL_MAT and MUL_MAT_ID (ggml-org#23647) ggml-webgpu: Fix how to dispatch WG to some ops (ggml-org#23750) vulkan: Switch MUL_MAT_VEC to 4 K per iteration for F16/32 (ggml-org#22887) vulkan: use GL_NV_cooperative_matrix_decode_vector for faster matmul (ggml-org#23541) vulkan: add REPEAT op support for f16 to f16. (ggml-org#23298) ci : move ARM jobs to self-hosted + disable kleidiai mac release (ggml-org#23780) vendor : update cpp-httplib to 0.46.0 (ggml-org#23650) pyproject : add conversion folder and update dependencies (ggml-org#23746) CUDA: restrict PDL to CTK >= 12.3 due to MSVC issues (ggml-org#23742) ci : bump cuda release to 13.3 (ggml-org#23749) common : fix env names to all have LLAMA_ARG_ prefix (ggml-org#23778) ci : fix windows ccaches (ggml-org#23777) ci : remove wasm test (ggml-org#23733) vulkan: avoid preferring transfer queue on AMD UMA devices (ggml-org#22455) ci : add ccache to server builds + fix undefined sanitizer build (ggml-org#23763) docs : fix duplicated "the" in granitevision and model-conversion docs (ggml-org#23767) convert: add MiniCPM5 tokenizer support (ggml-org#23384) server : fix the log message when using SSL (ggml-org#23393)
Add minicpm5 pre-tokenizer hash via convert_hf_to_gguf_update.py and implement hardcoded regex handling in llama-vocab.cpp, consistent with other BPE pre-tokenizers. Co-authored-by: zhangtao <zhangtao2@modelbest.cn>
Add minicpm5 pre-tokenizer hash via convert_hf_to_gguf_update.py and implement hardcoded regex handling in llama-vocab.cpp, consistent with other BPE pre-tokenizers. Co-authored-by: zhangtao <zhangtao2@modelbest.cn>
Adds MiniCPM5 support for HF → GGUF conversion and inference.
Detect MiniCPM5 in LlamaModel and use the correct Llama3-style BPE + ByteLevel vocab path
Register the minicpm5 BPE pre-tokenizer fingerprint
Extract HF pre-tokenizer regexes into GGUF (tokenizer.ggml.pre_tokenizer_regex) and apply them at runtime via LLAMA_VOCAB_PRE_TYPE_CUSTOM_REGEX
Scoped to MiniCPM5 only; other Llama-family models are unchanged.