vocab : keep DNA k-mer ids distinct from colliding BPE tokens#23466
Conversation
|
@CISC would something like this be a valid solution? |
|
To not essentially reimplement the vocab I wonder if it would make more sense to f.ex. postfix the DNA tokens with a special character at conversion, which we can then easily strip from |
|
thanks for the suggestion @CISC let me try that |
da621b9 to
3f4eab0
Compare
|
@CISC should be ready for intial review |
|
Nice, but don't modify |
3f4eab0 to
29e2bec
Compare
|
@CISC thanks for the suggestion. Fixed |
29e2bec to
745bfa2
Compare
|
I'll check it out later today (I added a test for Is it perhaps worth only checking the tokens after |
|
thanks @CISC checking! Also, hope you can get some sleep!! |
745bfa2 to
ed2329c
Compare
Sleep is overrated. :) |
|
@kashif I guess update the GGUFs ASAP, I'll add the vocab test files. |
|
Thanks @CISC, all weights updated and re-uploaded |
|
BTW, I think you forgot to delete |
|
opps! fixing |
|
good catch @CISC removed and reuploaded weights |
|
@ServeurpersoCom Thanks, holding for |
|
thanks @CISC |
* vocab : mark hybriddna k-mers to avoid BPE token collisions * improved loop --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* vocab : mark hybriddna k-mers to avoid BPE token collisions * improved loop --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* origin/master: server: only parse empty msg if continuing an assistant msg (ggml-org#23506) perplexity : fix integer overflow (ggml-org#23496) SYCL: improve MoE prefill throughput (ggml-org#23142) sycl : Level Zero detection in ggml_sycl_init (ggml-org#23097) SYCL : gated_delta_net K>1 (ggml-org#23174) SYCL: add BF16 to DMMV kernel path (~4x tg speedup on Intel Arc) (ggml-org#21580) docs: Update documentation with Granite 4.0/4.1 (ggml-org#23404) ggml-zendnn : add Q8_0 quantization support (ggml-org#23414) cmake : build router app only during standalone builds (ggml-org#23521) vocab : fix HybridDNA tokenizer (ggml-org#23466) cmake : add install() for impl libraries + fix apple builds (ggml-org#23511) CUDA: fix PDL CC check for JIT compilation (ggml-org#23471) cmake : remove STATIC from impl libraries, enable LLAMA_BUILD_APP by default (ggml-org#23462) Update WebGPU support and add link to blog/demo (ggml-org#23483) vulkan: fuse snake activation (mul, sin, sqr, mul, add) (ggml-org#22855)
* vocab : mark hybriddna k-mers to avoid BPE token collisions * improved loop --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* vocab : mark hybriddna k-mers to avoid BPE token collisions * improved loop --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* vocab : mark hybriddna k-mers to avoid BPE token collisions * improved loop --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* vocab : mark hybriddna k-mers to avoid BPE token collisions * improved loop --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Overview
Follow-up to #23410. The HybridDNA tokenizer gives every DNA k-mer its own id, but one 6-mer (
CCCCCC) also exists as a Qwen3 BPE token. Becauseget_vocab()is keyed by text, the DNA id (154402) was dropped in favor of the BPE id (91443) and written out as an unused pad — so<dna>…CCCCCC…</dna>encoded to the wrong id and 154402 detokenized to[PAD154402], diverging from the Python tokenizer.A naive conversion fix can't work: llama.cpp's vocab is a 1:1 text↔id map, so two tokens named
CCCCCCwon't load. transformers avoids this by resolving k-mers through a dedicated DNA map in<dna>context. This PR does the same insrc/llama-vocab.cpponly: inside<dna>a k-mer resolves to its own id by product-order index (not the shared text→id map), and at load the colliding k-mer's text is restored from its index so it detokenizes correctly.Result matches transformers both ways: DNA
CCCCCC→ 154402, plainCCCCCC→ 91443, both detokenize toCCCCCC. Verified with full token-id parity againstAutoTokenizer(..., trust_remote_code=True).Additional information
Requirements