Skip to content

UPSTREAM PR #18756: vocab: add tokenizer support for jina-embeddings-v2-base-zh#887

Open
loci-dev wants to merge 1 commit into
mainfrom
upstream-PR18756-branch_o7si-issue-18452
Open

UPSTREAM PR #18756: vocab: add tokenizer support for jina-embeddings-v2-base-zh#887
loci-dev wants to merge 1 commit into
mainfrom
upstream-PR18756-branch_o7si-issue-18452

Conversation

@loci-dev
Copy link
Copy Markdown

Mirrored from ggml-org/llama.cpp#18756

The jina-embeddings-v2-base-zh model uses:

  • Whitespace pre-tokenizer
  • Raw Unicode vocabulary (tokens stored as original characters like 你好)
  • Lowercase normalization

This PR adds tokenizer support for jina-embeddings-v2-base-zh.

Note: I'm still learning the llama.cpp codebase, so please point out any issues — I'll fix them promptly! :D

Tested:

Expected output (HuggingFace tokenizer)
$ git clone https://huggingface.co/jinaai/jina-embeddings-v2-base-zh 
$ cat << 'EOF' > test_jina_tokenizer.py
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("jina-embeddings-v2-base-zh")
ids = tokenizer.encode("你好")
for id in ids:
    print(id, tokenizer.decode(id))
EOF
$ python test_jina_tokenizer.py
0 <s>
49226 你好
2 </s>
Actual output (llama.cpp tokenizer)
$ git clone https://huggingface.co/jinaai/jina-embeddings-v2-base-zh 
$ python convert_hf_to_gguf.py jina-embeddings-v2-base-zh --outtype f32
$ ./build/bin/llama-tokenize -m jina-embeddings-v2-base-zh/Jina-Bert-Implementation-160M-F32.gguf -p "你好"
ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.029 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name:   Apple M4
ggml_metal_device_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: has tensor            = false
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 11453.25 MB
llama_model_load_from_file_impl: using device Metal (Apple M4) (unknown id) - 10922 MiB free
llama_model_loader: loaded meta data with 33 key-value pairs and 184 tensors from jina-embeddings-v2-base-zh/Jina-Bert-Implementation-160M-F32.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = jina-bert-v2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Jina Bert Implementation
llama_model_loader: - kv   3:                       general.organization str              = Jinaai
llama_model_loader: - kv   4:                         general.size_label str              = 160M
llama_model_loader: - kv   5:                            general.license str              = apache-2.0
llama_model_loader: - kv   6:                               general.tags arr[str,6]       = ["sentence-transformers", "feature-ex...
llama_model_loader: - kv   7:                          general.languages arr[str,2]       = ["en", "zh"]
llama_model_loader: - kv   8:                   jina-bert-v2.block_count u32              = 12
llama_model_loader: - kv   9:                jina-bert-v2.context_length u32              = 8192
llama_model_loader: - kv  10:              jina-bert-v2.embedding_length u32              = 768
llama_model_loader: - kv  11:           jina-bert-v2.feed_forward_length u32              = 3072
llama_model_loader: - kv  12:          jina-bert-v2.attention.head_count u32              = 12
llama_model_loader: - kv  13:  jina-bert-v2.attention.layer_norm_epsilon f32              = 0.000000
llama_model_loader: - kv  14:                          general.file_type u32              = 0
llama_model_loader: - kv  15:              jina-bert-v2.attention.causal bool             = false
llama_model_loader: - kv  16:                  jina-bert-v2.pooling_type u32              = 1
llama_model_loader: - kv  17:               general.quantization_version u32              = 2
llama_model_loader: - kv  18:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  19:                         tokenizer.ggml.pre str              = jina-v2-zh
llama_model_loader: - kv  20:                      tokenizer.ggml.tokens arr[str,61056]   = ["<s>", "<pad>", "</s>", "<unk>", "<m...
llama_model_loader: - kv  21:                  tokenizer.ggml.token_type arr[i32,61056]   = [3, 3, 3, 3, 3, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  22:                      tokenizer.ggml.merges arr[str,39382]   = ["t h", "i n", "a n", "e r", "th e", ...
llama_model_loader: - kv  23:                tokenizer.ggml.bos_token_id u32              = 0
llama_model_loader: - kv  24:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  25:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  26:          tokenizer.ggml.seperator_token_id u32              = 2
llama_model_loader: - kv  27:            tokenizer.ggml.padding_token_id u32              = 1
llama_model_loader: - kv  28:               tokenizer.ggml.mask_token_id u32              = 4
llama_model_loader: - kv  29:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  30:               tokenizer.ggml.add_eos_token bool             = true
llama_model_loader: - kv  31:               tokenizer.ggml.add_sep_token bool             = true
llama_model_loader: - kv  32:            tokenizer.ggml.token_type_count u32              = 2
llama_model_loader: - type  f32:  184 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = all F32
print_info: file size   = 611.20 MiB (32.00 BPW) 
load: empty token at index 5
init_tokenizer: initializing tokenizer for type 2
load: model vocab missing newline token, using special_pad_id instead
load: 0 unused tokens
load: control token:      0 '<s>' is not marked as EOG
load: control token:      4 '<mask>' is not marked as EOG
load: control token:      1 '<pad>' is not marked as EOG
load: control token:      3 '<unk>' is not marked as EOG
load: control token:      2 '</s>' is not marked as EOG
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load:   - 2 ('</s>')
load: special tokens cache size = 5
load: token to piece cache size = 0.3058 MB
print_info: arch             = jina-bert-v2
print_info: vocab_only       = 1
print_info: no_alloc         = 0
print_info: model type       = ?B
print_info: model params     = 160.22 M
print_info: general.name     = Jina Bert Implementation
print_info: vocab type       = BPE
print_info: n_vocab          = 61056
print_info: n_merges         = 39382
print_info: BOS token        = 0 '<s>'
print_info: EOS token        = 2 '</s>'
print_info: UNK token        = 3 '<unk>'
print_info: SEP token        = 2 '</s>'
print_info: PAD token        = 1 '<pad>'
print_info: MASK token       = 4 '<mask>'
print_info: EOG token        = 2 '</s>'
print_info: max token length = 45
llama_model_load: vocab only - skipping tensors
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 512
llama_context: n_ctx_seq     = 512
llama_context: n_batch       = 512
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = auto
llama_context: kv_unified    = false
llama_context: freq_base     = 0.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (512) > n_ctx_train (0) -- possible training context overflow
     0 -> '<s>'
 49226 -> '你好'
     2 -> '</s>'

Related issue:

  • #18452

@loci-review
Copy link
Copy Markdown

loci-review Bot commented Jan 11, 2026

Explore the complete analysis inside the Version Insights

I've successfully generated a summary report for your project. The report shows that Pull Request #887 for the llama.cpp repository (owned by auroralabs-loci) demonstrates significant performance improvements across multiple functions.

Key Highlights:

  • Top improvement: The end function in llama-kv-cache.cpp shows a 229.58% improvement in response time and 306.60% increase in throughput
  • Memory operations: Functions related to memory allocation show improvements of 150%+
  • No regressions: All analyzed functions show positive performance changes

The improvements are particularly notable in STL container operations and memory management, which are critical for the overall performance of the llama.cpp inference engine.

@loci-dev loci-dev force-pushed the main branch 27 times, most recently from 60b319a to a8ba58c Compare January 15, 2026 06:14
@loci-dev loci-dev force-pushed the main branch 30 times, most recently from 4bb08b0 to 238591d Compare January 22, 2026 15:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants