Skip to content

Conversation

@jackzhxng
Copy link
Contributor

>> cmake -DTOKENIZERS_BUILD_TOOLS=ON -DSUPPORT_REGEX_LOOKAHEAD=ON . -Bbuild && cmake --build build -j9
>> ./build/examples/tokenize_tool/tokenize_tool tekken ~/hf/models--mistralai--Voxtral-Mini-3B-2507/snapshots/3060fe34b35ba5d44202ce9ff3c097642914f8f3/tekken.json "<s>[INST]Let's go swim at the beach!"

I tokenizers:regex.cpp:27] Registering override fallback regex
I tokenizers:tekken.cpp:88] Loading Tekken tokenizer from: /home/jackzhxng/hf/models--mistralai--Voxtral-Mini-3B-2507/snapshots/3060fe34b35ba5d44202ce9ff3c097642914f8f3/tekken.json
I tokenizers:tekken.cpp:112] Tekken version: v7, vocab_size: 131072, special_tokens: 1000
I tokenizers:tekken.cpp:123] Loading special tokens from JSON
I tokenizers:tekken.cpp:282] Initialized 1000 special tokens (1000 defined, 0 placeholders)
I tokenizers:tekken.cpp:140] Loading 130072 vocabulary tokens
I tokenizers:tekken.cpp:223] Processing 130072 vocabulary entries (limit: 130072)
I tokenizers:tekken.cpp:260] Built vocabulary with 130072 tokens
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1755895132.712552 3548433 re2.cc:237] Error parsing '([^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}]*[\p{Ll}\p{Lm}\p{Lo}\p{M}]+|[^\r\n\p{L}\p{N}]?[\p{...': invalid perl operator: (?!
E tokenizers:re2_regex.cpp:22] Failed to compile regex: ([^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}]*[\p{Ll}\p{Lm}\p{Lo}\p{M}]+|[^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}]+[\p{Ll}\p{Lm}\p{Lo}\p{M}]*|\p{N}| ?[^\s\p{L}\p{N}]+[\r\n/]*|\s*[\r\n]+|\s+(?!\S)|\s+), error: invalid$
I tokenizers:regex_lookahead.cpp:27] Creating PCRE2 regex
I tokenizers:tekken.cpp:181] Tekken tokenizer loaded successfully. Vocab size: 131072, BOS: 1, EOS: 2
Vocab Size: 131072
BOS: 1
EOS: 2

PROMPT:
<s>[INST]Let's go swim at the beach!

Encoding...
[ 1 3 12598 1681 1974 64031 1513 1278 29397 1033 ]

Decoding...
<s>[INST]Let's go swim at the beach!

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 22, 2025
@jackzhxng jackzhxng merged commit 1c2481d into main Aug 22, 2025
4 checks passed
jackzhxng added a commit that referenced this pull request Sep 4, 2025
jackzhxng added a commit that referenced this pull request Sep 4, 2025
jackzhxng added a commit that referenced this pull request Sep 4, 2025
facebook-github-bot pushed a commit that referenced this pull request Sep 4, 2025
Summary:
Reland of #119


Differential Revision: D81692978

Pulled By: jackzhxng
facebook-github-bot pushed a commit that referenced this pull request Sep 4, 2025
Differential Revision: D81692978

Pull Request resolved: #124
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants