QVAC-18417: multilingual (MeCab/Cangjie) preprocessing wiring for tts-ggml Chatterbox#2347
Merged
Merged
Conversation
Preview deployments for qvac-docs-staging ⚡️
Commit: Deployment ID: Static site name: |
Contributor
Tier-based Approval Status |
ogad-tether
previously approved these changes
Jun 1, 2026
Repoint the tts-ggml default-registry baseline at fork commit e0b1cc3, which pins tts-cpp to qvac-ext-lib-whisper.cpp@c8b140c1 (the jpgaribotti review fixes on PR #19: MeCabTagger::convert decode-mutex against the shared-tagger data race, and memoize-on-success in mecab_tagger()/ cangjie_table() so a failed/late dict path is no longer sticky-null). Temporary fork pin for CI validation of the multilingual sweep. Co-authored-by: Cursor <cursoragent@cursor.com>
…ecab-multilingual # Conflicts: # packages/tts-ggml/vcpkg.json
1 task
Contributor
Author
ogad-tether
approved these changes
Jun 4, 2026
gianni-cor
approved these changes
Jun 4, 2026
Contributor
Author
|
/review |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Wires the multilingual text-preprocessing dictionaries through the
tts-ggmlChatterbox addon so the multilingual (MTL) GGUF can correctly synthesize
languages that need word-level segmentation — primarily Japanese (
ja) viaMeCab/IPAdic, plus the Cangjie TSV slot for Chinese (
zh).The actual MeCab/Cangjie segmentation already lives in
tts-cpp(
EngineOptions::mecab_dict_path/cangjie_tsv_path); this PR only forwardsthe host-resolved dictionary paths from JS down to the native layer. No
preprocessing logic is duplicated in the addon. Dictionaries are staged by the
host (and, for tests, downloaded from the QVAC model registry / S3) rather than
bundled.
Changes
Addon (C++)
ChatterboxConfig.hpp: addmecabDictPath+cangjieTsvPathfields.JSAdapter.cpp: read both from the JS config object.ChatterboxModel.cpp: map them intotts_cpp::chatterbox::EngineOptionsonly when non-empty (empty ⇒ tts-cpp keeps its character-level fallback, so
Turbo English is unaffected).
JS
index.js: acceptmecabDictDir/mecabDictPath/cangjieTsvPath(top-level or via
files.*) and pass them into the Chatterbox params.index.d.ts: document the new options.Tests / infra
test/utils/downloadModel.js:ensureMecabDict()helper that stages the6 compiled IPAdic files (
char.bin,dicrc,matrix.bin,mecabrc,sys.dic,unk.dic) from the registry (S3 prefixqvac_models_compiled/chatterbox/mecab-ipadic/) and returns the dir.test/integration/chatterbox-ja-mecab.test.js: Japanese (kanji + kana)synthesis smoke test that downloads the dict, passes it via
files.mecabDictDir, and asserts audio is produced. Skip-as-pass when theMTL GGUFs or the dictionary are unavailable (offline / no registry access).
Dependencies
tts-cppto>= 2026-05-29(the multilingual version that introducesmecab_dict_path/cangjie_tsv_path; this corresponds to ext-libwhisper.cpp PR updated qvac-cli readme #19 / commit
d3db516d).