Skip to content

QVAC-18417: multilingual (MeCab/Cangjie) preprocessing wiring for tts-ggml Chatterbox#2347

Merged
freddy311082 merged 12 commits into
mainfrom
QVAC-18417/tts-ggml-mecab-multilingual
Jun 4, 2026
Merged

QVAC-18417: multilingual (MeCab/Cangjie) preprocessing wiring for tts-ggml Chatterbox#2347
freddy311082 merged 12 commits into
mainfrom
QVAC-18417/tts-ggml-mecab-multilingual

Conversation

@freddy311082

@freddy311082 freddy311082 commented May 30, 2026

Copy link
Copy Markdown
Contributor

Summary

Wires the multilingual text-preprocessing dictionaries through the tts-ggml
Chatterbox addon so the multilingual (MTL) GGUF can correctly synthesize
languages that need word-level segmentation — primarily Japanese (ja) via
MeCab/IPAdic, plus the Cangjie TSV slot for Chinese (zh).

The actual MeCab/Cangjie segmentation already lives in tts-cpp
(EngineOptions::mecab_dict_path / cangjie_tsv_path); this PR only forwards
the host-resolved dictionary paths from JS down to the native layer. No
preprocessing logic is duplicated in the addon. Dictionaries are staged by the
host (and, for tests, downloaded from the QVAC model registry / S3) rather than
bundled.

Changes

Addon (C++)

  • ChatterboxConfig.hpp: add mecabDictPath + cangjieTsvPath fields.
  • JSAdapter.cpp: read both from the JS config object.
  • ChatterboxModel.cpp: map them into tts_cpp::chatterbox::EngineOptions
    only when non-empty (empty ⇒ tts-cpp keeps its character-level fallback, so
    Turbo English is unaffected).

JS

  • index.js: accept mecabDictDir / mecabDictPath / cangjieTsvPath
    (top-level or via files.*) and pass them into the Chatterbox params.
  • index.d.ts: document the new options.

Tests / infra

  • test/utils/downloadModel.js: ensureMecabDict() helper that stages the
    6 compiled IPAdic files (char.bin, dicrc, matrix.bin, mecabrc,
    sys.dic, unk.dic) from the registry (S3 prefix
    qvac_models_compiled/chatterbox/mecab-ipadic/) and returns the dir.
  • test/integration/chatterbox-ja-mecab.test.js: Japanese (kanji + kana)
    synthesis smoke test that downloads the dict, passes it via
    files.mecabDictDir, and asserts audio is produced. Skip-as-pass when the
    MTL GGUFs or the dictionary are unavailable (offline / no registry access).

Dependencies

  • Bumps tts-cpp to >= 2026-05-29 (the multilingual version that introduces
    mecab_dict_path / cangjie_tsv_path; this corresponds to ext-lib
    whisper.cpp PR updated qvac-cli readme #19 / commit d3db516d).

@kinsta

kinsta Bot commented Jun 1, 2026

Copy link
Copy Markdown

Preview deployments for qvac-docs-staging ⚡️

Status Branch preview Commit preview
✅ Ready Visit preview Visit preview

Commit: e3dd8e272c758f163637c424e1492508b25f6ebb

Deployment ID: 3f10cfbc-efb4-4730-9f5a-d1bdae17a492

Static site name: qvac-docs-staging-fazwv

@github-actions

github-actions Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Tier-based Approval Status

**PR Tier:** TIER1

**Current Status:** ✅ APPROVED

**Requirements:**
- 1 Team Member approval ✅ (1/1)
- 1 Team Lead OR Management approval ✅ (1/1)



---
*This comment is automatically updated when reviews change.*

ogad-tether
ogad-tether previously approved these changes Jun 1, 2026
Repoint the tts-ggml default-registry baseline at fork commit e0b1cc3,
which pins tts-cpp to qvac-ext-lib-whisper.cpp@c8b140c1 (the jpgaribotti
review fixes on PR #19: MeCabTagger::convert decode-mutex against the
shared-tagger data race, and memoize-on-success in mecab_tagger()/
cangjie_table() so a failed/late dict path is no longer sticky-null).
Temporary fork pin for CI validation of the multilingual sweep.

Co-authored-by: Cursor <cursoragent@cursor.com>
@freddy311082

Copy link
Copy Markdown
Contributor Author

@freddy311082

Copy link
Copy Markdown
Contributor Author

/review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants