Skip to content

QVAC-18420 fix[mod]: Bergamot vocab re-downloaded on every loadModel for shared-vocab pairs#1892

Merged
simon-iribarren merged 5 commits into
tetherto:mainfrom
simon-iribarren:fix/qvac-18420-bergamot-vocab-dedup
May 5, 2026
Merged

QVAC-18420 fix[mod]: Bergamot vocab re-downloaded on every loadModel for shared-vocab pairs#1892
simon-iribarren merged 5 commits into
tetherto:mainfrom
simon-iribarren:fix/qvac-18420-bergamot-vocab-dedup

Conversation

@simon-iribarren

Copy link
Copy Markdown
Contributor

Note: be concise and prefer bullet points.

🎯 What problem does this PR solve?

  • QVAC-18420 — reported by Dominic in #qvac-sdk: the Bergamot FR→EN vocab.fren.spm was being deleted and re-downloaded from the registry on every loadModel call, even though it was already cached. Investigation confirmed 7 bidirectional Bergamot pairs reproduce: FR_EN, EN_DE, EN_CS, ET_EN, FI_EN, PL_EN, PT_EN.
  • Shipping-hot: translation is not "offline after first use" for these pairs; every load costs ~600 KB – 1 MB of bandwidth and extra latency (originally caught on Keet/iOS).

📝 How does it solve it?

  • Root cause: deduplicateModels (sdk-pod models/update-models/registry.ts) drops the second occurrence of two registry entries that share a sha256Checksum. For the 7 affected pairs the two translation directions ship the same vocab blob under different paths (e.g. bergamot-fren/.../vocab.fren.spmbergamot-enfr/.../vocab.enfr.spm, both sha 783abf3a…c002a, 814404 bytes). Alphabetical dedup kept the bergamot-enfr path and dropped the bergamot-fren one — leaving bergamot-fren/vocab.fren.spm with no standalone RegistryItem. When the Bergamot plugin then called ctx.resolveModelPath("registry://…/vocab.fren.spm"), getModelByPath() returned undefinedexpectedSize = 0 → the cache validator saw 814404 ≠ 0 and unlinked + re-downloaded every load.
  • Fix A — dedup preserves companion-referenced duplicates (packages/sdk/models/update-models/registry.ts): before dropping on sha256 collision, check whether the candidate's registryPath is referenced as a file in any other entry's companionSet. Companion-referenced duplicates survive. Regenerated packages/sdk/models/registry/models.ts — the 7 previously-missing BERGAMOT_*_VOCAB entries are back. Remaining non-companion duplicates (6) are still deduped.
  • Fix C — skip redundant vocab resolve for registry:// Bergamot loads (packages/sdk/server/bare/plugins/nmtcpp-translation/plugin.ts): when the primary model is a registry:// source with auto-derived vocabs, skip the separate ctx.resolveModelPath calls entirely. downloadCompanionSetFromRegistry already places the vocabs next to the model under sets/<setKey>/, and createModel derives the colocated paths via the existing deriveColocatedBergamotVocabPaths helper. Eliminates the redundant flat-cache copy on every load (bandwidth + latency win even on unaffected pairs) and narrows the blast radius of this class of dedup hole. Same optimization applied to the pivot branch.
  • Unit tests: new packages/sdk/test/unit/update-models-dedup.test.ts covers the QVAC-18420 regression plus general dedup semantics (plain duplicates still drop, empty-sha entries untouched, cross-set companion references all survive).

🧪 How was it tested?

  • bun lint (sdk) — clean.
  • bun run build (sdk) — clean.
  • bun run test:unit (sdk) — all green, including 5 new dedup tests.
  • bun run update-models (sdk) against the live registry — diff is purely the 7 BERGAMOT_*_VOCAB entries added plus index-shifts on subsequent export consts; no other registry items changed. 🧹 Removed 6 duplicate model(s) (was 13) confirms exactly the 7 affected vocabs are now preserved.
  • Reproduction script from the bug report (packages/sdk/examples/translation/translation-bergamot-cache-bug.ts) is expected to return "0 of 15 reproduce" after this lands. Recommend the reporter re-run as an end-to-end check on a fresh ~/.qvac/models cache.

📦 Models

Added models

BERGAMOT_EN_CS_VOCAB
BERGAMOT_EN_DE_VOCAB
BERGAMOT_ET_EN_VOCAB
BERGAMOT_FI_EN_VOCAB
BERGAMOT_FR_EN_VOCAB
BERGAMOT_PL_EN_VOCAB
BERGAMOT_PT_EN_VOCAB

These are companion-only entries (no export const); they only appear inside the models array so that getModelByPath() resolves the companion vocab paths that are shared across both directions of each bidirectional pair.

…for shared-vocab pairs

- Dedup preserves registry entries referenced as companion files in any
  companion set, so shared-vocab blobs (identical sha256 under two paths) keep
  their standalone RegistryItem. This adds back 7 previously-dropped
  BERGAMOT_<pair>_VOCAB entries (FR_EN, EN_DE, EN_CS, ET_EN, FI_EN, PL_EN, PT_EN)
  and restores correct expectedSize/sha lookups for the Bergamot plugin's
  separate vocab resolve.
- For registry:// Bergamot loads with auto-derived vocabs (non-pivot and pivot),
  skip the separate per-vocab resolveModelPath entirely. The companion-set
  download already colocates vocabs under sets/<setKey>/ and createModel
  derives those paths via deriveColocatedBergamotVocabPaths — no redundant
  flat-cache download, no dedup-hole exposure.
opaninakuffo
opaninakuffo previously approved these changes May 5, 2026
Comment thread packages/sdk/server/bare/plugins/nmtcpp-translation/plugin.ts Outdated
NamelsKing
NamelsKing previously approved these changes May 5, 2026
Reviewer ask on PR tetherto#1892: lock the Fix-C registry-source vocab optimization
behind unit tests so the contract doesn't drift.

- Extract resolveBergamotVocab and its derivation helpers into
  resolve-vocab.ts. The plugin module imports @qvac/translation-nmtcpp at
  module-load time (Bare-only native addon), so the helpers were not
  importable from a bun-driven test runner. The new module has zero native
  deps; plugin.ts re-imports from it.
- Add test/unit/nmtcpp-resolve-vocab.test.ts covering:
  - registry:// + auto-derived vocabs (non-pivot) → ctx.resolveModelPath
    is never called and no vocab artifacts are emitted
  - registry:// + auto-derived vocabs (pivot) → exactly one resolveModelPath
    call (the pivot model itself), no vocab artifacts
  - registry:// + user-supplied srcVocabSrc OR dstVocabSrc → optimization
    is skipped and full per-vocab resolution runs (sanity: keeps the
    optimization narrow)
  - pear:// source still resolves vocabs explicitly (companion-set
    semantics don't apply)
  - mixed registry/pear pivot triggers full resolution path
  - unsupported sources raise ModelLoadFailedError
  - pure derivation helpers (buildBergamotVocabSources,
    deriveBergamotVocabSources, deriveBergamotRegistryVocabSources) for
    both shared-vocab and CJK split-vocab pairs
@opaninakuffo

Copy link
Copy Markdown
Contributor

review

@github-actions

github-actions Bot commented May 5, 2026

Copy link
Copy Markdown
Contributor

Tier-based Approval Status

**PR Tier:** TIER1

**Current Status:** ✅ APPROVED

**Requirements:**
- 1 Team Member approval ✅ (1/1)
- 1 Team Lead OR Management approval ✅ (1/1)



---
*This comment is automatically updated when reviews change.*

@simon-iribarren simon-iribarren merged commit b3c751b into tetherto:main May 5, 2026
12 of 13 checks passed
tamer-hassan-tether pushed a commit that referenced this pull request May 5, 2026
…for shared-vocab pairs (#1892)

* QVAC-18420 fix[mod]: Bergamot vocab re-downloaded on every loadModel for shared-vocab pairs

- Dedup preserves registry entries referenced as companion files in any
  companion set, so shared-vocab blobs (identical sha256 under two paths) keep
  their standalone RegistryItem. This adds back 7 previously-dropped
  BERGAMOT_<pair>_VOCAB entries (FR_EN, EN_DE, EN_CS, ET_EN, FI_EN, PL_EN, PT_EN)
  and restores correct expectedSize/sha lookups for the Bergamot plugin's
  separate vocab resolve.
- For registry:// Bergamot loads with auto-derived vocabs (non-pivot and pivot),
  skip the separate per-vocab resolveModelPath entirely. The companion-set
  download already colocates vocabs under sets/<setKey>/ and createModel
  derives those paths via deriveColocatedBergamotVocabPaths — no redundant
  flat-cache download, no dedup-hole exposure.

* QVAC-18420 test: cover Bergamot vocab resolver + extract pure helpers

Reviewer ask on PR #1892: lock the Fix-C registry-source vocab optimization
behind unit tests so the contract doesn't drift.

- Extract resolveBergamotVocab and its derivation helpers into
  resolve-vocab.ts. The plugin module imports @qvac/translation-nmtcpp at
  module-load time (Bare-only native addon), so the helpers were not
  importable from a bun-driven test runner. The new module has zero native
  deps; plugin.ts re-imports from it.
- Add test/unit/nmtcpp-resolve-vocab.test.ts covering:
  - registry:// + auto-derived vocabs (non-pivot) → ctx.resolveModelPath
    is never called and no vocab artifacts are emitted
  - registry:// + auto-derived vocabs (pivot) → exactly one resolveModelPath
    call (the pivot model itself), no vocab artifacts
  - registry:// + user-supplied srcVocabSrc OR dstVocabSrc → optimization
    is skipped and full per-vocab resolution runs (sanity: keeps the
    optimization narrow)
  - pear:// source still resolves vocabs explicitly (companion-set
    semantics don't apply)
  - mixed registry/pear pivot triggers full resolution path
  - unsupported sources raise ModelLoadFailedError
  - pure derivation helpers (buildBergamotVocabSources,
    deriveBergamotVocabSources, deriveBergamotRegistryVocabSources) for
    both shared-vocab and CJK split-vocab pairs

---------

Co-authored-by: Opanin Akuffo <46673050+opaninakuffo@users.noreply.github.com>
opaninakuffo added a commit to opaninakuffo/qvac that referenced this pull request May 5, 2026
…for shared-vocab pairs (tetherto#1892)

* QVAC-18420 fix[mod]: Bergamot vocab re-downloaded on every loadModel for shared-vocab pairs

- Dedup preserves registry entries referenced as companion files in any
  companion set, so shared-vocab blobs (identical sha256 under two paths) keep
  their standalone RegistryItem. This adds back 7 previously-dropped
  BERGAMOT_<pair>_VOCAB entries (FR_EN, EN_DE, EN_CS, ET_EN, FI_EN, PL_EN, PT_EN)
  and restores correct expectedSize/sha lookups for the Bergamot plugin's
  separate vocab resolve.
- For registry:// Bergamot loads with auto-derived vocabs (non-pivot and pivot),
  skip the separate per-vocab resolveModelPath entirely. The companion-set
  download already colocates vocabs under sets/<setKey>/ and createModel
  derives those paths via deriveColocatedBergamotVocabPaths — no redundant
  flat-cache download, no dedup-hole exposure.

* QVAC-18420 test: cover Bergamot vocab resolver + extract pure helpers

Reviewer ask on PR tetherto#1892: lock the Fix-C registry-source vocab optimization
behind unit tests so the contract doesn't drift.

- Extract resolveBergamotVocab and its derivation helpers into
  resolve-vocab.ts. The plugin module imports @qvac/translation-nmtcpp at
  module-load time (Bare-only native addon), so the helpers were not
  importable from a bun-driven test runner. The new module has zero native
  deps; plugin.ts re-imports from it.
- Add test/unit/nmtcpp-resolve-vocab.test.ts covering:
  - registry:// + auto-derived vocabs (non-pivot) → ctx.resolveModelPath
    is never called and no vocab artifacts are emitted
  - registry:// + auto-derived vocabs (pivot) → exactly one resolveModelPath
    call (the pivot model itself), no vocab artifacts
  - registry:// + user-supplied srcVocabSrc OR dstVocabSrc → optimization
    is skipped and full per-vocab resolution runs (sanity: keeps the
    optimization narrow)
  - pear:// source still resolves vocabs explicitly (companion-set
    semantics don't apply)
  - mixed registry/pear pivot triggers full resolution path
  - unsupported sources raise ModelLoadFailedError
  - pure derivation helpers (buildBergamotVocabSources,
    deriveBergamotVocabSources, deriveBergamotRegistryVocabSources) for
    both shared-vocab and CJK split-vocab pairs

---------

Co-authored-by: Opanin Akuffo <46673050+opaninakuffo@users.noreply.github.com>
tamer-hassan-tether pushed a commit that referenced this pull request May 6, 2026
…for shared-vocab pairs (#1892)

* QVAC-18420 fix[mod]: Bergamot vocab re-downloaded on every loadModel for shared-vocab pairs

- Dedup preserves registry entries referenced as companion files in any
  companion set, so shared-vocab blobs (identical sha256 under two paths) keep
  their standalone RegistryItem. This adds back 7 previously-dropped
  BERGAMOT_<pair>_VOCAB entries (FR_EN, EN_DE, EN_CS, ET_EN, FI_EN, PL_EN, PT_EN)
  and restores correct expectedSize/sha lookups for the Bergamot plugin's
  separate vocab resolve.
- For registry:// Bergamot loads with auto-derived vocabs (non-pivot and pivot),
  skip the separate per-vocab resolveModelPath entirely. The companion-set
  download already colocates vocabs under sets/<setKey>/ and createModel
  derives those paths via deriveColocatedBergamotVocabPaths — no redundant
  flat-cache download, no dedup-hole exposure.

* QVAC-18420 test: cover Bergamot vocab resolver + extract pure helpers

Reviewer ask on PR #1892: lock the Fix-C registry-source vocab optimization
behind unit tests so the contract doesn't drift.

- Extract resolveBergamotVocab and its derivation helpers into
  resolve-vocab.ts. The plugin module imports @qvac/translation-nmtcpp at
  module-load time (Bare-only native addon), so the helpers were not
  importable from a bun-driven test runner. The new module has zero native
  deps; plugin.ts re-imports from it.
- Add test/unit/nmtcpp-resolve-vocab.test.ts covering:
  - registry:// + auto-derived vocabs (non-pivot) → ctx.resolveModelPath
    is never called and no vocab artifacts are emitted
  - registry:// + auto-derived vocabs (pivot) → exactly one resolveModelPath
    call (the pivot model itself), no vocab artifacts
  - registry:// + user-supplied srcVocabSrc OR dstVocabSrc → optimization
    is skipped and full per-vocab resolution runs (sanity: keeps the
    optimization narrow)
  - pear:// source still resolves vocabs explicitly (companion-set
    semantics don't apply)
  - mixed registry/pear pivot triggers full resolution path
  - unsupported sources raise ModelLoadFailedError
  - pure derivation helpers (buildBergamotVocabSources,
    deriveBergamotVocabSources, deriveBergamotRegistryVocabSources) for
    both shared-vocab and CJK split-vocab pairs

---------

Co-authored-by: Opanin Akuffo <46673050+opaninakuffo@users.noreply.github.com>
opaninakuffo added a commit that referenced this pull request May 6, 2026
* QVAC-18420 fix[mod]: Bergamot vocab re-downloaded on every loadModel for shared-vocab pairs (#1892)

* QVAC-18420 fix[mod]: Bergamot vocab re-downloaded on every loadModel for shared-vocab pairs

- Dedup preserves registry entries referenced as companion files in any
  companion set, so shared-vocab blobs (identical sha256 under two paths) keep
  their standalone RegistryItem. This adds back 7 previously-dropped
  BERGAMOT_<pair>_VOCAB entries (FR_EN, EN_DE, EN_CS, ET_EN, FI_EN, PL_EN, PT_EN)
  and restores correct expectedSize/sha lookups for the Bergamot plugin's
  separate vocab resolve.
- For registry:// Bergamot loads with auto-derived vocabs (non-pivot and pivot),
  skip the separate per-vocab resolveModelPath entirely. The companion-set
  download already colocates vocabs under sets/<setKey>/ and createModel
  derives those paths via deriveColocatedBergamotVocabPaths — no redundant
  flat-cache download, no dedup-hole exposure.

* QVAC-18420 test: cover Bergamot vocab resolver + extract pure helpers

Reviewer ask on PR #1892: lock the Fix-C registry-source vocab optimization
behind unit tests so the contract doesn't drift.

- Extract resolveBergamotVocab and its derivation helpers into
  resolve-vocab.ts. The plugin module imports @qvac/translation-nmtcpp at
  module-load time (Bare-only native addon), so the helpers were not
  importable from a bun-driven test runner. The new module has zero native
  deps; plugin.ts re-imports from it.
- Add test/unit/nmtcpp-resolve-vocab.test.ts covering:
  - registry:// + auto-derived vocabs (non-pivot) → ctx.resolveModelPath
    is never called and no vocab artifacts are emitted
  - registry:// + auto-derived vocabs (pivot) → exactly one resolveModelPath
    call (the pivot model itself), no vocab artifacts
  - registry:// + user-supplied srcVocabSrc OR dstVocabSrc → optimization
    is skipped and full per-vocab resolution runs (sanity: keeps the
    optimization narrow)
  - pear:// source still resolves vocabs explicitly (companion-set
    semantics don't apply)
  - mixed registry/pear pivot triggers full resolution path
  - unsupported sources raise ModelLoadFailedError
  - pure derivation helpers (buildBergamotVocabSources,
    deriveBergamotVocabSources, deriveBergamotRegistryVocabSources) for
    both shared-vocab and CJK split-vocab pairs

---------

Co-authored-by: Opanin Akuffo <46673050+opaninakuffo@users.noreply.github.com>

* QVAC-17324 feat[api]: add harmony tool-call dialect (gpt-oss) (#1878)

* chore: bump @qvac/llm-llamacpp to 0.17.2 for harmony EOG fix
- Picks up #1812 which stops the addon from suppressing
  the `<|call|>` end-of-generation token, unblocking harmony tool-call
  parsing on the SDK side.

* feat: expose toolDialect override and add harmony to the dialect enum

* feat: add harmony tool-call parser for gpt-oss models

* feat: stream harmony frames and decouple dialect from tool-call activation

* test: add harmony unit coverage for parser and normalizer

* chore: consolidate tool examples under examples/tools/

* doc: clarify emitThinking capture-gating contract

* chore: remove conflict markers

* feat[mod]: sync sdk model registry to bergamot base-memory and drop deprecated marian opus (#1903)

- Bumps BERGAMOT_EN_IT and BERGAMOT_ES_EN to the base-memory variant at
  bergamot-{enit,esen}/2026-04-28/. Fixes leading "- " hallucinations on
  short inputs and en->it quality regression (registry seed updated in
  #1785, synced to DHT 2026-05-05).
- Drops 32 deprecated Marian Opus entries (NMT_Q0F16*, NMT_Q4_0*)
  auto-deprecated by #1680.
- Auto-generated by `bun update-models`.

* chore[skiplog]: release sdk 0.10.1

---------

Co-authored-by: Simon Iribarren <simon.ig13@gmail.com>
Proletter pushed a commit that referenced this pull request May 24, 2026
…for shared-vocab pairs (#1892)

* QVAC-18420 fix[mod]: Bergamot vocab re-downloaded on every loadModel for shared-vocab pairs

- Dedup preserves registry entries referenced as companion files in any
  companion set, so shared-vocab blobs (identical sha256 under two paths) keep
  their standalone RegistryItem. This adds back 7 previously-dropped
  BERGAMOT_<pair>_VOCAB entries (FR_EN, EN_DE, EN_CS, ET_EN, FI_EN, PL_EN, PT_EN)
  and restores correct expectedSize/sha lookups for the Bergamot plugin's
  separate vocab resolve.
- For registry:// Bergamot loads with auto-derived vocabs (non-pivot and pivot),
  skip the separate per-vocab resolveModelPath entirely. The companion-set
  download already colocates vocabs under sets/<setKey>/ and createModel
  derives those paths via deriveColocatedBergamotVocabPaths — no redundant
  flat-cache download, no dedup-hole exposure.

* QVAC-18420 test: cover Bergamot vocab resolver + extract pure helpers

Reviewer ask on PR #1892: lock the Fix-C registry-source vocab optimization
behind unit tests so the contract doesn't drift.

- Extract resolveBergamotVocab and its derivation helpers into
  resolve-vocab.ts. The plugin module imports @qvac/translation-nmtcpp at
  module-load time (Bare-only native addon), so the helpers were not
  importable from a bun-driven test runner. The new module has zero native
  deps; plugin.ts re-imports from it.
- Add test/unit/nmtcpp-resolve-vocab.test.ts covering:
  - registry:// + auto-derived vocabs (non-pivot) → ctx.resolveModelPath
    is never called and no vocab artifacts are emitted
  - registry:// + auto-derived vocabs (pivot) → exactly one resolveModelPath
    call (the pivot model itself), no vocab artifacts
  - registry:// + user-supplied srcVocabSrc OR dstVocabSrc → optimization
    is skipped and full per-vocab resolution runs (sanity: keeps the
    optimization narrow)
  - pear:// source still resolves vocabs explicitly (companion-set
    semantics don't apply)
  - mixed registry/pear pivot triggers full resolution path
  - unsupported sources raise ModelLoadFailedError
  - pure derivation helpers (buildBergamotVocabSources,
    deriveBergamotVocabSources, deriveBergamotRegistryVocabSources) for
    both shared-vocab and CJK split-vocab pairs

---------

Co-authored-by: Opanin Akuffo <46673050+opaninakuffo@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants