Skip to content

QVAC-18157 feat[mod]: refresh bergamot manifest + fix Firefox Remote Settings zh lookup#2075

Merged
olyasir merged 4 commits into
mainfrom
QVAC-18157-bergamot-fixes
May 15, 2026
Merged

QVAC-18157 feat[mod]: refresh bergamot manifest + fix Firefox Remote Settings zh lookup#2075
olyasir merged 4 commits into
mainfrom
QVAC-18157-bergamot-fixes

Conversation

@olyasir

@olyasir olyasir commented May 15, 2026

Copy link
Copy Markdown
Contributor

Summary

Follow-up to QVAC-17892 / PR #1785 (the bergamot S3 ↔ Firefox CDN variant divergence that desktop CI surfaced). This PR closes the runtime-impacting subset of the audit work tracked under QVAC-18157 — three categories of fix grouped into one branch because they share the same root cause (drift between QVAC's manifest, S3, and Firefox Remote Settings).

  • Manifest (packages/registry-server/data/models.prod.json):

    • Date bump for enbg, enhr, ennl2026-04-28. Same shape as QVAC-17892 fix[ci]: use refreshed base-memory bergamot models for desktop integration tests #1785's enit/esen fix: S3 has fresh base-memory uploads, the manifest still pointed at 2025-12-18 tiny.
    • Restore enzh and zhen, dropped by PR fix(registry): drop phantom Bergamot pairs and stale enja vocab entry #1919. Both pairs exist upstream in mozilla/firefox-translations-models/models/base-memory/; S3 bytes match Firefox Remote Settings v2.2/v2.1 SHA256s exactly; the original removal was a false-positive caused by querying Remote Settings with zh instead of zh-Hans.
    • Add 9 new pairs uploaded to S3 on 2026-04-28: enbs, ensr, enth, envi, hbsen, then (base-memory); ennb, enno, noen (tiny). Each link field is set from the actual S3 metadata.architecture, not a guess — this is the root-cause fix for the class of bug QVAC-17892 documented.
  • Fetcher (packages/translation-nmtcpp/lib/bergamot-model-fetcher.js):

    • Normalize zh → zh-Hans when filtering Firefox Remote Settings records, so runtime downloads of Chinese pairs actually find a record. Without this, every loadModel('en','zh') call threw No Firefox Translations model found for en-zh because Remote Settings catalogs Chinese under the BCP 47 tag.

Audit findings that informed this PR

Full crawl: s3://tether-ai-dev/qvac_models_compiled/bergamot/ (104 pair prefixes) vs models.prod.json (was 91 pairs) vs mozilla/firefox-translations-models (base-memory/: 42 dirs, tiny/: 84 dirs) vs Firefox Remote Settings live catalog (93 unique pairs).

group count resolution
Manifest matches S3 at claimed date 39 no change
Manifest claims base-memory; S3 has base-memory at 2026-04-28 only 3 (enbg, enhr, ennl) bumped this PR
Manifest entries dropped by #1919 that exist upstream + in S3 2 (enzh, zhen) restored this PR
S3-only pairs eligible for manifest 9 added this PR

SDK regen follow-up

packages/sdk/models/registry/models.ts is generated from the synced Hyperdrive registry, not from models.prod.json directly. After this PR merges and pr-models-validation-registry-server.yml's sync-staging job pushes the new entries into the registry, a second PR should run packages/sdk/models/update-models to regenerate models.ts with the new pairs' sha256Checksum / blobCoreKey / expectedSize populated from the live registry (same pattern as PR #1903 followed PR #1785).

Test plan

  • node packages/registry-server/scripts/validate-models-json.js --file=packages/registry-server/data/models.prod.json → ✓ Valid: 760 model(s)
  • npm --prefix packages/registry-server run test:unit → 43/43 pass
  • npm --prefix packages/translation-nmtcpp run test:unit → 59/59 pass (4 new tests in bergamot-model-fetcher.test.js)
  • npm --prefix packages/translation-nmtcpp run lint → clean
  • CI pr-models-validation-registry-server.yml validate-json job
  • Spot-check after merge: a loadModel for ('en','zh') on a clean cache successfully downloads from Firefox Remote Settings (Chinese pair end-to-end smoke)
  • Follow-up PR: SDK models.ts regen after sync-staging runs

Refs

@olyasir olyasir requested review from a team as code owners May 15, 2026 09:18
olyasir added 2 commits May 15, 2026 12:41
Three changes to packages/registry-server/data/models.prod.json
covering follow-ups from QVAC-17892 / PR #1785:

1. Date bump for enbg / enhr / ennl
   S3 has fresh base-memory uploads at /2026-04-28/ for these three
   pairs (alongside the enit/esen refresh that #1785 fixed). Manifest
   still pointed at the 2025-12-18 tiny uploads, so registry consumers
   were getting tiny bytes while Firefox Remote Settings served
   base-memory — same divergence shape #1785 patched.

2. Restore enzh and zhen
   PR #1919 dropped both pairs as "phantom" after a Remote Settings
   cross-check failed, but the lookup compared `zh` against
   `r.fromLang/r.toLang` while Firefox catalogs Chinese under the
   BCP 47 tag `zh-Hans`. Both pairs are in
   mozilla/firefox-translations-models/models/base-memory/, S3 has the
   bytes (2025-12-18), and the SHA256s match Firefox Remote Settings
   v2.2 (enzh) and v2.1 (zhen) byte-for-byte. enzh follows the CJK
   split-vocab convention (srcvocab + trgvocab); zhen uses single
   combined vocab.

3. Add 9 new bergamot pairs
   These were uploaded to S3 on 2026-04-28 but had no manifest entries
   yet. Each `link` field is set per the actual S3
   metadata.architecture so the manifest reflects reality from day one
   (the root cause of QVAC-17892 was a stale link claim diverging from
   S3 bytes).

     - enbs, ensr, enth, envi, hbsen, then    base-memory (31561787 B)
     - ennb, enno, noen                       tiny        (17141051 B)

Deferred from this PR:

  - 49 link-only mismatches (manifest claims `/base-memory/` but S3 +
    Mozilla only ever had `/tiny/`). Functionally already-tiny on
    both lanes; no runtime divergence. Will follow up as a doc-only
    PR.
  - enzh_hant / zh_hanten: model + lex + metadata present in S3 but no
    vocab files uploaded. Will follow up once vocabs are uploaded or a
    shared-vocab story with enzh/zhen is decided.
  - SDK regen (packages/sdk/models/registry/models.ts): generated from
    the synced Hyperdrive registry, not from this JSON file. Follow-up
    PR after this one merges and `sync-staging` runs.

Validation:

  - node packages/registry-server/scripts/validate-models-json.js
    --file=packages/registry-server/data/models.prod.json
    → ✓ Valid: 760 model(s)
  - npm run test:unit (registry-server) → 43/43 pass
…mote Settings

`bergamot-model-fetcher.js` queries Firefox Remote Settings (the
catalog the in-browser translation feature uses) and filters records
by `r.fromLang === srcLang && r.toLang === dstLang`. The catalog
exposes Chinese under the BCP 47 tag `zh-Hans` while callers pass the
ISO 639-1 short code `zh` (matching QVAC's manifest and Mozilla's
`firefox-translations-models` repo). The strict equality silently
failed for every Chinese pair: `en→zh` and `zh→en` both threw
"No Firefox Translations model found".

Fix narrowly: introduce a tiny BCP47_LANG_ALIASES map with the one
known mismatch (`zh → zh-Hans`) and apply it when building the
record filter. Filename handling is unchanged — Firefox's
`record.name` already uses the ISO short form
(`model.enzh.intgemm.alphas.bin`), so files land on disk under the
same names QVAC expects.

Verified against the live catalog: the matching `enzh` v2.2 model
bytes (SHA256 `4e5accc1…2157c`, 43849787 B) match the S3 upload at
`s3://tether-ai-dev/qvac_models_compiled/bergamot/bergamot-enzh/2025-12-18/`
exactly.

Add unit tests covering the normalization, the unchanged
passthrough for non-aliased codes, and the existing CJK split-vocab
filename convention to pin the contract.
@olyasir olyasir force-pushed the QVAC-18157-bergamot-fixes branch from 508b431 to 9a32da8 Compare May 15, 2026 09:42
@github-actions

github-actions Bot commented May 15, 2026

Copy link
Copy Markdown
Contributor

Tier-based Approval Status

**PR Tier:** TIER1

**Current Status:** ✅ APPROVED

**Requirements:**
- 1 Team Member approval ✅ (1/1)
- 1 Team Lead OR Management approval ✅ (1/1)



---
*This comment is automatically updated when reviews change.*

@olyasir

olyasir commented May 15, 2026

Copy link
Copy Markdown
Contributor Author

/review

@olyasir olyasir added the verify label May 15, 2026
@olyasir

olyasir commented May 15, 2026

Copy link
Copy Markdown
Contributor Author

/review

@olyasir olyasir added the verified Authorize secrets / label-gate in PR workflows label May 15, 2026
@olyasir olyasir merged commit a718b05 into main May 15, 2026
51 of 54 checks passed
@olyasir olyasir deleted the QVAC-18157-bergamot-fixes branch May 15, 2026 15:08
@github-actions

Copy link
Copy Markdown
Contributor

❌ E2E Mobile Test Results - Android

Overall Status: FAILED
Device Farm Result: UNKNOWN
Platform: Android
Addon: @qvac/translation-nmtcpp
PR: #2075
Commit: bd49985

Test Summary

Metric Count
Total Tests 0
✅ Passed 0
❌ Failed 0
⏭️ Skipped 0

Links


Automated E2E mobile testing powered by AWS Device Farm
Tests located in: test/mobile/

@github-actions

Copy link
Copy Markdown
Contributor

❌ E2E Mobile Test Results - iOS

Overall Status: FAILED
Device Farm Result: UNKNOWN
Platform: iOS
Addon: @qvac/translation-nmtcpp
PR: #2075
Commit: bd49985

Test Summary

Metric Count
Total Tests 0
✅ Passed 0
❌ Failed 0
⏭️ Skipped 0

Links


Automated E2E mobile testing powered by AWS Device Farm
Tests located in: test/mobile/

Proletter pushed a commit that referenced this pull request May 24, 2026
…Settings zh lookup (#2075)

* QVAC-18157 feat[mod]: refresh and extend bergamot manifest

Three changes to packages/registry-server/data/models.prod.json
covering follow-ups from QVAC-17892 / PR #1785:

1. Date bump for enbg / enhr / ennl
   S3 has fresh base-memory uploads at /2026-04-28/ for these three
   pairs (alongside the enit/esen refresh that #1785 fixed). Manifest
   still pointed at the 2025-12-18 tiny uploads, so registry consumers
   were getting tiny bytes while Firefox Remote Settings served
   base-memory — same divergence shape #1785 patched.

2. Restore enzh and zhen
   PR #1919 dropped both pairs as "phantom" after a Remote Settings
   cross-check failed, but the lookup compared `zh` against
   `r.fromLang/r.toLang` while Firefox catalogs Chinese under the
   BCP 47 tag `zh-Hans`. Both pairs are in
   mozilla/firefox-translations-models/models/base-memory/, S3 has the
   bytes (2025-12-18), and the SHA256s match Firefox Remote Settings
   v2.2 (enzh) and v2.1 (zhen) byte-for-byte. enzh follows the CJK
   split-vocab convention (srcvocab + trgvocab); zhen uses single
   combined vocab.

3. Add 9 new bergamot pairs
   These were uploaded to S3 on 2026-04-28 but had no manifest entries
   yet. Each `link` field is set per the actual S3
   metadata.architecture so the manifest reflects reality from day one
   (the root cause of QVAC-17892 was a stale link claim diverging from
   S3 bytes).

     - enbs, ensr, enth, envi, hbsen, then    base-memory (31561787 B)
     - ennb, enno, noen                       tiny        (17141051 B)

Deferred from this PR:

  - 49 link-only mismatches (manifest claims `/base-memory/` but S3 +
    Mozilla only ever had `/tiny/`). Functionally already-tiny on
    both lanes; no runtime divergence. Will follow up as a doc-only
    PR.
  - enzh_hant / zh_hanten: model + lex + metadata present in S3 but no
    vocab files uploaded. Will follow up once vocabs are uploaded or a
    shared-vocab story with enzh/zhen is decided.
  - SDK regen (packages/sdk/models/registry/models.ts): generated from
    the synced Hyperdrive registry, not from this JSON file. Follow-up
    PR after this one merges and `sync-staging` runs.

Validation:

  - node packages/registry-server/scripts/validate-models-json.js
    --file=packages/registry-server/data/models.prod.json
    → ✓ Valid: 760 model(s)
  - npm run test:unit (registry-server) → 43/43 pass

* QVAC-18157 fix[api]: normalize zh to zh-Hans when matching Firefox Remote Settings

`bergamot-model-fetcher.js` queries Firefox Remote Settings (the
catalog the in-browser translation feature uses) and filters records
by `r.fromLang === srcLang && r.toLang === dstLang`. The catalog
exposes Chinese under the BCP 47 tag `zh-Hans` while callers pass the
ISO 639-1 short code `zh` (matching QVAC's manifest and Mozilla's
`firefox-translations-models` repo). The strict equality silently
failed for every Chinese pair: `en→zh` and `zh→en` both threw
"No Firefox Translations model found".

Fix narrowly: introduce a tiny BCP47_LANG_ALIASES map with the one
known mismatch (`zh → zh-Hans`) and apply it when building the
record filter. Filename handling is unchanged — Firefox's
`record.name` already uses the ISO short form
(`model.enzh.intgemm.alphas.bin`), so files land on disk under the
same names QVAC expects.

Verified against the live catalog: the matching `enzh` v2.2 model
bytes (SHA256 `4e5accc1…2157c`, 43849787 B) match the S3 upload at
`s3://REMOVED-S3-BUCKET/qvac_models_compiled/bergamot/bergamot-enzh/2025-12-18/`
exactly.

Add unit tests covering the normalization, the unchanged
passthrough for non-aliased codes, and the existing CJK split-vocab
filename convention to pin the contract.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

verified Authorize secrets / label-gate in PR workflows verify

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants