QVAC-18157 feat[mod]: refresh bergamot manifest + fix Firefox Remote Settings zh lookup#2075
Merged
Conversation
Three changes to packages/registry-server/data/models.prod.json covering follow-ups from QVAC-17892 / PR #1785: 1. Date bump for enbg / enhr / ennl S3 has fresh base-memory uploads at /2026-04-28/ for these three pairs (alongside the enit/esen refresh that #1785 fixed). Manifest still pointed at the 2025-12-18 tiny uploads, so registry consumers were getting tiny bytes while Firefox Remote Settings served base-memory — same divergence shape #1785 patched. 2. Restore enzh and zhen PR #1919 dropped both pairs as "phantom" after a Remote Settings cross-check failed, but the lookup compared `zh` against `r.fromLang/r.toLang` while Firefox catalogs Chinese under the BCP 47 tag `zh-Hans`. Both pairs are in mozilla/firefox-translations-models/models/base-memory/, S3 has the bytes (2025-12-18), and the SHA256s match Firefox Remote Settings v2.2 (enzh) and v2.1 (zhen) byte-for-byte. enzh follows the CJK split-vocab convention (srcvocab + trgvocab); zhen uses single combined vocab. 3. Add 9 new bergamot pairs These were uploaded to S3 on 2026-04-28 but had no manifest entries yet. Each `link` field is set per the actual S3 metadata.architecture so the manifest reflects reality from day one (the root cause of QVAC-17892 was a stale link claim diverging from S3 bytes). - enbs, ensr, enth, envi, hbsen, then base-memory (31561787 B) - ennb, enno, noen tiny (17141051 B) Deferred from this PR: - 49 link-only mismatches (manifest claims `/base-memory/` but S3 + Mozilla only ever had `/tiny/`). Functionally already-tiny on both lanes; no runtime divergence. Will follow up as a doc-only PR. - enzh_hant / zh_hanten: model + lex + metadata present in S3 but no vocab files uploaded. Will follow up once vocabs are uploaded or a shared-vocab story with enzh/zhen is decided. - SDK regen (packages/sdk/models/registry/models.ts): generated from the synced Hyperdrive registry, not from this JSON file. Follow-up PR after this one merges and `sync-staging` runs. Validation: - node packages/registry-server/scripts/validate-models-json.js --file=packages/registry-server/data/models.prod.json → ✓ Valid: 760 model(s) - npm run test:unit (registry-server) → 43/43 pass
…mote Settings `bergamot-model-fetcher.js` queries Firefox Remote Settings (the catalog the in-browser translation feature uses) and filters records by `r.fromLang === srcLang && r.toLang === dstLang`. The catalog exposes Chinese under the BCP 47 tag `zh-Hans` while callers pass the ISO 639-1 short code `zh` (matching QVAC's manifest and Mozilla's `firefox-translations-models` repo). The strict equality silently failed for every Chinese pair: `en→zh` and `zh→en` both threw "No Firefox Translations model found". Fix narrowly: introduce a tiny BCP47_LANG_ALIASES map with the one known mismatch (`zh → zh-Hans`) and apply it when building the record filter. Filename handling is unchanged — Firefox's `record.name` already uses the ISO short form (`model.enzh.intgemm.alphas.bin`), so files land on disk under the same names QVAC expects. Verified against the live catalog: the matching `enzh` v2.2 model bytes (SHA256 `4e5accc1…2157c`, 43849787 B) match the S3 upload at `s3://tether-ai-dev/qvac_models_compiled/bergamot/bergamot-enzh/2025-12-18/` exactly. Add unit tests covering the normalization, the unchanged passthrough for non-aliased codes, and the existing CJK split-vocab filename convention to pin the contract.
508b431 to
9a32da8
Compare
Alok-Ranjan23
approved these changes
May 15, 2026
Contributor
Tier-based Approval Status |
DmitryMalishev
approved these changes
May 15, 2026
GustavoA1604
approved these changes
May 15, 2026
Contributor
Author
|
/review |
Contributor
Author
|
/review |
Contributor
❌ E2E Mobile Test Results - AndroidOverall Status: FAILED Test Summary
Links
Automated E2E mobile testing powered by AWS Device Farm |
Contributor
❌ E2E Mobile Test Results - iOSOverall Status: FAILED Test Summary
Links
Automated E2E mobile testing powered by AWS Device Farm |
Proletter
pushed a commit
that referenced
this pull request
May 24, 2026
…Settings zh lookup (#2075) * QVAC-18157 feat[mod]: refresh and extend bergamot manifest Three changes to packages/registry-server/data/models.prod.json covering follow-ups from QVAC-17892 / PR #1785: 1. Date bump for enbg / enhr / ennl S3 has fresh base-memory uploads at /2026-04-28/ for these three pairs (alongside the enit/esen refresh that #1785 fixed). Manifest still pointed at the 2025-12-18 tiny uploads, so registry consumers were getting tiny bytes while Firefox Remote Settings served base-memory — same divergence shape #1785 patched. 2. Restore enzh and zhen PR #1919 dropped both pairs as "phantom" after a Remote Settings cross-check failed, but the lookup compared `zh` against `r.fromLang/r.toLang` while Firefox catalogs Chinese under the BCP 47 tag `zh-Hans`. Both pairs are in mozilla/firefox-translations-models/models/base-memory/, S3 has the bytes (2025-12-18), and the SHA256s match Firefox Remote Settings v2.2 (enzh) and v2.1 (zhen) byte-for-byte. enzh follows the CJK split-vocab convention (srcvocab + trgvocab); zhen uses single combined vocab. 3. Add 9 new bergamot pairs These were uploaded to S3 on 2026-04-28 but had no manifest entries yet. Each `link` field is set per the actual S3 metadata.architecture so the manifest reflects reality from day one (the root cause of QVAC-17892 was a stale link claim diverging from S3 bytes). - enbs, ensr, enth, envi, hbsen, then base-memory (31561787 B) - ennb, enno, noen tiny (17141051 B) Deferred from this PR: - 49 link-only mismatches (manifest claims `/base-memory/` but S3 + Mozilla only ever had `/tiny/`). Functionally already-tiny on both lanes; no runtime divergence. Will follow up as a doc-only PR. - enzh_hant / zh_hanten: model + lex + metadata present in S3 but no vocab files uploaded. Will follow up once vocabs are uploaded or a shared-vocab story with enzh/zhen is decided. - SDK regen (packages/sdk/models/registry/models.ts): generated from the synced Hyperdrive registry, not from this JSON file. Follow-up PR after this one merges and `sync-staging` runs. Validation: - node packages/registry-server/scripts/validate-models-json.js --file=packages/registry-server/data/models.prod.json → ✓ Valid: 760 model(s) - npm run test:unit (registry-server) → 43/43 pass * QVAC-18157 fix[api]: normalize zh to zh-Hans when matching Firefox Remote Settings `bergamot-model-fetcher.js` queries Firefox Remote Settings (the catalog the in-browser translation feature uses) and filters records by `r.fromLang === srcLang && r.toLang === dstLang`. The catalog exposes Chinese under the BCP 47 tag `zh-Hans` while callers pass the ISO 639-1 short code `zh` (matching QVAC's manifest and Mozilla's `firefox-translations-models` repo). The strict equality silently failed for every Chinese pair: `en→zh` and `zh→en` both threw "No Firefox Translations model found". Fix narrowly: introduce a tiny BCP47_LANG_ALIASES map with the one known mismatch (`zh → zh-Hans`) and apply it when building the record filter. Filename handling is unchanged — Firefox's `record.name` already uses the ISO short form (`model.enzh.intgemm.alphas.bin`), so files land on disk under the same names QVAC expects. Verified against the live catalog: the matching `enzh` v2.2 model bytes (SHA256 `4e5accc1…2157c`, 43849787 B) match the S3 upload at `s3://REMOVED-S3-BUCKET/qvac_models_compiled/bergamot/bergamot-enzh/2025-12-18/` exactly. Add unit tests covering the normalization, the unchanged passthrough for non-aliased codes, and the existing CJK split-vocab filename convention to pin the contract.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Follow-up to QVAC-17892 / PR #1785 (the bergamot S3 ↔ Firefox CDN variant divergence that desktop CI surfaced). This PR closes the runtime-impacting subset of the audit work tracked under QVAC-18157 — three categories of fix grouped into one branch because they share the same root cause (drift between QVAC's manifest, S3, and Firefox Remote Settings).
Manifest (
packages/registry-server/data/models.prod.json):enbg,enhr,ennl→2026-04-28. Same shape as QVAC-17892 fix[ci]: use refreshed base-memory bergamot models for desktop integration tests #1785's enit/esen fix: S3 has fresh base-memory uploads, the manifest still pointed at 2025-12-18 tiny.enzhandzhen, dropped by PR fix(registry): drop phantom Bergamot pairs and stale enja vocab entry #1919. Both pairs exist upstream inmozilla/firefox-translations-models/models/base-memory/; S3 bytes match Firefox Remote Settings v2.2/v2.1 SHA256s exactly; the original removal was a false-positive caused by querying Remote Settings withzhinstead ofzh-Hans.enbs,ensr,enth,envi,hbsen,then(base-memory);ennb,enno,noen(tiny). Eachlinkfield is set from the actual S3metadata.architecture, not a guess — this is the root-cause fix for the class of bug QVAC-17892 documented.Fetcher (
packages/translation-nmtcpp/lib/bergamot-model-fetcher.js):zh → zh-Hanswhen filtering Firefox Remote Settings records, so runtime downloads of Chinese pairs actually find a record. Without this, everyloadModel('en','zh')call threwNo Firefox Translations model found for en-zhbecause Remote Settings catalogs Chinese under the BCP 47 tag.Audit findings that informed this PR
Full crawl:
s3://tether-ai-dev/qvac_models_compiled/bergamot/(104 pair prefixes) vsmodels.prod.json(was 91 pairs) vsmozilla/firefox-translations-models(base-memory/: 42 dirs,tiny/: 84 dirs) vs Firefox Remote Settings live catalog (93 unique pairs).SDK regen follow-up
packages/sdk/models/registry/models.tsis generated from the synced Hyperdrive registry, not frommodels.prod.jsondirectly. After this PR merges andpr-models-validation-registry-server.yml'ssync-stagingjob pushes the new entries into the registry, a second PR should runpackages/sdk/models/update-modelsto regeneratemodels.tswith the new pairs'sha256Checksum/blobCoreKey/expectedSizepopulated from the live registry (same pattern as PR #1903 followed PR #1785).Test plan
node packages/registry-server/scripts/validate-models-json.js --file=packages/registry-server/data/models.prod.json→ ✓Valid: 760 model(s)npm --prefix packages/registry-server run test:unit→ 43/43 passnpm --prefix packages/translation-nmtcpp run test:unit→ 59/59 pass (4 new tests inbergamot-model-fetcher.test.js)npm --prefix packages/translation-nmtcpp run lint→ cleanpr-models-validation-registry-server.ymlvalidate-json jobloadModelfor('en','zh')on a clean cache successfully downloads from Firefox Remote Settings (Chinese pair end-to-end smoke)models.tsregen aftersync-stagingrunsRefs