fix(registry): drop phantom Bergamot pairs and stale enja vocab entry#1919
Merged
Conversation
…vocab entry Follow-up patch to PR #1785. The DHT registry sync after that PR landed flagged 5 failing Bergamot S3 entries (visible in Slack #1785 thread on May 6: bergamot-enja/.../vocab.enja.spm, bergamot-enko/.../vocab.enko.spm, and 3× bergamot-fire/...). Cross-checking models.prod.json against Mozilla's Firefox-Translations Remote Settings — the same upstream the runtime bergamot-model-fetcher.js queries — shows two distinct issues: 1. Three pairs in the manifest don't exist upstream at all: - "fire" — Mozilla has no `re` language anywhere in the translations-models collection. The manifest description was "Bergamot ... fi-re" but `fi` only pairs with `en` upstream. - "enzh" — Mozilla doesn't ship an English→Chinese pair. - "zhen" — Mozilla doesn't ship Chinese→English either. None of these entries can ever sync because the bytes were never created. 2. One stale entry references a file Mozilla doesn't ship for that specific pair: - bergamot-enja/.../vocab.enja.spm — Mozilla only publishes srcvocab.enja.spm + trgvocab.enja.spm for enja (the CJK split-vocab convention). The split entries are already in the manifest correctly; this is a leftover combined entry that should never have been added. Net change: -14 manifest entries (13 phantom + 1 stale), zero entries replaced, zero entries added. All 666 remaining bergamot-* entries correspond to real files Mozilla actually publishes (or `metadata.json` sidecars that our team uploads alongside). What this PR deliberately does NOT touch: - bergamot-enko/.../vocab.enko.spm — also flagged in the failing sync output. Mozilla DOES ship combined `vocab.enko.spm` upstream alongside split, so the manifest entry is theoretically valid. Failure is more likely a date / S3-upload mismatch on our side. Waiting on @yury Samarin's S3 verification before deciding whether to drop the combined entry or keep it. Independent fix. - 90 `metadata.json` entries that Mozilla doesn't ship. These are custom files our team uploads to S3 alongside Mozilla's bytes (Yury's screenshot confirmed metadata.json IS present in our S3 for enja). Manifest is correct. Validation: - JSON syntax check: clean - Re-running the Mozilla cross-check after the edit: 0 phantom pairs and 0 vocab-naming mismatches remaining - 274 manifest entries (~half) are exact filename matches against Mozilla upstream - 91 remaining "wrong_filename" hits are all metadata.json (expected, custom internal file) Refs QVAC-17892, PR #1785. Co-authored-by: Cursor <cursoragent@cursor.com>
Companion to the previous commit (9e83643) on this branch. Yury confirmed via `aws s3 ls` that bergamot-enko/2025-12-18/ contains srcvocab.enko.spm + trgvocab.enko.spm and no combined vocab.enko.spm. This matches Mozilla's current upstream — they migrated enko from combined-vocab to split-vocab in their Remote Settings on 2025-07-22, and our team mirrored the post-migration layout to S3. Net change in this commit: - drop bergamot-enko/2025-12-18/vocab.enko.spm - add bergamot-enko/2025-12-18/srcvocab.enko.spm - add bergamot-enko/2025-12-18/trgvocab.enko.spm Same shape as the existing enja split entries already in this file (cloned from them for consistency: same description, engine, licenseId, tags, link). Refs QVAC-17892, PR #1785, this PR #1919. Co-authored-by: Cursor <cursoragent@cursor.com>
iancris
approved these changes
May 6, 2026
Contributor
Tier-based Approval Status |
yuranich
approved these changes
May 6, 2026
Contributor
|
/review |
2 similar comments
Contributor
Author
|
/review |
Contributor
Author
|
/review |
Contributor
Author
|
/review |
Merged
7 tasks
olyasir
added a commit
that referenced
this pull request
May 15, 2026
Three changes to packages/registry-server/data/models.prod.json covering follow-ups from QVAC-17892 / PR #1785: 1. Date bump for enbg / enhr / ennl S3 has fresh base-memory uploads at /2026-04-28/ for these three pairs (alongside the enit/esen refresh that #1785 fixed). Manifest still pointed at the 2025-12-18 tiny uploads, so registry consumers were getting tiny bytes while Firefox Remote Settings served base-memory — same divergence shape #1785 patched. 2. Restore enzh and zhen PR #1919 dropped both pairs as "phantom" after a Remote Settings cross-check failed, but the lookup compared `zh` against `r.fromLang/r.toLang` while Firefox catalogs Chinese under the BCP 47 tag `zh-Hans`. Both pairs are in mozilla/firefox-translations-models/models/base-memory/, S3 has the bytes (2025-12-18), and the SHA256s match Firefox Remote Settings v2.2 (enzh) and v2.1 (zhen) byte-for-byte. enzh follows the CJK split-vocab convention (srcvocab + trgvocab); zhen uses single combined vocab. 3. Add 9 new bergamot pairs These were uploaded to S3 on 2026-04-28 but had no manifest entries yet. Each `link` field is set per the actual S3 metadata.architecture so the manifest reflects reality from day one (the root cause of QVAC-17892 was a stale link claim diverging from S3 bytes). - enbs, ensr, enth, envi, hbsen, then base-memory (31561787 B) - ennb, enno, noen tiny (17141051 B) Deferred from this PR: - 49 link-only mismatches (manifest claims `/base-memory/` but S3 + Mozilla only ever had `/tiny/`). Functionally already-tiny on both lanes; no runtime divergence. Will follow up as a doc-only PR. - enzh_hant / zh_hanten: model + lex + metadata present in S3 but no vocab files uploaded. Will follow up once vocabs are uploaded or a shared-vocab story with enzh/zhen is decided. - SDK regen (packages/sdk/models/registry/models.ts): generated from the synced Hyperdrive registry, not from this JSON file. Follow-up PR after this one merges and `sync-staging` runs. Validation: - node packages/registry-server/scripts/validate-models-json.js --file=packages/registry-server/data/models.prod.json → ✓ Valid: 760 model(s) - npm run test:unit (registry-server) → 43/43 pass
olyasir
added a commit
that referenced
this pull request
May 15, 2026
…Settings zh lookup (#2075) * QVAC-18157 feat[mod]: refresh and extend bergamot manifest Three changes to packages/registry-server/data/models.prod.json covering follow-ups from QVAC-17892 / PR #1785: 1. Date bump for enbg / enhr / ennl S3 has fresh base-memory uploads at /2026-04-28/ for these three pairs (alongside the enit/esen refresh that #1785 fixed). Manifest still pointed at the 2025-12-18 tiny uploads, so registry consumers were getting tiny bytes while Firefox Remote Settings served base-memory — same divergence shape #1785 patched. 2. Restore enzh and zhen PR #1919 dropped both pairs as "phantom" after a Remote Settings cross-check failed, but the lookup compared `zh` against `r.fromLang/r.toLang` while Firefox catalogs Chinese under the BCP 47 tag `zh-Hans`. Both pairs are in mozilla/firefox-translations-models/models/base-memory/, S3 has the bytes (2025-12-18), and the SHA256s match Firefox Remote Settings v2.2 (enzh) and v2.1 (zhen) byte-for-byte. enzh follows the CJK split-vocab convention (srcvocab + trgvocab); zhen uses single combined vocab. 3. Add 9 new bergamot pairs These were uploaded to S3 on 2026-04-28 but had no manifest entries yet. Each `link` field is set per the actual S3 metadata.architecture so the manifest reflects reality from day one (the root cause of QVAC-17892 was a stale link claim diverging from S3 bytes). - enbs, ensr, enth, envi, hbsen, then base-memory (31561787 B) - ennb, enno, noen tiny (17141051 B) Deferred from this PR: - 49 link-only mismatches (manifest claims `/base-memory/` but S3 + Mozilla only ever had `/tiny/`). Functionally already-tiny on both lanes; no runtime divergence. Will follow up as a doc-only PR. - enzh_hant / zh_hanten: model + lex + metadata present in S3 but no vocab files uploaded. Will follow up once vocabs are uploaded or a shared-vocab story with enzh/zhen is decided. - SDK regen (packages/sdk/models/registry/models.ts): generated from the synced Hyperdrive registry, not from this JSON file. Follow-up PR after this one merges and `sync-staging` runs. Validation: - node packages/registry-server/scripts/validate-models-json.js --file=packages/registry-server/data/models.prod.json → ✓ Valid: 760 model(s) - npm run test:unit (registry-server) → 43/43 pass * QVAC-18157 fix[api]: normalize zh to zh-Hans when matching Firefox Remote Settings `bergamot-model-fetcher.js` queries Firefox Remote Settings (the catalog the in-browser translation feature uses) and filters records by `r.fromLang === srcLang && r.toLang === dstLang`. The catalog exposes Chinese under the BCP 47 tag `zh-Hans` while callers pass the ISO 639-1 short code `zh` (matching QVAC's manifest and Mozilla's `firefox-translations-models` repo). The strict equality silently failed for every Chinese pair: `en→zh` and `zh→en` both threw "No Firefox Translations model found". Fix narrowly: introduce a tiny BCP47_LANG_ALIASES map with the one known mismatch (`zh → zh-Hans`) and apply it when building the record filter. Filename handling is unchanged — Firefox's `record.name` already uses the ISO short form (`model.enzh.intgemm.alphas.bin`), so files land on disk under the same names QVAC expects. Verified against the live catalog: the matching `enzh` v2.2 model bytes (SHA256 `4e5accc1…2157c`, 43849787 B) match the S3 upload at `s3://tether-ai-dev/qvac_models_compiled/bergamot/bergamot-enzh/2025-12-18/` exactly. Add unit tests covering the normalization, the unchanged passthrough for non-aliased codes, and the existing CJK split-vocab filename convention to pin the contract.
Proletter
pushed a commit
that referenced
this pull request
May 24, 2026
…#1919) * QVAC-17892 fix(registry): drop phantom Bergamot pairs and stale enja vocab entry Follow-up patch to PR #1785. The DHT registry sync after that PR landed flagged 5 failing Bergamot S3 entries (visible in Slack #1785 thread on May 6: bergamot-enja/.../vocab.enja.spm, bergamot-enko/.../vocab.enko.spm, and 3× bergamot-fire/...). Cross-checking models.prod.json against Mozilla's Firefox-Translations Remote Settings — the same upstream the runtime bergamot-model-fetcher.js queries — shows two distinct issues: 1. Three pairs in the manifest don't exist upstream at all: - "fire" — Mozilla has no `re` language anywhere in the translations-models collection. The manifest description was "Bergamot ... fi-re" but `fi` only pairs with `en` upstream. - "enzh" — Mozilla doesn't ship an English→Chinese pair. - "zhen" — Mozilla doesn't ship Chinese→English either. None of these entries can ever sync because the bytes were never created. 2. One stale entry references a file Mozilla doesn't ship for that specific pair: - bergamot-enja/.../vocab.enja.spm — Mozilla only publishes srcvocab.enja.spm + trgvocab.enja.spm for enja (the CJK split-vocab convention). The split entries are already in the manifest correctly; this is a leftover combined entry that should never have been added. Net change: -14 manifest entries (13 phantom + 1 stale), zero entries replaced, zero entries added. All 666 remaining bergamot-* entries correspond to real files Mozilla actually publishes (or `metadata.json` sidecars that our team uploads alongside). What this PR deliberately does NOT touch: - bergamot-enko/.../vocab.enko.spm — also flagged in the failing sync output. Mozilla DOES ship combined `vocab.enko.spm` upstream alongside split, so the manifest entry is theoretically valid. Failure is more likely a date / S3-upload mismatch on our side. Waiting on @yury Samarin's S3 verification before deciding whether to drop the combined entry or keep it. Independent fix. - 90 `metadata.json` entries that Mozilla doesn't ship. These are custom files our team uploads to S3 alongside Mozilla's bytes (Yury's screenshot confirmed metadata.json IS present in our S3 for enja). Manifest is correct. Validation: - JSON syntax check: clean - Re-running the Mozilla cross-check after the edit: 0 phantom pairs and 0 vocab-naming mismatches remaining - 274 manifest entries (~half) are exact filename matches against Mozilla upstream - 91 remaining "wrong_filename" hits are all metadata.json (expected, custom internal file) Refs QVAC-17892, PR #1785. Co-authored-by: Cursor <cursoragent@cursor.com> * QVAC-17892 fix(registry): replace combined enko vocab with split src/trg Companion to the previous commit (9e83643) on this branch. Yury confirmed via `aws s3 ls` that bergamot-enko/2025-12-18/ contains srcvocab.enko.spm + trgvocab.enko.spm and no combined vocab.enko.spm. This matches Mozilla's current upstream — they migrated enko from combined-vocab to split-vocab in their Remote Settings on 2025-07-22, and our team mirrored the post-migration layout to S3. Net change in this commit: - drop bergamot-enko/2025-12-18/vocab.enko.spm - add bergamot-enko/2025-12-18/srcvocab.enko.spm - add bergamot-enko/2025-12-18/trgvocab.enko.spm Same shape as the existing enja split entries already in this file (cloned from them for consistency: same description, engine, licenseId, tags, link). Refs QVAC-17892, PR #1785, this PR #1919. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Alok-Ranjan23 <Alok-Ranjan23@users.noreply.github.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Yury Samarin <yuri.a.samarin@gmail.com>
Proletter
pushed a commit
that referenced
this pull request
May 24, 2026
…Settings zh lookup (#2075) * QVAC-18157 feat[mod]: refresh and extend bergamot manifest Three changes to packages/registry-server/data/models.prod.json covering follow-ups from QVAC-17892 / PR #1785: 1. Date bump for enbg / enhr / ennl S3 has fresh base-memory uploads at /2026-04-28/ for these three pairs (alongside the enit/esen refresh that #1785 fixed). Manifest still pointed at the 2025-12-18 tiny uploads, so registry consumers were getting tiny bytes while Firefox Remote Settings served base-memory — same divergence shape #1785 patched. 2. Restore enzh and zhen PR #1919 dropped both pairs as "phantom" after a Remote Settings cross-check failed, but the lookup compared `zh` against `r.fromLang/r.toLang` while Firefox catalogs Chinese under the BCP 47 tag `zh-Hans`. Both pairs are in mozilla/firefox-translations-models/models/base-memory/, S3 has the bytes (2025-12-18), and the SHA256s match Firefox Remote Settings v2.2 (enzh) and v2.1 (zhen) byte-for-byte. enzh follows the CJK split-vocab convention (srcvocab + trgvocab); zhen uses single combined vocab. 3. Add 9 new bergamot pairs These were uploaded to S3 on 2026-04-28 but had no manifest entries yet. Each `link` field is set per the actual S3 metadata.architecture so the manifest reflects reality from day one (the root cause of QVAC-17892 was a stale link claim diverging from S3 bytes). - enbs, ensr, enth, envi, hbsen, then base-memory (31561787 B) - ennb, enno, noen tiny (17141051 B) Deferred from this PR: - 49 link-only mismatches (manifest claims `/base-memory/` but S3 + Mozilla only ever had `/tiny/`). Functionally already-tiny on both lanes; no runtime divergence. Will follow up as a doc-only PR. - enzh_hant / zh_hanten: model + lex + metadata present in S3 but no vocab files uploaded. Will follow up once vocabs are uploaded or a shared-vocab story with enzh/zhen is decided. - SDK regen (packages/sdk/models/registry/models.ts): generated from the synced Hyperdrive registry, not from this JSON file. Follow-up PR after this one merges and `sync-staging` runs. Validation: - node packages/registry-server/scripts/validate-models-json.js --file=packages/registry-server/data/models.prod.json → ✓ Valid: 760 model(s) - npm run test:unit (registry-server) → 43/43 pass * QVAC-18157 fix[api]: normalize zh to zh-Hans when matching Firefox Remote Settings `bergamot-model-fetcher.js` queries Firefox Remote Settings (the catalog the in-browser translation feature uses) and filters records by `r.fromLang === srcLang && r.toLang === dstLang`. The catalog exposes Chinese under the BCP 47 tag `zh-Hans` while callers pass the ISO 639-1 short code `zh` (matching QVAC's manifest and Mozilla's `firefox-translations-models` repo). The strict equality silently failed for every Chinese pair: `en→zh` and `zh→en` both threw "No Firefox Translations model found". Fix narrowly: introduce a tiny BCP47_LANG_ALIASES map with the one known mismatch (`zh → zh-Hans`) and apply it when building the record filter. Filename handling is unchanged — Firefox's `record.name` already uses the ISO short form (`model.enzh.intgemm.alphas.bin`), so files land on disk under the same names QVAC expects. Verified against the live catalog: the matching `enzh` v2.2 model bytes (SHA256 `4e5accc1…2157c`, 43849787 B) match the S3 upload at `s3://REMOVED-S3-BUCKET/qvac_models_compiled/bergamot/bergamot-enzh/2025-12-18/` exactly. Add unit tests covering the normalization, the unchanged passthrough for non-aliased codes, and the existing CJK split-vocab filename convention to pin the contract.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Drop phantom Bergamot pairs and stale enja vocab entry
Three pairs in the manifest don't exist upstream at all:
relanguage anywhere in the translations-models collection. The manifest description was "Bergamot ... fi-re" butfionly pairs withenupstream.None of these entries can ever sync because the bytes were never created.
One stale entry references a file Mozilla doesn't ship for that
specific pair:
- bergamot-enja/.../vocab.enja.spm — Mozilla only publishes srcvocab.enja.spm + trgvocab.enja.spm for enja (the CJK
split-vocab convention). The split entries are already in the manifest correctly; this is a leftover combined entry
that should never have been added.
Net change: -14 manifest entries (13 phantom + 1 stale), zero entries replaced, zero entries added. All 666 remaining bergamot-* entries correspond to real files Mozilla actually publishes (or
metadata.jsonsidecars that our team uploads alongside).What this PR deliberately does NOT touch:
vocab.enko.spmupstream alongside split, so the manifest entry is theoretically valid. Failure is more likely a date / S3-upload mismatch on our side.Validation:
pairs and 0 vocab-naming mismatches remaining
Mozilla upstream
custom internal file)
Refs QVAC-17892, PR #1785.