Skip to content

fix(registry): drop phantom Bergamot pairs and stale enja vocab entry#1919

Merged
Alok-Ranjan23 merged 4 commits into
mainfrom
fix/QVAC-17892-bergamot-manifest-cleanup
May 6, 2026
Merged

fix(registry): drop phantom Bergamot pairs and stale enja vocab entry#1919
Alok-Ranjan23 merged 4 commits into
mainfrom
fix/QVAC-17892-bergamot-manifest-cleanup

Conversation

@Alok-Ranjan23

@Alok-Ranjan23 Alok-Ranjan23 commented May 6, 2026

Copy link
Copy Markdown
Contributor

Drop phantom Bergamot pairs and stale enja vocab entry

  1. Three pairs in the manifest don't exist upstream at all:

    • "fire" — Mozilla has no re language anywhere in the translations-models collection. The manifest description was "Bergamot ... fi-re" but fi only pairs with en upstream.
      • "enzh" — Mozilla doesn't ship an English→Chinese pair.
      • "zhen" — Mozilla doesn't ship Chinese→English either.
        None of these entries can ever sync because the bytes were never created.
  2. One stale entry references a file Mozilla doesn't ship for that
    specific pair:
    - bergamot-enja/.../vocab.enja.spm — Mozilla only publishes srcvocab.enja.spm + trgvocab.enja.spm for enja (the CJK
    split-vocab convention). The split entries are already in the manifest correctly; this is a leftover combined entry
    that should never have been added.

Net change: -14 manifest entries (13 phantom + 1 stale), zero entries replaced, zero entries added. All 666 remaining bergamot-* entries correspond to real files Mozilla actually publishes (or metadata.json sidecars that our team uploads alongside).

What this PR deliberately does NOT touch:

  • bergamot-enko/.../vocab.enko.spm — also flagged in the failing sync output. Mozilla DOES ship combined vocab.enko.spm upstream alongside split, so the manifest entry is theoretically valid. Failure is more likely a date / S3-upload mismatch on our side.

Validation:

  • JSON syntax check: clean
  • Re-running the Mozilla cross-check after the edit: 0 phantom
    pairs and 0 vocab-naming mismatches remaining
  • 274 manifest entries (~half) are exact filename matches against
    Mozilla upstream
  • 91 remaining "wrong_filename" hits are all metadata.json (expected,
    custom internal file)

Refs QVAC-17892, PR #1785.

…vocab entry

Follow-up patch to PR #1785. The DHT registry sync after that PR landed
flagged 5 failing Bergamot S3 entries (visible in Slack #1785 thread on
May 6: bergamot-enja/.../vocab.enja.spm, bergamot-enko/.../vocab.enko.spm,
and 3× bergamot-fire/...). Cross-checking models.prod.json against
Mozilla's Firefox-Translations Remote Settings — the same upstream the
runtime bergamot-model-fetcher.js queries — shows two distinct issues:

  1. Three pairs in the manifest don't exist upstream at all:
       - "fire" — Mozilla has no `re` language anywhere in the
         translations-models collection. The manifest description was
         "Bergamot ... fi-re" but `fi` only pairs with `en` upstream.
       - "enzh" — Mozilla doesn't ship an English→Chinese pair.
       - "zhen" — Mozilla doesn't ship Chinese→English either.
     None of these entries can ever sync because the bytes were never
     created.

  2. One stale entry references a file Mozilla doesn't ship for that
     specific pair:
       - bergamot-enja/.../vocab.enja.spm — Mozilla only publishes
         srcvocab.enja.spm + trgvocab.enja.spm for enja (the CJK
         split-vocab convention). The split entries are already in
         the manifest correctly; this is a leftover combined entry
         that should never have been added.

Net change: -14 manifest entries (13 phantom + 1 stale), zero entries
replaced, zero entries added. All 666 remaining bergamot-* entries
correspond to real files Mozilla actually publishes (or `metadata.json`
sidecars that our team uploads alongside).

What this PR deliberately does NOT touch:

  - bergamot-enko/.../vocab.enko.spm — also flagged in the failing
    sync output. Mozilla DOES ship combined `vocab.enko.spm`
    upstream alongside split, so the manifest entry is theoretically
    valid. Failure is more likely a date / S3-upload mismatch on our
    side. Waiting on @yury Samarin's S3 verification before deciding
    whether to drop the combined entry or keep it. Independent fix.
  - 90 `metadata.json` entries that Mozilla doesn't ship. These are
    custom files our team uploads to S3 alongside Mozilla's bytes
    (Yury's screenshot confirmed metadata.json IS present in our S3
    for enja). Manifest is correct.

Validation:

  - JSON syntax check: clean
  - Re-running the Mozilla cross-check after the edit: 0 phantom
    pairs and 0 vocab-naming mismatches remaining
  - 274 manifest entries (~half) are exact filename matches against
    Mozilla upstream
  - 91 remaining "wrong_filename" hits are all metadata.json (expected,
    custom internal file)

Refs QVAC-17892, PR #1785.

Co-authored-by: Cursor <cursoragent@cursor.com>
@Alok-Ranjan23 Alok-Ranjan23 requested a review from a team as a code owner May 6, 2026 11:08
@Alok-Ranjan23 Alok-Ranjan23 changed the title QVAC-17892 fix(registry): drop phantom Bergamot pairs and stale enja vocab entry fix(registry): drop phantom Bergamot pairs and stale enja vocab entry May 6, 2026
@Alok-Ranjan23 Alok-Ranjan23 requested a review from yuranich May 6, 2026 11:13
@Alok-Ranjan23 Alok-Ranjan23 requested a review from iancris May 6, 2026 11:14
Companion to the previous commit (9e83643) on this branch. Yury
confirmed via `aws s3 ls` that bergamot-enko/2025-12-18/ contains
srcvocab.enko.spm + trgvocab.enko.spm and no combined vocab.enko.spm.
This matches Mozilla's current upstream — they migrated enko from
combined-vocab to split-vocab in their Remote Settings on 2025-07-22,
and our team mirrored the post-migration layout to S3.

Net change in this commit:
  - drop bergamot-enko/2025-12-18/vocab.enko.spm
  - add bergamot-enko/2025-12-18/srcvocab.enko.spm
  - add bergamot-enko/2025-12-18/trgvocab.enko.spm

Same shape as the existing enja split entries already in this file
(cloned from them for consistency: same description, engine,
licenseId, tags, link).

Refs QVAC-17892, PR #1785, this PR #1919.

Co-authored-by: Cursor <cursoragent@cursor.com>
@github-actions

github-actions Bot commented May 6, 2026

Copy link
Copy Markdown
Contributor

Tier-based Approval Status

**PR Tier:** TIER1

**Current Status:** ✅ APPROVED

**Requirements:**
- 1 Team Member approval ✅ (1/1)
- 1 Team Lead OR Management approval ✅ (1/1)



---
*This comment is automatically updated when reviews change.*

@yuranich

yuranich commented May 6, 2026

Copy link
Copy Markdown
Contributor

/review

2 similar comments
@Alok-Ranjan23

Copy link
Copy Markdown
Contributor Author

/review

@Alok-Ranjan23

Copy link
Copy Markdown
Contributor Author

/review

@Alok-Ranjan23 Alok-Ranjan23 requested a review from a team as a code owner May 6, 2026 18:49
@Alok-Ranjan23

Copy link
Copy Markdown
Contributor Author

/review

@Alok-Ranjan23 Alok-Ranjan23 merged commit bc19601 into main May 6, 2026
14 checks passed
@Alok-Ranjan23 Alok-Ranjan23 deleted the fix/QVAC-17892-bergamot-manifest-cleanup branch May 6, 2026 19:33
olyasir added a commit that referenced this pull request May 15, 2026
Three changes to packages/registry-server/data/models.prod.json
covering follow-ups from QVAC-17892 / PR #1785:

1. Date bump for enbg / enhr / ennl
   S3 has fresh base-memory uploads at /2026-04-28/ for these three
   pairs (alongside the enit/esen refresh that #1785 fixed). Manifest
   still pointed at the 2025-12-18 tiny uploads, so registry consumers
   were getting tiny bytes while Firefox Remote Settings served
   base-memory — same divergence shape #1785 patched.

2. Restore enzh and zhen
   PR #1919 dropped both pairs as "phantom" after a Remote Settings
   cross-check failed, but the lookup compared `zh` against
   `r.fromLang/r.toLang` while Firefox catalogs Chinese under the
   BCP 47 tag `zh-Hans`. Both pairs are in
   mozilla/firefox-translations-models/models/base-memory/, S3 has the
   bytes (2025-12-18), and the SHA256s match Firefox Remote Settings
   v2.2 (enzh) and v2.1 (zhen) byte-for-byte. enzh follows the CJK
   split-vocab convention (srcvocab + trgvocab); zhen uses single
   combined vocab.

3. Add 9 new bergamot pairs
   These were uploaded to S3 on 2026-04-28 but had no manifest entries
   yet. Each `link` field is set per the actual S3
   metadata.architecture so the manifest reflects reality from day one
   (the root cause of QVAC-17892 was a stale link claim diverging from
   S3 bytes).

     - enbs, ensr, enth, envi, hbsen, then    base-memory (31561787 B)
     - ennb, enno, noen                       tiny        (17141051 B)

Deferred from this PR:

  - 49 link-only mismatches (manifest claims `/base-memory/` but S3 +
    Mozilla only ever had `/tiny/`). Functionally already-tiny on
    both lanes; no runtime divergence. Will follow up as a doc-only
    PR.
  - enzh_hant / zh_hanten: model + lex + metadata present in S3 but no
    vocab files uploaded. Will follow up once vocabs are uploaded or a
    shared-vocab story with enzh/zhen is decided.
  - SDK regen (packages/sdk/models/registry/models.ts): generated from
    the synced Hyperdrive registry, not from this JSON file. Follow-up
    PR after this one merges and `sync-staging` runs.

Validation:

  - node packages/registry-server/scripts/validate-models-json.js
    --file=packages/registry-server/data/models.prod.json
    → ✓ Valid: 760 model(s)
  - npm run test:unit (registry-server) → 43/43 pass
olyasir added a commit that referenced this pull request May 15, 2026
…Settings zh lookup (#2075)

* QVAC-18157 feat[mod]: refresh and extend bergamot manifest

Three changes to packages/registry-server/data/models.prod.json
covering follow-ups from QVAC-17892 / PR #1785:

1. Date bump for enbg / enhr / ennl
   S3 has fresh base-memory uploads at /2026-04-28/ for these three
   pairs (alongside the enit/esen refresh that #1785 fixed). Manifest
   still pointed at the 2025-12-18 tiny uploads, so registry consumers
   were getting tiny bytes while Firefox Remote Settings served
   base-memory — same divergence shape #1785 patched.

2. Restore enzh and zhen
   PR #1919 dropped both pairs as "phantom" after a Remote Settings
   cross-check failed, but the lookup compared `zh` against
   `r.fromLang/r.toLang` while Firefox catalogs Chinese under the
   BCP 47 tag `zh-Hans`. Both pairs are in
   mozilla/firefox-translations-models/models/base-memory/, S3 has the
   bytes (2025-12-18), and the SHA256s match Firefox Remote Settings
   v2.2 (enzh) and v2.1 (zhen) byte-for-byte. enzh follows the CJK
   split-vocab convention (srcvocab + trgvocab); zhen uses single
   combined vocab.

3. Add 9 new bergamot pairs
   These were uploaded to S3 on 2026-04-28 but had no manifest entries
   yet. Each `link` field is set per the actual S3
   metadata.architecture so the manifest reflects reality from day one
   (the root cause of QVAC-17892 was a stale link claim diverging from
   S3 bytes).

     - enbs, ensr, enth, envi, hbsen, then    base-memory (31561787 B)
     - ennb, enno, noen                       tiny        (17141051 B)

Deferred from this PR:

  - 49 link-only mismatches (manifest claims `/base-memory/` but S3 +
    Mozilla only ever had `/tiny/`). Functionally already-tiny on
    both lanes; no runtime divergence. Will follow up as a doc-only
    PR.
  - enzh_hant / zh_hanten: model + lex + metadata present in S3 but no
    vocab files uploaded. Will follow up once vocabs are uploaded or a
    shared-vocab story with enzh/zhen is decided.
  - SDK regen (packages/sdk/models/registry/models.ts): generated from
    the synced Hyperdrive registry, not from this JSON file. Follow-up
    PR after this one merges and `sync-staging` runs.

Validation:

  - node packages/registry-server/scripts/validate-models-json.js
    --file=packages/registry-server/data/models.prod.json
    → ✓ Valid: 760 model(s)
  - npm run test:unit (registry-server) → 43/43 pass

* QVAC-18157 fix[api]: normalize zh to zh-Hans when matching Firefox Remote Settings

`bergamot-model-fetcher.js` queries Firefox Remote Settings (the
catalog the in-browser translation feature uses) and filters records
by `r.fromLang === srcLang && r.toLang === dstLang`. The catalog
exposes Chinese under the BCP 47 tag `zh-Hans` while callers pass the
ISO 639-1 short code `zh` (matching QVAC's manifest and Mozilla's
`firefox-translations-models` repo). The strict equality silently
failed for every Chinese pair: `en→zh` and `zh→en` both threw
"No Firefox Translations model found".

Fix narrowly: introduce a tiny BCP47_LANG_ALIASES map with the one
known mismatch (`zh → zh-Hans`) and apply it when building the
record filter. Filename handling is unchanged — Firefox's
`record.name` already uses the ISO short form
(`model.enzh.intgemm.alphas.bin`), so files land on disk under the
same names QVAC expects.

Verified against the live catalog: the matching `enzh` v2.2 model
bytes (SHA256 `4e5accc1…2157c`, 43849787 B) match the S3 upload at
`s3://tether-ai-dev/qvac_models_compiled/bergamot/bergamot-enzh/2025-12-18/`
exactly.

Add unit tests covering the normalization, the unchanged
passthrough for non-aliased codes, and the existing CJK split-vocab
filename convention to pin the contract.
Proletter pushed a commit that referenced this pull request May 24, 2026
…#1919)

* QVAC-17892 fix(registry): drop phantom Bergamot pairs and stale enja vocab entry

Follow-up patch to PR #1785. The DHT registry sync after that PR landed
flagged 5 failing Bergamot S3 entries (visible in Slack #1785 thread on
May 6: bergamot-enja/.../vocab.enja.spm, bergamot-enko/.../vocab.enko.spm,
and 3× bergamot-fire/...). Cross-checking models.prod.json against
Mozilla's Firefox-Translations Remote Settings — the same upstream the
runtime bergamot-model-fetcher.js queries — shows two distinct issues:

  1. Three pairs in the manifest don't exist upstream at all:
       - "fire" — Mozilla has no `re` language anywhere in the
         translations-models collection. The manifest description was
         "Bergamot ... fi-re" but `fi` only pairs with `en` upstream.
       - "enzh" — Mozilla doesn't ship an English→Chinese pair.
       - "zhen" — Mozilla doesn't ship Chinese→English either.
     None of these entries can ever sync because the bytes were never
     created.

  2. One stale entry references a file Mozilla doesn't ship for that
     specific pair:
       - bergamot-enja/.../vocab.enja.spm — Mozilla only publishes
         srcvocab.enja.spm + trgvocab.enja.spm for enja (the CJK
         split-vocab convention). The split entries are already in
         the manifest correctly; this is a leftover combined entry
         that should never have been added.

Net change: -14 manifest entries (13 phantom + 1 stale), zero entries
replaced, zero entries added. All 666 remaining bergamot-* entries
correspond to real files Mozilla actually publishes (or `metadata.json`
sidecars that our team uploads alongside).

What this PR deliberately does NOT touch:

  - bergamot-enko/.../vocab.enko.spm — also flagged in the failing
    sync output. Mozilla DOES ship combined `vocab.enko.spm`
    upstream alongside split, so the manifest entry is theoretically
    valid. Failure is more likely a date / S3-upload mismatch on our
    side. Waiting on @yury Samarin's S3 verification before deciding
    whether to drop the combined entry or keep it. Independent fix.
  - 90 `metadata.json` entries that Mozilla doesn't ship. These are
    custom files our team uploads to S3 alongside Mozilla's bytes
    (Yury's screenshot confirmed metadata.json IS present in our S3
    for enja). Manifest is correct.

Validation:

  - JSON syntax check: clean
  - Re-running the Mozilla cross-check after the edit: 0 phantom
    pairs and 0 vocab-naming mismatches remaining
  - 274 manifest entries (~half) are exact filename matches against
    Mozilla upstream
  - 91 remaining "wrong_filename" hits are all metadata.json (expected,
    custom internal file)

Refs QVAC-17892, PR #1785.

Co-authored-by: Cursor <cursoragent@cursor.com>

* QVAC-17892 fix(registry): replace combined enko vocab with split src/trg

Companion to the previous commit (9e83643) on this branch. Yury
confirmed via `aws s3 ls` that bergamot-enko/2025-12-18/ contains
srcvocab.enko.spm + trgvocab.enko.spm and no combined vocab.enko.spm.
This matches Mozilla's current upstream — they migrated enko from
combined-vocab to split-vocab in their Remote Settings on 2025-07-22,
and our team mirrored the post-migration layout to S3.

Net change in this commit:
  - drop bergamot-enko/2025-12-18/vocab.enko.spm
  - add bergamot-enko/2025-12-18/srcvocab.enko.spm
  - add bergamot-enko/2025-12-18/trgvocab.enko.spm

Same shape as the existing enja split entries already in this file
(cloned from them for consistency: same description, engine,
licenseId, tags, link).

Refs QVAC-17892, PR #1785, this PR #1919.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Alok-Ranjan23 <Alok-Ranjan23@users.noreply.github.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Yury Samarin <yuri.a.samarin@gmail.com>
Proletter pushed a commit that referenced this pull request May 24, 2026
…Settings zh lookup (#2075)

* QVAC-18157 feat[mod]: refresh and extend bergamot manifest

Three changes to packages/registry-server/data/models.prod.json
covering follow-ups from QVAC-17892 / PR #1785:

1. Date bump for enbg / enhr / ennl
   S3 has fresh base-memory uploads at /2026-04-28/ for these three
   pairs (alongside the enit/esen refresh that #1785 fixed). Manifest
   still pointed at the 2025-12-18 tiny uploads, so registry consumers
   were getting tiny bytes while Firefox Remote Settings served
   base-memory — same divergence shape #1785 patched.

2. Restore enzh and zhen
   PR #1919 dropped both pairs as "phantom" after a Remote Settings
   cross-check failed, but the lookup compared `zh` against
   `r.fromLang/r.toLang` while Firefox catalogs Chinese under the
   BCP 47 tag `zh-Hans`. Both pairs are in
   mozilla/firefox-translations-models/models/base-memory/, S3 has the
   bytes (2025-12-18), and the SHA256s match Firefox Remote Settings
   v2.2 (enzh) and v2.1 (zhen) byte-for-byte. enzh follows the CJK
   split-vocab convention (srcvocab + trgvocab); zhen uses single
   combined vocab.

3. Add 9 new bergamot pairs
   These were uploaded to S3 on 2026-04-28 but had no manifest entries
   yet. Each `link` field is set per the actual S3
   metadata.architecture so the manifest reflects reality from day one
   (the root cause of QVAC-17892 was a stale link claim diverging from
   S3 bytes).

     - enbs, ensr, enth, envi, hbsen, then    base-memory (31561787 B)
     - ennb, enno, noen                       tiny        (17141051 B)

Deferred from this PR:

  - 49 link-only mismatches (manifest claims `/base-memory/` but S3 +
    Mozilla only ever had `/tiny/`). Functionally already-tiny on
    both lanes; no runtime divergence. Will follow up as a doc-only
    PR.
  - enzh_hant / zh_hanten: model + lex + metadata present in S3 but no
    vocab files uploaded. Will follow up once vocabs are uploaded or a
    shared-vocab story with enzh/zhen is decided.
  - SDK regen (packages/sdk/models/registry/models.ts): generated from
    the synced Hyperdrive registry, not from this JSON file. Follow-up
    PR after this one merges and `sync-staging` runs.

Validation:

  - node packages/registry-server/scripts/validate-models-json.js
    --file=packages/registry-server/data/models.prod.json
    → ✓ Valid: 760 model(s)
  - npm run test:unit (registry-server) → 43/43 pass

* QVAC-18157 fix[api]: normalize zh to zh-Hans when matching Firefox Remote Settings

`bergamot-model-fetcher.js` queries Firefox Remote Settings (the
catalog the in-browser translation feature uses) and filters records
by `r.fromLang === srcLang && r.toLang === dstLang`. The catalog
exposes Chinese under the BCP 47 tag `zh-Hans` while callers pass the
ISO 639-1 short code `zh` (matching QVAC's manifest and Mozilla's
`firefox-translations-models` repo). The strict equality silently
failed for every Chinese pair: `en→zh` and `zh→en` both threw
"No Firefox Translations model found".

Fix narrowly: introduce a tiny BCP47_LANG_ALIASES map with the one
known mismatch (`zh → zh-Hans`) and apply it when building the
record filter. Filename handling is unchanged — Firefox's
`record.name` already uses the ISO short form
(`model.enzh.intgemm.alphas.bin`), so files land on disk under the
same names QVAC expects.

Verified against the live catalog: the matching `enzh` v2.2 model
bytes (SHA256 `4e5accc1…2157c`, 43849787 B) match the S3 upload at
`s3://REMOVED-S3-BUCKET/qvac_models_compiled/bergamot/bergamot-enzh/2025-12-18/`
exactly.

Add unit tests covering the normalization, the unchanged
passthrough for non-aliased codes, and the existing CJK split-vocab
filename convention to pin the contract.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants