Skip to content

QVAC-18421 test[skiplog]: add e2e regression tests for Bergamot vocab cache invalidation#2004

Merged
Victor-Rodzko merged 2 commits into
mainfrom
test/qvac-18421-bergamot-cache-reload-e2e
May 13, 2026
Merged

QVAC-18421 test[skiplog]: add e2e regression tests for Bergamot vocab cache invalidation#2004
Victor-Rodzko merged 2 commits into
mainfrom
test/qvac-18421-bergamot-cache-reload-e2e

Conversation

@Victor-Rodzko

@Victor-Rodzko Victor-Rodzko commented May 12, 2026

Copy link
Copy Markdown
Contributor

🎯 What problem does this PR solve?

  • Regression coverage for QVAC-18420 — for bidirectional Bergamot pairs (e.g. fr↔en) the shared vocab file vocab.<pair>.spm was silently re-downloaded on every loadModel call. Root cause: deduplicateModels dropped one of two byte-identical registry entries, then getModelByPath() returned undefined, expectedSize collapsed to 0, and validateCachedFile wiped the cached vocab.
  • The bug was caught by users rather than by our test suite.
  • Existing unit tests (update-models-dedup, nmtcpp-resolve-vocab) cover the fix in isolation but don't exercise the end-to-end symptom — silent re-download through the live server path.

📝 How does it solve it?

  • Adds 2 e2e tests in tests-qvac covering the shared-vocab branch (vocab.<pair>.spm):
    • translation-bergamot-fr-en-cache-reload
    • translation-bergamot-en-fr-cache-reload
  • Each test does load → unload (Round 1, warm cache) then load with onProgress → unload (Round 2, must be a pure cache hit).
  • Cache-hit detection is platform-agnostic — counts partial-percentage progress events instead of snapshotting ~/.qvac/models via node:fs. A real re-download emits many downloaded < total events; a true cache hit emits at most one final 100% event per file. The test fails if any partial events are seen on Round 2.
  • New TranslationBergamotCacheExecutor lives in tests/shared/executors/ and uses dependency: "none" so ResourceManager evicts in-memory models without touching the on-disk cache.
  • Desktop-only: skipped on mobile via SkipExecutor. The bug is in server-side Bare code that's bit-identical across platforms, so desktop coverage is the source of truth and we save expensive Device Farm cycles for a regression that can't manifest differently on the device.

🧪 How was it tested?

  • Local desktop run local-local-1778595279738: 19/19 tests passed, both new tests green (128 ms / 129 ms).
  • Verified via loadModel.ttfb profiler counter (count: 2) that streaming/onProgress was actually wired up on Round 2 — so an empty-events pass reflects a true cache hit, not a silently disabled callback.
  • Server logs confirm both rounds fully validate all 4 companion files (model + lex + vocab + metadata) without any re-download.

… cache invalidation

Adds 2 e2e tests in tests-qvac (translation-bergamot-fr-en-cache-reload,
translation-bergamot-en-fr-cache-reload) covering the QVAC-18420 regression
where shared vocab files for bidirectional Bergamot pairs were silently
re-downloaded on every loadModel call.

Each test does load -> unload (Round 1, warm cache) then load with onProgress
-> unload (Round 2, must be a pure cache hit). Cache-hit detection is
platform-agnostic via partial-percentage progress event counting (no
node:fs snapshots).

Skipped on mobile via SkipExecutor since the bug lives in server-side Bare
code that is bit-identical across platforms.

Co-authored-by: Cursor <cursoragent@cursor.com>

@simon-iribarren simon-iribarren left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The shape, placement, and intent are right — tests-only PR, test[skiplog] tag, tier1+verify both set, no production-code creep. The mobile SkipExecutor comment ("Server-side Bare code path, identical across platforms — desktop coverage is source of truth") is exactly the level of rationale that prevents future drift, and cache-hit detection via onProgress rather than node:fs snapshots is the right call for mobile portability.

The only red CI check is CodeQL / Analyze (python) failing on a github/codeql-action download 429 — this PR has 0 Python files, so it's unambiguously an infra flake; a rerun should clear it.

Three non-blocking nits worth folding in:

1. Round 2 can pass even if onProgress silently stops firing

translation-bergamot-cache-executor.ts asserts touchedKeys.size === 0, but if a future change drops the onProgress wiring on the cache-hit path, round2 is empty and the test returns passed: true with 0 cache-hit notification(s) — the wrong outcome. Suggest a positive lower bound: assert at least N final percentage === 100 events (one per cached companion file: model + lex + vocab + metadata = 4), which directly mirrors your "Server logs confirm both rounds fully validate all 4 companion files" claim in the description.

2. Round 1 has no onProgress callback

There's no positive evidence Round 1 actually exercised the download path. If something later quietly causes Round 1 to be a cache hit (test reordering, a global fixture pre-warms the cache, or someone removes dependency: "none"), Round 2 trivially passes for the wrong reason and we lose the regression coverage. Two options: (a) attach onProgress to Round 1 too and assert round1.length >= round2.length (Round 1 is at least as active as Round 2) — self-validating regardless of cache state at entry; (b) document explicitly that the regression fires regardless of Round 1's cache state.

3. Pattern overlap with TranslationExecutor is fragile

/^translation-(indictrans|bergamot|llm|salamandra|afriquegemma)-/ also matches translation-bergamot-fr-en-cache-reload. The fix (register TranslationBergamotCacheExecutor first + first-match-wins + inline comment) is correct, but the next person to alphabetize the handler list silently breaks it. Cheapest hardening: tighten TranslationExecutor.pattern with a negative lookahead — /^translation-(...)-(?!.*cache-reload)/ — so the dispatchers are mutually exclusive at the regex level.

Micro-nits

  • estimatedDurationMs: 180000 is 3 min for a test you measured at 128 ms locally — probably fine for a cold Round 1 on a slow runner, but 30–60s is plenty of slack and won't mask a hang.
  • A one-line comment in translation-bergamot-cache-tests.ts explaining why dependency: "none" (so ConsumerBase doesn't pre-warm the cache before the test even starts) would help a future reader not "fix" it.

Solid regression coverage for QVAC-18420. Approving — CodeQL Python flake aside, none of the above would push me away.

@github-actions

github-actions Bot commented May 13, 2026

Copy link
Copy Markdown
Contributor

Tier-based Approval Status

**PR Tier:** TIER1

**Current Status:** ✅ APPROVED

**Requirements:**
- 1 Team Member approval ✅ (1/1)
- 1 Team Lead OR Management approval ✅ (1/1)



---
*This comment is automatically updated when reviews change.*

@kinsta

kinsta Bot commented May 13, 2026

Copy link
Copy Markdown

Preview deployments for qvac-docs-staging ⚡️

Status Branch preview Commit preview
🔁 Deploying... N/A N/A

Commit: bacf19bb2516b71fc97eaf55161e5c93187bcfac

Deployment ID: 870eef2c-eb1f-4bda-9f53-5e0d896d3c01

Static site name: qvac-docs-staging-fazwv

@Victor-Rodzko Victor-Rodzko merged commit 56690bd into main May 13, 2026
23 checks passed
@Victor-Rodzko Victor-Rodzko deleted the test/qvac-18421-bergamot-cache-reload-e2e branch May 13, 2026 18:10
Proletter pushed a commit that referenced this pull request May 24, 2026
… cache invalidation (#2004)

Adds 2 e2e tests in tests-qvac (translation-bergamot-fr-en-cache-reload,
translation-bergamot-en-fr-cache-reload) covering the QVAC-18420 regression
where shared vocab files for bidirectional Bergamot pairs were silently
re-downloaded on every loadModel call.

Each test does load -> unload (Round 1, warm cache) then load with onProgress
-> unload (Round 2, must be a pure cache hit). Cache-hit detection is
platform-agnostic via partial-percentage progress event counting (no
node:fs snapshots).

Skipped on mobile via SkipExecutor since the bug lives in server-side Bare
code that is bit-identical across platforms.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants