QVAC-18421 test[skiplog]: add e2e regression tests for Bergamot vocab cache invalidation#2004
Conversation
… cache invalidation Adds 2 e2e tests in tests-qvac (translation-bergamot-fr-en-cache-reload, translation-bergamot-en-fr-cache-reload) covering the QVAC-18420 regression where shared vocab files for bidirectional Bergamot pairs were silently re-downloaded on every loadModel call. Each test does load -> unload (Round 1, warm cache) then load with onProgress -> unload (Round 2, must be a pure cache hit). Cache-hit detection is platform-agnostic via partial-percentage progress event counting (no node:fs snapshots). Skipped on mobile via SkipExecutor since the bug lives in server-side Bare code that is bit-identical across platforms. Co-authored-by: Cursor <cursoragent@cursor.com>
simon-iribarren
left a comment
There was a problem hiding this comment.
The shape, placement, and intent are right — tests-only PR, test[skiplog] tag, tier1+verify both set, no production-code creep. The mobile SkipExecutor comment ("Server-side Bare code path, identical across platforms — desktop coverage is source of truth") is exactly the level of rationale that prevents future drift, and cache-hit detection via onProgress rather than node:fs snapshots is the right call for mobile portability.
The only red CI check is CodeQL / Analyze (python) failing on a github/codeql-action download 429 — this PR has 0 Python files, so it's unambiguously an infra flake; a rerun should clear it.
Three non-blocking nits worth folding in:
1. Round 2 can pass even if onProgress silently stops firing
translation-bergamot-cache-executor.ts asserts touchedKeys.size === 0, but if a future change drops the onProgress wiring on the cache-hit path, round2 is empty and the test returns passed: true with 0 cache-hit notification(s) — the wrong outcome. Suggest a positive lower bound: assert at least N final percentage === 100 events (one per cached companion file: model + lex + vocab + metadata = 4), which directly mirrors your "Server logs confirm both rounds fully validate all 4 companion files" claim in the description.
2. Round 1 has no onProgress callback
There's no positive evidence Round 1 actually exercised the download path. If something later quietly causes Round 1 to be a cache hit (test reordering, a global fixture pre-warms the cache, or someone removes dependency: "none"), Round 2 trivially passes for the wrong reason and we lose the regression coverage. Two options: (a) attach onProgress to Round 1 too and assert round1.length >= round2.length (Round 1 is at least as active as Round 2) — self-validating regardless of cache state at entry; (b) document explicitly that the regression fires regardless of Round 1's cache state.
3. Pattern overlap with TranslationExecutor is fragile
/^translation-(indictrans|bergamot|llm|salamandra|afriquegemma)-/ also matches translation-bergamot-fr-en-cache-reload. The fix (register TranslationBergamotCacheExecutor first + first-match-wins + inline comment) is correct, but the next person to alphabetize the handler list silently breaks it. Cheapest hardening: tighten TranslationExecutor.pattern with a negative lookahead — /^translation-(...)-(?!.*cache-reload)/ — so the dispatchers are mutually exclusive at the regex level.
Micro-nits
estimatedDurationMs: 180000is 3 min for a test you measured at 128 ms locally — probably fine for a cold Round 1 on a slow runner, but 30–60s is plenty of slack and won't mask a hang.- A one-line comment in
translation-bergamot-cache-tests.tsexplaining whydependency: "none"(soConsumerBasedoesn't pre-warm the cache before the test even starts) would help a future reader not "fix" it.
Solid regression coverage for QVAC-18420. Approving — CodeQL Python flake aside, none of the above would push me away.
Tier-based Approval Status |
Preview deployments for qvac-docs-staging ⚡️
Commit: Deployment ID: Static site name: |
… cache invalidation (#2004) Adds 2 e2e tests in tests-qvac (translation-bergamot-fr-en-cache-reload, translation-bergamot-en-fr-cache-reload) covering the QVAC-18420 regression where shared vocab files for bidirectional Bergamot pairs were silently re-downloaded on every loadModel call. Each test does load -> unload (Round 1, warm cache) then load with onProgress -> unload (Round 2, must be a pure cache hit). Cache-hit detection is platform-agnostic via partial-percentage progress event counting (no node:fs snapshots). Skipped on mobile via SkipExecutor since the bug lives in server-side Bare code that is bit-identical across platforms. Co-authored-by: Cursor <cursoragent@cursor.com>
🎯 What problem does this PR solve?
fr↔en) the shared vocab filevocab.<pair>.spmwas silently re-downloaded on everyloadModelcall. Root cause:deduplicateModelsdropped one of two byte-identical registry entries, thengetModelByPath()returnedundefined,expectedSizecollapsed to 0, andvalidateCachedFilewiped the cached vocab.update-models-dedup,nmtcpp-resolve-vocab) cover the fix in isolation but don't exercise the end-to-end symptom — silent re-download through the live server path.📝 How does it solve it?
tests-qvaccovering the shared-vocab branch (vocab.<pair>.spm):translation-bergamot-fr-en-cache-reloadtranslation-bergamot-en-fr-cache-reloadload → unload(Round 1, warm cache) thenloadwithonProgress→ unload(Round 2, must be a pure cache hit).~/.qvac/modelsvianode:fs. A real re-download emits manydownloaded < totalevents; a true cache hit emits at most one final 100% event per file. The test fails if any partial events are seen on Round 2.TranslationBergamotCacheExecutorlives intests/shared/executors/and usesdependency: "none"soResourceManagerevicts in-memory models without touching the on-disk cache.SkipExecutor. The bug is in server-side Bare code that's bit-identical across platforms, so desktop coverage is the source of truth and we save expensive Device Farm cycles for a regression that can't manifest differently on the device.🧪 How was it tested?
local-local-1778595279738: 19/19 tests passed, both new tests green (128 ms / 129 ms).loadModel.ttfbprofiler counter (count: 2) that streaming/onProgresswas actually wired up on Round 2 — so an empty-events pass reflects a true cache hit, not a silently disabled callback.