QVAC-18156 fix: deterministic decoding for LLM translate by olyasir · Pull Request #1808 · tetherto/qvac

olyasir · 2026-04-29T14:31:55Z

Summary

Force greedy decoding with a fixed seed and bounded output length on every Salamandra LLM translate() call so output is reproducible and runaway generations cannot blow ctx_size on the next call.
One call-site change in packages/sdk/server/bare/ops/translate.ts. Prompt template, NMT branch, and AfriqueGemma routing are unchanged.

Background

With @qvac/llm-llamacpp@0.17.x, calling translate() against Salamandra (loaded with no decoding params) intermittently produced one of three failure modes on the same input:

verbatim source echo ("Hello, how are you today?" returned untranslated),
"Translation in Spanish:" preambles,
processPromptImpl: context overflow on inputs as small as "bank".

The flake was masked in CI because the smoke contains-any validators still matched a Spanish keyword inside a preamble. Salamandra was loaded with no temp/seed/stop_sequences (compare AfriqueGemma in tests-qvac/.../desktop/consumer.ts:202-216 which sets them at load time), so every call had non-deterministic sampling and no upper bound on output length. With auto KV-cache reuse in the new addon, a long preamble in one call could push the next short call over ctx_size.

Fix

In translate.ts, when the model is llamacpp-completion and is not an AFRICAN_* registry entry, pass per-call generationParams to override sampling for that one runJob:

temp/top_k/top_p collapse to greedy
repeat_penalty: 1.3 breaks single-token echo loops (greedy "bank" → "bank\nbank\n...")
seed: 42 pins any residual sampling
predict: 256 caps output so a runaway can't accumulate KV state

stop_sequences is load-time only in the addon (transformed to reverse_prompt in plugin.ts:53), so it can't be added here; the deterministic sampling + bounded predict cover the same ground.

AfriqueGemma is preserved as-is

Earlier iterations of this PR tried to dispatch by language pair (afriquePrompt = isAfrican(from) || isAfrican(to)) but that flag is silently always false for the codes the smoke tests pass: AFRICAN_LANGUAGES_MAP is keyed by FLORES codes ("swh_Latn") while the tests pass ISO codes ("sw"). Dispatching by model name (entry.local.name?.startsWith("AFRICAN_")) is the only currently correct discriminator.

For AFRICAN_* models we fall back to model.run(input) with no override — byte-identical to the call shape before this PR. AfriqueGemma's load-time modelConfig (stop_sequences: ["\n"], repeat_penalty: 1) keeps driving its decoding. Local repro on @qvac/llm-llamacpp@0.17.1 shows AfriqueGemma's translate output is already malformed on main (sw→en returns garbage tokens, never the expected English keywords), independent of this PR. Tracking it down is out of scope for QVAC-18156 — the smoke for translation-afriquegemma-sw-en is gated behind the verify label so it has not been visible on PR CI.

Verification

Local repro on @qvac/llm-llamacpp@0.17.1 (the version main targets), 30 calls per run (streaming + en-es + context, 10 iterations each), Salamandra Q4 loaded with no custom config (matching the smoke consumer):

Test	Before	After
translation-salamandra-streaming	10/10 pass (varying outputs, occasional preamble)	10/10 identical `"¡Hola, ¿cómo te va hoy?"`
translation-salamandra-en-es	8/10 pass + 2× verbatim source echo	10/10 identical `"¡Hola, ¿cómo te va hoy?"`
translation-salamandra-context	5/10 pass + 2× ctx-overflow + 3× echo	10/10 identical `"bank\nbanco"`

AfriqueGemma was repro'd with the same per-call shape main uses (model.run(input) no opts) — output identical to running against unmodified main, confirming this PR does not regress it.

Test plan

CI smoke (test-e2e-smoke) goes green for translation-salamandra-streaming, translation-salamandra-en-es, translation-salamandra-context on iOS, Android, and Desktop runners.
AfriqueGemma translation tests behave the same as on main (the shouldSkipPerCallSampling guard preserves the pre-PR call shape for AFRICAN_* models).
Existing NMT translation tests (Bergamot, IndicTrans, NMTcpp) still pass (path is bypassed by canonicalModelType === llamacppCompletion guard).

Linked

Asana: QVAC-18156
Out of scope: separate AfriqueGemma garbage-output investigation.

Force greedy decoding with a fixed seed and bounded output length on every LLM translate call (non-African branch) so output is reproducible across calls and runaway generations cannot blow ctx_size on the next call. Background: with @qvac/llm-llamacpp 0.17.x, calling `translate()` against Salamandra (loaded with no decoding params) intermittently produced verbatim source echo, "Translation in Spanish:" preambles, or `processPromptImpl: context overflow` on tiny inputs like "bank". The flake was non-deterministic across runs on the same input, masked in the smoke suite by `contains-any` validators that still matched a Spanish keyword inside a preamble. The change is one call site: when the model is llamacpp-completion and the prompt is not the AfriqueGemma path, pass per-call generationParams overriding sampling for that runJob: - temp/top_k/top_p collapse to greedy - repeat_penalty: 1.3 breaks single-token echo loops (e.g. greedy "bank" -> "bank\nbank\n...") - seed: 42 pins anything residual sampling - predict: 256 caps output so a runaway can't accumulate KV state Prompt template, NMT branch, and African branch are unchanged. AfriqueGemma is loaded with its own deterministic config + stop_sequences already, so we skip the override there. Verified locally on @qvac/llm-llamacpp 0.17.1 with 30 calls (streaming + en-es + context, 10 iterations each): - before: 23/30 pass with 2 echoes, 2 ctx-overflow, 3 echoes - after: 30/30 pass, all outputs identical across iterations

Pull the per-call sampling overrides for LLM translate out of the call site into a top-of-file constant with a comment that explains the purpose of each field. No behavior change — values are identical to the previous commit. Adding a third translate-friendly LLM model later still goes through this single constant unless it needs different sampling, in which case it would warrant a small profile lookup keyed on model family. That restructure is deferred until a concrete second profile lands.

simon-iribarren

Thanks for the thorough write-up and repro table — the determinism story is convincing. A few things to address before merge:

CI

validate-pr is failing: [api] tag requires a fenced code block in the body. (https://github.com/tetherto/qvac/actions/runs/25116034864/job/73603194812)

API Surface & Tagging

The diff doesn't touch any exported types, schemas, IPC channels, or HTTP endpoints — it changes internal default sampling. Per commit-and-pr-format.mdc, [api] is reserved for surface changes. My recommendation is to drop the [api] tag and retitle to:

QVAC-18156 fix: deterministic decoding for LLM translate

That also makes the validator pass without contortions. If you do want to keep [api] because consumer-visible output changes, please add a fenced code block illustrating the (unchanged) call shape so the validator is satisfied.

Blocking — description contradicts the code

The Summary says:

Force greedy decoding [...] on every LLM translate() call (non-African branch)
Prompt template, NMT branch, and AfriqueGemma branch are unchanged.

But the diff passes per-call generationParams to both branches (afriquePrompt ? AFRIQUEGEMMA_… : SALAMANDRA_…). The African path is touched. Either the Summary needs updating, or — better — see the next point.

Blocking — AfriqueGemma now silently overrides consumer load-time tuning

AFRIQUEGEMMA_TRANSLATE_GENERATION_PARAMS mirrors today's load-time tuning in tests-qvac/tests/desktop/consumer.ts:208-214 exactly, but the SDK now silently overrides whatever the consumer chose at load time. If anyone retunes AfriqueGemma at the call site, the SDK's hardcoded values silently win and the tuning has no effect.

The cleanest fix is to not pass per-call params on the African branch at all — keep await model.run(input) for the AfriqueGemma path. Its load-time config + stop_sequences:["\n"] already pin everything you need; the description's intent matches that, and it keeps the change scoped to the actual problem (Salamandra). That removes the duplication, the override risk, and the description/code mismatch in one move.

Strong suggestions

Rename SALAMANDRA_TRANSLATE_GENERATION_PARAMS. It's used for every non-African LLM (your own test plan calls this out). LLM_TRANSLATE_GENERATION_PARAMS (or DEFAULT_LLM_TRANSLATE_GENERATION_PARAMS) avoids future grep confusion.
Inline cast on model.run.bind(model) — see inline comment.
Record<string, number> is too loose — define a typed alias for the six fields so a typo on a key is caught at compile time.

What's good

Targeted single-file change, scoped to the actual flake
Excellent repro evidence (3 tests × 10 iterations × before/after table)
Comments explain why, not what — that's what we want
Test plan explicitly notes the non-Salamandra LLM cases get the same treatment

Once the validator is green, the description matches the code, and the AfriqueGemma branch is settled, I'm happy to approve.

github-actions · 2026-04-29T15:48:16Z

Tier-based Approval Status

**PR Tier:** TIER1

**Current Status:** ✅ APPROVED

**Requirements:**
- 1 Team Member approval ✅ (1/1)
- 1 Team Lead OR Management approval ✅ (1/1)



---
*This comment is automatically updated when reviews change.*

Apply the per-call deterministic-decoding override only to non-AFRICAN_* LLM models. AfriqueGemma's load-time `modelConfig` carries `stop_sequences: ["\n"]` and `repeat_penalty: 1`, and these values must not be overridden mid-call: with `repeat_penalty: 1.3`, the addon penalises "\n" and the stop never fires, so generation runs all the way to `predict` and produces non-translation output. The earlier attempt to dispatch by `afriquePrompt` (language-pair-derived) silently did nothing for the actual AfriqueGemma traffic: `isAfrican("sw")` returns `false` because `AFRICAN_LANGUAGES_MAP` is keyed by FLORES codes (`"swh_Latn"`), not the ISO codes the smoke tests pass. This commit dispatches by model name (entry.local.name starts with "AFRICAN_") and falls back to `model.run(input)` with no override — identical to the pre-fix call shape — so AfriqueGemma's behaviour is preserved exactly as it is on main. A latent AfriqueGemma garbage-output issue exists at HEAD regardless of this PR; that is out of scope. The constant is renamed `LLM_TRANSLATE_GENERATION_PARAMS` since it now applies to every non-skipped LLM, not just Salamandra.

Pull `RunOptions` and `GenerationParams` from `@qvac/llm-llamacpp` and use them in place of the loose `Record<string, number>` cast in `translate()`. Define a `LlmTranslateGenerationParams` alias as the specific subset of `GenerationParams` we set per call (six fields, required) so a typo on any of them is a compile error. The cast on `model.run.bind(model)` now references the addon's `RunOptions` shape directly, which keeps us protected if the addon's option shape changes. No behaviour change.

olyasir · 2026-04-29T19:52:30Z

Thanks for the careful review. Going through your points in order:

CI / title

Renamed the PR title to QVAC-18156 fix: deterministic decoding for LLM translate (no [api]). The diff is purely an internal default-sampling change, so dropping the tag is the honest fix; validate-pr should now pass.

Blocking — description vs. code (both points)

Both addressed in commit 22357aa6 (force-pushed earlier):

AFRIQUEGEMMA_TRANSLATE_GENERATION_PARAMS removed.
The African branch falls back to plain await model.run(input) with no per-call override; load-time consumer config stays authoritative.
Dispatch is now by model name (entry.local.name?.startsWith("AFRICAN_")) instead of the language-pair afriquePrompt flag — see the new shouldSkipPerCallSampling helper. The previous isAfrican(from) || isAfrican(to) check was a no-op for actual AfriqueGemma traffic anyway, since AFRICAN_LANGUAGES_MAP is keyed by FLORES codes ("swh_Latn") but the smoke tests pass ISO codes ("sw").

Strong suggestions

Rename to LLM_TRANSLATE_GENERATION_PARAMS — done in 22357aa6.
Inline cast on model.run.bind(model) — softened in 94eaf431: now uses the addon's own RunOptions type (no inline { generationParams: Record<string, number> }).
Typed alias for the six fields — also 94eaf431: LlmTranslateGenerationParams = Required<Pick<GenerationParams, ...>>. Typos on any of the six keys are now a compile error.
The bind+cast itself remains — AnyModel.run is intentionally erased to a single-arg signature in Omit<BaseInference, "addon">, and we re-narrow the same way completion-stream.ts:182 does. Widening AnyModel itself would touch @qvac/infer-base typing and feels out of scope.

Side note

Worth flagging separately: AfriqueGemma's GPU output on Intel Vulkan (Iris Xe iGPU) is producing garbage on this hardware regardless of addon version (verified down to 0.12.4, the version Alok ran his FLORES benchmarks on in QVAC-13540). Salamandra on the same machine is fine. This is environmental — Vulkan/Mesa × Gemma3-4B-Q4_K_M — and predates the entire 6-week addon series. Independent of this PR; will file a separate ticket against qvac-lib-infer-llamacpp-llm with the bisect evidence.

Updated PR contents:

22357aa6  refactor: skip per-call sampling override for AfriqueGemma
dcff9b93  refactor: extract LLM translate generation params into named constant
549fe629  fix[api]: deterministic decoding for LLM translate     ← original
94eaf431  refactor: tighten typing on per-call generation params  ← new

Ready for another look when you have a minute.

NamelsKing · 2026-04-30T07:52:52Z

/review

* fix[api]: deterministic decoding for LLM translate Force greedy decoding with a fixed seed and bounded output length on every LLM translate call (non-African branch) so output is reproducible across calls and runaway generations cannot blow ctx_size on the next call. Background: with @qvac/llm-llamacpp 0.17.x, calling `translate()` against Salamandra (loaded with no decoding params) intermittently produced verbatim source echo, "Translation in Spanish:" preambles, or `processPromptImpl: context overflow` on tiny inputs like "bank". The flake was non-deterministic across runs on the same input, masked in the smoke suite by `contains-any` validators that still matched a Spanish keyword inside a preamble. The change is one call site: when the model is llamacpp-completion and the prompt is not the AfriqueGemma path, pass per-call generationParams overriding sampling for that runJob: - temp/top_k/top_p collapse to greedy - repeat_penalty: 1.3 breaks single-token echo loops (e.g. greedy "bank" -> "bank\nbank\n...") - seed: 42 pins anything residual sampling - predict: 256 caps output so a runaway can't accumulate KV state Prompt template, NMT branch, and African branch are unchanged. AfriqueGemma is loaded with its own deterministic config + stop_sequences already, so we skip the override there. Verified locally on @qvac/llm-llamacpp 0.17.1 with 30 calls (streaming + en-es + context, 10 iterations each): - before: 23/30 pass with 2 echoes, 2 ctx-overflow, 3 echoes - after: 30/30 pass, all outputs identical across iterations * refactor: extract LLM translate generation params into named constant Pull the per-call sampling overrides for LLM translate out of the call site into a top-of-file constant with a comment that explains the purpose of each field. No behavior change — values are identical to the previous commit. Adding a third translate-friendly LLM model later still goes through this single constant unless it needs different sampling, in which case it would warrant a small profile lookup keyed on model family. That restructure is deferred until a concrete second profile lands. * refactor[api]: skip per-call sampling override for AfriqueGemma Apply the per-call deterministic-decoding override only to non-AFRICAN_* LLM models. AfriqueGemma's load-time `modelConfig` carries `stop_sequences: ["\n"]` and `repeat_penalty: 1`, and these values must not be overridden mid-call: with `repeat_penalty: 1.3`, the addon penalises "\n" and the stop never fires, so generation runs all the way to `predict` and produces non-translation output. The earlier attempt to dispatch by `afriquePrompt` (language-pair-derived) silently did nothing for the actual AfriqueGemma traffic: `isAfrican("sw")` returns `false` because `AFRICAN_LANGUAGES_MAP` is keyed by FLORES codes (`"swh_Latn"`), not the ISO codes the smoke tests pass. This commit dispatches by model name (entry.local.name starts with "AFRICAN_") and falls back to `model.run(input)` with no override — identical to the pre-fix call shape — so AfriqueGemma's behaviour is preserved exactly as it is on main. A latent AfriqueGemma garbage-output issue exists at HEAD regardless of this PR; that is out of scope. The constant is renamed `LLM_TRANSLATE_GENERATION_PARAMS` since it now applies to every non-skipped LLM, not just Salamandra. * refactor: tighten typing on per-call generation params Pull `RunOptions` and `GenerationParams` from `@qvac/llm-llamacpp` and use them in place of the loose `Record<string, number>` cast in `translate()`. Define a `LlmTranslateGenerationParams` alias as the specific subset of `GenerationParams` we set per call (six fields, required) so a typo on any of them is a compile error. The cast on `model.run.bind(model)` now references the addon's `RunOptions` shape directly, which keeps us protected if the addon's option shape changes. No behaviour change. --------- Co-authored-by: Dmytro Medvinskyi <functionsilence@gmail.com>

olyasir requested review from a team as code owners April 29, 2026 14:31

olyasir had a problem deploying to release April 29, 2026 14:32 — with GitHub Actions Failure

olyasir had a problem deploying to release April 29, 2026 14:38 — with GitHub Actions Failure

olyasir had a problem deploying to release April 29, 2026 14:50 — with GitHub Actions Failure

simon-iribarren requested changes Apr 29, 2026

View reviewed changes

Comment thread packages/sdk/server/bare/ops/translate.ts Outdated

Comment thread packages/sdk/server/bare/ops/translate.ts Outdated

Comment thread packages/sdk/server/bare/ops/translate.ts

olyasir force-pushed the fix/qvac-18156-salamandra-translate-determinism branch from 7c7c5f7 to 22357aa Compare April 29, 2026 15:54

olyasir had a problem deploying to release April 29, 2026 15:55 — with GitHub Actions Failure

olyasir had a problem deploying to release April 29, 2026 15:56 — with GitHub Actions Failure

olyasir changed the title ~~QVAC-18156 fix[api]: deterministic decoding for LLM translate~~ QVAC-18156 fix: deterministic decoding for LLM translate Apr 29, 2026

olyasir had a problem deploying to release April 29, 2026 19:51 — with GitHub Actions Failure

simon-iribarren approved these changes Apr 30, 2026

View reviewed changes

NamelsKing approved these changes Apr 30, 2026

View reviewed changes

Merge branch 'main' into fix/qvac-18156-salamandra-translate-determinism

60e5276

NamelsKing had a problem deploying to release April 30, 2026 07:34 — with GitHub Actions Failure

Merge branch 'main' into fix/qvac-18156-salamandra-translate-determinism

557a144

NamelsKing had a problem deploying to release April 30, 2026 07:53 — with GitHub Actions Failure

NamelsKing merged commit a03ce64 into main Apr 30, 2026
19 of 20 checks passed

NamelsKing deleted the fix/qvac-18156-salamandra-translate-determinism branch April 30, 2026 07:57

NamelsKing had a problem deploying to release April 30, 2026 07:57 — with GitHub Actions Failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QVAC-18156 fix: deterministic decoding for LLM translate#1808

QVAC-18156 fix: deterministic decoding for LLM translate#1808
NamelsKing merged 6 commits into
mainfrom
fix/qvac-18156-salamandra-translate-determinism

olyasir commented Apr 29, 2026 •

edited

Loading

Uh oh!

simon-iribarren left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Apr 29, 2026 •

edited

Loading

Uh oh!

olyasir commented Apr 29, 2026

Uh oh!

NamelsKing commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

olyasir commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Background

Fix

AfriqueGemma is preserved as-is

Verification

Test plan

Linked

Uh oh!

simon-iribarren left a comment

Choose a reason for hiding this comment

CI

API Surface & Tagging

Blocking — description contradicts the code

Blocking — AfriqueGemma now silently overrides consumer load-time tuning

Strong suggestions

What's good

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Tier-based Approval Status

Uh oh!

olyasir commented Apr 29, 2026

CI / title

Blocking — description vs. code (both points)

Strong suggestions

Side note

Uh oh!

NamelsKing commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

olyasir commented Apr 29, 2026 •

edited

Loading

github-actions Bot commented Apr 29, 2026 •

edited

Loading