Skip to content

QVAC-18156 fix: deterministic decoding for LLM translate#1808

Merged
NamelsKing merged 6 commits into
mainfrom
fix/qvac-18156-salamandra-translate-determinism
Apr 30, 2026
Merged

QVAC-18156 fix: deterministic decoding for LLM translate#1808
NamelsKing merged 6 commits into
mainfrom
fix/qvac-18156-salamandra-translate-determinism

Conversation

@olyasir

@olyasir olyasir commented Apr 29, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Force greedy decoding with a fixed seed and bounded output length on every Salamandra LLM translate() call so output is reproducible and runaway generations cannot blow ctx_size on the next call.
  • One call-site change in packages/sdk/server/bare/ops/translate.ts. Prompt template, NMT branch, and AfriqueGemma routing are unchanged.

Background

With @qvac/llm-llamacpp@0.17.x, calling translate() against Salamandra (loaded with no decoding params) intermittently produced one of three failure modes on the same input:

  • verbatim source echo ("Hello, how are you today?" returned untranslated),
  • "Translation in Spanish:" preambles,
  • processPromptImpl: context overflow on inputs as small as "bank".

The flake was masked in CI because the smoke contains-any validators still matched a Spanish keyword inside a preamble. Salamandra was loaded with no temp/seed/stop_sequences (compare AfriqueGemma in tests-qvac/.../desktop/consumer.ts:202-216 which sets them at load time), so every call had non-deterministic sampling and no upper bound on output length. With auto KV-cache reuse in the new addon, a long preamble in one call could push the next short call over ctx_size.

Fix

In translate.ts, when the model is llamacpp-completion and is not an AFRICAN_* registry entry, pass per-call generationParams to override sampling for that one runJob:

  • temp/top_k/top_p collapse to greedy
  • repeat_penalty: 1.3 breaks single-token echo loops (greedy "bank""bank\nbank\n...")
  • seed: 42 pins any residual sampling
  • predict: 256 caps output so a runaway can't accumulate KV state

stop_sequences is load-time only in the addon (transformed to reverse_prompt in plugin.ts:53), so it can't be added here; the deterministic sampling + bounded predict cover the same ground.

AfriqueGemma is preserved as-is

Earlier iterations of this PR tried to dispatch by language pair (afriquePrompt = isAfrican(from) || isAfrican(to)) but that flag is silently always false for the codes the smoke tests pass: AFRICAN_LANGUAGES_MAP is keyed by FLORES codes ("swh_Latn") while the tests pass ISO codes ("sw"). Dispatching by model name (entry.local.name?.startsWith("AFRICAN_")) is the only currently correct discriminator.

For AFRICAN_* models we fall back to model.run(input) with no override — byte-identical to the call shape before this PR. AfriqueGemma's load-time modelConfig (stop_sequences: ["\n"], repeat_penalty: 1) keeps driving its decoding. Local repro on @qvac/llm-llamacpp@0.17.1 shows AfriqueGemma's translate output is already malformed on main (sw→en returns garbage tokens, never the expected English keywords), independent of this PR. Tracking it down is out of scope for QVAC-18156 — the smoke for translation-afriquegemma-sw-en is gated behind the verify label so it has not been visible on PR CI.

Verification

Local repro on @qvac/llm-llamacpp@0.17.1 (the version main targets), 30 calls per run (streaming + en-es + context, 10 iterations each), Salamandra Q4 loaded with no custom config (matching the smoke consumer):

Test Before After
translation-salamandra-streaming 10/10 pass (varying outputs, occasional preamble) 10/10 identical "¡Hola, ¿cómo te va hoy?"
translation-salamandra-en-es 8/10 pass + 2× verbatim source echo 10/10 identical "¡Hola, ¿cómo te va hoy?"
translation-salamandra-context 5/10 pass + 2× ctx-overflow + 3× echo 10/10 identical "bank\nbanco"

AfriqueGemma was repro'd with the same per-call shape main uses (model.run(input) no opts) — output identical to running against unmodified main, confirming this PR does not regress it.

Test plan

  • CI smoke (test-e2e-smoke) goes green for translation-salamandra-streaming, translation-salamandra-en-es, translation-salamandra-context on iOS, Android, and Desktop runners.
  • AfriqueGemma translation tests behave the same as on main (the shouldSkipPerCallSampling guard preserves the pre-PR call shape for AFRICAN_* models).
  • Existing NMT translation tests (Bergamot, IndicTrans, NMTcpp) still pass (path is bypassed by canonicalModelType === llamacppCompletion guard).

Linked

  • Asana: QVAC-18156
  • Out of scope: separate AfriqueGemma garbage-output investigation.

Force greedy decoding with a fixed seed and bounded output length on every
LLM translate call (non-African branch) so output is reproducible across
calls and runaway generations cannot blow ctx_size on the next call.

Background: with @qvac/llm-llamacpp 0.17.x, calling `translate()` against
Salamandra (loaded with no decoding params) intermittently produced
verbatim source echo, "Translation in Spanish:" preambles, or
`processPromptImpl: context overflow` on tiny inputs like "bank". The
flake was non-deterministic across runs on the same input, masked in the
smoke suite by `contains-any` validators that still matched a Spanish
keyword inside a preamble.

The change is one call site: when the model is llamacpp-completion and
the prompt is not the AfriqueGemma path, pass per-call generationParams
overriding sampling for that runJob:
- temp/top_k/top_p collapse to greedy
- repeat_penalty: 1.3 breaks single-token echo loops
  (e.g. greedy "bank" -> "bank\nbank\n...")
- seed: 42 pins anything residual sampling
- predict: 256 caps output so a runaway can't accumulate KV state

Prompt template, NMT branch, and African branch are unchanged.
AfriqueGemma is loaded with its own deterministic config + stop_sequences
already, so we skip the override there.

Verified locally on @qvac/llm-llamacpp 0.17.1 with 30 calls
(streaming + en-es + context, 10 iterations each):
- before: 23/30 pass with 2 echoes, 2 ctx-overflow, 3 echoes
- after:  30/30 pass, all outputs identical across iterations
@olyasir olyasir requested review from a team as code owners April 29, 2026 14:31
Pull the per-call sampling overrides for LLM translate out of the call
site into a top-of-file constant with a comment that explains the
purpose of each field. No behavior change — values are identical to the
previous commit.

Adding a third translate-friendly LLM model later still goes through
this single constant unless it needs different sampling, in which case
it would warrant a small profile lookup keyed on model family. That
restructure is deferred until a concrete second profile lands.

@simon-iribarren simon-iribarren left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the thorough write-up and repro table — the determinism story is convincing. A few things to address before merge:

CI

API Surface & Tagging

The diff doesn't touch any exported types, schemas, IPC channels, or HTTP endpoints — it changes internal default sampling. Per commit-and-pr-format.mdc, [api] is reserved for surface changes. My recommendation is to drop the [api] tag and retitle to:

QVAC-18156 fix: deterministic decoding for LLM translate

That also makes the validator pass without contortions. If you do want to keep [api] because consumer-visible output changes, please add a fenced code block illustrating the (unchanged) call shape so the validator is satisfied.

Blocking — description contradicts the code

The Summary says:

Force greedy decoding [...] on every LLM translate() call (non-African branch)
Prompt template, NMT branch, and AfriqueGemma branch are unchanged.

But the diff passes per-call generationParams to both branches (afriquePrompt ? AFRIQUEGEMMA_… : SALAMANDRA_…). The African path is touched. Either the Summary needs updating, or — better — see the next point.

Blocking — AfriqueGemma now silently overrides consumer load-time tuning

AFRIQUEGEMMA_TRANSLATE_GENERATION_PARAMS mirrors today's load-time tuning in tests-qvac/tests/desktop/consumer.ts:208-214 exactly, but the SDK now silently overrides whatever the consumer chose at load time. If anyone retunes AfriqueGemma at the call site, the SDK's hardcoded values silently win and the tuning has no effect.

The cleanest fix is to not pass per-call params on the African branch at all — keep await model.run(input) for the AfriqueGemma path. Its load-time config + stop_sequences:["\n"] already pin everything you need; the description's intent matches that, and it keeps the change scoped to the actual problem (Salamandra). That removes the duplication, the override risk, and the description/code mismatch in one move.

Strong suggestions

  • Rename SALAMANDRA_TRANSLATE_GENERATION_PARAMS. It's used for every non-African LLM (your own test plan calls this out). LLM_TRANSLATE_GENERATION_PARAMS (or DEFAULT_LLM_TRANSLATE_GENERATION_PARAMS) avoids future grep confusion.
  • Inline cast on model.run.bind(model) — see inline comment.
  • Record<string, number> is too loose — define a typed alias for the six fields so a typo on a key is caught at compile time.

What's good

  • Targeted single-file change, scoped to the actual flake
  • Excellent repro evidence (3 tests × 10 iterations × before/after table)
  • Comments explain why, not what — that's what we want
  • Test plan explicitly notes the non-Salamandra LLM cases get the same treatment

Once the validator is green, the description matches the code, and the AfriqueGemma branch is settled, I'm happy to approve.

Comment thread packages/sdk/server/bare/ops/translate.ts Outdated
Comment thread packages/sdk/server/bare/ops/translate.ts Outdated
Comment thread packages/sdk/server/bare/ops/translate.ts
@github-actions

github-actions Bot commented Apr 29, 2026

Copy link
Copy Markdown
Contributor

Tier-based Approval Status

**PR Tier:** TIER1

**Current Status:** ✅ APPROVED

**Requirements:**
- 1 Team Member approval ✅ (1/1)
- 1 Team Lead OR Management approval ✅ (1/1)



---
*This comment is automatically updated when reviews change.*

Apply the per-call deterministic-decoding override only to non-AFRICAN_*
LLM models. AfriqueGemma's load-time `modelConfig` carries
`stop_sequences: ["\n"]` and `repeat_penalty: 1`, and these values must
not be overridden mid-call: with `repeat_penalty: 1.3`, the addon
penalises "\n" and the stop never fires, so generation runs all the way
to `predict` and produces non-translation output. The earlier attempt
to dispatch by `afriquePrompt` (language-pair-derived) silently did
nothing for the actual AfriqueGemma traffic: `isAfrican("sw")` returns
`false` because `AFRICAN_LANGUAGES_MAP` is keyed by FLORES codes
(`"swh_Latn"`), not the ISO codes the smoke tests pass.

This commit dispatches by model name (entry.local.name starts with
"AFRICAN_") and falls back to `model.run(input)` with no override —
identical to the pre-fix call shape — so AfriqueGemma's behaviour is
preserved exactly as it is on main. A latent AfriqueGemma garbage-output
issue exists at HEAD regardless of this PR; that is out of scope.

The constant is renamed `LLM_TRANSLATE_GENERATION_PARAMS` since it now
applies to every non-skipped LLM, not just Salamandra.
@olyasir olyasir force-pushed the fix/qvac-18156-salamandra-translate-determinism branch from 7c7c5f7 to 22357aa Compare April 29, 2026 15:54
Pull `RunOptions` and `GenerationParams` from `@qvac/llm-llamacpp` and
use them in place of the loose `Record<string, number>` cast in
`translate()`. Define a `LlmTranslateGenerationParams` alias as the
specific subset of `GenerationParams` we set per call (six fields,
required) so a typo on any of them is a compile error. The cast on
`model.run.bind(model)` now references the addon's `RunOptions` shape
directly, which keeps us protected if the addon's option shape changes.

No behaviour change.
@olyasir olyasir changed the title QVAC-18156 fix[api]: deterministic decoding for LLM translate QVAC-18156 fix: deterministic decoding for LLM translate Apr 29, 2026
@olyasir

olyasir commented Apr 29, 2026

Copy link
Copy Markdown
Contributor Author

Thanks for the careful review. Going through your points in order:

CI / title

  • Renamed the PR title to QVAC-18156 fix: deterministic decoding for LLM translate (no [api]). The diff is purely an internal default-sampling change, so dropping the tag is the honest fix; validate-pr should now pass.

Blocking — description vs. code (both points)

Both addressed in commit 22357aa6 (force-pushed earlier):

  • AFRIQUEGEMMA_TRANSLATE_GENERATION_PARAMS removed.
  • The African branch falls back to plain await model.run(input) with no per-call override; load-time consumer config stays authoritative.
  • Dispatch is now by model name (entry.local.name?.startsWith("AFRICAN_")) instead of the language-pair afriquePrompt flag — see the new shouldSkipPerCallSampling helper. The previous isAfrican(from) || isAfrican(to) check was a no-op for actual AfriqueGemma traffic anyway, since AFRICAN_LANGUAGES_MAP is keyed by FLORES codes ("swh_Latn") but the smoke tests pass ISO codes ("sw").

Strong suggestions

  • Rename to LLM_TRANSLATE_GENERATION_PARAMS — done in 22357aa6.
  • Inline cast on model.run.bind(model) — softened in 94eaf431: now uses the addon's own RunOptions type (no inline { generationParams: Record<string, number> }).
  • Typed alias for the six fields — also 94eaf431: LlmTranslateGenerationParams = Required<Pick<GenerationParams, ...>>. Typos on any of the six keys are now a compile error.
  • The bind+cast itself remains — AnyModel.run is intentionally erased to a single-arg signature in Omit<BaseInference, "addon">, and we re-narrow the same way completion-stream.ts:182 does. Widening AnyModel itself would touch @qvac/infer-base typing and feels out of scope.

Side note

Worth flagging separately: AfriqueGemma's GPU output on Intel Vulkan (Iris Xe iGPU) is producing garbage on this hardware regardless of addon version (verified down to 0.12.4, the version Alok ran his FLORES benchmarks on in QVAC-13540). Salamandra on the same machine is fine. This is environmental — Vulkan/Mesa × Gemma3-4B-Q4_K_M — and predates the entire 6-week addon series. Independent of this PR; will file a separate ticket against qvac-lib-infer-llamacpp-llm with the bisect evidence.

Updated PR contents:

22357aa6  refactor: skip per-call sampling override for AfriqueGemma
dcff9b93  refactor: extract LLM translate generation params into named constant
549fe629  fix[api]: deterministic decoding for LLM translate     ← original
94eaf431  refactor: tighten typing on per-call generation params  ← new

Ready for another look when you have a minute.

@NamelsKing

Copy link
Copy Markdown
Contributor

/review

@NamelsKing NamelsKing merged commit a03ce64 into main Apr 30, 2026
19 of 20 checks passed
@NamelsKing NamelsKing deleted the fix/qvac-18156-salamandra-translate-determinism branch April 30, 2026 07:57
Proletter pushed a commit that referenced this pull request May 24, 2026
* fix[api]: deterministic decoding for LLM translate

Force greedy decoding with a fixed seed and bounded output length on every
LLM translate call (non-African branch) so output is reproducible across
calls and runaway generations cannot blow ctx_size on the next call.

Background: with @qvac/llm-llamacpp 0.17.x, calling `translate()` against
Salamandra (loaded with no decoding params) intermittently produced
verbatim source echo, "Translation in Spanish:" preambles, or
`processPromptImpl: context overflow` on tiny inputs like "bank". The
flake was non-deterministic across runs on the same input, masked in the
smoke suite by `contains-any` validators that still matched a Spanish
keyword inside a preamble.

The change is one call site: when the model is llamacpp-completion and
the prompt is not the AfriqueGemma path, pass per-call generationParams
overriding sampling for that runJob:
- temp/top_k/top_p collapse to greedy
- repeat_penalty: 1.3 breaks single-token echo loops
  (e.g. greedy "bank" -> "bank\nbank\n...")
- seed: 42 pins anything residual sampling
- predict: 256 caps output so a runaway can't accumulate KV state

Prompt template, NMT branch, and African branch are unchanged.
AfriqueGemma is loaded with its own deterministic config + stop_sequences
already, so we skip the override there.

Verified locally on @qvac/llm-llamacpp 0.17.1 with 30 calls
(streaming + en-es + context, 10 iterations each):
- before: 23/30 pass with 2 echoes, 2 ctx-overflow, 3 echoes
- after:  30/30 pass, all outputs identical across iterations

* refactor: extract LLM translate generation params into named constant

Pull the per-call sampling overrides for LLM translate out of the call
site into a top-of-file constant with a comment that explains the
purpose of each field. No behavior change — values are identical to the
previous commit.

Adding a third translate-friendly LLM model later still goes through
this single constant unless it needs different sampling, in which case
it would warrant a small profile lookup keyed on model family. That
restructure is deferred until a concrete second profile lands.

* refactor[api]: skip per-call sampling override for AfriqueGemma

Apply the per-call deterministic-decoding override only to non-AFRICAN_*
LLM models. AfriqueGemma's load-time `modelConfig` carries
`stop_sequences: ["\n"]` and `repeat_penalty: 1`, and these values must
not be overridden mid-call: with `repeat_penalty: 1.3`, the addon
penalises "\n" and the stop never fires, so generation runs all the way
to `predict` and produces non-translation output. The earlier attempt
to dispatch by `afriquePrompt` (language-pair-derived) silently did
nothing for the actual AfriqueGemma traffic: `isAfrican("sw")` returns
`false` because `AFRICAN_LANGUAGES_MAP` is keyed by FLORES codes
(`"swh_Latn"`), not the ISO codes the smoke tests pass.

This commit dispatches by model name (entry.local.name starts with
"AFRICAN_") and falls back to `model.run(input)` with no override —
identical to the pre-fix call shape — so AfriqueGemma's behaviour is
preserved exactly as it is on main. A latent AfriqueGemma garbage-output
issue exists at HEAD regardless of this PR; that is out of scope.

The constant is renamed `LLM_TRANSLATE_GENERATION_PARAMS` since it now
applies to every non-skipped LLM, not just Salamandra.

* refactor: tighten typing on per-call generation params

Pull `RunOptions` and `GenerationParams` from `@qvac/llm-llamacpp` and
use them in place of the loose `Record<string, number>` cast in
`translate()`. Define a `LlmTranslateGenerationParams` alias as the
specific subset of `GenerationParams` we set per call (six fields,
required) so a typo on any of them is a compile error. The cast on
`model.run.bind(model)` now references the addon's `RunOptions` shape
directly, which keeps us protected if the addon's option shape changes.

No behaviour change.

---------

Co-authored-by: Dmytro Medvinskyi <functionsilence@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants