QVAC-18156 fix: deterministic decoding for LLM translate#1808
Conversation
Force greedy decoding with a fixed seed and bounded output length on every LLM translate call (non-African branch) so output is reproducible across calls and runaway generations cannot blow ctx_size on the next call. Background: with @qvac/llm-llamacpp 0.17.x, calling `translate()` against Salamandra (loaded with no decoding params) intermittently produced verbatim source echo, "Translation in Spanish:" preambles, or `processPromptImpl: context overflow` on tiny inputs like "bank". The flake was non-deterministic across runs on the same input, masked in the smoke suite by `contains-any` validators that still matched a Spanish keyword inside a preamble. The change is one call site: when the model is llamacpp-completion and the prompt is not the AfriqueGemma path, pass per-call generationParams overriding sampling for that runJob: - temp/top_k/top_p collapse to greedy - repeat_penalty: 1.3 breaks single-token echo loops (e.g. greedy "bank" -> "bank\nbank\n...") - seed: 42 pins anything residual sampling - predict: 256 caps output so a runaway can't accumulate KV state Prompt template, NMT branch, and African branch are unchanged. AfriqueGemma is loaded with its own deterministic config + stop_sequences already, so we skip the override there. Verified locally on @qvac/llm-llamacpp 0.17.1 with 30 calls (streaming + en-es + context, 10 iterations each): - before: 23/30 pass with 2 echoes, 2 ctx-overflow, 3 echoes - after: 30/30 pass, all outputs identical across iterations
Pull the per-call sampling overrides for LLM translate out of the call site into a top-of-file constant with a comment that explains the purpose of each field. No behavior change — values are identical to the previous commit. Adding a third translate-friendly LLM model later still goes through this single constant unless it needs different sampling, in which case it would warrant a small profile lookup keyed on model family. That restructure is deferred until a concrete second profile lands.
simon-iribarren
left a comment
There was a problem hiding this comment.
Thanks for the thorough write-up and repro table — the determinism story is convincing. A few things to address before merge:
CI
validate-pris failing:[api]tag requires a fenced code block in the body. (https://github.com/tetherto/qvac/actions/runs/25116034864/job/73603194812)
API Surface & Tagging
The diff doesn't touch any exported types, schemas, IPC channels, or HTTP endpoints — it changes internal default sampling. Per commit-and-pr-format.mdc, [api] is reserved for surface changes. My recommendation is to drop the [api] tag and retitle to:
QVAC-18156 fix: deterministic decoding for LLM translate
That also makes the validator pass without contortions. If you do want to keep [api] because consumer-visible output changes, please add a fenced code block illustrating the (unchanged) call shape so the validator is satisfied.
Blocking — description contradicts the code
The Summary says:
Force greedy decoding [...] on every LLM
translate()call (non-African branch)
Prompt template, NMT branch, and AfriqueGemma branch are unchanged.
But the diff passes per-call generationParams to both branches (afriquePrompt ? AFRIQUEGEMMA_… : SALAMANDRA_…). The African path is touched. Either the Summary needs updating, or — better — see the next point.
Blocking — AfriqueGemma now silently overrides consumer load-time tuning
AFRIQUEGEMMA_TRANSLATE_GENERATION_PARAMS mirrors today's load-time tuning in tests-qvac/tests/desktop/consumer.ts:208-214 exactly, but the SDK now silently overrides whatever the consumer chose at load time. If anyone retunes AfriqueGemma at the call site, the SDK's hardcoded values silently win and the tuning has no effect.
The cleanest fix is to not pass per-call params on the African branch at all — keep await model.run(input) for the AfriqueGemma path. Its load-time config + stop_sequences:["\n"] already pin everything you need; the description's intent matches that, and it keeps the change scoped to the actual problem (Salamandra). That removes the duplication, the override risk, and the description/code mismatch in one move.
Strong suggestions
- Rename
SALAMANDRA_TRANSLATE_GENERATION_PARAMS. It's used for every non-African LLM (your own test plan calls this out).LLM_TRANSLATE_GENERATION_PARAMS(orDEFAULT_LLM_TRANSLATE_GENERATION_PARAMS) avoids future grep confusion. - Inline cast on
model.run.bind(model)— see inline comment. Record<string, number>is too loose — define a typed alias for the six fields so a typo on a key is caught at compile time.
What's good
- Targeted single-file change, scoped to the actual flake
- Excellent repro evidence (3 tests × 10 iterations × before/after table)
- Comments explain why, not what — that's what we want
- Test plan explicitly notes the non-Salamandra LLM cases get the same treatment
Once the validator is green, the description matches the code, and the AfriqueGemma branch is settled, I'm happy to approve.
Tier-based Approval Status |
Apply the per-call deterministic-decoding override only to non-AFRICAN_*
LLM models. AfriqueGemma's load-time `modelConfig` carries
`stop_sequences: ["\n"]` and `repeat_penalty: 1`, and these values must
not be overridden mid-call: with `repeat_penalty: 1.3`, the addon
penalises "\n" and the stop never fires, so generation runs all the way
to `predict` and produces non-translation output. The earlier attempt
to dispatch by `afriquePrompt` (language-pair-derived) silently did
nothing for the actual AfriqueGemma traffic: `isAfrican("sw")` returns
`false` because `AFRICAN_LANGUAGES_MAP` is keyed by FLORES codes
(`"swh_Latn"`), not the ISO codes the smoke tests pass.
This commit dispatches by model name (entry.local.name starts with
"AFRICAN_") and falls back to `model.run(input)` with no override —
identical to the pre-fix call shape — so AfriqueGemma's behaviour is
preserved exactly as it is on main. A latent AfriqueGemma garbage-output
issue exists at HEAD regardless of this PR; that is out of scope.
The constant is renamed `LLM_TRANSLATE_GENERATION_PARAMS` since it now
applies to every non-skipped LLM, not just Salamandra.
7c7c5f7 to
22357aa
Compare
Pull `RunOptions` and `GenerationParams` from `@qvac/llm-llamacpp` and use them in place of the loose `Record<string, number>` cast in `translate()`. Define a `LlmTranslateGenerationParams` alias as the specific subset of `GenerationParams` we set per call (six fields, required) so a typo on any of them is a compile error. The cast on `model.run.bind(model)` now references the addon's `RunOptions` shape directly, which keeps us protected if the addon's option shape changes. No behaviour change.
|
Thanks for the careful review. Going through your points in order: CI / title
Blocking — description vs. code (both points)Both addressed in commit
Strong suggestions
Side noteWorth flagging separately: AfriqueGemma's GPU output on Intel Vulkan (Iris Xe iGPU) is producing garbage on this hardware regardless of addon version (verified down to Updated PR contents: Ready for another look when you have a minute. |
|
/review |
* fix[api]: deterministic decoding for LLM translate
Force greedy decoding with a fixed seed and bounded output length on every
LLM translate call (non-African branch) so output is reproducible across
calls and runaway generations cannot blow ctx_size on the next call.
Background: with @qvac/llm-llamacpp 0.17.x, calling `translate()` against
Salamandra (loaded with no decoding params) intermittently produced
verbatim source echo, "Translation in Spanish:" preambles, or
`processPromptImpl: context overflow` on tiny inputs like "bank". The
flake was non-deterministic across runs on the same input, masked in the
smoke suite by `contains-any` validators that still matched a Spanish
keyword inside a preamble.
The change is one call site: when the model is llamacpp-completion and
the prompt is not the AfriqueGemma path, pass per-call generationParams
overriding sampling for that runJob:
- temp/top_k/top_p collapse to greedy
- repeat_penalty: 1.3 breaks single-token echo loops
(e.g. greedy "bank" -> "bank\nbank\n...")
- seed: 42 pins anything residual sampling
- predict: 256 caps output so a runaway can't accumulate KV state
Prompt template, NMT branch, and African branch are unchanged.
AfriqueGemma is loaded with its own deterministic config + stop_sequences
already, so we skip the override there.
Verified locally on @qvac/llm-llamacpp 0.17.1 with 30 calls
(streaming + en-es + context, 10 iterations each):
- before: 23/30 pass with 2 echoes, 2 ctx-overflow, 3 echoes
- after: 30/30 pass, all outputs identical across iterations
* refactor: extract LLM translate generation params into named constant
Pull the per-call sampling overrides for LLM translate out of the call
site into a top-of-file constant with a comment that explains the
purpose of each field. No behavior change — values are identical to the
previous commit.
Adding a third translate-friendly LLM model later still goes through
this single constant unless it needs different sampling, in which case
it would warrant a small profile lookup keyed on model family. That
restructure is deferred until a concrete second profile lands.
* refactor[api]: skip per-call sampling override for AfriqueGemma
Apply the per-call deterministic-decoding override only to non-AFRICAN_*
LLM models. AfriqueGemma's load-time `modelConfig` carries
`stop_sequences: ["\n"]` and `repeat_penalty: 1`, and these values must
not be overridden mid-call: with `repeat_penalty: 1.3`, the addon
penalises "\n" and the stop never fires, so generation runs all the way
to `predict` and produces non-translation output. The earlier attempt
to dispatch by `afriquePrompt` (language-pair-derived) silently did
nothing for the actual AfriqueGemma traffic: `isAfrican("sw")` returns
`false` because `AFRICAN_LANGUAGES_MAP` is keyed by FLORES codes
(`"swh_Latn"`), not the ISO codes the smoke tests pass.
This commit dispatches by model name (entry.local.name starts with
"AFRICAN_") and falls back to `model.run(input)` with no override —
identical to the pre-fix call shape — so AfriqueGemma's behaviour is
preserved exactly as it is on main. A latent AfriqueGemma garbage-output
issue exists at HEAD regardless of this PR; that is out of scope.
The constant is renamed `LLM_TRANSLATE_GENERATION_PARAMS` since it now
applies to every non-skipped LLM, not just Salamandra.
* refactor: tighten typing on per-call generation params
Pull `RunOptions` and `GenerationParams` from `@qvac/llm-llamacpp` and
use them in place of the loose `Record<string, number>` cast in
`translate()`. Define a `LlmTranslateGenerationParams` alias as the
specific subset of `GenerationParams` we set per call (six fields,
required) so a typo on any of them is a compile error. The cast on
`model.run.bind(model)` now references the addon's `RunOptions` shape
directly, which keeps us protected if the addon's option shape changes.
No behaviour change.
---------
Co-authored-by: Dmytro Medvinskyi <functionsilence@gmail.com>
Summary
translate()call so output is reproducible and runaway generations cannot blowctx_sizeon the next call.packages/sdk/server/bare/ops/translate.ts. Prompt template, NMT branch, and AfriqueGemma routing are unchanged.Background
With
@qvac/llm-llamacpp@0.17.x, callingtranslate()against Salamandra (loaded with no decoding params) intermittently produced one of three failure modes on the same input:"Hello, how are you today?"returned untranslated),"Translation in Spanish:"preambles,processPromptImpl: context overflowon inputs as small as"bank".The flake was masked in CI because the smoke
contains-anyvalidators still matched a Spanish keyword inside a preamble. Salamandra was loaded with notemp/seed/stop_sequences(compare AfriqueGemma intests-qvac/.../desktop/consumer.ts:202-216which sets them at load time), so every call had non-deterministic sampling and no upper bound on output length. With auto KV-cache reuse in the new addon, a long preamble in one call could push the next short call overctx_size.Fix
In
translate.ts, when the model isllamacpp-completionand is not an AFRICAN_* registry entry, pass per-callgenerationParamsto override sampling for that onerunJob:temp/top_k/top_pcollapse to greedyrepeat_penalty: 1.3breaks single-token echo loops (greedy"bank"→"bank\nbank\n...")seed: 42pins any residual samplingpredict: 256caps output so a runaway can't accumulate KV statestop_sequencesis load-time only in the addon (transformed toreverse_promptinplugin.ts:53), so it can't be added here; the deterministic sampling + boundedpredictcover the same ground.AfriqueGemma is preserved as-is
Earlier iterations of this PR tried to dispatch by language pair (
afriquePrompt = isAfrican(from) || isAfrican(to)) but that flag is silently always false for the codes the smoke tests pass:AFRICAN_LANGUAGES_MAPis keyed by FLORES codes ("swh_Latn") while the tests pass ISO codes ("sw"). Dispatching by model name (entry.local.name?.startsWith("AFRICAN_")) is the only currently correct discriminator.For AFRICAN_* models we fall back to
model.run(input)with no override — byte-identical to the call shape before this PR. AfriqueGemma's load-timemodelConfig(stop_sequences: ["\n"],repeat_penalty: 1) keeps driving its decoding. Local repro on@qvac/llm-llamacpp@0.17.1shows AfriqueGemma's translate output is already malformed onmain(sw→en returns garbage tokens, never the expected English keywords), independent of this PR. Tracking it down is out of scope for QVAC-18156 — the smoke fortranslation-afriquegemma-sw-enis gated behind theverifylabel so it has not been visible on PR CI.Verification
Local repro on
@qvac/llm-llamacpp@0.17.1(the versionmaintargets), 30 calls per run (streaming + en-es + context, 10 iterations each), Salamandra Q4 loaded with no custom config (matching the smoke consumer):"¡Hola, ¿cómo te va hoy?""¡Hola, ¿cómo te va hoy?""bank\nbanco"AfriqueGemma was repro'd with the same per-call shape main uses (
model.run(input)no opts) — output identical to running against unmodified main, confirming this PR does not regress it.Test plan
test-e2e-smoke) goes green fortranslation-salamandra-streaming,translation-salamandra-en-es,translation-salamandra-contexton iOS, Android, and Desktop runners.main(theshouldSkipPerCallSamplingguard preserves the pre-PR call shape forAFRICAN_*models).canonicalModelType === llamacppCompletionguard).Linked