Qvac 17489 io binding for kv cache chaining#1686
Merged
Zbig9000 merged 5 commits intoApr 23, 2026
Merged
Conversation
mario-rei
previously approved these changes
Apr 23, 2026
GustavoA1604
requested changes
Apr 23, 2026
GustavoA1604
left a comment
Contributor
There was a problem hiding this comment.
Need not to add kvCacheChaining as a JS option. Can just keep it to true always. We can compare with rpevious version by running old version of the addon
GustavoA1604
approved these changes
Apr 23, 2026
mario-rei
approved these changes
Apr 23, 2026
GustavoA1604
approved these changes
Apr 23, 2026
Contributor
|
/review |
Contributor
Tier-based Approval Status |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What problem does this PR solve?
present.*tensor back into its matchingpast_key_values.*input on every step, a per-step O(past_len × heads × head_dim) copy that grew linearly with sequence length (quadratic total cost for an N-step generation).How does it solve it?
setOutputToInputChain/clearChainedInputs/isInputChainedtoIOnnxInferSession. After eachrun()the sessionstd::moves outputOrt::Values directly into the matching input slots — no copy, no intermediate user-space vector.OnnxInferSession::initInputTensors()preserves chained slots across re-init so the moved tensors survive.ChatterboxEngine::enableKvCacheChaining()builds the{present.i → past_key_values.i}mapping from session I/O names (inputs[keyValueOffset_..]map 1:1 onto outputs[1..], sinceoutputs[0]islogits). Wired into bothgenerateSpeechTokens(non-CFG) andgenerateSpeechTokensWithCfg(CFG) paths.writeKvToTensorsandcachePastKeyValuesnow skip chained inputs.ChatterboxConfig.kvCacheChaining(defaulttrue) wired throughTTSModel,AddonJs, andindex.jsso the optimization can be toggled for A/B benchmarking or an emergency disable.How was it tested?
qvac-lib-inference-tts-unit-testwith integration filters excluded.KvCacheChainingTestcases: English offset=3 mapping, multilingual offset=2 mapping, truncated-outputs edge case,writeKvToTensorsskip,cachePastKeyValuesskip.OnnxInferSessionMockTestcases for the new mock methods.jfk.wavreference,kvCacheChainingtoggled):standard) clean; C++ lints clean.API Changes
New non-breaking
kvCacheChainingoption on theONNXTTSconstructor (defaulttrue, i.e. optimization on).