perf(memory/v2): quantize rerank model to q8 and batch channel queries#29631
Conversation
Cuts cross-encoder rerank latency on the conversation hot path through two complementary changes: - Quantize the rerank model to q8 by default via a new \`memory.v2.rerank.dtype\` config field plumbed end-to-end into the auto-generated rerank-worker.mjs. ModernBERT int8 ONNX runs ~3× faster on CPU with negligible reranker accuracy loss; falls back to pure fused scores if the configured model has no quantized export. - Batch the user-channel and assistant-channel rerank queries into one tokenizer + ONNX forward pass instead of two serialized worker round-trips. \`rerankCandidates\` now takes \`queries[]\` and returns \`Map[]\`; activation calls it once per turn. Bumps RUNTIME_VERSION (\`_workers-v2\` suffix) so existing installs regenerate the worker script on next daemon start.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 404656b5b5
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const key = cacheKey(q, candidates); | ||
| const cached = cache.get(key); |
There was a problem hiding this comment.
Include rerank dtype in cache key
rerankCandidates now routes scoring through config.memory.v2.rerank.dtype, but cache lookups still key only on (query, candidates). If an operator changes memory.v2.rerank.dtype (for example fp32 → q8) while serving traffic, repeated queries within the 2-minute TTL will return scores computed under the previous dtype, so the new setting is silently ignored and experiment comparisons are contaminated. The cache key (and corresponding writes) should include dtype (and ideally model) so config changes take effect immediately.
Useful? React with 👍 / 👎.
Summary
memory.v2.rerank.dtypeconfig field plumbed end-to-end into the auto-generated rerank-worker.mjs. ModernBERT int8 ONNX runs ~3× faster on CPU with negligible reranker accuracy loss.rerankCandidatesnow takesqueries[]and returnsMap[]; activation calls it once per turn instead of twice.Bumps
RUNTIME_VERSION(_workers-v2suffix) so existing installs regenerate the worker script. One-time reinstall on the next daemon start; during that window rerank degrades to pure fused scores via the existing fail-open path.Original prompt
#1 and #4