Skip to content

perf(memory/v2): quantize rerank model to q8 and batch channel queries#29631

Merged
siddseethepalli merged 1 commit into
mainfrom
do/memory-v2-rerank-quantize-and-batch
May 5, 2026
Merged

perf(memory/v2): quantize rerank model to q8 and batch channel queries#29631
siddseethepalli merged 1 commit into
mainfrom
do/memory-v2-rerank-quantize-and-batch

Conversation

@siddseethepalli
Copy link
Copy Markdown
Contributor

@siddseethepalli siddseethepalli commented May 5, 2026

Summary

  • Quantize the memory v2 rerank model to q8 by default via a new memory.v2.rerank.dtype config field plumbed end-to-end into the auto-generated rerank-worker.mjs. ModernBERT int8 ONNX runs ~3× faster on CPU with negligible reranker accuracy loss.
  • Batch the user-channel and assistant-channel rerank queries into one tokenizer + ONNX forward pass. rerankCandidates now takes queries[] and returns Map[]; activation calls it once per turn instead of twice.
  • Combined, targeting ~7-10× rerank latency reduction (was ~60s/turn at top_k=50, fp32, two serialized batches).

Bumps RUNTIME_VERSION (_workers-v2 suffix) so existing installs regenerate the worker script. One-time reinstall on the next daemon start; during that window rerank degrades to pure fused scores via the existing fail-open path.

Original prompt

#1 and #4


Open in Devin Review

Cuts cross-encoder rerank latency on the conversation hot path through
two complementary changes:

- Quantize the rerank model to q8 by default via a new
  \`memory.v2.rerank.dtype\` config field plumbed end-to-end into the
  auto-generated rerank-worker.mjs. ModernBERT int8 ONNX runs ~3×
  faster on CPU with negligible reranker accuracy loss; falls back to
  pure fused scores if the configured model has no quantized export.
- Batch the user-channel and assistant-channel rerank queries into one
  tokenizer + ONNX forward pass instead of two serialized worker
  round-trips. \`rerankCandidates\` now takes \`queries[]\` and returns
  \`Map[]\`; activation calls it once per turn.

Bumps RUNTIME_VERSION (\`_workers-v2\` suffix) so existing installs
regenerate the worker script on next daemon start.
@siddseethepalli siddseethepalli merged commit f919d59 into main May 5, 2026
@siddseethepalli siddseethepalli deleted the do/memory-v2-rerank-quantize-and-batch branch May 5, 2026 08:13
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 404656b5b5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +87 to +88
const key = cacheKey(q, candidates);
const cached = cache.get(key);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Include rerank dtype in cache key

rerankCandidates now routes scoring through config.memory.v2.rerank.dtype, but cache lookups still key only on (query, candidates). If an operator changes memory.v2.rerank.dtype (for example fp32q8) while serving traffic, repeated queries within the 2-minute TTL will return scores computed under the previous dtype, so the new setting is silently ignored and experiment comparisons are contaminated. The cache key (and corresponding writes) should include dtype (and ideally model) so config changes take effect immediately.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 5 additional findings.

Open in Devin Review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant