perf(memory/v2): quantize rerank model to q8 and batch channel queries by siddseethepalli · Pull Request #29631 · vellum-ai/vellum-assistant

siddseethepalli · 2026-05-05T08:13:35Z

Summary

Quantize the memory v2 rerank model to q8 by default via a new memory.v2.rerank.dtype config field plumbed end-to-end into the auto-generated rerank-worker.mjs. ModernBERT int8 ONNX runs ~3× faster on CPU with negligible reranker accuracy loss.
Batch the user-channel and assistant-channel rerank queries into one tokenizer + ONNX forward pass. rerankCandidates now takes queries[] and returns Map[]; activation calls it once per turn instead of twice.
Combined, targeting ~7-10× rerank latency reduction (was ~60s/turn at top_k=50, fp32, two serialized batches).

Bumps RUNTIME_VERSION (_workers-v2 suffix) so existing installs regenerate the worker script. One-time reinstall on the next daemon start; during that window rerank degrades to pure fused scores via the existing fail-open path.

Original prompt

#1 and #4

Cuts cross-encoder rerank latency on the conversation hot path through two complementary changes: - Quantize the rerank model to q8 by default via a new \`memory.v2.rerank.dtype\` config field plumbed end-to-end into the auto-generated rerank-worker.mjs. ModernBERT int8 ONNX runs ~3× faster on CPU with negligible reranker accuracy loss; falls back to pure fused scores if the configured model has no quantized export. - Batch the user-channel and assistant-channel rerank queries into one tokenizer + ONNX forward pass instead of two serialized worker round-trips. \`rerankCandidates\` now takes \`queries[]\` and returns \`Map[]\`; activation calls it once per turn. Bumps RUNTIME_VERSION (\`_workers-v2\` suffix) so existing installs regenerate the worker script on next daemon start.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 404656b5b5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-05T08:16:35Z

+    const key = cacheKey(q, candidates);
+    const cached = cache.get(key);


Include rerank dtype in cache key

rerankCandidates now routes scoring through config.memory.v2.rerank.dtype, but cache lookups still key only on (query, candidates). If an operator changes memory.v2.rerank.dtype (for example fp32 → q8) while serving traffic, repeated queries within the 2-minute TTL will return scores computed under the previous dtype, so the new setting is silently ignored and experiment comparisons are contaminated. The cache key (and corresponding writes) should include dtype (and ideally model) so config changes take effect immediately.

Useful? React with 👍 / 👎.

devin-ai-integration

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 5 additional findings.

siddseethepalli merged commit f919d59 into main May 5, 2026

siddseethepalli deleted the do/memory-v2-rerank-quantize-and-batch branch May 5, 2026 08:13

chatgpt-codex-connector Bot reviewed May 5, 2026

View reviewed changes

devin-ai-integration Bot reviewed May 5, 2026

View reviewed changes

credence-the-bot Bot mentioned this pull request May 5, 2026

fix(oauth): route assistant oauth connect --callback-transport=gateway through daemon IPC #29596

Merged

5 tasks

siddseethepalli mentioned this pull request May 9, 2026

feat(memory-v2): cross-encoder rerank as additive boost #29555

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(memory/v2): quantize rerank model to q8 and batch channel queries#29631

perf(memory/v2): quantize rerank model to q8 and batch channel queries#29631
siddseethepalli merged 1 commit into
mainfrom
do/memory-v2-rerank-quantize-and-batch

siddseethepalli commented May 5, 2026 •

edited by devin-ai-integration Bot

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 5, 2026

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		const key = cacheKey(q, candidates);
		const cached = cache.get(key);

Conversation

siddseethepalli commented May 5, 2026 • edited by devin-ai-integration Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Original prompt

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 5, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

✅ Devin Review: No Issues Found

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

siddseethepalli commented May 5, 2026 •

edited by devin-ai-integration Bot

Loading