QVAC-18733 feat[api]: add openai responses routes with in-memory store by lauripiisang · Pull Request #2030 · tetherto/qvac

lauripiisang · 2026-05-13T15:59:25Z

Note: be concise and prefer bullet points.

What problem does this PR solve?

OpenAI-compatible clients expect /v1/responses (blocking, SSE streaming, retrieval by id, and previous_response_id chaining). The CLI HTTP server only exposed chat/completions-style paths, so those clients could not target QVAC through the same surface.

How does it solve it?

Implements POST /v1/responses (blocking and stream: true SSE) by translating the Responses request into the existing sdkCompletion path used for chat-style inference — same models and streaming behavior as the rest of the OpenAI adapter.
Adds GET / DELETE /v1/responses/{id} and GET /v1/responses/{id}/input_items backed by a small in-memory store (LRU + TTL; default on; store: false skips persistence).
Supports previous_response_id by prefixing stored prior output into the completion history; returns 404 with previous_response_not_found when the prior id is missing.
Surfaces X-QVAC-Stub: responses-volatile on these routes and logs a one-line startup warning so operators know state is not durable.

OpenAI API compatibility (what matches)

Same URL shape and verbs for create, retrieve, delete, and input listing; SSE event framing for streaming where applicable.
Request fields the adapter understands are translated to QVAC completion calls; response objects and stream events follow the same general shape clients expect for basic text / tool-call flows wired today.

Caveats / gaps (notable)

State is volatile: process memory only; restart or eviction loses ids — not OpenAI’s hosted persistence.
Intentionally rejected with 400: conversation, background: true, and built-in tools (e.g. web_search) not mapped to local inference yet.
Full parity with every OpenAI Responses option, tool type, and edge case is not claimed; unknown or unsupported combinations should fail loudly rather than silently diverge.

How was it tested?

packages/cli: npm run lint, npm run test:unit, npm run test:bats, npm run test:e2e (e2e covers blocking + SSE + store + chain + validation paths in e2e.bats).

API Changes

# Blocking create
curl -sS "$BASE/v1/responses" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"model":"'"$MODEL"'","input":"ping","store":true}'

# Streaming
curl -sN "$BASE/v1/responses" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"model":"'"$MODEL"'","input":"ping","stream":true,"store":true}'

# Chained follow-up (after capturing response id from prior call)
curl -sS "$BASE/v1/responses" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"model":"'"$MODEL"'","input":"and now?","previous_response_id":"resp_..."}'

github-actions · 2026-05-13T19:20:51Z

Tier-based Approval Status

**PR Tier:** TIER1

**Current Status:** ✅ APPROVED

**Requirements:**
- 1 Team Member approval ✅ (1/1)
- 1 Team Lead OR Management approval ✅ (1/1)



---
*This comment is automatically updated when reviews change.*

…ax_tokens warn, README) Five low-severity items from PR #2030 review: - Drop the `data: [DONE]` sentinel on `/v1/responses` SSE: spec ends on `response.completed`. Adds an `EndSSEOptions { sentinel?: boolean }` knob to `endSSE` so chat-completions keeps its existing sentinel and Responses opts out via `endSSE(res, { sentinel: false })`. E2E flips the assertion accordingly. - Drop the duplicate `response.in_progress` event emitted back-to-back with `response.created` (same payload, no state transition — strict parsers can choke). - Tighten `BuildResponseObjectParams.parallelToolCalls` from `boolean | undefined` to `boolean` (the route already resolves a default before calling), eliminating a dead `?? true` fallback. - Warn on `max_tokens` for /v1/responses (spec field is `max_output_tokens`); still accepted as a fallback so existing clients don't break, but they get a logger.warn nudge. - README: add a "serve openai" section listing all routes and a Responses subsection that documents volatility, the `X-QVAC-Stub` header, the `store: false` opt-out, and curl examples. The README previously listed no openai-compat endpoints at all. Skipped from the review: - #2 (no client-disconnect handling in streaming): pre-existing gap shared with /v1/chat/completions, reviewer marked out of scope. - #7 (per-entry byte-size cap on the in-memory store): reviewer marked follow-up; `maxEntries` + TTL still bound memory pressure for the local-first single-user target audience.

lauripiisang · 2026-05-14T05:47:50Z

@simon-iribarren comments resolved in latest commits, please see

simon-iribarren

Approve — my earlier notes were addressed in c6beee725 (chain walk) and 7fc2459f6 (SSE [DONE], dup response.in_progress, ?? true, max_tokens warn, README). 6 of 8 fully closed; the remaining 2 (client-disconnect, byte-size cap) are correctly framed as follow-up — no objection.

Two new items found on top, neither a blocker:

[DONE] re-emerges on the streaming error path. sendError's default endSSE(res) writes data: [DONE]\n\n after a response.error event — closes the gap from my original note in the happy path only. Suggest endSSE(res, { sendDone: false }) when sendError is invoked from inside an active stream.
GET /v1/responses/:id/input_items silently drops the after cursor. Route only forwards limit; the store implements after, but the route never reads it from the query string. Spec-compliant pagination re-fetches page 1 forever. One-line fix in the route handler.

Smaller nits:

storeEnabled = body['store'] !== false accepts non-boolean truthy values. Tighter as body['store'] === undefined || body['store'] === true.
ResponsesStore.delete returns true for expired-but-unpruned records → 404 inconsistency vs get.
Configurable ttlMs on the store is silently ignored by routes that hardcode the default — the banner lies.
Banner uses logger.warn for an informational message; logger.info reads better.

Also: please add tier1 + verify labels so the approval gate routes correctly.

…ax_tokens warn, README) Five low-severity items from PR #2030 review: - Drop the `data: [DONE]` sentinel on `/v1/responses` SSE: spec ends on `response.completed`. Adds an `EndSSEOptions { sentinel?: boolean }` knob to `endSSE` so chat-completions keeps its existing sentinel and Responses opts out via `endSSE(res, { sentinel: false })`. E2E flips the assertion accordingly. - Drop the duplicate `response.in_progress` event emitted back-to-back with `response.created` (same payload, no state transition — strict parsers can choke). - Tighten `BuildResponseObjectParams.parallelToolCalls` from `boolean | undefined` to `boolean` (the route already resolves a default before calling), eliminating a dead `?? true` fallback. - Warn on `max_tokens` for /v1/responses (spec field is `max_output_tokens`); still accepted as a fallback so existing clients don't break, but they get a logger.warn nudge. - README: add a "serve openai" section listing all routes and a Responses subsection that documents volatility, the `X-QVAC-Stub` header, the `store: false` opt-out, and curl examples. The README previously listed no openai-compat endpoints at all. Skipped from the review: - #2 (no client-disconnect handling in streaming): pre-existing gap shared with /v1/chat/completions, reviewer marked out of scope. - #7 (per-entry byte-size cap on the in-memory store): reviewer marked follow-up; `maxEntries` + TTL still bound memory pressure for the local-first single-user target audience.

lauripiisang · 2026-05-14T11:30:13Z

/review

Implement POST /v1/responses (blocking + SSE), GET/DELETE /v1/responses/{id}, GET /v1/responses/{id}/input_items, previous_response_id chaining, LRU+TTL store, X-QVAC-Stub: responses-volatile header, and startup banner.

…stats - Approach (b): always include the assistant `message` item in `response.output[0]`, even when tool calls are present, so the streamed item tree matches `response.completed`. - Pre-allocate `msgItemId` and `fcItemIds` once and reuse them across SSE events and the finalized `output[]`, fixing client-side accumulation by `item_id`. - Use distinct `output_index` per tool call (1..n) and set `item_id` on `response.function_call_arguments.delta`/`.done` to the function-call item id (was the OpenAI `call_id`, causing collisions and wrong wiring). - Populate `required_action.submit_tool_outputs.tool_calls` so OpenAI clients can satisfy tool calls instead of hanging in `requires_action` with no payload. - Drop the duplicate `previous_response_id` lookup in `handlePostResponses`. - Drop `parallel_tool_calls` from the unsupported-params log: it is honored. - Recognise `function_call_output` (-> `tool` role) and `function_call` (-> synthesized assistant `<tool_call>` content) in `openaiResponsesInputToHistory` and `historyPrefixFromStoredResponse` so chained tool round-trips actually carry through `previous_response_id`. - Use `crypto.randomUUID()` for `resp_`/`msg_`/`fc_`/input-item ids. - Surface real `usage.output_tokens` from `result.stats.generatedTokens` (Responses + chat.completions, blocking + streaming); fall back to word count when stats are missing. `input_tokens` stays 0 with an inline note that the SDK does not expose a prompt-token count today. - Tighten `CompletionResult.stats` to a structured `CompletionRunStats` shape. Tests: extend `responses.test.ts` and `translate.test.ts`; add `responses-streaming.test.ts` driving the new exported `writeStreamingResponse` / `writeBlockingResponse` helpers with a fake `CompletionResult` and `ServerResponse`.

Pin temperature=0 + seed and bump max_output_tokens to 512 so Qwen3-600M has room for both its <think> block and the actual answer. The test exercises previous_response_id chain wiring; it should not depend on sampling luck or the model's reasoning length.

…history Each StoredResponse.inputItems only carries that turn's NEW input (`normalizeResponsesInputItemsForStorage(body['input'])`), so a chain of depth >= 3 silently lost the grandparent turn: resp_1 input "A" -> output "X" (stored: ["A"]) resp_2 prev=resp_1 input "B" history sent: [A, X, B] (stored: ["B"]) resp_3 prev=resp_2 input "C" history sent: [B, Y, C] -- A and X gone historyPrefixFromStoredResponse now walks the chain via responseObject.previous_response_id when given a resolver, prepending earlier turns oldest-first. Cap depth at 32 to bound work and protect against pathological cycles. Routes pass `(id) => store.get(id)` as the resolver. Legacy single-step callers still work unchanged when the resolver is omitted. Tests: - unit: depth-3 chain produces all six prefix entries in order; maxDepth cap honored. - e2e: resp_1 sets "code word is XYZZY", resp_2 acks, resp_3 asks for the word and recovers it -- would silently fail before this fix.

…ax_tokens warn, README) Five low-severity items from PR #2030 review: - Drop the `data: [DONE]` sentinel on `/v1/responses` SSE: spec ends on `response.completed`. Adds an `EndSSEOptions { sentinel?: boolean }` knob to `endSSE` so chat-completions keeps its existing sentinel and Responses opts out via `endSSE(res, { sentinel: false })`. E2E flips the assertion accordingly. - Drop the duplicate `response.in_progress` event emitted back-to-back with `response.created` (same payload, no state transition — strict parsers can choke). - Tighten `BuildResponseObjectParams.parallelToolCalls` from `boolean | undefined` to `boolean` (the route already resolves a default before calling), eliminating a dead `?? true` fallback. - Warn on `max_tokens` for /v1/responses (spec field is `max_output_tokens`); still accepted as a fallback so existing clients don't break, but they get a logger.warn nudge. - README: add a "serve openai" section listing all routes and a Responses subsection that documents volatility, the `X-QVAC-Stub` header, the `store: false` opt-out, and curl examples. The README previously listed no openai-compat endpoints at all. Skipped from the review: - #2 (no client-disconnect handling in streaming): pre-existing gap shared with /v1/chat/completions, reviewer marked out of scope. - #7 (per-entry byte-size cap on the in-memory store): reviewer marked follow-up; `maxEntries` + TTL still bound memory pressure for the local-first single-user target audience.

…ter cursor) Two surfaced post-rebase: 1. sendError gained an opt-in { sseSentinel: false } so callers inside an active stream can suppress the trailing `data: [DONE]\n\n` after the `response.error` SSE event. Responses streaming error path now passes it, closing the gap that the happy path already handled (response.completed already used endSSE({ sentinel: false })). 2. GET /v1/responses/:id/input_items now reads the `after` cursor from the query string in addition to `limit`. Spec-compliant pagination would have re-fetched page 1 forever; the store already implemented the cursor. Added a store-level pagination test that walks all pages by `last_id`.

kinsta · 2026-05-14T13:33:19Z

Preview deployments for qvac-docs-staging ⚡️

Status	Branch preview	Commit preview
🔁 Deploying...	N/A	N/A

Commit: 71c4a9215f1bad766c5f9f624da8fc31b4a085ed

Deployment ID: 04856c2a-a684-42f8-ac60-92d31c1082a2

Static site name: qvac-docs-staging-fazwv

lauripiisang · 2026-05-14T13:33:53Z

/review

#2030) * QVAC-18733 feat[api]: add OpenAI Responses routes with in-memory store Implement POST /v1/responses (blocking + SSE), GET/DELETE /v1/responses/{id}, GET /v1/responses/{id}/input_items, previous_response_id chaining, LRU+TTL store, X-QVAC-Stub: responses-volatile header, and startup banner. * fix: align Responses streaming with finalized response and add usage stats - Approach (b): always include the assistant `message` item in `response.output[0]`, even when tool calls are present, so the streamed item tree matches `response.completed`. - Pre-allocate `msgItemId` and `fcItemIds` once and reuse them across SSE events and the finalized `output[]`, fixing client-side accumulation by `item_id`. - Use distinct `output_index` per tool call (1..n) and set `item_id` on `response.function_call_arguments.delta`/`.done` to the function-call item id (was the OpenAI `call_id`, causing collisions and wrong wiring). - Populate `required_action.submit_tool_outputs.tool_calls` so OpenAI clients can satisfy tool calls instead of hanging in `requires_action` with no payload. - Drop the duplicate `previous_response_id` lookup in `handlePostResponses`. - Drop `parallel_tool_calls` from the unsupported-params log: it is honored. - Recognise `function_call_output` (-> `tool` role) and `function_call` (-> synthesized assistant `<tool_call>` content) in `openaiResponsesInputToHistory` and `historyPrefixFromStoredResponse` so chained tool round-trips actually carry through `previous_response_id`. - Use `crypto.randomUUID()` for `resp_`/`msg_`/`fc_`/input-item ids. - Surface real `usage.output_tokens` from `result.stats.generatedTokens` (Responses + chat.completions, blocking + streaming); fall back to word count when stats are missing. `input_tokens` stays 0 with an inline note that the SDK does not expose a prompt-token count today. - Tighten `CompletionResult.stats` to a structured `CompletionRunStats` shape. Tests: extend `responses.test.ts` and `translate.test.ts`; add `responses-streaming.test.ts` driving the new exported `writeStreamingResponse` / `writeBlockingResponse` helpers with a fake `CompletionResult` and `ServerResponse`. * test[skiplog]: stabilize Responses chain e2e for tiny reasoning model Pin temperature=0 + seed and bump max_output_tokens to 512 so Qwen3-600M has room for both its <think> block and the actual answer. The test exercises previous_response_id chain wiring; it should not depend on sampling luck or the model's reasoning length. * fix: walk previous_response_id chain so multi-turn keeps grandparent history Each StoredResponse.inputItems only carries that turn's NEW input (`normalizeResponsesInputItemsForStorage(body['input'])`), so a chain of depth >= 3 silently lost the grandparent turn: resp_1 input "A" -> output "X" (stored: ["A"]) resp_2 prev=resp_1 input "B" history sent: [A, X, B] (stored: ["B"]) resp_3 prev=resp_2 input "C" history sent: [B, Y, C] -- A and X gone historyPrefixFromStoredResponse now walks the chain via responseObject.previous_response_id when given a resolver, prepending earlier turns oldest-first. Cap depth at 32 to bound work and protect against pathological cycles. Routes pass `(id) => store.get(id)` as the resolver. Legacy single-step callers still work unchanged when the resolver is omitted. Tests: - unit: depth-3 chain produces all six prefix entries in order; maxDepth cap honored. - e2e: resp_1 sets "code word is XYZZY", resp_2 acks, resp_3 asks for the word and recovers it -- would silently fail before this fix. * fix: address Responses review nits (SSE sentinel, dup event, types, max_tokens warn, README) Five low-severity items from PR #2030 review: - Drop the `data: [DONE]` sentinel on `/v1/responses` SSE: spec ends on `response.completed`. Adds an `EndSSEOptions { sentinel?: boolean }` knob to `endSSE` so chat-completions keeps its existing sentinel and Responses opts out via `endSSE(res, { sentinel: false })`. E2E flips the assertion accordingly. - Drop the duplicate `response.in_progress` event emitted back-to-back with `response.created` (same payload, no state transition — strict parsers can choke). - Tighten `BuildResponseObjectParams.parallelToolCalls` from `boolean | undefined` to `boolean` (the route already resolves a default before calling), eliminating a dead `?? true` fallback. - Warn on `max_tokens` for /v1/responses (spec field is `max_output_tokens`); still accepted as a fallback so existing clients don't break, but they get a logger.warn nudge. - README: add a "serve openai" section listing all routes and a Responses subsection that documents volatility, the `X-QVAC-Stub` header, the `store: false` opt-out, and curl examples. The README previously listed no openai-compat endpoints at all. Skipped from the review: - #2 (no client-disconnect handling in streaming): pre-existing gap shared with /v1/chat/completions, reviewer marked out of scope. - #7 (per-entry byte-size cap on the in-memory store): reviewer marked follow-up; `maxEntries` + TTL still bound memory pressure for the local-first single-user target audience. * fix: address Simon review nits (stream error sentinel, input_items after cursor) Two surfaced post-rebase: 1. sendError gained an opt-in { sseSentinel: false } so callers inside an active stream can suppress the trailing `data: [DONE]\n\n` after the `response.error` SSE event. Responses streaming error path now passes it, closing the gap that the happy path already handled (response.completed already used endSSE({ sentinel: false })). 2. GET /v1/responses/:id/input_items now reads the `after` cursor from the query string in addition to `limit`. Spec-compliant pagination would have re-fetched page 1 forever; the store already implemented the cursor. Added a store-level pagination test that walks all pages by `last_id`.

lauripiisang requested review from a team as code owners May 13, 2026 15:59

lauripiisang force-pushed the feat/QVAC-18733-v1-responses-stateless branch from b9ecc6a to 3e16fd9 Compare May 13, 2026 18:15

This comment was marked as resolved.

Sign in to view

lauripiisang force-pushed the feat/QVAC-18733-v1-responses-stateless branch from 7660dc7 to 4bdfa43 Compare May 14, 2026 05:36

simon-iribarren previously approved these changes May 14, 2026

View reviewed changes

NamelsKing approved these changes May 14, 2026

View reviewed changes

NamelsKing previously approved these changes May 14, 2026

View reviewed changes

lauripiisang dismissed stale reviews from NamelsKing and simon-iribarren via 342b12e May 14, 2026 10:57

lauripiisang force-pushed the feat/QVAC-18733-v1-responses-stateless branch from 7fc2459 to 342b12e Compare May 14, 2026 10:57

lauripiisang added verify tier1 labels May 14, 2026

NamelsKing previously approved these changes May 14, 2026

View reviewed changes

lauripiisang added 6 commits May 14, 2026 16:45

lauripiisang dismissed NamelsKing’s stale review via 86bde53 May 14, 2026 12:47

lauripiisang force-pushed the feat/QVAC-18733-v1-responses-stateless branch from 0500c1b to 86bde53 Compare May 14, 2026 12:47

NamelsKing approved these changes May 14, 2026

View reviewed changes

simon-iribarren approved these changes May 14, 2026

View reviewed changes

Merge branch 'main' into feat/QVAC-18733-v1-responses-stateless

71c4a92

kinsta Bot deployed to preview May 14, 2026 13:33 View deployment

lauripiisang merged commit 0741eb5 into main May 14, 2026
12 checks passed

lauripiisang deleted the feat/QVAC-18733-v1-responses-stateless branch May 14, 2026 13:34

simon-iribarren mentioned this pull request May 15, 2026

chore[notask|skiplog]: release @qvac/cli v0.5.0 #2081

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QVAC-18733 feat[api]: add openai responses routes with in-memory store#2030

QVAC-18733 feat[api]: add openai responses routes with in-memory store#2030
lauripiisang merged 7 commits into
mainfrom
feat/QVAC-18733-v1-responses-stateless

lauripiisang commented May 13, 2026 •

edited

Loading

Uh oh!

This comment was marked as resolved.

Uh oh!

github-actions Bot commented May 13, 2026 •

edited

Loading

Uh oh!

lauripiisang commented May 14, 2026

Uh oh!

simon-iribarren left a comment

Uh oh!

lauripiisang commented May 14, 2026

Uh oh!

kinsta Bot commented May 14, 2026

Uh oh!

lauripiisang commented May 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

lauripiisang commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

How does it solve it?

OpenAI API compatibility (what matches)

Caveats / gaps (notable)

How was it tested?

API Changes

Uh oh!

This comment was marked as resolved.

Uh oh!

github-actions Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Tier-based Approval Status

Uh oh!

lauripiisang commented May 14, 2026

Uh oh!

simon-iribarren left a comment

Choose a reason for hiding this comment

Uh oh!

lauripiisang commented May 14, 2026

Uh oh!

kinsta Bot commented May 14, 2026

Preview deployments for qvac-docs-staging ⚡️

Uh oh!

lauripiisang commented May 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lauripiisang commented May 13, 2026 •

edited

Loading

github-actions Bot commented May 13, 2026 •

edited

Loading