QVAC-18733 feat[api]: add openai responses routes with in-memory store#2030
Merged
Conversation
b9ecc6a to
3e16fd9
Compare
Contributor
Tier-based Approval Status |
7660dc7 to
4bdfa43
Compare
lauripiisang
added a commit
that referenced
this pull request
May 14, 2026
…ax_tokens warn, README) Five low-severity items from PR #2030 review: - Drop the `data: [DONE]` sentinel on `/v1/responses` SSE: spec ends on `response.completed`. Adds an `EndSSEOptions { sentinel?: boolean }` knob to `endSSE` so chat-completions keeps its existing sentinel and Responses opts out via `endSSE(res, { sentinel: false })`. E2E flips the assertion accordingly. - Drop the duplicate `response.in_progress` event emitted back-to-back with `response.created` (same payload, no state transition — strict parsers can choke). - Tighten `BuildResponseObjectParams.parallelToolCalls` from `boolean | undefined` to `boolean` (the route already resolves a default before calling), eliminating a dead `?? true` fallback. - Warn on `max_tokens` for /v1/responses (spec field is `max_output_tokens`); still accepted as a fallback so existing clients don't break, but they get a logger.warn nudge. - README: add a "serve openai" section listing all routes and a Responses subsection that documents volatility, the `X-QVAC-Stub` header, the `store: false` opt-out, and curl examples. The README previously listed no openai-compat endpoints at all. Skipped from the review: - #2 (no client-disconnect handling in streaming): pre-existing gap shared with /v1/chat/completions, reviewer marked out of scope. - #7 (per-entry byte-size cap on the in-memory store): reviewer marked follow-up; `maxEntries` + TTL still bound memory pressure for the local-first single-user target audience.
Contributor
Author
|
@simon-iribarren comments resolved in latest commits, please see |
simon-iribarren
previously approved these changes
May 14, 2026
simon-iribarren
left a comment
Contributor
There was a problem hiding this comment.
Approve — my earlier notes were addressed in c6beee725 (chain walk) and 7fc2459f6 (SSE [DONE], dup response.in_progress, ?? true, max_tokens warn, README). 6 of 8 fully closed; the remaining 2 (client-disconnect, byte-size cap) are correctly framed as follow-up — no objection.
Two new items found on top, neither a blocker:
[DONE]re-emerges on the streaming error path.sendError's defaultendSSE(res)writesdata: [DONE]\n\nafter aresponse.errorevent — closes the gap from my original note in the happy path only. SuggestendSSE(res, { sendDone: false })whensendErroris invoked from inside an active stream.GET /v1/responses/:id/input_itemssilently drops theaftercursor. Route only forwardslimit; the store implementsafter, but the route never reads it from the query string. Spec-compliant pagination re-fetches page 1 forever. One-line fix in the route handler.
Smaller nits:
storeEnabled = body['store'] !== falseaccepts non-boolean truthy values. Tighter asbody['store'] === undefined || body['store'] === true.ResponsesStore.deletereturnstruefor expired-but-unpruned records → 404 inconsistency vsget.- Configurable
ttlMson the store is silently ignored by routes that hardcode the default — the banner lies. - Banner uses
logger.warnfor an informational message;logger.inforeads better.
Also: please add tier1 + verify labels so the approval gate routes correctly.
NamelsKing
approved these changes
May 14, 2026
NamelsKing
previously approved these changes
May 14, 2026
lauripiisang
added a commit
that referenced
this pull request
May 14, 2026
…ax_tokens warn, README) Five low-severity items from PR #2030 review: - Drop the `data: [DONE]` sentinel on `/v1/responses` SSE: spec ends on `response.completed`. Adds an `EndSSEOptions { sentinel?: boolean }` knob to `endSSE` so chat-completions keeps its existing sentinel and Responses opts out via `endSSE(res, { sentinel: false })`. E2E flips the assertion accordingly. - Drop the duplicate `response.in_progress` event emitted back-to-back with `response.created` (same payload, no state transition — strict parsers can choke). - Tighten `BuildResponseObjectParams.parallelToolCalls` from `boolean | undefined` to `boolean` (the route already resolves a default before calling), eliminating a dead `?? true` fallback. - Warn on `max_tokens` for /v1/responses (spec field is `max_output_tokens`); still accepted as a fallback so existing clients don't break, but they get a logger.warn nudge. - README: add a "serve openai" section listing all routes and a Responses subsection that documents volatility, the `X-QVAC-Stub` header, the `store: false` opt-out, and curl examples. The README previously listed no openai-compat endpoints at all. Skipped from the review: - #2 (no client-disconnect handling in streaming): pre-existing gap shared with /v1/chat/completions, reviewer marked out of scope. - #7 (per-entry byte-size cap on the in-memory store): reviewer marked follow-up; `maxEntries` + TTL still bound memory pressure for the local-first single-user target audience.
342b12e
7fc2459 to
342b12e
Compare
Contributor
Author
|
/review |
NamelsKing
previously approved these changes
May 14, 2026
Implement POST /v1/responses (blocking + SSE), GET/DELETE /v1/responses/{id},
GET /v1/responses/{id}/input_items, previous_response_id chaining, LRU+TTL
store, X-QVAC-Stub: responses-volatile header, and startup banner.
…stats - Approach (b): always include the assistant `message` item in `response.output[0]`, even when tool calls are present, so the streamed item tree matches `response.completed`. - Pre-allocate `msgItemId` and `fcItemIds` once and reuse them across SSE events and the finalized `output[]`, fixing client-side accumulation by `item_id`. - Use distinct `output_index` per tool call (1..n) and set `item_id` on `response.function_call_arguments.delta`/`.done` to the function-call item id (was the OpenAI `call_id`, causing collisions and wrong wiring). - Populate `required_action.submit_tool_outputs.tool_calls` so OpenAI clients can satisfy tool calls instead of hanging in `requires_action` with no payload. - Drop the duplicate `previous_response_id` lookup in `handlePostResponses`. - Drop `parallel_tool_calls` from the unsupported-params log: it is honored. - Recognise `function_call_output` (-> `tool` role) and `function_call` (-> synthesized assistant `<tool_call>` content) in `openaiResponsesInputToHistory` and `historyPrefixFromStoredResponse` so chained tool round-trips actually carry through `previous_response_id`. - Use `crypto.randomUUID()` for `resp_`/`msg_`/`fc_`/input-item ids. - Surface real `usage.output_tokens` from `result.stats.generatedTokens` (Responses + chat.completions, blocking + streaming); fall back to word count when stats are missing. `input_tokens` stays 0 with an inline note that the SDK does not expose a prompt-token count today. - Tighten `CompletionResult.stats` to a structured `CompletionRunStats` shape. Tests: extend `responses.test.ts` and `translate.test.ts`; add `responses-streaming.test.ts` driving the new exported `writeStreamingResponse` / `writeBlockingResponse` helpers with a fake `CompletionResult` and `ServerResponse`.
Pin temperature=0 + seed and bump max_output_tokens to 512 so Qwen3-600M has room for both its <think> block and the actual answer. The test exercises previous_response_id chain wiring; it should not depend on sampling luck or the model's reasoning length.
…history
Each StoredResponse.inputItems only carries that turn's NEW input
(`normalizeResponsesInputItemsForStorage(body['input'])`), so a chain of
depth >= 3 silently lost the grandparent turn:
resp_1 input "A" -> output "X" (stored: ["A"])
resp_2 prev=resp_1 input "B" history sent: [A, X, B]
(stored: ["B"])
resp_3 prev=resp_2 input "C" history sent: [B, Y, C] -- A and X gone
historyPrefixFromStoredResponse now walks the chain via
responseObject.previous_response_id when given a resolver, prepending
earlier turns oldest-first. Cap depth at 32 to bound work and protect
against pathological cycles. Routes pass `(id) => store.get(id)` as the
resolver. Legacy single-step callers still work unchanged when the
resolver is omitted.
Tests:
- unit: depth-3 chain produces all six prefix entries in order; maxDepth
cap honored.
- e2e: resp_1 sets "code word is XYZZY", resp_2 acks, resp_3 asks for the
word and recovers it -- would silently fail before this fix.
…ax_tokens warn, README) Five low-severity items from PR #2030 review: - Drop the `data: [DONE]` sentinel on `/v1/responses` SSE: spec ends on `response.completed`. Adds an `EndSSEOptions { sentinel?: boolean }` knob to `endSSE` so chat-completions keeps its existing sentinel and Responses opts out via `endSSE(res, { sentinel: false })`. E2E flips the assertion accordingly. - Drop the duplicate `response.in_progress` event emitted back-to-back with `response.created` (same payload, no state transition — strict parsers can choke). - Tighten `BuildResponseObjectParams.parallelToolCalls` from `boolean | undefined` to `boolean` (the route already resolves a default before calling), eliminating a dead `?? true` fallback. - Warn on `max_tokens` for /v1/responses (spec field is `max_output_tokens`); still accepted as a fallback so existing clients don't break, but they get a logger.warn nudge. - README: add a "serve openai" section listing all routes and a Responses subsection that documents volatility, the `X-QVAC-Stub` header, the `store: false` opt-out, and curl examples. The README previously listed no openai-compat endpoints at all. Skipped from the review: - #2 (no client-disconnect handling in streaming): pre-existing gap shared with /v1/chat/completions, reviewer marked out of scope. - #7 (per-entry byte-size cap on the in-memory store): reviewer marked follow-up; `maxEntries` + TTL still bound memory pressure for the local-first single-user target audience.
…ter cursor)
Two surfaced post-rebase:
1. sendError gained an opt-in { sseSentinel: false } so callers inside an
active stream can suppress the trailing `data: [DONE]\n\n` after the
`response.error` SSE event. Responses streaming error path now passes
it, closing the gap that the happy path already handled (response.completed
already used endSSE({ sentinel: false })).
2. GET /v1/responses/:id/input_items now reads the `after` cursor from the
query string in addition to `limit`. Spec-compliant pagination would have
re-fetched page 1 forever; the store already implemented the cursor.
Added a store-level pagination test that walks all pages by `last_id`.
0500c1b to
86bde53
Compare
NamelsKing
approved these changes
May 14, 2026
simon-iribarren
approved these changes
May 14, 2026
Preview deployments for qvac-docs-staging ⚡️
Commit: Deployment ID: Static site name: |
Contributor
Author
|
/review |
Proletter
pushed a commit
that referenced
this pull request
May 24, 2026
#2030) * QVAC-18733 feat[api]: add OpenAI Responses routes with in-memory store Implement POST /v1/responses (blocking + SSE), GET/DELETE /v1/responses/{id}, GET /v1/responses/{id}/input_items, previous_response_id chaining, LRU+TTL store, X-QVAC-Stub: responses-volatile header, and startup banner. * fix: align Responses streaming with finalized response and add usage stats - Approach (b): always include the assistant `message` item in `response.output[0]`, even when tool calls are present, so the streamed item tree matches `response.completed`. - Pre-allocate `msgItemId` and `fcItemIds` once and reuse them across SSE events and the finalized `output[]`, fixing client-side accumulation by `item_id`. - Use distinct `output_index` per tool call (1..n) and set `item_id` on `response.function_call_arguments.delta`/`.done` to the function-call item id (was the OpenAI `call_id`, causing collisions and wrong wiring). - Populate `required_action.submit_tool_outputs.tool_calls` so OpenAI clients can satisfy tool calls instead of hanging in `requires_action` with no payload. - Drop the duplicate `previous_response_id` lookup in `handlePostResponses`. - Drop `parallel_tool_calls` from the unsupported-params log: it is honored. - Recognise `function_call_output` (-> `tool` role) and `function_call` (-> synthesized assistant `<tool_call>` content) in `openaiResponsesInputToHistory` and `historyPrefixFromStoredResponse` so chained tool round-trips actually carry through `previous_response_id`. - Use `crypto.randomUUID()` for `resp_`/`msg_`/`fc_`/input-item ids. - Surface real `usage.output_tokens` from `result.stats.generatedTokens` (Responses + chat.completions, blocking + streaming); fall back to word count when stats are missing. `input_tokens` stays 0 with an inline note that the SDK does not expose a prompt-token count today. - Tighten `CompletionResult.stats` to a structured `CompletionRunStats` shape. Tests: extend `responses.test.ts` and `translate.test.ts`; add `responses-streaming.test.ts` driving the new exported `writeStreamingResponse` / `writeBlockingResponse` helpers with a fake `CompletionResult` and `ServerResponse`. * test[skiplog]: stabilize Responses chain e2e for tiny reasoning model Pin temperature=0 + seed and bump max_output_tokens to 512 so Qwen3-600M has room for both its <think> block and the actual answer. The test exercises previous_response_id chain wiring; it should not depend on sampling luck or the model's reasoning length. * fix: walk previous_response_id chain so multi-turn keeps grandparent history Each StoredResponse.inputItems only carries that turn's NEW input (`normalizeResponsesInputItemsForStorage(body['input'])`), so a chain of depth >= 3 silently lost the grandparent turn: resp_1 input "A" -> output "X" (stored: ["A"]) resp_2 prev=resp_1 input "B" history sent: [A, X, B] (stored: ["B"]) resp_3 prev=resp_2 input "C" history sent: [B, Y, C] -- A and X gone historyPrefixFromStoredResponse now walks the chain via responseObject.previous_response_id when given a resolver, prepending earlier turns oldest-first. Cap depth at 32 to bound work and protect against pathological cycles. Routes pass `(id) => store.get(id)` as the resolver. Legacy single-step callers still work unchanged when the resolver is omitted. Tests: - unit: depth-3 chain produces all six prefix entries in order; maxDepth cap honored. - e2e: resp_1 sets "code word is XYZZY", resp_2 acks, resp_3 asks for the word and recovers it -- would silently fail before this fix. * fix: address Responses review nits (SSE sentinel, dup event, types, max_tokens warn, README) Five low-severity items from PR #2030 review: - Drop the `data: [DONE]` sentinel on `/v1/responses` SSE: spec ends on `response.completed`. Adds an `EndSSEOptions { sentinel?: boolean }` knob to `endSSE` so chat-completions keeps its existing sentinel and Responses opts out via `endSSE(res, { sentinel: false })`. E2E flips the assertion accordingly. - Drop the duplicate `response.in_progress` event emitted back-to-back with `response.created` (same payload, no state transition — strict parsers can choke). - Tighten `BuildResponseObjectParams.parallelToolCalls` from `boolean | undefined` to `boolean` (the route already resolves a default before calling), eliminating a dead `?? true` fallback. - Warn on `max_tokens` for /v1/responses (spec field is `max_output_tokens`); still accepted as a fallback so existing clients don't break, but they get a logger.warn nudge. - README: add a "serve openai" section listing all routes and a Responses subsection that documents volatility, the `X-QVAC-Stub` header, the `store: false` opt-out, and curl examples. The README previously listed no openai-compat endpoints at all. Skipped from the review: - #2 (no client-disconnect handling in streaming): pre-existing gap shared with /v1/chat/completions, reviewer marked out of scope. - #7 (per-entry byte-size cap on the in-memory store): reviewer marked follow-up; `maxEntries` + TTL still bound memory pressure for the local-first single-user target audience. * fix: address Simon review nits (stream error sentinel, input_items after cursor) Two surfaced post-rebase: 1. sendError gained an opt-in { sseSentinel: false } so callers inside an active stream can suppress the trailing `data: [DONE]\n\n` after the `response.error` SSE event. Responses streaming error path now passes it, closing the gap that the happy path already handled (response.completed already used endSSE({ sentinel: false })). 2. GET /v1/responses/:id/input_items now reads the `after` cursor from the query string in addition to `limit`. Spec-compliant pagination would have re-fetched page 1 forever; the store already implemented the cursor. Added a store-level pagination test that walks all pages by `last_id`.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Note: be concise and prefer bullet points.
What problem does this PR solve?
/v1/responses(blocking, SSE streaming, retrieval by id, andprevious_response_idchaining). The CLI HTTP server only exposed chat/completions-style paths, so those clients could not target QVAC through the same surface.How does it solve it?
POST /v1/responses(blocking andstream: trueSSE) by translating the Responses request into the existingsdkCompletionpath used for chat-style inference — same models and streaming behavior as the rest of the OpenAI adapter.GET/DELETE/v1/responses/{id}andGET /v1/responses/{id}/input_itemsbacked by a small in-memory store (LRU + TTL; default on;store: falseskips persistence).previous_response_idby prefixing stored prior output into the completion history; returns404withprevious_response_not_foundwhen the prior id is missing.X-QVAC-Stub: responses-volatileon these routes and logs a one-line startup warning so operators know state is not durable.OpenAI API compatibility (what matches)
Caveats / gaps (notable)
400:conversation,background: true, and built-in tools (e.g.web_search) not mapped to local inference yet.How was it tested?
packages/cli:npm run lint,npm run test:unit,npm run test:bats,npm run test:e2e(e2e covers blocking + SSE + store + chain + validation paths ine2e.bats).API Changes