Skip to content

QVAC-18733 feat[api]: add openai responses routes with in-memory store#2030

Merged
lauripiisang merged 7 commits into
mainfrom
feat/QVAC-18733-v1-responses-stateless
May 14, 2026
Merged

QVAC-18733 feat[api]: add openai responses routes with in-memory store#2030
lauripiisang merged 7 commits into
mainfrom
feat/QVAC-18733-v1-responses-stateless

Conversation

@lauripiisang

@lauripiisang lauripiisang commented May 13, 2026

Copy link
Copy Markdown
Contributor

Note: be concise and prefer bullet points.

What problem does this PR solve?

  • OpenAI-compatible clients expect /v1/responses (blocking, SSE streaming, retrieval by id, and previous_response_id chaining). The CLI HTTP server only exposed chat/completions-style paths, so those clients could not target QVAC through the same surface.

How does it solve it?

  • Implements POST /v1/responses (blocking and stream: true SSE) by translating the Responses request into the existing sdkCompletion path used for chat-style inference — same models and streaming behavior as the rest of the OpenAI adapter.
  • Adds GET / DELETE /v1/responses/{id} and GET /v1/responses/{id}/input_items backed by a small in-memory store (LRU + TTL; default on; store: false skips persistence).
  • Supports previous_response_id by prefixing stored prior output into the completion history; returns 404 with previous_response_not_found when the prior id is missing.
  • Surfaces X-QVAC-Stub: responses-volatile on these routes and logs a one-line startup warning so operators know state is not durable.

OpenAI API compatibility (what matches)

  • Same URL shape and verbs for create, retrieve, delete, and input listing; SSE event framing for streaming where applicable.
  • Request fields the adapter understands are translated to QVAC completion calls; response objects and stream events follow the same general shape clients expect for basic text / tool-call flows wired today.

Caveats / gaps (notable)

  • State is volatile: process memory only; restart or eviction loses ids — not OpenAI’s hosted persistence.
  • Intentionally rejected with 400: conversation, background: true, and built-in tools (e.g. web_search) not mapped to local inference yet.
  • Full parity with every OpenAI Responses option, tool type, and edge case is not claimed; unknown or unsupported combinations should fail loudly rather than silently diverge.

How was it tested?

  • packages/cli: npm run lint, npm run test:unit, npm run test:bats, npm run test:e2e (e2e covers blocking + SSE + store + chain + validation paths in e2e.bats).

API Changes

# Blocking create
curl -sS "$BASE/v1/responses" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"model":"'"$MODEL"'","input":"ping","store":true}'

# Streaming
curl -sN "$BASE/v1/responses" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"model":"'"$MODEL"'","input":"ping","stream":true,"store":true}'

# Chained follow-up (after capturing response id from prior call)
curl -sS "$BASE/v1/responses" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"model":"'"$MODEL"'","input":"and now?","previous_response_id":"resp_..."}'

@lauripiisang lauripiisang requested review from a team as code owners May 13, 2026 15:59
@lauripiisang lauripiisang force-pushed the feat/QVAC-18733-v1-responses-stateless branch from b9ecc6a to 3e16fd9 Compare May 13, 2026 18:15
simon-iribarren

This comment was marked as resolved.

@github-actions

github-actions Bot commented May 13, 2026

Copy link
Copy Markdown
Contributor

Tier-based Approval Status

**PR Tier:** TIER1

**Current Status:** ✅ APPROVED

**Requirements:**
- 1 Team Member approval ✅ (1/1)
- 1 Team Lead OR Management approval ✅ (1/1)



---
*This comment is automatically updated when reviews change.*

@lauripiisang lauripiisang force-pushed the feat/QVAC-18733-v1-responses-stateless branch from 7660dc7 to 4bdfa43 Compare May 14, 2026 05:36
lauripiisang added a commit that referenced this pull request May 14, 2026
…ax_tokens warn, README)

Five low-severity items from PR #2030 review:

- Drop the `data: [DONE]` sentinel on `/v1/responses` SSE: spec ends on
  `response.completed`. Adds an `EndSSEOptions { sentinel?: boolean }`
  knob to `endSSE` so chat-completions keeps its existing sentinel and
  Responses opts out via `endSSE(res, { sentinel: false })`. E2E flips
  the assertion accordingly.
- Drop the duplicate `response.in_progress` event emitted back-to-back
  with `response.created` (same payload, no state transition — strict
  parsers can choke).
- Tighten `BuildResponseObjectParams.parallelToolCalls` from
  `boolean | undefined` to `boolean` (the route already resolves a
  default before calling), eliminating a dead `?? true` fallback.
- Warn on `max_tokens` for /v1/responses (spec field is
  `max_output_tokens`); still accepted as a fallback so existing clients
  don't break, but they get a logger.warn nudge.
- README: add a "serve openai" section listing all routes and a
  Responses subsection that documents volatility, the
  `X-QVAC-Stub` header, the `store: false` opt-out, and curl examples.
  The README previously listed no openai-compat endpoints at all.

Skipped from the review:
- #2 (no client-disconnect handling in streaming): pre-existing gap
  shared with /v1/chat/completions, reviewer marked out of scope.
- #7 (per-entry byte-size cap on the in-memory store): reviewer marked
  follow-up; `maxEntries` + TTL still bound memory pressure for the
  local-first single-user target audience.
@lauripiisang

Copy link
Copy Markdown
Contributor Author

@simon-iribarren comments resolved in latest commits, please see

@simon-iribarren simon-iribarren left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve — my earlier notes were addressed in c6beee725 (chain walk) and 7fc2459f6 (SSE [DONE], dup response.in_progress, ?? true, max_tokens warn, README). 6 of 8 fully closed; the remaining 2 (client-disconnect, byte-size cap) are correctly framed as follow-up — no objection.

Two new items found on top, neither a blocker:

  1. [DONE] re-emerges on the streaming error path. sendError's default endSSE(res) writes data: [DONE]\n\n after a response.error event — closes the gap from my original note in the happy path only. Suggest endSSE(res, { sendDone: false }) when sendError is invoked from inside an active stream.
  2. GET /v1/responses/:id/input_items silently drops the after cursor. Route only forwards limit; the store implements after, but the route never reads it from the query string. Spec-compliant pagination re-fetches page 1 forever. One-line fix in the route handler.

Smaller nits:

  • storeEnabled = body['store'] !== false accepts non-boolean truthy values. Tighter as body['store'] === undefined || body['store'] === true.
  • ResponsesStore.delete returns true for expired-but-unpruned records → 404 inconsistency vs get.
  • Configurable ttlMs on the store is silently ignored by routes that hardcode the default — the banner lies.
  • Banner uses logger.warn for an informational message; logger.info reads better.

Also: please add tier1 + verify labels so the approval gate routes correctly.

NamelsKing
NamelsKing previously approved these changes May 14, 2026
lauripiisang added a commit that referenced this pull request May 14, 2026
…ax_tokens warn, README)

Five low-severity items from PR #2030 review:

- Drop the `data: [DONE]` sentinel on `/v1/responses` SSE: spec ends on
  `response.completed`. Adds an `EndSSEOptions { sentinel?: boolean }`
  knob to `endSSE` so chat-completions keeps its existing sentinel and
  Responses opts out via `endSSE(res, { sentinel: false })`. E2E flips
  the assertion accordingly.
- Drop the duplicate `response.in_progress` event emitted back-to-back
  with `response.created` (same payload, no state transition — strict
  parsers can choke).
- Tighten `BuildResponseObjectParams.parallelToolCalls` from
  `boolean | undefined` to `boolean` (the route already resolves a
  default before calling), eliminating a dead `?? true` fallback.
- Warn on `max_tokens` for /v1/responses (spec field is
  `max_output_tokens`); still accepted as a fallback so existing clients
  don't break, but they get a logger.warn nudge.
- README: add a "serve openai" section listing all routes and a
  Responses subsection that documents volatility, the
  `X-QVAC-Stub` header, the `store: false` opt-out, and curl examples.
  The README previously listed no openai-compat endpoints at all.

Skipped from the review:
- #2 (no client-disconnect handling in streaming): pre-existing gap
  shared with /v1/chat/completions, reviewer marked out of scope.
- #7 (per-entry byte-size cap on the in-memory store): reviewer marked
  follow-up; `maxEntries` + TTL still bound memory pressure for the
  local-first single-user target audience.
@lauripiisang lauripiisang dismissed stale reviews from NamelsKing and simon-iribarren via 342b12e May 14, 2026 10:57
@lauripiisang lauripiisang force-pushed the feat/QVAC-18733-v1-responses-stateless branch from 7fc2459 to 342b12e Compare May 14, 2026 10:57
@lauripiisang

Copy link
Copy Markdown
Contributor Author

/review

NamelsKing
NamelsKing previously approved these changes May 14, 2026
Implement POST /v1/responses (blocking + SSE), GET/DELETE /v1/responses/{id},
GET /v1/responses/{id}/input_items, previous_response_id chaining, LRU+TTL
store, X-QVAC-Stub: responses-volatile header, and startup banner.
…stats

- Approach (b): always include the assistant `message` item in `response.output[0]`,
  even when tool calls are present, so the streamed item tree matches `response.completed`.
- Pre-allocate `msgItemId` and `fcItemIds` once and reuse them across SSE events and
  the finalized `output[]`, fixing client-side accumulation by `item_id`.
- Use distinct `output_index` per tool call (1..n) and set `item_id` on
  `response.function_call_arguments.delta`/`.done` to the function-call item id
  (was the OpenAI `call_id`, causing collisions and wrong wiring).
- Populate `required_action.submit_tool_outputs.tool_calls` so OpenAI clients can
  satisfy tool calls instead of hanging in `requires_action` with no payload.
- Drop the duplicate `previous_response_id` lookup in `handlePostResponses`.
- Drop `parallel_tool_calls` from the unsupported-params log: it is honored.
- Recognise `function_call_output` (-> `tool` role) and `function_call`
  (-> synthesized assistant `<tool_call>` content) in
  `openaiResponsesInputToHistory` and `historyPrefixFromStoredResponse` so chained
  tool round-trips actually carry through `previous_response_id`.
- Use `crypto.randomUUID()` for `resp_`/`msg_`/`fc_`/input-item ids.
- Surface real `usage.output_tokens` from `result.stats.generatedTokens`
  (Responses + chat.completions, blocking + streaming); fall back to word count
  when stats are missing. `input_tokens` stays 0 with an inline note that the SDK
  does not expose a prompt-token count today.
- Tighten `CompletionResult.stats` to a structured `CompletionRunStats` shape.

Tests: extend `responses.test.ts` and `translate.test.ts`; add
`responses-streaming.test.ts` driving the new exported `writeStreamingResponse` /
`writeBlockingResponse` helpers with a fake `CompletionResult` and `ServerResponse`.
Pin temperature=0 + seed and bump max_output_tokens to 512 so Qwen3-600M
has room for both its <think> block and the actual answer. The test
exercises previous_response_id chain wiring; it should not depend on
sampling luck or the model's reasoning length.
…history

Each StoredResponse.inputItems only carries that turn's NEW input
(`normalizeResponsesInputItemsForStorage(body['input'])`), so a chain of
depth >= 3 silently lost the grandparent turn:

  resp_1 input "A" -> output "X"        (stored: ["A"])
  resp_2 prev=resp_1 input "B"           history sent: [A, X, B]
                                         (stored: ["B"])
  resp_3 prev=resp_2 input "C"           history sent: [B, Y, C]  -- A and X gone

historyPrefixFromStoredResponse now walks the chain via
responseObject.previous_response_id when given a resolver, prepending
earlier turns oldest-first. Cap depth at 32 to bound work and protect
against pathological cycles. Routes pass `(id) => store.get(id)` as the
resolver. Legacy single-step callers still work unchanged when the
resolver is omitted.

Tests:
- unit: depth-3 chain produces all six prefix entries in order; maxDepth
  cap honored.
- e2e: resp_1 sets "code word is XYZZY", resp_2 acks, resp_3 asks for the
  word and recovers it -- would silently fail before this fix.
…ax_tokens warn, README)

Five low-severity items from PR #2030 review:

- Drop the `data: [DONE]` sentinel on `/v1/responses` SSE: spec ends on
  `response.completed`. Adds an `EndSSEOptions { sentinel?: boolean }`
  knob to `endSSE` so chat-completions keeps its existing sentinel and
  Responses opts out via `endSSE(res, { sentinel: false })`. E2E flips
  the assertion accordingly.
- Drop the duplicate `response.in_progress` event emitted back-to-back
  with `response.created` (same payload, no state transition — strict
  parsers can choke).
- Tighten `BuildResponseObjectParams.parallelToolCalls` from
  `boolean | undefined` to `boolean` (the route already resolves a
  default before calling), eliminating a dead `?? true` fallback.
- Warn on `max_tokens` for /v1/responses (spec field is
  `max_output_tokens`); still accepted as a fallback so existing clients
  don't break, but they get a logger.warn nudge.
- README: add a "serve openai" section listing all routes and a
  Responses subsection that documents volatility, the
  `X-QVAC-Stub` header, the `store: false` opt-out, and curl examples.
  The README previously listed no openai-compat endpoints at all.

Skipped from the review:
- #2 (no client-disconnect handling in streaming): pre-existing gap
  shared with /v1/chat/completions, reviewer marked out of scope.
- #7 (per-entry byte-size cap on the in-memory store): reviewer marked
  follow-up; `maxEntries` + TTL still bound memory pressure for the
  local-first single-user target audience.
…ter cursor)

Two surfaced post-rebase:

1. sendError gained an opt-in { sseSentinel: false } so callers inside an
   active stream can suppress the trailing `data: [DONE]\n\n` after the
   `response.error` SSE event. Responses streaming error path now passes
   it, closing the gap that the happy path already handled (response.completed
   already used endSSE({ sentinel: false })).

2. GET /v1/responses/:id/input_items now reads the `after` cursor from the
   query string in addition to `limit`. Spec-compliant pagination would have
   re-fetched page 1 forever; the store already implemented the cursor.
   Added a store-level pagination test that walks all pages by `last_id`.
@lauripiisang lauripiisang force-pushed the feat/QVAC-18733-v1-responses-stateless branch from 0500c1b to 86bde53 Compare May 14, 2026 12:47
@kinsta

kinsta Bot commented May 14, 2026

Copy link
Copy Markdown

Preview deployments for qvac-docs-staging ⚡️

Status Branch preview Commit preview
🔁 Deploying... N/A N/A

Commit: 71c4a9215f1bad766c5f9f624da8fc31b4a085ed

Deployment ID: 04856c2a-a684-42f8-ac60-92d31c1082a2

Static site name: qvac-docs-staging-fazwv

@lauripiisang

Copy link
Copy Markdown
Contributor Author

/review

@lauripiisang lauripiisang merged commit 0741eb5 into main May 14, 2026
12 checks passed
@lauripiisang lauripiisang deleted the feat/QVAC-18733-v1-responses-stateless branch May 14, 2026 13:34
Proletter pushed a commit that referenced this pull request May 24, 2026
#2030)

* QVAC-18733 feat[api]: add OpenAI Responses routes with in-memory store

Implement POST /v1/responses (blocking + SSE), GET/DELETE /v1/responses/{id},
GET /v1/responses/{id}/input_items, previous_response_id chaining, LRU+TTL
store, X-QVAC-Stub: responses-volatile header, and startup banner.

* fix: align Responses streaming with finalized response and add usage stats

- Approach (b): always include the assistant `message` item in `response.output[0]`,
  even when tool calls are present, so the streamed item tree matches `response.completed`.
- Pre-allocate `msgItemId` and `fcItemIds` once and reuse them across SSE events and
  the finalized `output[]`, fixing client-side accumulation by `item_id`.
- Use distinct `output_index` per tool call (1..n) and set `item_id` on
  `response.function_call_arguments.delta`/`.done` to the function-call item id
  (was the OpenAI `call_id`, causing collisions and wrong wiring).
- Populate `required_action.submit_tool_outputs.tool_calls` so OpenAI clients can
  satisfy tool calls instead of hanging in `requires_action` with no payload.
- Drop the duplicate `previous_response_id` lookup in `handlePostResponses`.
- Drop `parallel_tool_calls` from the unsupported-params log: it is honored.
- Recognise `function_call_output` (-> `tool` role) and `function_call`
  (-> synthesized assistant `<tool_call>` content) in
  `openaiResponsesInputToHistory` and `historyPrefixFromStoredResponse` so chained
  tool round-trips actually carry through `previous_response_id`.
- Use `crypto.randomUUID()` for `resp_`/`msg_`/`fc_`/input-item ids.
- Surface real `usage.output_tokens` from `result.stats.generatedTokens`
  (Responses + chat.completions, blocking + streaming); fall back to word count
  when stats are missing. `input_tokens` stays 0 with an inline note that the SDK
  does not expose a prompt-token count today.
- Tighten `CompletionResult.stats` to a structured `CompletionRunStats` shape.

Tests: extend `responses.test.ts` and `translate.test.ts`; add
`responses-streaming.test.ts` driving the new exported `writeStreamingResponse` /
`writeBlockingResponse` helpers with a fake `CompletionResult` and `ServerResponse`.

* test[skiplog]: stabilize Responses chain e2e for tiny reasoning model

Pin temperature=0 + seed and bump max_output_tokens to 512 so Qwen3-600M
has room for both its <think> block and the actual answer. The test
exercises previous_response_id chain wiring; it should not depend on
sampling luck or the model's reasoning length.

* fix: walk previous_response_id chain so multi-turn keeps grandparent history

Each StoredResponse.inputItems only carries that turn's NEW input
(`normalizeResponsesInputItemsForStorage(body['input'])`), so a chain of
depth >= 3 silently lost the grandparent turn:

  resp_1 input "A" -> output "X"        (stored: ["A"])
  resp_2 prev=resp_1 input "B"           history sent: [A, X, B]
                                         (stored: ["B"])
  resp_3 prev=resp_2 input "C"           history sent: [B, Y, C]  -- A and X gone

historyPrefixFromStoredResponse now walks the chain via
responseObject.previous_response_id when given a resolver, prepending
earlier turns oldest-first. Cap depth at 32 to bound work and protect
against pathological cycles. Routes pass `(id) => store.get(id)` as the
resolver. Legacy single-step callers still work unchanged when the
resolver is omitted.

Tests:
- unit: depth-3 chain produces all six prefix entries in order; maxDepth
  cap honored.
- e2e: resp_1 sets "code word is XYZZY", resp_2 acks, resp_3 asks for the
  word and recovers it -- would silently fail before this fix.

* fix: address Responses review nits (SSE sentinel, dup event, types, max_tokens warn, README)

Five low-severity items from PR #2030 review:

- Drop the `data: [DONE]` sentinel on `/v1/responses` SSE: spec ends on
  `response.completed`. Adds an `EndSSEOptions { sentinel?: boolean }`
  knob to `endSSE` so chat-completions keeps its existing sentinel and
  Responses opts out via `endSSE(res, { sentinel: false })`. E2E flips
  the assertion accordingly.
- Drop the duplicate `response.in_progress` event emitted back-to-back
  with `response.created` (same payload, no state transition — strict
  parsers can choke).
- Tighten `BuildResponseObjectParams.parallelToolCalls` from
  `boolean | undefined` to `boolean` (the route already resolves a
  default before calling), eliminating a dead `?? true` fallback.
- Warn on `max_tokens` for /v1/responses (spec field is
  `max_output_tokens`); still accepted as a fallback so existing clients
  don't break, but they get a logger.warn nudge.
- README: add a "serve openai" section listing all routes and a
  Responses subsection that documents volatility, the
  `X-QVAC-Stub` header, the `store: false` opt-out, and curl examples.
  The README previously listed no openai-compat endpoints at all.

Skipped from the review:
- #2 (no client-disconnect handling in streaming): pre-existing gap
  shared with /v1/chat/completions, reviewer marked out of scope.
- #7 (per-entry byte-size cap on the in-memory store): reviewer marked
  follow-up; `maxEntries` + TTL still bound memory pressure for the
  local-first single-user target audience.

* fix: address Simon review nits (stream error sentinel, input_items after cursor)

Two surfaced post-rebase:

1. sendError gained an opt-in { sseSentinel: false } so callers inside an
   active stream can suppress the trailing `data: [DONE]\n\n` after the
   `response.error` SSE event. Responses streaming error path now passes
   it, closing the gap that the happy path already handled (response.completed
   already used endSSE({ sentinel: false })).

2. GET /v1/responses/:id/input_items now reads the `after` cursor from the
   query string in addition to `limit`. Spec-compliant pagination would have
   re-fetched page 1 forever; the store already implemented the cursor.
   Added a store-level pagination test that walks all pages by `last_id`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants