Skip to content

server: improve Responses API compliance and Codex CLI compatibility#21174

Draft
krystophny wants to merge 7 commits intoggml-org:masterfrom
krystophny:responses-api-codex-compat
Draft

server: improve Responses API compliance and Codex CLI compatibility#21174
krystophny wants to merge 7 commits intoggml-org:masterfrom
krystophny:responses-api-codex-compat

Conversation

@krystophny
Copy link
Copy Markdown

@krystophny krystophny commented Mar 30, 2026

Summary

Improve Responses API (/v1/responses) compliance and Codex CLI compatibility.

Response object (non-streaming and streaming):

  • Add 24 missing fields per OpenAI Responses API spec (tools, truncation, temperature, top_p, metadata, store, service_tier, etc.)
  • Fix function_call id/call_id field mapping (id gets unique fc_ ID, call_id gets the model's tool_call.id)
  • Add output_text top-level convenience field
  • Add output_tokens_details with reasoning_tokens to usage
  • Restore refusal content type handling (was broken in upstream — unreachable code after throw)

Streaming events:

  • Add sequence_number to ALL streaming events (created, in_progress, added, delta, done, completed)
  • Add output_index to all events referencing output items
  • Add content_index to content-related events
  • Populate full response object in response.created and response.in_progress (was only {id, object, status})
  • Function call item IDs consistent across output_item.added, function_call_arguments.delta, and output_item.done
  • Counter state persisted across streaming chunks via task_result_state

Codex CLI compatibility:

  • Skip non-function tool types (web_search, code_interpreter) instead of rejecting with 400
  • Merge developer/system role messages into first system message for templates requiring system at position 0 (e.g. Qwen)
  • Strip Responses-only request keys (store, include, prompt_cache_key, web_search, text, truncation, metadata)
  • Accept input_text type alongside output_text for EasyInputMessage / AssistantMessageItemParam

Prior art: #19720 by @riskywindow (stale, 500+ commits behind). This PR incorporates applicable ideas adapted to the current codebase.

Fixes #19138. Related: #20156, #20733, #20607.

Verification

pytest (14 tests, tinyllama2)

$ LLAMA_SERVER_BIN_PATH=./build/bin/llama-server pytest unit/test_compat_oai_responses.py -v
test_responses_with_openai_library PASSED
test_responses_stream_with_openai_library PASSED
test_responses_schema_fields PASSED
test_responses_stream_schema_fields PASSED
test_responses_non_function_tool_skipped PASSED
test_responses_only_non_function_tools_same_as_no_tools PASSED
test_responses_extra_keys_stripped PASSED
test_responses_developer_role_merging PASSED
test_responses_input_text_type_multi_turn PASSED
test_responses_output_text_matches_content PASSED
test_responses_stream_output_text_consistency PASSED
test_responses_stream_created_event_has_full_response PASSED
test_responses_stream_all_events_have_sequence_number PASSED
test_responses_stream_delta_events_have_indices PASSED
14 passed

E2E with async OpenAI SDK + Qwen3.5-9B Q4_K_M

$ python3 e2e_test.py  # uses AsyncOpenAI against localhost:8080
Test 1: Non-streaming basic          OK: output_text='4', fields=36
Test 2: Streaming basic              OK: 205 events, gathered text matches
Test 3: Non-function tools skipped   OK: status=completed
Test 4: Developer role merging       OK
Test 5: Multi-turn with input_text   OK
Test 7: Streaming seq_nums           OK: 105 events, strictly increasing
Test 6: Concurrent stress (5 req)    OK: 5/5 completed, 0 failures
ALL E2E TESTS PASSED

Codex CLI E2E

$ codex exec -p local "Say hello in one word"
Hello
tokens used: 6,119

$ codex exec -p local "Run 'echo hello world' and show the output"
exec: /bin/zsh -lc 'echo hello world' succeeded
hello world
tokens used: 1,054

Test plan

  • 14 pytest tests covering all code paths (schema, streaming, tools, roles, keys)
  • Async OpenAI SDK: non-streaming, streaming, concurrent stress
  • Streaming events: sequence_number strictly increasing, output_index/content_index present
  • response.created has full response object with all fields
  • Non-function tools silently skipped (200 not 400)
  • Developer/system messages merged correctly
  • Codex CLI connects and works (text + tool calling)
  • Function call IDs consistent between added/done streaming events

If reviewers prefer, this can be split into smaller PRs for faster review. Let me know.

Codex CLI compatibility:
- Skip non-function tool types (web_search, code_interpreter)
- Merge developer/system messages into position 0 for Qwen templates
- Strip Responses-only request keys (store, include, prompt_cache_key)
- output_text convenience field in streaming and non-streaming responses

Responses API compliance (ideas from ggml-org#19720 by riskywindow, adapted):
- Add 24 missing Response object fields per OpenAI spec
- Fix function_call id/call_id field mapping
- Add sequence_number, output_index, content_index to streaming events
- Accept input_text type and EasyInputMessage for multi-turn input

Verified: codex -p local and codex -p fast work against local
llama.cpp with Qwen3.5 models including native tool calling.

Refs: ggml-org#19138, ggml-org#19720
@krystophny krystophny requested a review from a team as a code owner March 30, 2026 07:54
@ngxson
Copy link
Copy Markdown
Contributor

ngxson commented Mar 30, 2026

please add proper testings for it

@github-actions github-actions bot added the python python script changes label Mar 30, 2026
@krystophny krystophny force-pushed the responses-api-codex-compat branch 2 times, most recently from f1210e4 to 530e13c Compare March 30, 2026 12:51
Add 8 new tests covering the changes in this PR:

- test_responses_schema_fields: verify all 24+ Response object fields
- test_responses_stream_schema_fields: verify sequence_number,
  output_index, content_index on streaming events
- test_responses_non_function_tool_skipped: web_search/code_interpreter
  tool types return 200 instead of 400
- test_responses_mixed_tool_types: non-function tools filtered,
  function tools retained (not rejected at parsing layer)
- test_responses_extra_keys_stripped: store, include, prompt_cache_key,
  web_search, text, truncation, metadata don't cause errors
- test_responses_developer_role: developer messages merged into system
- test_responses_input_text_type: input_text accepted for EasyInputMessage
- test_responses_function_call_id_fields: output items have correct ids

All 10 tests pass (2 existing + 8 new).
@krystophny krystophny force-pushed the responses-api-codex-compat branch from 530e13c to 467266b Compare March 30, 2026 13:04
@krystophny
Copy link
Copy Markdown
Author

please add proper testings for it

Done!

@fumlig
Copy link
Copy Markdown

fumlig commented Mar 30, 2026

According to the responses API reference for streaming events, the response.created and response.in_progress streaming events should also contain created_at, model etc.

It seems like the current implementation just returns a minimal response object in those events. This causes issues with certain spec-compliant client libraries like async-openai. Would it be possible to add the missing streaming event fields here as well?

- Add sequence_number to ALL streaming events (created, in_progress,
  output_item.added, content_part.added, all delta events)
- Add output_index to all events referencing output items
- Add content_index to content-related events
- Populate full response object in response.created and
  response.in_progress events (was only {id, object, status})
- Add id field to function_call output_item.added events
- Add status: completed to reasoning output_item.done events
- Counter state persisted across streaming chunks via task_result_state

Fixes: spec-compliant client libraries (async-openai) that require
these fields can now parse all streaming events without error.

Refs: ggml-org#21174 (fumlig review comment)
- test_responses_stream_created_event_has_full_response: verify
  response.created contains all 24+ fields with status in_progress
- test_responses_stream_all_events_have_sequence_number: every event
  has sequence_number and they are strictly increasing across stream
- test_responses_stream_delta_events_have_indices: output_index and
  content_index present on all delta/added events

All 14 tests pass (2 original + 9 from previous commit + 3 new).
Code fixes:
- build_oai_resp_metadata accepts status param; completed_at is null
  when status is in_progress (was always set to timestamp)
- response.created/in_progress events use zeroed usage (was passing
  actual prompt tokens before response was logically started)
- Function call item IDs are now generated once per tool call in
  update() and reused consistently across output_item.added,
  function_call_arguments.delta, and output_item.done events
  (was generating independent random IDs in each path)
- Clean up commented-out status checks in server-common.cpp

Test fixes:
- Assert sequence_number on every event unconditionally (was using
  weak "if present" guard)
- Check actual values not just key presence in streaming created
  event test (completed_at is None, usage tokens are 0, etc.)

Refs: ggml-org#21174 (patrick review)
@krystophny
Copy link
Copy Markdown
Author

@fumlig Thanks for the feedback. The streaming events now include the full response object in response.created and response.in_progress events (with all 24+ required fields, status: "in_progress", completed_at: null, zeroed usage). All partial events (output_item.added, content_part.added, all deltas) now have sequence_number, output_index, and content_index per spec.

Tested with the async OpenAI Python SDK (which validates event schemas similarly to async-openai on the Rust side).

@ngxson Tests added: 14 pytest tests covering schema fields, streaming compliance, tool type skipping, developer role merging, key stripping, multi-turn input, and output_text consistency. Plus E2E tests with async OpenAI SDK against Qwen3.5-9B.

If you'd prefer this split into smaller PRs for faster review, happy to do so.

@krystophny
Copy link
Copy Markdown
Author

Additional E2E testing (async OpenAI SDK, Codex CLI, concurrent stress tests, multiple Qwen3.5 models) is documented in the companion meta-repo: https://github.com/krystophny/llama.cpp-dev

The meta-repo includes a Nix flake for reproducible test environments and scripted test harnesses under scripts/.

krystophny added a commit to krystophny/llama.cpp that referenced this pull request Mar 30, 2026
Codex CLI compatibility:
- Skip non-function tool types (web_search, code_interpreter)
- Merge developer/system messages into position 0 for Qwen templates
- Strip Responses-only request keys (store, include, prompt_cache_key)
- Restore refusal content type handling

Responses API compliance (ideas from ggml-org#19720 by riskywindow, adapted):
- Add 24 missing Response object fields per OpenAI spec
- Fix function_call id/call_id field mapping
- Add sequence_number, output_index, content_index to ALL streaming events
- Full response object in response.created/in_progress events
- Accept input_text type and EasyInputMessage for multi-turn input
- output_text convenience field, output_tokens_details

14 pytest tests, E2E tested with async OpenAI SDK and Codex CLI.

Refs: ggml-org#19138, ggml-org#19720, ggml-org#21174
@ngxson
Copy link
Copy Markdown
Contributor

ngxson commented Mar 30, 2026

I'm ok with the current PR, but could you let us know when you are finally ready for review? I have been re-running the CI each time you pushed a new PR, and without CI passed, I cannot merge it

@krystophny krystophny marked this pull request as draft March 30, 2026 18:04
@krystophny
Copy link
Copy Markdown
Author

I'm ok with the current PR, but could you let us know when you are finally ready for review? I have been re-running the CI each time you pushed a new PR, and without CI passed, I cannot merge it

@ngxson thanks! I marked it as draft and iterate a bit and test locally more with https://github.com/lazy-fortran/fortbench during the coming days and see if any problems pop up on the codex path compared to opencode. Then I'll let you know.

Accept all valid reasoning item content formats in multi-turn input:
- Array of objects: [{"type":"reasoning_text","text":"..."}] (spec format)
- Plain string: "thinking about it" (OpenCode format)
- Null: content:null with encrypted_content (Codex, openai/codex#11834)
- Omitted entirely: no content field present

Previously threw "item['content'] is not an array" for non-array formats,
breaking OpenCode multi-turn conversations. The encrypted_content field
is accepted but ignored for local models (no server-side decryption).

Add 4 tests covering each format variant.

Refs: openai/codex#11834, anomalyco/opencode#19081
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples python python script changes server

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: Support OpenAI Responses API (/v1/responses) in llama.cpp server

3 participants