server: improve Responses API compliance and Codex CLI compatibility by krystophny · Pull Request #21174 · ggml-org/llama.cpp

krystophny · 2026-03-30T07:54:24Z

Summary

Improve Responses API (/v1/responses) compliance and Codex CLI compatibility.

Response object (non-streaming and streaming):

Add 24 missing fields per OpenAI Responses API spec (tools, truncation, temperature, top_p, metadata, store, service_tier, etc.)
Fix function_call id/call_id field mapping (id gets unique fc_ ID, call_id gets the model's tool_call.id)
Add output_text top-level convenience field
Add output_tokens_details with reasoning_tokens to usage
Restore refusal content type handling (was broken in upstream — unreachable code after throw)

Streaming events:

Add sequence_number to ALL streaming events (created, in_progress, added, delta, done, completed)
Add output_index to all events referencing output items
Add content_index to content-related events
Populate full response object in response.created and response.in_progress (was only {id, object, status})
Function call item IDs consistent across output_item.added, function_call_arguments.delta, and output_item.done
Counter state persisted across streaming chunks via task_result_state

Codex CLI compatibility:

Skip non-function tool types (web_search, code_interpreter) instead of rejecting with 400
Merge developer/system role messages into first system message for templates requiring system at position 0 (e.g. Qwen)
Strip Responses-only request keys (store, include, prompt_cache_key, web_search, text, truncation, metadata)
Accept input_text type alongside output_text for EasyInputMessage / AssistantMessageItemParam

Prior art: #19720 by @riskywindow (stale, 500+ commits behind). This PR incorporates applicable ideas adapted to the current codebase.

Fixes #19138. Related: #20156, #20733, #20607.

Verification

pytest (14 tests, tinyllama2)

$ LLAMA_SERVER_BIN_PATH=./build/bin/llama-server pytest unit/test_compat_oai_responses.py -v
test_responses_with_openai_library PASSED
test_responses_stream_with_openai_library PASSED
test_responses_schema_fields PASSED
test_responses_stream_schema_fields PASSED
test_responses_non_function_tool_skipped PASSED
test_responses_only_non_function_tools_same_as_no_tools PASSED
test_responses_extra_keys_stripped PASSED
test_responses_developer_role_merging PASSED
test_responses_input_text_type_multi_turn PASSED
test_responses_output_text_matches_content PASSED
test_responses_stream_output_text_consistency PASSED
test_responses_stream_created_event_has_full_response PASSED
test_responses_stream_all_events_have_sequence_number PASSED
test_responses_stream_delta_events_have_indices PASSED
14 passed

E2E with async OpenAI SDK + Qwen3.5-9B Q4_K_M

$ python3 e2e_test.py  # uses AsyncOpenAI against localhost:8080
Test 1: Non-streaming basic          OK: output_text='4', fields=36
Test 2: Streaming basic              OK: 205 events, gathered text matches
Test 3: Non-function tools skipped   OK: status=completed
Test 4: Developer role merging       OK
Test 5: Multi-turn with input_text   OK
Test 7: Streaming seq_nums           OK: 105 events, strictly increasing
Test 6: Concurrent stress (5 req)    OK: 5/5 completed, 0 failures
ALL E2E TESTS PASSED

Codex CLI E2E

$ codex exec -p local "Say hello in one word"
Hello
tokens used: 6,119

$ codex exec -p local "Run 'echo hello world' and show the output"
exec: /bin/zsh -lc 'echo hello world' succeeded
hello world
tokens used: 1,054

Test plan

14 pytest tests covering all code paths (schema, streaming, tools, roles, keys)
Async OpenAI SDK: non-streaming, streaming, concurrent stress
Streaming events: sequence_number strictly increasing, output_index/content_index present
response.created has full response object with all fields
Non-function tools silently skipped (200 not 400)
Developer/system messages merged correctly
Codex CLI connects and works (text + tool calling)
Function call IDs consistent between added/done streaming events

If reviewers prefer, this can be split into smaller PRs for faster review. Let me know.

Codex CLI compatibility: - Skip non-function tool types (web_search, code_interpreter) - Merge developer/system messages into position 0 for Qwen templates - Strip Responses-only request keys (store, include, prompt_cache_key) - output_text convenience field in streaming and non-streaming responses Responses API compliance (ideas from ggml-org#19720 by riskywindow, adapted): - Add 24 missing Response object fields per OpenAI spec - Fix function_call id/call_id field mapping - Add sequence_number, output_index, content_index to streaming events - Accept input_text type and EasyInputMessage for multi-turn input Verified: codex -p local and codex -p fast work against local llama.cpp with Qwen3.5 models including native tool calling. Refs: ggml-org#19138, ggml-org#19720

ngxson · 2026-03-30T08:28:41Z

please add proper testings for it

Add 8 new tests covering the changes in this PR: - test_responses_schema_fields: verify all 24+ Response object fields - test_responses_stream_schema_fields: verify sequence_number, output_index, content_index on streaming events - test_responses_non_function_tool_skipped: web_search/code_interpreter tool types return 200 instead of 400 - test_responses_mixed_tool_types: non-function tools filtered, function tools retained (not rejected at parsing layer) - test_responses_extra_keys_stripped: store, include, prompt_cache_key, web_search, text, truncation, metadata don't cause errors - test_responses_developer_role: developer messages merged into system - test_responses_input_text_type: input_text accepted for EasyInputMessage - test_responses_function_call_id_fields: output items have correct ids All 10 tests pass (2 existing + 8 new).

krystophny · 2026-03-30T13:11:56Z

please add proper testings for it

Done!

fumlig · 2026-03-30T15:04:28Z

According to the responses API reference for streaming events, the response.created and response.in_progress streaming events should also contain created_at, model etc.

It seems like the current implementation just returns a minimal response object in those events. This causes issues with certain spec-compliant client libraries like async-openai. Would it be possible to add the missing streaming event fields here as well?

- Add sequence_number to ALL streaming events (created, in_progress, output_item.added, content_part.added, all delta events) - Add output_index to all events referencing output items - Add content_index to content-related events - Populate full response object in response.created and response.in_progress events (was only {id, object, status}) - Add id field to function_call output_item.added events - Add status: completed to reasoning output_item.done events - Counter state persisted across streaming chunks via task_result_state Fixes: spec-compliant client libraries (async-openai) that require these fields can now parse all streaming events without error. Refs: ggml-org#21174 (fumlig review comment)

- test_responses_stream_created_event_has_full_response: verify response.created contains all 24+ fields with status in_progress - test_responses_stream_all_events_have_sequence_number: every event has sequence_number and they are strictly increasing across stream - test_responses_stream_delta_events_have_indices: output_index and content_index present on all delta/added events All 14 tests pass (2 original + 9 from previous commit + 3 new).

Code fixes: - build_oai_resp_metadata accepts status param; completed_at is null when status is in_progress (was always set to timestamp) - response.created/in_progress events use zeroed usage (was passing actual prompt tokens before response was logically started) - Function call item IDs are now generated once per tool call in update() and reused consistently across output_item.added, function_call_arguments.delta, and output_item.done events (was generating independent random IDs in each path) - Clean up commented-out status checks in server-common.cpp Test fixes: - Assert sequence_number on every event unconditionally (was using weak "if present" guard) - Check actual values not just key presence in streaming created event test (completed_at is None, usage tokens are 0, etc.) Refs: ggml-org#21174 (patrick review)

krystophny · 2026-03-30T16:27:17Z

@fumlig Thanks for the feedback. The streaming events now include the full response object in response.created and response.in_progress events (with all 24+ required fields, status: "in_progress", completed_at: null, zeroed usage). All partial events (output_item.added, content_part.added, all deltas) now have sequence_number, output_index, and content_index per spec.

Tested with the async OpenAI Python SDK (which validates event schemas similarly to async-openai on the Rust side).

@ngxson Tests added: 14 pytest tests covering schema fields, streaming compliance, tool type skipping, developer role merging, key stripping, multi-turn input, and output_text consistency. Plus E2E tests with async OpenAI SDK against Qwen3.5-9B.

If you'd prefer this split into smaller PRs for faster review, happy to do so.

krystophny · 2026-03-30T16:32:42Z

Additional E2E testing (async OpenAI SDK, Codex CLI, concurrent stress tests, multiple Qwen3.5 models) is documented in the companion meta-repo: https://github.com/krystophny/llama.cpp-dev

The meta-repo includes a Nix flake for reproducible test environments and scripted test harnesses under scripts/.

Codex CLI compatibility: - Skip non-function tool types (web_search, code_interpreter) - Merge developer/system messages into position 0 for Qwen templates - Strip Responses-only request keys (store, include, prompt_cache_key) - Restore refusal content type handling Responses API compliance (ideas from ggml-org#19720 by riskywindow, adapted): - Add 24 missing Response object fields per OpenAI spec - Fix function_call id/call_id field mapping - Add sequence_number, output_index, content_index to ALL streaming events - Full response object in response.created/in_progress events - Accept input_text type and EasyInputMessage for multi-turn input - output_text convenience field, output_tokens_details 14 pytest tests, E2E tested with async OpenAI SDK and Codex CLI. Refs: ggml-org#19138, ggml-org#19720, ggml-org#21174

ngxson · 2026-03-30T17:18:49Z

I'm ok with the current PR, but could you let us know when you are finally ready for review? I have been re-running the CI each time you pushed a new PR, and without CI passed, I cannot merge it

krystophny · 2026-03-30T18:05:54Z

I'm ok with the current PR, but could you let us know when you are finally ready for review? I have been re-running the CI each time you pushed a new PR, and without CI passed, I cannot merge it

@ngxson thanks! I marked it as draft and iterate a bit and test locally more with https://github.com/lazy-fortran/fortbench during the coming days and see if any problems pop up on the codex path compared to opencode. Then I'll let you know.

Accept all valid reasoning item content formats in multi-turn input: - Array of objects: [{"type":"reasoning_text","text":"..."}] (spec format) - Plain string: "thinking about it" (OpenCode format) - Null: content:null with encrypted_content (Codex, openai/codex#11834) - Omitted entirely: no content field present Previously threw "item['content'] is not an array" for non-array formats, breaking OpenCode multi-turn conversations. The encrypted_content field is accepted but ignored for local models (no server-side decryption). Add 4 tests covering each format variant. Refs: openai/codex#11834, anomalyco/opencode#19081

krystophny requested a review from a team as a code owner March 30, 2026 07:54

github-actions bot added examples server labels Mar 30, 2026

github-actions bot added the python python script changes label Mar 30, 2026

krystophny force-pushed the responses-api-codex-compat branch 2 times, most recently from f1210e4 to 530e13c Compare March 30, 2026 12:51

krystophny force-pushed the responses-api-codex-compat branch from 530e13c to 467266b Compare March 30, 2026 13:04

krystophny added 3 commits March 30, 2026 18:13

krystophny marked this pull request as draft March 30, 2026 18:04

krystophny added 2 commits March 30, 2026 20:12

ci: retrigger after transient infrastructure failures

428b68a

sawansri mentioned this pull request Apr 1, 2026

Implement lemonade provider for Codex lemonade-sdk/lemonade#1505

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: improve Responses API compliance and Codex CLI compatibility#21174

server: improve Responses API compliance and Codex CLI compatibility#21174
krystophny wants to merge 7 commits intoggml-org:masterfrom
krystophny:responses-api-codex-compat

krystophny commented Mar 30, 2026 •

edited

Loading

Uh oh!

ngxson commented Mar 30, 2026

Uh oh!

krystophny commented Mar 30, 2026

Uh oh!

fumlig commented Mar 30, 2026

Uh oh!

krystophny commented Mar 30, 2026

Uh oh!

krystophny commented Mar 30, 2026

Uh oh!

ngxson commented Mar 30, 2026

Uh oh!

krystophny commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

krystophny commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Verification

pytest (14 tests, tinyllama2)

E2E with async OpenAI SDK + Qwen3.5-9B Q4_K_M

Codex CLI E2E

Test plan

Uh oh!

ngxson commented Mar 30, 2026

Uh oh!

krystophny commented Mar 30, 2026

Uh oh!

fumlig commented Mar 30, 2026

Uh oh!

krystophny commented Mar 30, 2026

Uh oh!

krystophny commented Mar 30, 2026

Uh oh!

ngxson commented Mar 30, 2026

Uh oh!

krystophny commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

krystophny commented Mar 30, 2026 •

edited

Loading