server: improve Responses API compliance and Codex CLI compatibility#21174
server: improve Responses API compliance and Codex CLI compatibility#21174krystophny wants to merge 7 commits intoggml-org:masterfrom
Conversation
Codex CLI compatibility: - Skip non-function tool types (web_search, code_interpreter) - Merge developer/system messages into position 0 for Qwen templates - Strip Responses-only request keys (store, include, prompt_cache_key) - output_text convenience field in streaming and non-streaming responses Responses API compliance (ideas from ggml-org#19720 by riskywindow, adapted): - Add 24 missing Response object fields per OpenAI spec - Fix function_call id/call_id field mapping - Add sequence_number, output_index, content_index to streaming events - Accept input_text type and EasyInputMessage for multi-turn input Verified: codex -p local and codex -p fast work against local llama.cpp with Qwen3.5 models including native tool calling. Refs: ggml-org#19138, ggml-org#19720
|
please add proper testings for it |
f1210e4 to
530e13c
Compare
Add 8 new tests covering the changes in this PR: - test_responses_schema_fields: verify all 24+ Response object fields - test_responses_stream_schema_fields: verify sequence_number, output_index, content_index on streaming events - test_responses_non_function_tool_skipped: web_search/code_interpreter tool types return 200 instead of 400 - test_responses_mixed_tool_types: non-function tools filtered, function tools retained (not rejected at parsing layer) - test_responses_extra_keys_stripped: store, include, prompt_cache_key, web_search, text, truncation, metadata don't cause errors - test_responses_developer_role: developer messages merged into system - test_responses_input_text_type: input_text accepted for EasyInputMessage - test_responses_function_call_id_fields: output items have correct ids All 10 tests pass (2 existing + 8 new).
530e13c to
467266b
Compare
Done! |
|
According to the responses API reference for streaming events, the It seems like the current implementation just returns a minimal response object in those events. This causes issues with certain spec-compliant client libraries like async-openai. Would it be possible to add the missing streaming event fields here as well? |
- Add sequence_number to ALL streaming events (created, in_progress,
output_item.added, content_part.added, all delta events)
- Add output_index to all events referencing output items
- Add content_index to content-related events
- Populate full response object in response.created and
response.in_progress events (was only {id, object, status})
- Add id field to function_call output_item.added events
- Add status: completed to reasoning output_item.done events
- Counter state persisted across streaming chunks via task_result_state
Fixes: spec-compliant client libraries (async-openai) that require
these fields can now parse all streaming events without error.
Refs: ggml-org#21174 (fumlig review comment)
- test_responses_stream_created_event_has_full_response: verify response.created contains all 24+ fields with status in_progress - test_responses_stream_all_events_have_sequence_number: every event has sequence_number and they are strictly increasing across stream - test_responses_stream_delta_events_have_indices: output_index and content_index present on all delta/added events All 14 tests pass (2 original + 9 from previous commit + 3 new).
Code fixes: - build_oai_resp_metadata accepts status param; completed_at is null when status is in_progress (was always set to timestamp) - response.created/in_progress events use zeroed usage (was passing actual prompt tokens before response was logically started) - Function call item IDs are now generated once per tool call in update() and reused consistently across output_item.added, function_call_arguments.delta, and output_item.done events (was generating independent random IDs in each path) - Clean up commented-out status checks in server-common.cpp Test fixes: - Assert sequence_number on every event unconditionally (was using weak "if present" guard) - Check actual values not just key presence in streaming created event test (completed_at is None, usage tokens are 0, etc.) Refs: ggml-org#21174 (patrick review)
|
@fumlig Thanks for the feedback. The streaming events now include the full response object in Tested with the async OpenAI Python SDK (which validates event schemas similarly to async-openai on the Rust side). @ngxson Tests added: 14 pytest tests covering schema fields, streaming compliance, tool type skipping, developer role merging, key stripping, multi-turn input, and output_text consistency. Plus E2E tests with async OpenAI SDK against Qwen3.5-9B. If you'd prefer this split into smaller PRs for faster review, happy to do so. |
|
Additional E2E testing (async OpenAI SDK, Codex CLI, concurrent stress tests, multiple Qwen3.5 models) is documented in the companion meta-repo: https://github.com/krystophny/llama.cpp-dev The meta-repo includes a Nix flake for reproducible test environments and scripted test harnesses under |
Codex CLI compatibility: - Skip non-function tool types (web_search, code_interpreter) - Merge developer/system messages into position 0 for Qwen templates - Strip Responses-only request keys (store, include, prompt_cache_key) - Restore refusal content type handling Responses API compliance (ideas from ggml-org#19720 by riskywindow, adapted): - Add 24 missing Response object fields per OpenAI spec - Fix function_call id/call_id field mapping - Add sequence_number, output_index, content_index to ALL streaming events - Full response object in response.created/in_progress events - Accept input_text type and EasyInputMessage for multi-turn input - output_text convenience field, output_tokens_details 14 pytest tests, E2E tested with async OpenAI SDK and Codex CLI. Refs: ggml-org#19138, ggml-org#19720, ggml-org#21174
|
I'm ok with the current PR, but could you let us know when you are finally ready for review? I have been re-running the CI each time you pushed a new PR, and without CI passed, I cannot merge it |
@ngxson thanks! I marked it as draft and iterate a bit and test locally more with https://github.com/lazy-fortran/fortbench during the coming days and see if any problems pop up on the codex path compared to opencode. Then I'll let you know. |
Accept all valid reasoning item content formats in multi-turn input:
- Array of objects: [{"type":"reasoning_text","text":"..."}] (spec format)
- Plain string: "thinking about it" (OpenCode format)
- Null: content:null with encrypted_content (Codex, openai/codex#11834)
- Omitted entirely: no content field present
Previously threw "item['content'] is not an array" for non-array formats,
breaking OpenCode multi-turn conversations. The encrypted_content field
is accepted but ignored for local models (no server-side decryption).
Add 4 tests covering each format variant.
Refs: openai/codex#11834, anomalyco/opencode#19081
Summary
Improve Responses API (
/v1/responses) compliance and Codex CLI compatibility.Response object (non-streaming and streaming):
id/call_idfield mapping (idgets unique fc_ ID,call_idgets the model's tool_call.id)output_texttop-level convenience fieldoutput_tokens_detailswithreasoning_tokensto usagerefusalcontent type handling (was broken in upstream — unreachable code after throw)Streaming events:
sequence_numberto ALL streaming events (created, in_progress, added, delta, done, completed)output_indexto all events referencing output itemscontent_indexto content-related eventsresponse.createdandresponse.in_progress(was only{id, object, status})output_item.added,function_call_arguments.delta, andoutput_item.donetask_result_stateCodex CLI compatibility:
input_texttype alongsideoutput_textfor EasyInputMessage / AssistantMessageItemParamPrior art: #19720 by @riskywindow (stale, 500+ commits behind). This PR incorporates applicable ideas adapted to the current codebase.
Fixes #19138. Related: #20156, #20733, #20607.
Verification
pytest (14 tests, tinyllama2)
E2E with async OpenAI SDK + Qwen3.5-9B Q4_K_M
Codex CLI E2E
Test plan
If reviewers prefer, this can be split into smaller PRs for faster review. Let me know.