Skip to content

server: add /v1/responses support#1184

Merged
ikawrakow merged 2 commits intoikawrakow:mainfrom
RodriMora:feature/responses-api
Feb 14, 2026
Merged

server: add /v1/responses support#1184
ikawrakow merged 2 commits intoikawrakow:mainfrom
RodriMora:feature/responses-api

Conversation

@RodriMora
Copy link
Copy Markdown
Contributor

@RodriMora RodriMora commented Jan 23, 2026

Summary

  • add /v1/responses endpoint by converting Responses payloads to chat-completions and emitting Responses-style SSE events
  • track response IDs/state in server slots and serialize Responses output/final payloads
  • document /v1/responses usage and add responses compatibility scenarios

Testing

  • cmake --build build --target llama-server (works too with CUDA enabled on my system)
  • behave -i server.feature -n "OAI Responses Compatibility"

Notes

  • i tried to run the full server test suite but it does not run, embeddings scenario fails on main due to /embedding response shape. This seems to be failing too on the main branch, I can look into those in a separate pr

Curently as a draft as I do more test. Currently works via regular curl

curl http://localhost:5000/v1/responses \
                       -H "Content-Type: application/json" \
                       -H "Authorization: Bearer no-key" \
                       -d '{
                     "model": "MiniMax-M2.1",
                     "input": "Say hello in one short sentence."
                   }'

Based off ggml-org/llama.cpp#18486

Edit: Used an AI agent to help create the draft, as this seems like a good use case of an implementation for it as it doesn't involve a new model or architecture which seems to yield bad results. I didn't see any contributing rules against it. If that's the case feel free to close it.

@firecoperana
Copy link
Copy Markdown
Collaborator

I don't think these tests are well maintained. Just need to test in llama-server

@ikawrakow
Copy link
Copy Markdown
Owner

This is still draft?

@RodriMora RodriMora marked this pull request as ready for review February 2, 2026 11:02
@RodriMora
Copy link
Copy Markdown
Contributor Author

Tool calling doesn't work perfectly but I checked mainline llama.cpp and it has the same problems, so it has feature parity in terms of regular text completion for streaming and non-streaming test:

Launch command:

./build/bin/llama-server \
                                                                  --model /mnt/llms/models/bartowski/MiniMaxAI_MiniMax-M2.1-GGUF/MiniMaxAI_MiniMax-M2.1-Q6_K/MiniMaxAI_MiniMax-M2.1-Q6_K-00001-of-00005.gguf \
                                                                  --alias "MiniMax-M2.1" \
                                                                  --ctx-size 128000 \
                                                                  -ger \
                                                                  -ngl 99 \
                                                                  --host 0.0.0.0 \
                                                                  --port 5000 \
                                                                  --jinja -np 4

Test:

curl -s http://192.168.10.115:5000/v1/responses \
                         -H "Content-Type: application/json" \
                         -H "Authorization: Bearer no-key" \
                         -d '{
                       "model": "MiniMax-M2.1",
                       "instructions": "You are a helpful assistant.",
                       "input": "Write a limerick about exceptions",
                       "max_output_tokens": 32
                     }'
{"completed_at":1770028662,"created_at":1770028662,"id":"resp_L1bEyCME6xhrIk2ePfniibiVoa6s6zNR","model":"MiniMax-M2.1","object":"response","output":[{"id":"rs_F0JvQ8MRkVid9hpBBHxxUzzj51zXrK1L","summary":[],"type":"reasoning","content":[{"text":"The user wants a limerick about exceptions. A limerick is a humorous five-line poem with an AABBA rhyme scheme, where lines 1","type":"reasoning_text"}],"encrypted_content":"","status":"completed"}],"status":"completed","usage":{"input_tokens":29,"output_tokens":32,"total_tokens":61}}

@firecoperana
Copy link
Copy Markdown
Collaborator

The model name is defaulted to "gpt-3.5-turbo-0613" when I leave the model empty in the request message. Mainline returns the correct model name. Otherwise, they look good.

@RodriMora
Copy link
Copy Markdown
Contributor Author

The model name is defaulted to "gpt-3.5-turbo-0613" when I leave the model empty in the request message. Mainline returns the correct model name. Otherwise, they look good.

Nice catch, thanks. If model is empty/missing now, it uses the loaded model name instead of defaulting to "gpt-3.5-turbo-0613".

@firecoperana
Copy link
Copy Markdown
Collaborator

Yeah, it's a bug not related to this PR. I will fix later.

@firecoperana
Copy link
Copy Markdown
Collaborator

It is fine to merge now.

@ikawrakow ikawrakow merged commit 102f77b into ikawrakow:main Feb 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants