diff --git a/docs/source/commands/trtllm-serve/trtllm-serve.rst b/docs/source/commands/trtllm-serve/trtllm-serve.rst index 25ed2bc394c..33bad7f1e5e 100644 --- a/docs/source/commands/trtllm-serve/trtllm-serve.rst +++ b/docs/source/commands/trtllm-serve/trtllm-serve.rst @@ -34,7 +34,7 @@ For the full syntax and argument descriptions, refer to :ref:`syntax`. Inference Endpoints ------------------- -After you start the server, you can send inference requests through completions API and Chat API, which are compatible with corresponding OpenAI APIs. We use `TinyLlama-1.1B-Chat-v1.0 `_ for examples in the following sections. +After you start the server, you can send inference requests through completions API, Chat API and Responses API, which are compatible with corresponding OpenAI APIs. We use `TinyLlama-1.1B-Chat-v1.0 `_ for examples in the following sections. Chat API ~~~~~~~~ @@ -66,6 +66,24 @@ Another example uses ``curl``: :language: bash :linenos: +Responses API +~~~~~~~~~~~~~~~ + +You can query Responses API with any http clients, a typical example is OpenAI Python client: + +.. literalinclude:: ../../../../examples/serve/openai_responses_client.py + :language: python + :linenos: + +Another example uses ``curl``: + +.. literalinclude:: ../../../../examples/serve/curl_responses_client.sh + :language: bash + :linenos: + + +More openai compatible examples can be found in the `compatibility examples `_ directory. + Multimodal Serving ~~~~~~~~~~~~~~~~~~ diff --git a/examples/serve/compatibility/README.md b/examples/serve/compatibility/README.md index f3e375843b2..5351f269e82 100644 --- a/examples/serve/compatibility/README.md +++ b/examples/serve/compatibility/README.md @@ -34,17 +34,27 @@ python examples/serve/compatibility/chat_completions/example_01_basic_chat.py ### 📋 Complete Example List -All examples demonstrate the `/v1/chat/completions` endpoint: +#### Chat Completions (`/v1/chat/completions`) | Example | File | Description | |---------|------|-------------| -| **01** | `example_01_basic_chat.py` | Basic non-streaming chat completion | -| **02** | `example_02_streaming_chat.py` | Streaming responses with real-time delivery | -| **03** | `example_03_multi_turn_conversation.py` | Multi-turn conversation with context | -| **04** | `example_04_streaming_with_usage.py` | Streaming with continuous token usage stats | -| **05** | `example_05_json_mode.py` | Structured output with JSON schema | -| **06** | `example_06_tool_calling.py` | Function/tool calling with tools | -| **07** | `example_07_advanced_sampling.py` | TensorRT-LLM extended sampling parameters | +| **01** | `chat_completions/example_01_basic_chat.py` | Basic non-streaming chat completion | +| **02** | `chat_completions/example_02_streaming_chat.py` | Streaming responses with real-time delivery | +| **03** | `chat_completions/example_03_multi_turn_conversation.py` | Multi-turn conversation with context | +| **04** | `chat_completions/example_04_streaming_with_usage.py` | Streaming with continuous token usage stats | +| **05** | `chat_completions/example_05_json_mode.py` | Structured output with JSON schema | +| **06** | `chat_completions/example_06_tool_calling.py` | Function/tool calling with tools | +| **07** | `chat_completions/example_07_advanced_sampling.py` | TensorRT-LLM extended sampling parameters | + +#### Responses (`/v1/responses`) + +| Example | File | Description | +|---------|------|-------------| +| **01** | `responses/example_01_basic_chat.py` | Basic non-streaming response | +| **02** | `responses/example_02_streaming_chat.py` | Streaming with event handling | +| **03** | `responses/example_03_multi_turn_conversation.py` | Multi-turn using `previous_response_id` | +| **04** | `responses/example_04_json_mode.py` | Structured output with JSON schema | +| **05** | `responses/example_05_tool_calling.py` | Function/tool calling with tools | ## Configuration @@ -68,8 +78,8 @@ client = OpenAI( Some examples require specific model capabilities: -| Example | Model Requirement | +| Feature | Model Requirement | |---------|------------------| -| 05 (JSON Mode) | xgrammar support | -| 06 (Tool Calling) | Tool-capable model (Qwen3, GPT OSS) | +| JSON Mode | xgrammar support | +| Tool Calling | Tool-capable model (Qwen3, GPT-OSS, Kimi K2) | | Others | Any model | diff --git a/examples/serve/compatibility/responses/README.md b/examples/serve/compatibility/responses/README.md new file mode 100644 index 00000000000..4dbdcf850a3 --- /dev/null +++ b/examples/serve/compatibility/responses/README.md @@ -0,0 +1,102 @@ +# Responses API Examples + +Examples for the `/v1/responses` endpoint. All examples in this directory use the Responses API, demonstrating features such as streaming, tool/function calling, and multi-turn dialogue. + +## Quick Start + +```bash +# Run the basic example +python example_01_basic_chat.py +``` + +## Examples Overview + +### Basic Examples + +1. **`example_01_basic_chat.py`** - Start here! + - Simple request/response + - Non-streaming mode + - Uses `input` parameter for user message + +2. **`example_02_streaming_chat.py`** - Real-time responses + - Stream tokens as generated + - Handles various event types (`response.created`, `response.output_text.delta`, etc.) + - Server-Sent Events (SSE) + +3. **`example_03_multi_turn_conversation.py`** - Context management + - Multiple conversation turns + - Uses `previous_response_id` to maintain context + - Follow-up questions without resending history + +### Advanced Examples + +4. **`example_04_json_mode.py`** - Structured output + - JSON schema validation via `text.format` + - Structured data extraction + - Requires xgrammar support + +5. **`example_05_tool_calling.py`** - Function calling + - External tool integration + - Function definitions with `tools` parameter + - Tool result handling with `function_call_output` + - Requires compatible model (Qwen3, GPT-OSS, Kimi K2) + +## Key Concepts + +### Non-Streaming vs Streaming + +**Non-Streaming** (`stream=False`): +- Wait for complete response +- Single response object +- Simple to use + +**Streaming** (`stream=True`): +- Tokens delivered as generated +- Better perceived latency +- Server-Sent Events (SSE) + +### Multi-turn Context + +Use `previous_response_id` to continue conversations: +```python +# First turn +response1 = client.responses.create( + model=model, + input="What is 15 multiplied by 23?", +) + +# Second turn - references previous response +response2 = client.responses.create( + model=model, + input="Now divide that result by 5", + previous_response_id=response1.id, +) +``` + +### Tool Calling + +Define functions the model can call: +```python +tools = [{ + "name": "get_weather", + "type": "function", + "description": "Get the current weather in a location", + "parameters": { + "type": "object", + "properties": { + "location": {"type": "string"}, + }, + "required": ["location"], + } +}] +``` + +## Model Requirements + +| Feature | Requirement | +|---------|-------------| +| Basic chat | Any model | +| Streaming | Any model | +| Multi-turn | Any model | +| JSON mode | xgrammar support | +| Tool calling | Compatible model (Qwen3, GPT-OSS, Kimi K2) | diff --git a/examples/serve/compatibility/responses/example_01_basic_chat.py b/examples/serve/compatibility/responses/example_01_basic_chat.py new file mode 100644 index 00000000000..237108017fb --- /dev/null +++ b/examples/serve/compatibility/responses/example_01_basic_chat.py @@ -0,0 +1,48 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +#!/usr/bin/env python3 +"""Example 1: Basic Non-Streaming Responses. + +Demonstrates a simple responses request with the OpenAI-compatible API. +""" + +from openai import OpenAI + +# Initialize the client +client = OpenAI( + base_url="http://localhost:8000/v1", + api_key="tensorrt_llm", +) + +# Get the model name from the server +models = client.models.list() +model = models.data[0].id + +print("=" * 80) +print("Example 1: Basic Non-Streaming Responses") +print("=" * 80) +print() + +# Create a simple responses request +response = client.responses.create( + model=model, + input="What is the capital of France?", + max_output_tokens=4096, +) + +# Print the response +print("Response:") +print(f"Content: {response.output_text}") diff --git a/examples/serve/compatibility/responses/example_02_streaming_chat.py b/examples/serve/compatibility/responses/example_02_streaming_chat.py new file mode 100644 index 00000000000..1e6e92d51fb --- /dev/null +++ b/examples/serve/compatibility/responses/example_02_streaming_chat.py @@ -0,0 +1,98 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +#!/usr/bin/env python3 +"""Example 2: Streaming Responses. + +Demonstrates streaming responses with real-time token delivery. +""" + +from openai import OpenAI + + +def print_streaming_responses_item(item, show_events=True): + event_type = getattr(item, "type", "") + + if event_type == "response.created": + if show_events: + print(f"[Response Created: {getattr(item.response, 'id', 'unknown')}]") + elif event_type == "response.in_progress": + if show_events: + print("[Response In Progress]") + elif event_type == "response.output_item.added": + if show_events: + item_type = getattr(item.item, "type", "unknown") + item_id = getattr(item.item, "id", "unknown") + print(f"\n[Output Item Added: {item_type} (id: {item_id})]") + elif event_type == "response.content_part.added": + if show_events: + part_type = getattr(item.part, "type", "unknown") + print(f"[Content Part Added: {part_type}]") + elif event_type == "response.reasoning_text.delta": + print(item.delta, end="", flush=True) + elif event_type == "response.output_text.delta": + print(item.delta, end="", flush=True) + elif event_type == "response.reasoning_text.done": + if show_events: + print(f"\n[Reasoning Text Done: {len(item.text)} chars]") + elif event_type == "response.output_text.done": + if show_events: + print(f"\n[Output Text Done: {len(item.text)} chars]") + elif event_type == "response.content_part.done": + if show_events: + part_type = getattr(item.part, "type", "unknown") + print(f"[Content Part Done: {part_type}]") + elif event_type == "response.output_item.done": + if show_events: + item_type = getattr(item.item, "type", "unknown") + item_id = getattr(item.item, "id", "unknown") + print(f"[Output Item Done: {item_type} (id: {item_id})]") + elif event_type == "response.completed": + if show_events: + print("\n[Response Completed]") + + +# Initialize the client +client = OpenAI( + base_url="http://localhost:8000/v1", + api_key="tensorrt_llm", +) + +# Get the model name from the server +models = client.models.list() +model = models.data[0].id + +print("=" * 80) +print("Example 2: Streaming Responses") +print("=" * 80) +print() + +print("Prompt: Write a haiku about artificial intelligence\n") + +# Create a streaming responses +stream = client.responses.create( + model=model, + input="Write a haiku about artificial intelligence", + max_output_tokens=4096, + stream=True, +) + +# Print tokens as they arrive +print("Response (streaming):") +print("Assistant: ", end="", flush=True) + +current_state = "none" +for event in stream: + print_streaming_responses_item(event) diff --git a/examples/serve/compatibility/responses/example_03_multi_turn_conversation.py b/examples/serve/compatibility/responses/example_03_multi_turn_conversation.py new file mode 100644 index 00000000000..c24c23226e4 --- /dev/null +++ b/examples/serve/compatibility/responses/example_03_multi_turn_conversation.py @@ -0,0 +1,63 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +#!/usr/bin/env python3 +"""Example 3: Multi-turn Conversation. + +Demonstrates maintaining conversation context across multiple turns. +""" + +from openai import OpenAI + +# Initialize the client +client = OpenAI( + base_url="http://localhost:8000/v1", + api_key="tensorrt_llm", +) + +# Get the model name from the server +models = client.models.list() +model = models.data[0].id + +print("=" * 80) +print("Example 3: Multi-turn Conversation") +print("=" * 80) +print() + +# First turn: User asks a question +print("USER: What is 15 multiplied by 23?") + +response1 = client.responses.create( + model=model, + input="What is 15 multiplied by 23?", + max_output_tokens=4096, +) + +assistant_reply_1 = response1.output_text +print(f"ASSISTANT: {assistant_reply_1}\n") + +# Second turn: User asks a follow-up question +print("USER: Now divide that result by 5") + +# No context need to be provided for the second turn, only include the previous response id +response2 = client.responses.create( + model=model, + input="Now divide that result by 5", + max_output_tokens=4096, + previous_response_id=response1.id, +) + +assistant_reply_2 = response2.output_text +print(f"ASSISTANT: {assistant_reply_2}") diff --git a/examples/serve/compatibility/responses/example_04_json_mode.py b/examples/serve/compatibility/responses/example_04_json_mode.py new file mode 100644 index 00000000000..83d4b9be20f --- /dev/null +++ b/examples/serve/compatibility/responses/example_04_json_mode.py @@ -0,0 +1,80 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +#!/usr/bin/env python3 +"""Example 4: JSON Mode with Schema. + +Demonstrates structured output generation with JSON schema validation. + +Note: This requires xgrammar support and compatible model configuration. +""" + +import json + +from openai import OpenAI + +# Initialize the client +client = OpenAI( + base_url="http://localhost:8000/v1", + api_key="tensorrt_llm", +) + +# Get the model name from the server +models = client.models.list() +model = models.data[0].id + +print("=" * 80) +print("Example 4: JSON Mode with Schema") +print("=" * 80) +print() + +# Define the JSON schema +schema = { + "type": "json_schema", + "name": "city_info", + "schema": { + "type": "object", + "properties": { + "name": {"type": "string"}, + "country": {"type": "string"}, + "population": {"type": "integer"}, + "famous_for": {"type": "array", "items": {"type": "string"}}, + }, + "required": ["name", "country", "population"], + }, + "strict": True, +} + +print("Request with JSON schema:") +print(json.dumps(schema, indent=2)) +print() +print("Note: JSON schema support requires xgrammar and compatible model configuration.\n") + +try: + # Create responses with JSON schema + response = client.responses.create( + model=model, + instructions="You are a helpful assistant that outputs JSON.", + input="Give me information about Tokyo.", + text={"format": schema}, + reasoning={"effort": "low"}, + max_output_tokens=1024, + ) + + print("JSON Response:") + print(response.output_text) +except Exception as e: + print("JSON schema support requires xgrammar and proper configuration.") + print(f"Error: {e}") diff --git a/examples/serve/compatibility/responses/example_05_tool_calling.py b/examples/serve/compatibility/responses/example_05_tool_calling.py new file mode 100644 index 00000000000..6489e7e4530 --- /dev/null +++ b/examples/serve/compatibility/responses/example_05_tool_calling.py @@ -0,0 +1,132 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +#!/usr/bin/env python3 +"""Example 5: Tool/Function Calling. + +Demonstrates tool calling with function definitions and responses. + +Note: This requires a compatible model (e.g., Qwen3, GPT-OSS, Kimi K2). +""" + +import json + +from openai import OpenAI + +# Initialize the client +client = OpenAI( + base_url="http://localhost:8000/v1", + api_key="tensorrt_llm", +) + +# Get the model name from the server +models = client.models.list() +model = models.data[0].id +TOOL_CALL_SUPPORTED_MODELS = ["Qwen3", "GPT-OSS", "Kimi K2"] + +print("=" * 80) +print("Example 5: Tool/Function Calling") +print("=" * 80) +print() +print( + f"Note: Tool calling requires compatible models (e.g. {', '.join(TOOL_CALL_SUPPORTED_MODELS)})\n" +) + +# Define the available tools +tools = [ + { + "name": "get_weather", + "parameters": { + "type": "object", + "properties": { + "location": { + "type": "string", + "description": "City and state, e.g. San Francisco, CA", + }, + "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}, + }, + "required": ["location"], + }, + "type": "function", + "description": "Get the current weather in a location", + } +] + + +def get_weather(location: str, unit: str = "fahrenheit") -> dict: + return {"location": location, "temperature": 68, "unit": unit, "conditions": "sunny"} + + +def process_tool_call(response) -> tuple[dict, str]: + function_name = None + function_arguments = None + tool_call_id = None + for output in response.output: + if output.type == "function_call": + function_name = output.name + function_arguments = json.loads(output.arguments) + tool_call_id = output.call_id + break + + try: + print( + f"Get tool call result:\n\ttool_name: {function_name}\n\tparameters: {function_arguments})" + ) + result = eval(f"{function_name}(**{function_arguments})") + except Exception as e: + print(f"Error processing tool call: {e}") + return None, None + + return result, tool_call_id + + +print("Available tools:") +print(json.dumps(tools, indent=2)) +print("\nUser query: What is the weather in San Francisco?\n") + +try: + # Initial request with tools + response = client.responses.create( + model=model, + input="What is the weather in San Francisco?", + tools=tools, + tool_choice="auto", + max_output_tokens=4096, + ) + + tool_call_result, tool_call_id = process_tool_call(response) + call_input = [ + { + "type": "function_call_output", + "call_id": tool_call_id, + "output": json.dumps(tool_call_result), + } + ] + + prev_response_id = response.id + response = client.responses.create( + model=model, + input=call_input, + previous_response_id=prev_response_id, + tools=tools, + ) + + print(f"Final response: {response.output_text}") + +except Exception as e: + print( + f"Note: Tool calling requires model support (e.g. {', '.join(TOOL_CALL_SUPPORTED_MODELS)})" + ) + print(f"Error: {e}") diff --git a/examples/serve/curl_responses_client.sh b/examples/serve/curl_responses_client.sh new file mode 100644 index 00000000000..7a54f21bb8a --- /dev/null +++ b/examples/serve/curl_responses_client.sh @@ -0,0 +1,9 @@ +#! /usr/bin/env bash + +curl http://localhost:8000/v1/responses \ + -H "Content-Type: application/json" \ + -d '{ + "model": "TinyLlama-1.1B-Chat-v1.0", + "input": "Where is New York?", + "max_output_tokens": 16 + }' diff --git a/examples/serve/openai_responses_client.py b/examples/serve/openai_responses_client.py new file mode 100644 index 00000000000..04d1b356b7b --- /dev/null +++ b/examples/serve/openai_responses_client.py @@ -0,0 +1,15 @@ +### :title OpenAI Responses Client + +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:8000/v1", + api_key="tensorrt_llm", +) + +response = client.responses.create( + model="TinyLlama-1.1B-Chat-v1.0", + input="Where is New York?", + max_output_tokens=20, +) +print(response) diff --git a/tests/unittest/llmapi/apps/_test_trtllm_serve_example.py b/tests/unittest/llmapi/apps/_test_trtllm_serve_example.py index 6921c024d54..7828b94b87a 100644 --- a/tests/unittest/llmapi/apps/_test_trtllm_serve_example.py +++ b/tests/unittest/llmapi/apps/_test_trtllm_serve_example.py @@ -52,9 +52,11 @@ def example_root(): "exe, script", [("python3", "openai_chat_client.py"), ("python3", "openai_completion_client.py"), ("python3", "openai_completion_client_json_schema.py"), + ("python3", "openai_responses_client.py"), ("bash", "curl_chat_client.sh"), ("bash", "curl_completion_client.sh"), - ("bash", "genai_perf_client.sh")]) + ("bash", "genai_perf_client.sh"), + ("bash", "curl_responses_client.sh")]) def test_trtllm_serve_examples(exe: str, script: str, server: RemoteOpenAIServer, example_root: str): client_script = os.path.join(example_root, script)