From 603fd58fd66d9b3e9e2aa5d66f79d6d2efb29de1 Mon Sep 17 00:00:00 2001
From: Jonathan Buttner <jonathan.buttner@elastic.co>
Date: Fri, 13 Dec 2024 14:56:07 -0500
Subject: [PATCH 1/8] Including examples

---
 .../inference/inference-shared.asciidoc       |  42 +-
 .../inference/unified-inference.asciidoc      | 411 ++++++++++++++++++
 2 files changed, 452 insertions(+), 1 deletion(-)
 create mode 100644 docs/reference/inference/unified-inference.asciidoc

diff --git a/docs/reference/inference/inference-shared.asciidoc b/docs/reference/inference/inference-shared.asciidoc
index da497c6581e5d..be8c8056ce0cd 100644
--- a/docs/reference/inference/inference-shared.asciidoc
+++ b/docs/reference/inference/inference-shared.asciidoc
@@ -41,7 +41,7 @@ end::chunking-settings[]
 tag::chunking-settings-max-chunking-size[]
 Specifies the maximum size of a chunk in words.
 Defaults to `250`.
-This value cannot be higher than `300` or lower than `20` (for `sentence` strategy) or `10` (for `word` strategy). 
+This value cannot be higher than `300` or lower than `20` (for `sentence` strategy) or `10` (for `word` strategy).
 end::chunking-settings-max-chunking-size[]
 
 tag::chunking-settings-overlap[]
@@ -63,4 +63,44 @@ Specifies the chunking strategy.
 It could be either `sentence` or `word`.
 end::chunking-settings-strategy[]
 
+tag::unified-schema-content-with-examples[]
+.Examples
+[%collapsible%closed]
+======
+String example
+[source,json]
+------------------------------------------------------------
+{
+    "content": "Some string"
+}
+------------------------------------------------------------
+// NOTCONSOLE
+
+Object example
+[source,json]
+------------------------------------------------------------
+{
+    "content": [
+        {
+            "text": "Some text",
+            "type": "text"
+        }
+    ]
+}
+------------------------------------------------------------
+// NOTCONSOLE
+======
 
+String representation:::
+(Required, string)
+The text content.
++
+Object representation:::
+`text`::::
+(Required, string)
+The text content.
++
+`type`::::
+(Required, string)
+This must be set to the value `text`.
+end::unified-schema-content-with-examples[]
diff --git a/docs/reference/inference/unified-inference.asciidoc b/docs/reference/inference/unified-inference.asciidoc
new file mode 100644
index 0000000000000..695e58c49ee8b
--- /dev/null
+++ b/docs/reference/inference/unified-inference.asciidoc
@@ -0,0 +1,411 @@
+[role="xpack"]
+[[unified-inference-api]]
+=== Unified inference API
+
+Streams a chat completion response using the Unified Schema format.
+
+IMPORTANT: The {infer} APIs enable you to use certain services, such as built-in {ml} models (ELSER, E5), models uploaded through Eland, Cohere, OpenAI, Azure, Google AI Studio, Google Vertex AI, Anthropic, Watsonx.ai, or Hugging Face.
+For built-in models and models uploaded through Eland, the {infer} APIs offer an alternative way to use and manage trained models.
+However, if you do not plan to use the {infer} APIs to use these models or if you want to use non-NLP models, use the <<ml-df-trained-models-apis>>.
+
+
+[discrete]
+[[unified-inference-api-request]]
+==== {api-request-title}
+
+`POST /_inference/<inference_id>/_unified`
+
+`POST /_inference/<task_type>/<inference_id>/_unified`
+
+
+[discrete]
+[[unified-inference-api-prereqs]]
+==== {api-prereq-title}
+
+* Requires the `monitor_inference` <<privileges-list-cluster,cluster privilege>>
+(the built-in `inference_admin` and `inference_user` roles grant this privilege)
+* You must use a client that supports streaming.
+
+
+[discrete]
+[[unified-inference-api-desc]]
+==== {api-description-title}
+
+The unified {infer} API enables real-time responses for completion tasks by delivering answers incrementally, reducing response times during computation.
+It only works with the `completion` task type for OpenAI and Elastic Inference Service. The Unified Schema defines a common schema for use across multiple services.
+
+
+[discrete]
+[[unified-inference-api-path-params]]
+==== {api-path-parms-title}
+
+`<inference_id>`::
+(Required, string)
+The unique identifier of the {infer} endpoint.
+
+
+`<task_type>`::
+(Optional, string)
+The type of {infer} task that the model performs. If included, this must be set to the value `completion`.
+
+
+[discrete]
+[[unified-inference-api-request-body]]
+==== {api-request-body-title}
+
+`messages`::
+(Required, array of objects) A list of objects representing the conversation.
++
+.Assistant message
+[%collapsible%closed]
+=====
+`content`::
+(Required unless `tool_calls` is specified, string or array of objects)
+The contents of the message.
++
+include::inference-shared.asciidoc[tag=unified-schema-content-with-examples]
++
+`role`::
+(Required, string)
+The role of the message author. This should be set to `assistant` for this type of message.
++
+`tool_calls`::
+(Optional, array of objects)
++
+.Examples
+[%collapsible%closed]
+======
+[source,json]
+------------------------------------------------------------
+{
+    "tool_calls": [
+        {
+            "id": "call_KcAjWtAww20AihPHphUh46Gd",
+            "type": "function",
+            "function": {
+                "name": "get_current_weather",
+                "arguments": "{\"location\":\"Boston, MA\"}"
+            }
+        }
+    ]
+}
+------------------------------------------------------------
+// NOTCONSOLE
+======
++
+`id`:::
+(Required, string)
+The identifier of the tool call.
++
+`type`:::
+(Required, string)
+The type of tool call. This must be set to the value `function`.
++
+`function`:::
+(Required, object)
+The function that the model called.
++
+`name`::::
+(Required, string)
+The name of the function to call.
++
+`arguments`::::
+(Required, string)
+The arguments to call the function with in JSON format.
+=====
++
+.System message
+[%collapsible%closed]
+=====
+`content`:::
+(Required, string or array of objects)
+The contents of the message.
++
+include::inference-shared.asciidoc[tag=unified-schema-content-with-examples]
++
+`role`:::
+(Required, string)
+The role of the message author. This should be set to `system` for this type of message.
+=====
++
+.Tool message
+[%collapsible%closed]
+=====
+`content`::
+(Required, string or array of objects)
+The contents of the message.
++
+include::inference-shared.asciidoc[tag=unified-schema-content-with-examples]
++
+`role`::
+(Required, string)
+The role of the message author. This should be set to `tool` for this type of message.
++
+`tool_call_id`::
+(Required, string)
+The tool call that this message is responding to.
+=====
++
+.User message
+[%collapsible%closed]
+=====
+`content`::
+(Required, string or array of objects)
+The contents of the message.
++
+include::inference-shared.asciidoc[tag=unified-schema-content-with-examples]
++
+`role`::
+(Required, string)
+The role of the message author. This should be set to `user` for this type of message.
+=====
+
+`model`::
+(Optional, string)
+The ID of the model to use.
+
+`max_completion_tokens`::
+(Optional, integer)
+The upper bound limit for the number of tokens that can be generated for a completion request.
+
+`stop`::
+(Optional, array of strings)
+A sequence of strings to control when the model should stop generating additional tokens.
+
+`temperature`::
+(Optional, float)
+The sampling temperature to use.
+
+`tools`::
+(Optional, array of objects)
+A list of tools that the model can call.
++
+.Structure
+[%collapsible%closed]
+=====
+`type`::
+(Required, string)
+The type of tool, must be set to the value `function`.
++
+`function`::
+(Required, object)
+The function definition.
++
+`description`:::
+(Optional, string)
+A description of what the function does. This is used by the model to choose when and how to call the function.
++
+`name`:::
+(Required, string)
+The name of the function.
++
+`parameters`:::
+(Optional, object)
+The parameters the functional accepts. This should be formatted as a JSON object.
++
+`strict`:::
+(Optional, boolean)
+Whether to enable schema adherence when generating the function call.
+=====
++
+.Examples
+[%collapsible%closed]
+======
+[source,json]
+------------------------------------------------------------
+{
+    "tools": [
+        {
+            "type": "function",
+            "function": {
+                "name": "get_price_of_item",
+                "description": "Get the current price of an item",
+                "parameters": {
+                    "type": "object",
+                    "properties": {
+                        "item": {
+                            "id": "12345"
+                        },
+                        "unit": {
+                            "type": "currency"
+                        }
+                    }
+                }
+            }
+        }
+    ]
+}
+------------------------------------------------------------
+// NOTCONSOLE
+======
+
+`tool_choice`::
+(Optional, string or object)
+Controls which tool is called by the model.
++
+String representation:::
+One of `auto`, `none`, or `requrired`. `auto` allows the model to choose between calling tools and generating a message. `none` causes the model to not call any tools. `required` forces the model to call one or more tools.
++
+Object representation:::
++
+.Structure
+[%collapsible%closed]
+=====
+`type`::
+(Required, string)
+The type of the tool. This must be set to the value `function`.
++
+`function`::
+(Required, object)
++
+`name`:::
+(Required, string)
+The name of the function to call.
+=====
++
+.Examples
+[%collapsible%closed]
+=====
+[source,json]
+------------------------------------------------------------
+{
+    "tool_choice": {
+        "type": "function",
+        "function": {
+            "name": "get_current_weather"
+        }
+    }
+}
+------------------------------------------------------------
+// NOTCONSOLE
+=====
+
+`top_p`::
+(Optional, float)
+Nucleus sampling, an alternative to sampling with temperature.
+
+[discrete]
+[[unified-inference-api-example]]
+==== {api-examples-title}
+
+The following example performs a completion on the example question with streaming.
+
+
+[source,console]
+------------------------------------------------------------
+POST _inference/completion/openai-completion/_unified
+{
+    "model": "gpt-4o",
+    "messages": [
+        {
+            "role": "user",
+            "content": "What is Elastic?"
+        }
+    ]
+}
+------------------------------------------------------------
+// TEST[skip:TBD]
+
+The following example performs a completion using an Assistant message with `tool_calls`.
+
+[source,console]
+------------------------------------------------------------
+POST _inference/completion/openai-completion/_unified
+{
+    "messages": [
+        {
+            "role": "assistant",
+            "content": "Let's find out what the weather is",
+            "tool_calls": [ <1>
+                {
+                    "id": "call_KcAjWtAww20AihPHphUh46Gd",
+                    "type": "function",
+                    "function": {
+                        "name": "get_current_weather",
+                        "arguments": "{\"location\":\"Boston, MA\"}"
+                    }
+                }
+            ]
+        },
+        { <2>
+            "role": "tool",
+            "content": "The weather is cold",
+            "tool_call_id": "call_KcAjWtAww20AihPHphUh46Gd"
+        }
+    ]
+}
+------------------------------------------------------------
+// TEST[skip:TBD]
+
+<1> Each tool call needs a corresponding Tool message.
+<2> The corresponding Tool message.
+
+The following example performs a completion using a User message with `tools` and `tool_choice`.
+
+[source,console]
+------------------------------------------------------------
+POST _inference/completion/openai-completion/_unified
+{
+    "messages": [
+        {
+            "role": "user",
+            "content": [
+                {
+                    "type": "text",
+                    "text": "What's the price of a scarf?"
+                }
+            ]
+        }
+    ],
+    "tools": [
+        {
+            "type": "function",
+            "function": {
+                "name": "get_current_price",
+                "description": "Get the current price of a item",
+                "parameters": {
+                    "type": "object",
+                    "properties": {
+                        "item": {
+                            "id": "123"
+                        }
+                    }
+                }
+            }
+        }
+    ],
+    "tool_choice": {
+        "type": "function",
+        "function": {
+            "name": "get_current_price"
+        }
+    }
+}
+------------------------------------------------------------
+// TEST[skip:TBD]
+
+The API returns the following response when a request is made to the OpenAI service:
+
+
+[source,txt]
+------------------------------------------------------------
+event: message
+data: {"id":"chatcmpl-Ae0TWsy2VPnSfBbv5UztnSdYUMFP3","choices":[{"delta":{"content":"","role":"assistant"},"index":0}],"model":"gpt-4o-2024-08-06","object":"chat.completion.chunk"}
+
+event: message
+data: {"id":"chatcmpl-Ae0TWsy2VPnSfBbv5UztnSdYUMFP3","choices":[{"delta":{"content":Elastic"},"index":0}],"model":"gpt-4o-2024-08-06","object":"chat.completion.chunk"}
+
+event: message
+data: {"id":"chatcmpl-Ae0TWsy2VPnSfBbv5UztnSdYUMFP3","choices":[{"delta":{"content":" is"},"index":0}],"model":"gpt-4o-2024-08-06","object":"chat.completion.chunk"}
+
+(...)
+
+event: message
+data: {"id":"chatcmpl-Ae0TWsy2VPnSfBbv5UztnSdYUMFP3","choices":[],"model":"gpt-4o-2024-08-06","object":"chat.completion.chunk","usage":{"completion_tokens":28,"prompt_tokens":16,"total_tokens":44}} <1>
+
+event: message
+data: [DONE]
+------------------------------------------------------------
+// NOTCONSOLE
+
+<1> The last object message of the stream contains the token usage information.

From 16d0f626683126a658338e480b638d1a62a751b2 Mon Sep 17 00:00:00 2001
From: Jonathan Buttner <jonathan.buttner@elastic.co>
Date: Fri, 13 Dec 2024 15:06:16 -0500
Subject: [PATCH 2/8] Using js instead of json

---
 docs/reference/inference/inference-shared.asciidoc  | 4 ++--
 docs/reference/inference/unified-inference.asciidoc | 6 +++---
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/docs/reference/inference/inference-shared.asciidoc b/docs/reference/inference/inference-shared.asciidoc
index be8c8056ce0cd..d8f24b2a9a797 100644
--- a/docs/reference/inference/inference-shared.asciidoc
+++ b/docs/reference/inference/inference-shared.asciidoc
@@ -68,7 +68,7 @@ tag::unified-schema-content-with-examples[]
 [%collapsible%closed]
 ======
 String example
-[source,json]
+[source,js]
 ------------------------------------------------------------
 {
     "content": "Some string"
@@ -77,7 +77,7 @@ String example
 // NOTCONSOLE
 
 Object example
-[source,json]
+[source,js]
 ------------------------------------------------------------
 {
     "content": [
diff --git a/docs/reference/inference/unified-inference.asciidoc b/docs/reference/inference/unified-inference.asciidoc
index 695e58c49ee8b..7a8f4602e4b3f 100644
--- a/docs/reference/inference/unified-inference.asciidoc
+++ b/docs/reference/inference/unified-inference.asciidoc
@@ -75,7 +75,7 @@ The role of the message author. This should be set to `assistant` for this type
 .Examples
 [%collapsible%closed]
 ======
-[source,json]
+[source,js]
 ------------------------------------------------------------
 {
     "tool_calls": [
@@ -211,7 +211,7 @@ Whether to enable schema adherence when generating the function call.
 .Examples
 [%collapsible%closed]
 ======
-[source,json]
+[source,js]
 ------------------------------------------------------------
 {
     "tools": [
@@ -266,7 +266,7 @@ The name of the function to call.
 .Examples
 [%collapsible%closed]
 =====
-[source,json]
+[source,js]
 ------------------------------------------------------------
 {
     "tool_choice": {

From 45b8fa4af7815a3c7d280846ede22893009da3c4 Mon Sep 17 00:00:00 2001
From: Jonathan Buttner <jonathan.buttner@elastic.co>
Date: Mon, 16 Dec 2024 09:48:19 -0500
Subject: [PATCH 3/8] Adding unified docs to main page

---
 docs/reference/inference/inference-apis.asciidoc | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/docs/reference/inference/inference-apis.asciidoc b/docs/reference/inference/inference-apis.asciidoc
index 8d5ee1b7d6ba5..3ab85229ebd06 100644
--- a/docs/reference/inference/inference-apis.asciidoc
+++ b/docs/reference/inference/inference-apis.asciidoc
@@ -20,6 +20,7 @@ the following APIs to manage {infer} models and perform {infer}:
 * <<post-inference-api>>
 * <<put-inference-api>>
 * <<stream-inference-api>>
+* <<unified-inference-api>>
 * <<update-inference-api>>
 
 [[inference-landscape]]
@@ -28,9 +29,9 @@ image::images/inference-landscape.jpg[A representation of the Elastic inference
 
 An {infer} endpoint enables you to use the corresponding {ml} model without
 manual deployment and apply it to your data at ingestion time through
-<<semantic-search-semantic-text, semantic text>>. 
+<<semantic-search-semantic-text, semantic text>>.
 
-Choose a model from your provider or use ELSER – a retrieval model trained by 
+Choose a model from your provider or use ELSER – a retrieval model trained by
 Elastic –, then create an {infer} endpoint by the <<put-inference-api>>.
 Now use <<semantic-search-semantic-text, semantic text>> to perform
 <<semantic-search, semantic search>> on your data.
@@ -61,7 +62,7 @@ The following list contains the default {infer} endpoints listed by `inference_i
 Use the `inference_id` of the endpoint in a <<semantic-text,`semantic_text`>> field definition or when creating an <<inference-processor,{infer} processor>>.
 The API call will automatically download and deploy the model which might take a couple of minutes.
 Default {infer} enpoints have {ml-docs}/ml-nlp-auto-scale.html#nlp-model-adaptive-allocations[adaptive allocations] enabled.
-For these models, the minimum number of allocations is `0`. 
+For these models, the minimum number of allocations is `0`.
 If there is no {infer} activity that uses the endpoint, the number of allocations will scale down to `0` automatically after 15 minutes.
 
 
@@ -78,7 +79,7 @@ Returning a long document in search results is less useful than providing the mo
 Each chunk will include the text subpassage and the corresponding embedding generated from it.
 
 By default, documents are split into sentences and grouped in sections up to 250 words with 1 sentence overlap so that each chunk shares a sentence with the previous chunk.
-Overlapping ensures continuity and prevents vital contextual information in the input text from being lost by a hard break. 
+Overlapping ensures continuity and prevents vital contextual information in the input text from being lost by a hard break.
 
 {es} uses the https://unicode-org.github.io/icu-docs/[ICU4J] library to detect word and sentence boundaries for chunking.
 https://unicode-org.github.io/icu/userguide/boundaryanalysis/#word-boundary[Word boundaries] are identified by following a series of rules, not just the presence of a whitespace character.
@@ -129,6 +130,7 @@ PUT _inference/sparse_embedding/small_chunk_size
 include::delete-inference.asciidoc[]
 include::get-inference.asciidoc[]
 include::post-inference.asciidoc[]
+include::unified-inference.asciidoc[]
 include::put-inference.asciidoc[]
 include::stream-inference.asciidoc[]
 include::update-inference.asciidoc[]

From 6e3373fe6923b44ae95d58f8a4eca9abd2aff82d Mon Sep 17 00:00:00 2001
From: Jonathan Buttner <jonathan.buttner@elastic.co>
Date: Mon, 16 Dec 2024 15:02:08 -0500
Subject: [PATCH 4/8] Adding missing description text

---
 docs/reference/inference/unified-inference.asciidoc | 1 +
 1 file changed, 1 insertion(+)

diff --git a/docs/reference/inference/unified-inference.asciidoc b/docs/reference/inference/unified-inference.asciidoc
index 7a8f4602e4b3f..3e66e7bd4bd38 100644
--- a/docs/reference/inference/unified-inference.asciidoc
+++ b/docs/reference/inference/unified-inference.asciidoc
@@ -71,6 +71,7 @@ The role of the message author. This should be set to `assistant` for this type
 +
 `tool_calls`::
 (Optional, array of objects)
+The tool calls generated by the model.
 +
 .Examples
 [%collapsible%closed]

From 5621112ef701f9fdf65bf59bd440ea7ff5ca0396 Mon Sep 17 00:00:00 2001
From: Jonathan Buttner <jonathan.buttner@elastic.co>
Date: Wed, 8 Jan 2025 13:08:28 -0500
Subject: [PATCH 5/8] Refactoring to remove unified route

---
 ...doc => chat-completion-inference.asciidoc} | 63 ++++++++++---------
 .../inference/inference-apis.asciidoc         |  4 +-
 .../inference/inference-shared.asciidoc       |  8 ++-
 .../inference/service-openai.asciidoc         | 12 +++-
 .../inference/stream-inference.asciidoc       |  6 +-
 5 files changed, 57 insertions(+), 36 deletions(-)
 rename docs/reference/inference/{unified-inference.asciidoc => chat-completion-inference.asciidoc} (75%)

diff --git a/docs/reference/inference/unified-inference.asciidoc b/docs/reference/inference/chat-completion-inference.asciidoc
similarity index 75%
rename from docs/reference/inference/unified-inference.asciidoc
rename to docs/reference/inference/chat-completion-inference.asciidoc
index 3e66e7bd4bd38..e902c40f5c15b 100644
--- a/docs/reference/inference/unified-inference.asciidoc
+++ b/docs/reference/inference/chat-completion-inference.asciidoc
@@ -1,8 +1,8 @@
 [role="xpack"]
-[[unified-inference-api]]
-=== Unified inference API
+[[chat-completion-inference-api]]
+=== Chat completion inference API
 
-Streams a chat completion response using the Unified Schema format.
+Streams a chat completion response.
 
 IMPORTANT: The {infer} APIs enable you to use certain services, such as built-in {ml} models (ELSER, E5), models uploaded through Eland, Cohere, OpenAI, Azure, Google AI Studio, Google Vertex AI, Anthropic, Watsonx.ai, or Hugging Face.
 For built-in models and models uploaded through Eland, the {infer} APIs offer an alternative way to use and manage trained models.
@@ -10,16 +10,16 @@ However, if you do not plan to use the {infer} APIs to use these models or if yo
 
 
 [discrete]
-[[unified-inference-api-request]]
+[[chat-completion-inference-api-request]]
 ==== {api-request-title}
 
-`POST /_inference/<inference_id>/_unified`
+`POST /_inference/<inference_id>/_stream`
 
-`POST /_inference/<task_type>/<inference_id>/_unified`
+`POST /_inference/chat_completion/<inference_id>/_stream`
 
 
 [discrete]
-[[unified-inference-api-prereqs]]
+[[chat-completion-inference-api-prereqs]]
 ==== {api-prereq-title}
 
 * Requires the `monitor_inference` <<privileges-list-cluster,cluster privilege>>
@@ -28,15 +28,19 @@ However, if you do not plan to use the {infer} APIs to use these models or if yo
 
 
 [discrete]
-[[unified-inference-api-desc]]
+[[chat-completion-inference-api-desc]]
 ==== {api-description-title}
 
-The unified {infer} API enables real-time responses for completion tasks by delivering answers incrementally, reducing response times during computation.
-It only works with the `completion` task type for OpenAI and Elastic Inference Service. The Unified Schema defines a common schema for use across multiple services.
+The chat completion {infer} API enables real-time responses for chat completion tasks by delivering answers incrementally, reducing response times during computation.
+It only works with the `chat_completion` task type for OpenAI and Elastic Inference Service.
 
+[NOTE]
+====
+The `chat_completion` task type only supports streaming and therefore the path must include `_stream`.
+====
 
 [discrete]
-[[unified-inference-api-path-params]]
+[[chat-completion-inference-api-path-params]]
 ==== {api-path-parms-title}
 
 `<inference_id>`::
@@ -46,15 +50,16 @@ The unique identifier of the {infer} endpoint.
 
 `<task_type>`::
 (Optional, string)
-The type of {infer} task that the model performs. If included, this must be set to the value `completion`.
+The type of {infer} task that the model performs. If included, this must be set to the value `chat_completion`.
 
 
 [discrete]
-[[unified-inference-api-request-body]]
+[[chat-completion-inference-api-request-body]]
 ==== {api-request-body-title}
 
 `messages`::
 (Required, array of objects) A list of objects representing the conversation.
+Requests should generally only add new messages from the user (role `user). The other message roles ("assistant", "system", or "tool") should generally only be copied from the response to a previous completion request, such that the messages array is built up over the course of a conversation.
 +
 .Assistant message
 [%collapsible%closed]
@@ -63,7 +68,7 @@ The type of {infer} task that the model performs. If included, this must be set
 (Required unless `tool_calls` is specified, string or array of objects)
 The contents of the message.
 +
-include::inference-shared.asciidoc[tag=unified-schema-content-with-examples]
+include::inference-shared.asciidoc[tag=chat-completion-schema-content-with-examples]
 +
 `role`::
 (Required, string)
@@ -122,7 +127,7 @@ The arguments to call the function with in JSON format.
 (Required, string or array of objects)
 The contents of the message.
 +
-include::inference-shared.asciidoc[tag=unified-schema-content-with-examples]
+include::inference-shared.asciidoc[tag=chat-completion-schema-content-with-examples]
 +
 `role`:::
 (Required, string)
@@ -136,7 +141,7 @@ The role of the message author. This should be set to `system` for this type of
 (Required, string or array of objects)
 The contents of the message.
 +
-include::inference-shared.asciidoc[tag=unified-schema-content-with-examples]
+include::inference-shared.asciidoc[tag=chat-completion-schema-content-with-examples]
 +
 `role`::
 (Required, string)
@@ -154,7 +159,7 @@ The tool call that this message is responding to.
 (Required, string or array of objects)
 The contents of the message.
 +
-include::inference-shared.asciidoc[tag=unified-schema-content-with-examples]
+include::inference-shared.asciidoc[tag=chat-completion-schema-content-with-examples]
 +
 `role`::
 (Required, string)
@@ -163,7 +168,7 @@ The role of the message author. This should be set to `user` for this type of me
 
 `model`::
 (Optional, string)
-The ID of the model to use.
+The ID of the model to use. By default, the model ID is set to the value included when creating the inference endpoint.
 
 `max_completion_tokens`::
 (Optional, integer)
@@ -286,15 +291,15 @@ The name of the function to call.
 Nucleus sampling, an alternative to sampling with temperature.
 
 [discrete]
-[[unified-inference-api-example]]
+[[chat-completion-inference-api-example]]
 ==== {api-examples-title}
 
-The following example performs a completion on the example question with streaming.
+The following example performs a chat completion on the example question with streaming.
 
 
 [source,console]
 ------------------------------------------------------------
-POST _inference/completion/openai-completion/_unified
+POST _inference/chat_completion/openai-completion/_stream
 {
     "model": "gpt-4o",
     "messages": [
@@ -307,11 +312,11 @@ POST _inference/completion/openai-completion/_unified
 ------------------------------------------------------------
 // TEST[skip:TBD]
 
-The following example performs a completion using an Assistant message with `tool_calls`.
+The following example performs a chat completion using an Assistant message with `tool_calls`.
 
 [source,console]
 ------------------------------------------------------------
-POST _inference/completion/openai-completion/_unified
+POST _inference/chat_completion/openai-completion/_stream
 {
     "messages": [
         {
@@ -341,11 +346,11 @@ POST _inference/completion/openai-completion/_unified
 <1> Each tool call needs a corresponding Tool message.
 <2> The corresponding Tool message.
 
-The following example performs a completion using a User message with `tools` and `tool_choice`.
+The following example performs a chat completion using a User message with `tools` and `tool_choice`.
 
 [source,console]
 ------------------------------------------------------------
-POST _inference/completion/openai-completion/_unified
+POST _inference/chat_completion/openai-completion/_stream
 {
     "messages": [
         {
@@ -391,18 +396,18 @@ The API returns the following response when a request is made to the OpenAI serv
 [source,txt]
 ------------------------------------------------------------
 event: message
-data: {"id":"chatcmpl-Ae0TWsy2VPnSfBbv5UztnSdYUMFP3","choices":[{"delta":{"content":"","role":"assistant"},"index":0}],"model":"gpt-4o-2024-08-06","object":"chat.completion.chunk"}
+data: {"chat_completion":{"id":"chatcmpl-Ae0TWsy2VPnSfBbv5UztnSdYUMFP3","choices":[{"delta":{"content":"","role":"assistant"},"index":0}],"model":"gpt-4o-2024-08-06","object":"chat.completion.chunk"}}
 
 event: message
-data: {"id":"chatcmpl-Ae0TWsy2VPnSfBbv5UztnSdYUMFP3","choices":[{"delta":{"content":Elastic"},"index":0}],"model":"gpt-4o-2024-08-06","object":"chat.completion.chunk"}
+data: {"chat_completion":{"id":"chatcmpl-Ae0TWsy2VPnSfBbv5UztnSdYUMFP3","choices":[{"delta":{"content":Elastic"},"index":0}],"model":"gpt-4o-2024-08-06","object":"chat.completion.chunk"}}
 
 event: message
-data: {"id":"chatcmpl-Ae0TWsy2VPnSfBbv5UztnSdYUMFP3","choices":[{"delta":{"content":" is"},"index":0}],"model":"gpt-4o-2024-08-06","object":"chat.completion.chunk"}
+data: {"chat_completion":{"id":"chatcmpl-Ae0TWsy2VPnSfBbv5UztnSdYUMFP3","choices":[{"delta":{"content":" is"},"index":0}],"model":"gpt-4o-2024-08-06","object":"chat.completion.chunk"}}
 
 (...)
 
 event: message
-data: {"id":"chatcmpl-Ae0TWsy2VPnSfBbv5UztnSdYUMFP3","choices":[],"model":"gpt-4o-2024-08-06","object":"chat.completion.chunk","usage":{"completion_tokens":28,"prompt_tokens":16,"total_tokens":44}} <1>
+data: {"chat_completion":{"id":"chatcmpl-Ae0TWsy2VPnSfBbv5UztnSdYUMFP3","choices":[],"model":"gpt-4o-2024-08-06","object":"chat.completion.chunk","usage":{"completion_tokens":28,"prompt_tokens":16,"total_tokens":44}}} <1>
 
 event: message
 data: [DONE]
diff --git a/docs/reference/inference/inference-apis.asciidoc b/docs/reference/inference/inference-apis.asciidoc
index 3ab85229ebd06..0a004948be194 100644
--- a/docs/reference/inference/inference-apis.asciidoc
+++ b/docs/reference/inference/inference-apis.asciidoc
@@ -20,7 +20,7 @@ the following APIs to manage {infer} models and perform {infer}:
 * <<post-inference-api>>
 * <<put-inference-api>>
 * <<stream-inference-api>>
-* <<unified-inference-api>>
+* <<chat-completion-inference-api>>
 * <<update-inference-api>>
 
 [[inference-landscape]]
@@ -130,7 +130,7 @@ PUT _inference/sparse_embedding/small_chunk_size
 include::delete-inference.asciidoc[]
 include::get-inference.asciidoc[]
 include::post-inference.asciidoc[]
-include::unified-inference.asciidoc[]
+include::chat-completion-inference.asciidoc[]
 include::put-inference.asciidoc[]
 include::stream-inference.asciidoc[]
 include::update-inference.asciidoc[]
diff --git a/docs/reference/inference/inference-shared.asciidoc b/docs/reference/inference/inference-shared.asciidoc
index d8f24b2a9a797..b133c54082810 100644
--- a/docs/reference/inference/inference-shared.asciidoc
+++ b/docs/reference/inference/inference-shared.asciidoc
@@ -63,7 +63,7 @@ Specifies the chunking strategy.
 It could be either `sentence` or `word`.
 end::chunking-settings-strategy[]
 
-tag::unified-schema-content-with-examples[]
+tag::chat-completion-schema-content-with-examples[]
 .Examples
 [%collapsible%closed]
 ======
@@ -103,4 +103,8 @@ The text content.
 `type`::::
 (Required, string)
 This must be set to the value `text`.
-end::unified-schema-content-with-examples[]
+end::chat-completion-schema-content-with-examples[]
+
+tag::chat-completion-docs[]
+For more information on how to use the `chat_completion` task type, please refer to the <<chat-completion-inference-api, chat completion documentation>>.
+end::chat-completion-docs[]
diff --git a/docs/reference/inference/service-openai.asciidoc b/docs/reference/inference/service-openai.asciidoc
index 9211e2d08e88b..f5217b7e50d55 100644
--- a/docs/reference/inference/service-openai.asciidoc
+++ b/docs/reference/inference/service-openai.asciidoc
@@ -25,10 +25,18 @@ include::inference-shared.asciidoc[tag=task-type]
 --
 Available task types:
 
+* `chat_completion`,
 * `completion`,
 * `text_embedding`.
 --
 
+[NOTE]
+====
+The `chat_completion` task type only supports streaming and therefore the path must include `_stream`.
+
+include::inference-shared.asciidoc[tag=chat-completion-docs]
+====
+
 [discrete]
 [[infer-service-openai-api-request-body]]
 ==== {api-request-body-title}
@@ -55,7 +63,7 @@ include::inference-shared.asciidoc[tag=chunking-settings-strategy]
 
 `service`::
 (Required, string)
-The type of service supported for the specified task type. In this case, 
+The type of service supported for the specified task type. In this case,
 `openai`.
 
 `service_settings`::
@@ -170,4 +178,4 @@ PUT _inference/completion/openai-completion
     }
 }
 ------------------------------------------------------------
-// TEST[skip:TBD]
\ No newline at end of file
+// TEST[skip:TBD]
diff --git a/docs/reference/inference/stream-inference.asciidoc b/docs/reference/inference/stream-inference.asciidoc
index e66acd630cb3e..c2c44430cd32d 100644
--- a/docs/reference/inference/stream-inference.asciidoc
+++ b/docs/reference/inference/stream-inference.asciidoc
@@ -32,8 +32,12 @@ However, if you do not plan to use the {infer} APIs to use these models or if yo
 ==== {api-description-title}
 
 The stream {infer} API enables real-time responses for completion tasks by delivering answers incrementally, reducing response times during computation.
-It only works with the `completion` task type.
+It only works with the `completion` and `chat_completion` task types.
 
+[NOTE]
+====
+include::inference-shared.asciidoc[tag=chat-completion-docs]
+====
 
 [discrete]
 [[stream-inference-api-path-params]]

From 28f581b02933dd590e787b557597a00608cdf2c8 Mon Sep 17 00:00:00 2001
From: Jonathan Buttner <jonathan.buttner@elastic.co>
Date: Fri, 10 Jan 2025 14:08:37 -0500
Subject: [PATCH 6/8] Addign back references to the _unified route

---
 .../inference/chat-completion-inference.asciidoc       |  6 +++---
 docs/reference/inference/put-inference.asciidoc        | 10 +++++-----
 docs/reference/inference/service-openai.asciidoc       |  2 +-
 3 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/docs/reference/inference/chat-completion-inference.asciidoc b/docs/reference/inference/chat-completion-inference.asciidoc
index e902c40f5c15b..90b8303740d63 100644
--- a/docs/reference/inference/chat-completion-inference.asciidoc
+++ b/docs/reference/inference/chat-completion-inference.asciidoc
@@ -13,9 +13,9 @@ However, if you do not plan to use the {infer} APIs to use these models or if yo
 [[chat-completion-inference-api-request]]
 ==== {api-request-title}
 
-`POST /_inference/<inference_id>/_stream`
+`POST /_inference/<inference_id>/_unified`
 
-`POST /_inference/chat_completion/<inference_id>/_stream`
+`POST /_inference/chat_completion/<inference_id>/_unified`
 
 
 [discrete]
@@ -36,7 +36,7 @@ It only works with the `chat_completion` task type for OpenAI and Elastic Infere
 
 [NOTE]
 ====
-The `chat_completion` task type only supports streaming and therefore the path must include `_stream`.
+The `chat_completion` task type is only available within the _unified API and only supports streaming.
 ====
 
 [discrete]
diff --git a/docs/reference/inference/put-inference.asciidoc b/docs/reference/inference/put-inference.asciidoc
index 4f82889f562d8..57f0a24c5c00f 100644
--- a/docs/reference/inference/put-inference.asciidoc
+++ b/docs/reference/inference/put-inference.asciidoc
@@ -36,7 +36,7 @@ include::inference-shared.asciidoc[tag=inference-id]
 include::inference-shared.asciidoc[tag=task-type]
 +
 --
-Refer to the service list in the <<put-inference-api-desc,API description section>> for the available task types. 
+Refer to the service list in the <<put-inference-api-desc,API description section>> for the available task types.
 --
 
 
@@ -55,7 +55,7 @@ The create {infer} API enables you to create an {infer} endpoint and configure a
 
 
 The following services are available through the {infer} API.
-You can find the available task types next to the service name. 
+You can find the available task types next to the service name.
 Click the links to review the configuration details of the services:
 
 * <<infer-service-alibabacloud-ai-search,AlibabaCloud AI Search>> (`completion`, `rerank`, `sparse_embedding`, `text_embedding`)
@@ -67,10 +67,10 @@ Click the links to review the configuration details of the services:
 * <<infer-service-elasticsearch,Elasticsearch>> (`rerank`, `sparse_embedding`, `text_embedding` - this service is for built-in models and models uploaded through Eland)
 * <<infer-service-elser,ELSER>> (`sparse_embedding`)
 * <<infer-service-google-ai-studio,Google AI Studio>> (`completion`, `text_embedding`)
-* <<infer-service-google-vertex-ai,Google Vertex AI>> (`rerank`, `text_embedding`) 
+* <<infer-service-google-vertex-ai,Google Vertex AI>> (`rerank`, `text_embedding`)
 * <<infer-service-hugging-face,Hugging Face>> (`text_embedding`)
 * <<infer-service-mistral,Mistral>> (`text_embedding`)
-* <<infer-service-openai,OpenAI>> (`completion`, `text_embedding`)
+* <<infer-service-openai,OpenAI>> (`chat_completion`, `completion`, `text_embedding`)
 * <<infer-service-watsonx-ai>> (`text_embedding`)
 
 The {es} and ELSER services run on a {ml} node in your {es} cluster. The rest of
@@ -87,4 +87,4 @@ When adaptive allocations are enabled:
 - The number of allocations scales up automatically when the load increases.
 - Allocations scale down to a minimum of 0 when the load decreases, saving resources.
 
-For more information about adaptive allocations and resources, refer to the {ml-docs}/ml-nlp-auto-scale.html[trained model autoscaling] documentation.
\ No newline at end of file
+For more information about adaptive allocations and resources, refer to the {ml-docs}/ml-nlp-auto-scale.html[trained model autoscaling] documentation.
diff --git a/docs/reference/inference/service-openai.asciidoc b/docs/reference/inference/service-openai.asciidoc
index f5217b7e50d55..c5ec557b97acd 100644
--- a/docs/reference/inference/service-openai.asciidoc
+++ b/docs/reference/inference/service-openai.asciidoc
@@ -32,7 +32,7 @@ Available task types:
 
 [NOTE]
 ====
-The `chat_completion` task type only supports streaming and therefore the path must include `_stream`.
+The `chat_completion` task type only supports streaming and only through the `_unified` API.
 
 include::inference-shared.asciidoc[tag=chat-completion-docs]
 ====

From 6f368a253e6e6f4ad73bcfa05c7778e5b09c900e Mon Sep 17 00:00:00 2001
From: Jonathan Buttner <56361221+jonathan-buttner@users.noreply.github.com>
Date: Mon, 13 Jan 2025 09:19:35 -0500
Subject: [PATCH 7/8] Update
 docs/reference/inference/chat-completion-inference.asciidoc
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co>
---
 docs/reference/inference/chat-completion-inference.asciidoc | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/reference/inference/chat-completion-inference.asciidoc b/docs/reference/inference/chat-completion-inference.asciidoc
index 90b8303740d63..c18f029dab9a9 100644
--- a/docs/reference/inference/chat-completion-inference.asciidoc
+++ b/docs/reference/inference/chat-completion-inference.asciidoc
@@ -59,7 +59,7 @@ The type of {infer} task that the model performs. If included, this must be set
 
 `messages`::
 (Required, array of objects) A list of objects representing the conversation.
-Requests should generally only add new messages from the user (role `user). The other message roles ("assistant", "system", or "tool") should generally only be copied from the response to a previous completion request, such that the messages array is built up over the course of a conversation.
+Requests should generally only add new messages from the user (role `user`). The other message roles (`assistant`, `system`, or `tool`) should generally only be copied from the response to a previous completion request, such that the messages array is built up throughout a conversation.
 +
 .Assistant message
 [%collapsible%closed]

From 263eb087516ee298fa5f31cab7b5e13940457637 Mon Sep 17 00:00:00 2001
From: Jonathan Buttner <jonathan.buttner@elastic.co>
Date: Mon, 13 Jan 2025 09:21:02 -0500
Subject: [PATCH 8/8] Address feedback

---
 docs/reference/inference/chat-completion-inference.asciidoc | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/reference/inference/chat-completion-inference.asciidoc b/docs/reference/inference/chat-completion-inference.asciidoc
index c18f029dab9a9..83a8f94634f2f 100644
--- a/docs/reference/inference/chat-completion-inference.asciidoc
+++ b/docs/reference/inference/chat-completion-inference.asciidoc
@@ -32,7 +32,7 @@ However, if you do not plan to use the {infer} APIs to use these models or if yo
 ==== {api-description-title}
 
 The chat completion {infer} API enables real-time responses for chat completion tasks by delivering answers incrementally, reducing response times during computation.
-It only works with the `chat_completion` task type for OpenAI and Elastic Inference Service.
+It only works with the `chat_completion` task type for `openai` and `elastic` {infer} services.
 
 [NOTE]
 ====