[DRAFT] Cohere multimodal embedding support by DonalEvans · Pull Request #135494 · elastic/elasticsearch

DonalEvans · 2025-09-25T22:58:01Z

This PR is intentionally spit into two commits. The first one contains the relevant code changes required to add support for multimodal embedding using Cohere. The second one contains incidental changes that are not relevant to the new functionality, such as those caused by method signature changes and class or method name changes.

To allow the relevant changes to be viewed without a lot of additional noise, please only look at the first commit.

For multimodal embeddings, Cohere uses the inputs field rather than the texts field used by our current text_embedding implementation (there is also an images field, but the API accepts only one image per request when using this field, so it was not considered for use here due to its limitations):

"inputs": [
 {
   "content": [
     {
       "type": "text", "text": "abc"
     }
   ]
 },
 {
   "content": [
     {
       "type": "image_url", "image_url": {
         "url": "data:image/png;base64,..."
       }
     }
   ]
 }
]

One array of floats is returned per content object, so it’s possible to provide multiple texts/images and receive a single embedding back, although I’m not sure what the use case for this is or whether we want to support it.

Example multimodal inference request with mixed text and image inputs using the design implemented in this PR:

POST _inference/multimodal_embedding/cohere-multimodal-embedding
{
  "input": ["First text", "Second text"],
  "image_url": ["data:image/png;base64,*image data*", "data:image/png;base64,*image data*"]
}

The above produces the output:

{
    "multimodal_embedding": [
        {
            "embedding": [
                ...
            ]
        },
        {
            "embedding": [
                ...
            ]
        },
        {
            "embedding": [
                ...
            ]
        },
        {
            "embedding": [
                ...
            ]
        }
    ]
}

Combining input and image_url into a single object array where the objects are aware of whether they’re raw text or images instead of using two arrays of String would be neater, but potentially require modifying a large number of files to allow this new object to be used where we’re currently using String. This change is also important if we need to preserve the ordering of inputs when a user provides a mixture of text and images, since with two separate arrays, it’s not possible to know the original order in which the mixed inputs were provided, meaning that the embeddings returned to the user would potentially be out of order compared to the inputs.

In order to support using the existing input field for multimodal embedding tasks that use an array of objects rather than String, we could use InferenceActionProxy to route requests based on task type, which is already being done to support UnifiedCompletionAction having different parsing logic from InferenceAction.

Currently there is no way for ingested documents to define a field that will map to the equivalent of the image_url field used in HTTP requests, or for binary image data in those documents to be converted to a base64 encoded data URL, meaning that it is not possible to use multimodal embedding at ingest time. Adding support for that functionality is well outside the scope of this PR and may need to be implemented by another team, but once that functionality is present, the implementation in this PR will be able to use the data URLs provided with very minimal changes.

DonalEvans · 2025-09-25T22:58:34Z

...erence/src/main/java/org/elasticsearch/xpack/inference/chunking/EmbeddingRequestChunker.java

+            boolean isImage = chunkInferenceInput instanceof ChunkInferenceImageInput;
+            if (isImage) {
+                // Do not chunk image URLs
+                chunker = NoopChunker.INSTANCE;
+                chunkingSettings = NoneChunkingSettings.INSTANCE;
+            } else {
+                chunker = chunkers.getOrDefault(chunkingSettings.getChunkingStrategy(), defaultChunker);
+            }


We do not want to chunk image URLs using the existing chunking logic, be they data URLs or web URLs. In the future, a chunking strategy could be adopted for data URLs, since that would allow images larger than than the maximum image size for a given service to be turned into embeddings, but this would be potentially quite complex as it would require at least partially decoding the data URL to know how to properly chunk it.

DonalEvans · 2025-09-25T23:00:37Z

...plugin/inference/src/main/java/org/elasticsearch/xpack/inference/services/SenderService.java

+        // TODO: combine images and text inputs into one list?
+        if (input == null) {
+            input = List.of();
+        }


This is a hacky workaround for the case where a multimodal embedding is performed with only image URLs and no text inputs. The current implementation assumes that there will always be something in the input field, but this implementation allows for empty input providing that image_url is not also empty. Changing to having a single list of inputs which can be either text or image URLs will resolve this problem.

DonalEvans · 2025-11-18T19:16:24Z

Closing this PR since the investigation phase of this work is over and multimodal embedding support is being implemented in other PRs such as #138198

DonalEvans added 2 commits September 25, 2025 15:46

Relevant changes

e25fa61

Non-relevant changes

4756c5a

elasticsearchmachine added the v9.2.0 label Sep 25, 2025

DonalEvans commented Sep 25, 2025

View reviewed changes

DonalEvans added the Team:ML Meta label for the ML team label Sep 25, 2025

DonalEvans added 3 commits September 25, 2025 16:44

Fix test failures

0ffc635

Merge branch 'main' into image-embedding-support

cc8c3f5

Fix ModelRegistryIT

956a8bf

elasticsearchmachine added v9.3.0 and removed v9.2.0 labels Oct 2, 2025

DonalEvans closed this Nov 18, 2025

DonalEvans deleted the image-embedding-support branch January 7, 2026 22:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DRAFT] Cohere multimodal embedding support#135494

[DRAFT] Cohere multimodal embedding support#135494
DonalEvans wants to merge 5 commits intoelastic:mainfrom
DonalEvans:image-embedding-support

DonalEvans commented Sep 25, 2025 •

edited

Loading

Uh oh!

DonalEvans Sep 25, 2025

Uh oh!

DonalEvans Sep 25, 2025

Uh oh!

DonalEvans commented Nov 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

DonalEvans commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

To allow the relevant changes to be viewed without a lot of additional noise, please only look at the first commit.

Uh oh!

DonalEvans Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

DonalEvans Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

DonalEvans commented Nov 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

DonalEvans commented Sep 25, 2025 •

edited

Loading