Skip to content

[DRAFT] Cohere multimodal embedding support#135494

Closed
DonalEvans wants to merge 5 commits intoelastic:mainfrom
DonalEvans:image-embedding-support
Closed

[DRAFT] Cohere multimodal embedding support#135494
DonalEvans wants to merge 5 commits intoelastic:mainfrom
DonalEvans:image-embedding-support

Conversation

@DonalEvans
Copy link
Contributor

@DonalEvans DonalEvans commented Sep 25, 2025

This PR is intentionally spit into two commits. The first one contains the relevant code changes required to add support for multimodal embedding using Cohere. The second one contains incidental changes that are not relevant to the new functionality, such as those caused by method signature changes and class or method name changes.

To allow the relevant changes to be viewed without a lot of additional noise, please only look at the first commit.

For multimodal embeddings, Cohere uses the inputs field rather than the texts field used by our current text_embedding implementation (there is also an images field, but the API accepts only one image per request when using this field, so it was not considered for use here due to its limitations):

"inputs": [
 {
   "content": [
     {
       "type": "text", "text": "abc"
     }
   ]
 },
 {
   "content": [
     {
       "type": "image_url", "image_url": {
         "url": "data:image/png;base64,..."
       }
     }
   ]
 }
]

One array of floats is returned per content object, so it’s possible to provide multiple texts/images and receive a single embedding back, although I’m not sure what the use case for this is or whether we want to support it.

Example multimodal inference request with mixed text and image inputs using the design implemented in this PR:

POST _inference/multimodal_embedding/cohere-multimodal-embedding
{
  "input": ["First text", "Second text"],
  "image_url": ["data:image/png;base64,*image data*", "data:image/png;base64,*image data*"]
}

The above produces the output:

{
    "multimodal_embedding": [
        {
            "embedding": [
                ...
            ]
        },
        {
            "embedding": [
                ...
            ]
        },
        {
            "embedding": [
                ...
            ]
        },
        {
            "embedding": [
                ...
            ]
        }
    ]
}

Combining input and image_url into a single object array where the objects are aware of whether they’re raw text or images instead of using two arrays of String would be neater, but potentially require modifying a large number of files to allow this new object to be used where we’re currently using String. This change is also important if we need to preserve the ordering of inputs when a user provides a mixture of text and images, since with two separate arrays, it’s not possible to know the original order in which the mixed inputs were provided, meaning that the embeddings returned to the user would potentially be out of order compared to the inputs.

In order to support using the existing input field for multimodal embedding tasks that use an array of objects rather than String, we could use InferenceActionProxy to route requests based on task type, which is already being done to support UnifiedCompletionAction having different parsing logic from InferenceAction.

Currently there is no way for ingested documents to define a field that will map to the equivalent of the image_url field used in HTTP requests, or for binary image data in those documents to be converted to a base64 encoded data URL, meaning that it is not possible to use multimodal embedding at ingest time. Adding support for that functionality is well outside the scope of this PR and may need to be implemented by another team, but once that functionality is present, the implementation in this PR will be able to use the data URLs provided with very minimal changes.

Comment on lines +129 to +136
boolean isImage = chunkInferenceInput instanceof ChunkInferenceImageInput;
if (isImage) {
// Do not chunk image URLs
chunker = NoopChunker.INSTANCE;
chunkingSettings = NoneChunkingSettings.INSTANCE;
} else {
chunker = chunkers.getOrDefault(chunkingSettings.getChunkingStrategy(), defaultChunker);
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not want to chunk image URLs using the existing chunking logic, be they data URLs or web URLs. In the future, a chunking strategy could be adopted for data URLs, since that would allow images larger than than the maximum image size for a given service to be turned into embeddings, but this would be potentially quite complex as it would require at least partially decoding the data URL to know how to properly chunk it.

Comment on lines +79 to +82
// TODO: combine images and text inputs into one list?
if (input == null) {
input = List.of();
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a hacky workaround for the case where a multimodal embedding is performed with only image URLs and no text inputs. The current implementation assumes that there will always be something in the input field, but this implementation allows for empty input providing that image_url is not also empty. Changing to having a single list of inputs which can be either text or image URLs will resolve this problem.

@DonalEvans DonalEvans added the Team:ML Meta label for the ML team label Sep 25, 2025
@DonalEvans
Copy link
Contributor Author

Closing this PR since the investigation phase of this work is over and multimodal embedding support is being implemented in other PRs such as #138198

@DonalEvans DonalEvans closed this Nov 18, 2025
@DonalEvans DonalEvans deleted the image-embedding-support branch January 7, 2026 22:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Team:ML Meta label for the ML team v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants