[DRAFT] Cohere multimodal embedding support#135494
[DRAFT] Cohere multimodal embedding support#135494DonalEvans wants to merge 5 commits intoelastic:mainfrom
Conversation
| boolean isImage = chunkInferenceInput instanceof ChunkInferenceImageInput; | ||
| if (isImage) { | ||
| // Do not chunk image URLs | ||
| chunker = NoopChunker.INSTANCE; | ||
| chunkingSettings = NoneChunkingSettings.INSTANCE; | ||
| } else { | ||
| chunker = chunkers.getOrDefault(chunkingSettings.getChunkingStrategy(), defaultChunker); | ||
| } |
There was a problem hiding this comment.
We do not want to chunk image URLs using the existing chunking logic, be they data URLs or web URLs. In the future, a chunking strategy could be adopted for data URLs, since that would allow images larger than than the maximum image size for a given service to be turned into embeddings, but this would be potentially quite complex as it would require at least partially decoding the data URL to know how to properly chunk it.
| // TODO: combine images and text inputs into one list? | ||
| if (input == null) { | ||
| input = List.of(); | ||
| } |
There was a problem hiding this comment.
This is a hacky workaround for the case where a multimodal embedding is performed with only image URLs and no text inputs. The current implementation assumes that there will always be something in the input field, but this implementation allows for empty input providing that image_url is not also empty. Changing to having a single list of inputs which can be either text or image URLs will resolve this problem.
|
Closing this PR since the investigation phase of this work is over and multimodal embedding support is being implemented in other PRs such as #138198 |
This PR is intentionally spit into two commits. The first one contains the relevant code changes required to add support for multimodal embedding using Cohere. The second one contains incidental changes that are not relevant to the new functionality, such as those caused by method signature changes and class or method name changes.
To allow the relevant changes to be viewed without a lot of additional noise, please only look at the first commit.
For multimodal embeddings, Cohere uses the
inputsfield rather than thetextsfield used by our currenttext_embeddingimplementation (there is also animagesfield, but the API accepts only one image per request when using this field, so it was not considered for use here due to its limitations):One array of floats is returned per
contentobject, so it’s possible to provide multiple texts/images and receive a single embedding back, although I’m not sure what the use case for this is or whether we want to support it.Example multimodal inference request with mixed text and image inputs using the design implemented in this PR:
The above produces the output:
Combining
inputandimage_urlinto a single object array where the objects are aware of whether they’re raw text or images instead of using two arrays ofStringwould be neater, but potentially require modifying a large number of files to allow this new object to be used where we’re currently usingString. This change is also important if we need to preserve the ordering of inputs when a user provides a mixture of text and images, since with two separate arrays, it’s not possible to know the original order in which the mixed inputs were provided, meaning that the embeddings returned to the user would potentially be out of order compared to the inputs.In order to support using the existing
inputfield for multimodal embedding tasks that use an array of objects rather thanString, we could useInferenceActionProxyto route requests based on task type, which is already being done to supportUnifiedCompletionActionhaving different parsing logic fromInferenceAction.Currently there is no way for ingested documents to define a field that will map to the equivalent of the
image_urlfield used in HTTP requests, or for binary image data in those documents to be converted to a base64 encoded data URL, meaning that it is not possible to use multimodal embedding at ingest time. Adding support for that functionality is well outside the scope of this PR and may need to be implemented by another team, but once that functionality is present, the implementation in this PR will be able to use the data URLs provided with very minimal changes.