[ES|QL] TEXT_EMBEDDING function definition#135059
Conversation
|
Pinging @elastic/es-search-relevance (Team:Search Relevance) |
🔍 Preview links for changed docs |
ℹ️ Important: Docs version tagging👋 Thanks for updating the docs! Just a friendly reminder that our docs are now cumulative. This means all 9.x versions are documented on the same page and published off of the main branch, instead of creating separate pages for each minor version. We use applies_to tags to mark version-specific features and changes. Expand for a quick overviewWhen to use applies_to tags:✅ At the page level to indicate which products/deployments the content applies to (mandatory) What NOT to do:❌ Don't remove or replace information that applies to an older version 🤔 Need help?
|
carlosdelest
left a comment
There was a problem hiding this comment.
LGTM!
Some minor nits about a common interface and some missing capability checks.
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/analysis/Analyzer.java
Show resolved
Hide resolved
.../src/main/java/org/elasticsearch/xpack/esql/expression/function/inference/TextEmbedding.java
Outdated
Show resolved
Hide resolved
.../src/main/java/org/elasticsearch/xpack/esql/expression/function/inference/TextEmbedding.java
Show resolved
Hide resolved
x-pack/plugin/esql/src/test/java/org/elasticsearch/xpack/esql/analysis/AnalyzerTestUtils.java
Outdated
Show resolved
Hide resolved
x-pack/plugin/esql/src/test/java/org/elasticsearch/xpack/esql/analysis/VerifierTests.java
Outdated
Show resolved
Hide resolved
x-pack/plugin/esql/src/test/java/org/elasticsearch/xpack/esql/analysis/VerifierTests.java
Show resolved
Hide resolved
x-pack/plugin/esql/src/test/java/org/elasticsearch/xpack/esql/analysis/VerifierTests.java
Show resolved
Hide resolved
Co-authored-by: Carlos Delgado <6339205+carlosdelest@users.noreply.github.com>
…esql_text_embedding_function_definition
| @Param( | ||
| name = InferenceFunction.INFERENCE_ID_PARAMETER_NAME, | ||
| type = { "keyword" }, | ||
| description = "Identifier of the inference endpoint" |
There was a problem hiding this comment.
we need to add a function example and then regen the docs
otherwise Kibana CI will fail when we try to bring the json specification of this function to Kibana even if it's under snapshot.
this happened recently for decay too #134705 - opened a separate PR to address this #135094 since I promised to look into it.
There was a problem hiding this comment.
I think this should be fixed in Kibana CI so it does not break in such a case.
I mean, it is expected for a function that have appliesTo set to FunctionAppliesToLifecycle.DEVELOPMENT to be incomplete. Especially examples come at the very last when writing CSV tests.
Anyway, I added an placeholder example so I will not break anything, It will be replaced when adding more realistic CSV tests.
...plugin/esql/src/test/java/org/elasticsearch/xpack/esql/inference/InferenceResolverTests.java
Outdated
Show resolved
Hide resolved
| plan.forEachExpressionUp(UnresolvedFunction.class, f -> { | ||
| String functionName = snapshotRegistry.resolveAlias(f.name()); | ||
| if (snapshotRegistry.functionExists(functionName)) { | ||
| FunctionDefinition def = snapshotRegistry.resolveFunction(functionName); |
There was a problem hiding this comment.
usually this is done at the analyzer level - it's a bit odd to do the function resolution here, but I don't see another simpler option for now 🤷♀️
There was a problem hiding this comment.
In fact, there is other case where this kind of analysis is done outside of the analyzer (enrich policy or indices resolution).
Because the analyzer is purely synchronous and inference resolution has to be done async, it is necessary to do it this way.
| return BytesRefs.toString(e.fold(FoldContext.small())); | ||
| } | ||
|
|
||
| public String inferenceId(UnresolvedFunction f, FunctionDefinition def) { |
There was a problem hiding this comment.
if this method is actually needed, should this be private? maybe also add a javadoc to explain what it does?
There was a problem hiding this comment.
Added some comments and improved the code readability.
| } | ||
|
|
||
| @Override | ||
| public void writeTo(StreamOutput out) throws IOException { |
There was a problem hiding this comment.
I don't think we need to serialize this function - since we will always resolve it on the coordinator and replace it with its result.
We have other instances in ES|QL where we don't serialize - FORK, ROW, RENAME come to mind:
But this would probably be the first function that does not require serialization.
There was a problem hiding this comment.
I would prefer to keep the serialization even if the expression is not supposed to be moved between node.
The first reason is because the Function are supposed to implement it. If for some reason the infrastructure of execution need to move them between node in the future, I would not like to be the black sheep that make things more difficult.
Also TextEmbeddingTests extends AbstractFunctionTestCase which expect the function to be serializable.
I consider that the implementation or serialization / deserialization is simple to implement and it is probably not beneficial to let it unimplemented.
There was a problem hiding this comment.
The first reason is because the Function are supposed to implement it.
That does not mean it should. If the text_embedding function is actually serialized and sent between nodes, that's an execution path that should never happen.
Anything that extends from LogicalPlan is also supposed to implement serialization, but as I said before there are many examples of plans where we don't serialize.
We could have made the same argument for logical plans that the serialization is simple to implement, or that it's not beneficial to let it unimplemented.
However raising an exception in the case where we attempt to serialize these plans is intentional and it captures very clearly the fact that this should never happen.
I get that making this unserializable is more work, especially for AbstractFunctionTestCase, but at least then the behaviour would be correct.
There was a problem hiding this comment.
It feels to me that we are removing some code that we will ultimately reintroduce when we will support non constant input text embeddings. Anyway, I pushed a version without serialization so we can move forward and merge this PR.
| if (snapshotRegistry.functionExists(functionName)) { | ||
| FunctionDefinition def = snapshotRegistry.resolveFunction(functionName); | ||
| if (InferenceFunction.class.isAssignableFrom(def.clazz())) { | ||
| String inferenceId = inferenceId(f, def); |
There was a problem hiding this comment.
InferenceFunction has an inferenceId method already - I wonder if we can just use that, instead of trying to see which function param is called INFERENCE_ID_PARAMETER_NAME.
There was a problem hiding this comment.
My problem is that f is not an InferenceFunction but is an UnresoivedFunction.
The only way to access the inference id is to read the argument that have the right name.
By convention I supposed that it will always be called inference_id for all the InferenceFunction we will implement in the future.
There was a problem hiding this comment.
To me, the fact that we are looking for the argument that is called inference_id leaks an internal implementation detail of the function.
We could also try and resolve the function - the same way the analyzer does:
this would still not be ideal though - I am okay to leave this as it is - but we should reopen the discussion of making the analyzer support async rules.
| public LogicalPlan apply(LogicalPlan plan, AnalyzerContext context) { | ||
| return plan.transformDown(InferencePlan.class, p -> resolveInferencePlan(p, context)); | ||
| return plan.transformDown(InferencePlan.class, p -> resolveInferencePlan(p, context)) | ||
| .transformExpressionsOnly(InferenceFunction.class, f -> resolveInferenceFunction(f, context)); |
There was a problem hiding this comment.
Non-blocking question for my ES|QL education: Can you help me understand why this change?
There was a problem hiding this comment.
plan.transformDown(InferencePlan.class, p -> resolveInferencePlan(p, context))
will transform all the children in the plan that have the class InferencePlan with the result of the resolveInferencePlan(p, context) call. Inference plans are typically command using inference: RERANK and COMPLETION
.transformExpressionsOnly(InferenceFunction.class, f -> resolveInferenceFunction(f, context));
will transform do the same but for InferenceFunction instead of plan. Because text embedding is our first inference function this was not yet done, so I added it.
ioanatia
left a comment
There was a problem hiding this comment.
I think this looks really close - I would like us to look a bit more into the serialization aspect before merging
| if (snapshotRegistry.functionExists(functionName)) { | ||
| FunctionDefinition def = snapshotRegistry.resolveFunction(functionName); | ||
| if (InferenceFunction.class.isAssignableFrom(def.clazz())) { | ||
| String inferenceId = inferenceId(f, def); |
There was a problem hiding this comment.
To me, the fact that we are looking for the argument that is called inference_id leaks an internal implementation detail of the function.
We could also try and resolve the function - the same way the analyzer does:
this would still not be ideal though - I am okay to leave this as it is - but we should reopen the discussion of making the analyzer support async rules.
| } | ||
|
|
||
| @Override | ||
| public void writeTo(StreamOutput out) throws IOException { |
There was a problem hiding this comment.
The first reason is because the Function are supposed to implement it.
That does not mean it should. If the text_embedding function is actually serialized and sent between nodes, that's an execution path that should never happen.
Anything that extends from LogicalPlan is also supposed to implement serialization, but as I said before there are many examples of plans where we don't serialize.
We could have made the same argument for logical plans that the serialization is simple to implement, or that it's not beneficial to let it unimplemented.
However raising an exception in the case where we attempt to serialize these plans is intentional and it captures very clearly the fact that this should never happen.
I get that making this unserializable is more work, especially for AbstractFunctionTestCase, but at least then the behaviour would be correct.
…esql_text_embedding_function_definition
This PR introduces a new TEXT_EMBEDDING function to ES|QL, enabling users to generate dense vector embeddings for text directly within their queries.
Examples
Example 1: Generating embeddings
Example 2: Semantic search with KNN
This work is part of #131079