ESQL: introduce support for mapping-unavailable fields#139417
ESQL: introduce support for mapping-unavailable fields#139417bpintea wants to merge 37 commits intoelastic:mainfrom
Conversation
ℹ️ Important: Docs version tagging👋 Thanks for updating the docs! Just a friendly reminder that our docs are now cumulative. This means all 9.x versions are documented on the same page and published off of the main branch, instead of creating separate pages for each minor version. We use applies_to tags to mark version-specific features and changes. Expand for a quick overviewWhen to use applies_to tags:✅ At the page level to indicate which products/deployments the content applies to (mandatory) What NOT to do:❌ Don't remove or replace information that applies to an older version 🤔 Need help?
|
|
Hi @bpintea, I've created a changelog YAML for you. |
GalLalouche
left a comment
There was a problem hiding this comment.
I've left a few small style comments/suggestions, but otherwise LGTM!
| import org.elasticsearch.xpack.esql.core.type.DataType; | ||
|
|
||
| /** | ||
| * This is a unmapped-fields strategy discriminator. |
There was a problem hiding this comment.
Nit 1: an unmapped-fields (but see below).
Nit 2: I think that the Javadocs on the main class is redundant, given the Javadocs on the enums themselves. This Javadoc isn't any more informative than the enum name.
There was a problem hiding this comment.
Nit 1: addressed.
Nit 2: kept, as this is what an editor will display when showing info about the class/enum.
|
|
||
| /** | ||
| * In case the query references a field that's not present in the index mapping, alias this field to value {@code null} of type | ||
| * {@link DataType}.{@code NULL} |
There was a problem hiding this comment.
Use {@link DataType#NULL} to link to the enum variant directly.
There was a problem hiding this comment.
Ah, rite, thanks! Fixed.
| NULLIFY, | ||
|
|
||
| /** | ||
| * Just like {@code NULLIFY}, but instead of null-aliasing, insert extractors in the data source. |
There was a problem hiding this comment.
- (In general) Use {@link NULLIFY} to link to the
enumvariant directly, but see below. - This is very much unlike
NULLIFY, given that it actually (attempts) to load the data... Perhaps rephrase to:
In case the query references a field that's not present in the index mapping, attempts to load it from {@code _source}."There was a problem hiding this comment.
- It seems one cannot ref from within the enum with
link. - Applied, thx.
| * The rule handles fields that don't show up in the index mapping, but are used within the query. These fields can either be missing | ||
| * entirely, or be present in the document, but not in the mapping (which can happen with non-dynamic mappings). | ||
| * <p> | ||
| * In the case of the former ones, the rule introducees {@code EVAL missing = NULL} commands (null-aliasing / null-Eval'ing). |
There was a problem hiding this comment.
- s/introducees/introduces.
- This comment is misleading, since it reads like the rule chooses nullifying/loading based on the data being present or not, and not based on the
enumpassed to it, which is configured by the user.
There was a problem hiding this comment.
- Fixed, thx.
- Added a clarification.
| // insert an Eval on top of those LeafPlan that are children of n-ary plans (could happen with UnionAll) | ||
| transformed = transformed.transformUp( | ||
| n -> n instanceof UnaryPlan == false && n instanceof LeafPlan == false, | ||
| nAry -> evalUnresolved(nAry, nullAliases) |
There was a problem hiding this comment.
While the overload here does "the right thing", I find it pretty confusing that it only works due the child being more specific. Consider just using two different names here, e.g., evalUnresolveUnary.
| Holder<Boolean> patched = new Holder<>(false); | ||
| child = child.transformDown( | ||
| // TODO add a suitable forEachDownMayReturnEarly equivalent | ||
| n -> patched.get() == false && n instanceof Project, // process top Project only (Fork-injected) |
There was a problem hiding this comment.
Same downcast comment here :)
There was a problem hiding this comment.
There's a tiny subtlety here: I want to skip the type checks for the remainder of the tree once the boolean flag flips.
There was a problem hiding this comment.
FWIW, this is further refined by reusing the newer transformDownSkipBranch.
| } | ||
|
|
||
| private static void assertSourceType(LogicalPlan source) { | ||
| switch (source) { |
There was a problem hiding this comment.
Personally, I think this part would be simpler with a plain old if (cond) { throw }.
There was a problem hiding this comment.
The cond would be hard to read: source instance of EsRelation == false && (or the || alternative) repeated two more times.
| private static List<Alias> nullAliases(List<UnresolvedAttribute> unresolved) { | ||
| Map<String, Alias> aliasesMap = new LinkedHashMap<>(unresolved.size()); | ||
| for (var u : unresolved) { | ||
| if (aliasesMap.containsKey(u.name()) == false) { |
There was a problem hiding this comment.
Consider using putIfAbsent, or if you want to avoid the allocation: computeIfAbsent.
| return new Alias(attribute.source(), attribute.name(), NULLIFIED); | ||
| } | ||
|
|
||
| // collect all UAs in the node |
There was a problem hiding this comment.
Replace comment with a method rename :)
There was a problem hiding this comment.
I chose to extend the comment, since a method name would "loose" too much detail.
| import static org.hamcrest.Matchers.hasSize; | ||
| import static org.hamcrest.Matchers.is; | ||
|
|
||
| public class AnalyzerUnmappedTests extends ESTestCase { |
There was a problem hiding this comment.
cough Golden tests 🧐 cough
x-pack/plugin/esql/qa/testFixtures/src/main/resources/unmapped-load.csv-spec
Show resolved
Hide resolved
x-pack/plugin/esql/qa/testFixtures/src/main/resources/unmapped-load.csv-spec
Show resolved
Hide resolved
| ROW x = 1 | ||
| | EVAL language_code = does_not_exist::INTEGER | ||
| | LOOKUP JOIN languages_lookup ON language_code | ||
| ; |
There was a problem hiding this comment.
What about stuff like:
SET unmapped_fields="nullify"\;
ROW x = 1
| EVAL language_code = does_not_exist::INTEGER
SET unmapped_fields="load"\;
| LOOKUP JOIN languages_lookup ON language_code
;
?
Is that allowed?
There was a problem hiding this comment.
That is not, there's just one SET allowed per statment/query.
astefan
left a comment
There was a problem hiding this comment.
There is some very good work in this PR, especially the attention to the different use cases and commands that can use "unmapped fields". Good job with this and the tests; I learned a bit about this feature by looking and trying out different queries from tests.
I am seeing difficulties in adding such a broad feature with different tricky/special commands to account for (fork, sub-queries, _insist and union types) and I am seeing some inconsistent behavior and outcome (see my comments). I don't think I have enough context to understand the full list of functionality for this feature, but I am seeing some things missing and I am curious on the plan for those missing bits/limitations (maybe?).
I will go over the PR one more time, at least. But, so far it is looking good, ignoring the missing bits/issues I mentioned above.
...k/plugin/esql/src/test/java/org/elasticsearch/xpack/esql/analysis/AnalyzerUnmappedTests.java
Outdated
Show resolved
Hide resolved
...k/plugin/esql/src/test/java/org/elasticsearch/xpack/esql/analysis/AnalyzerUnmappedTests.java
Outdated
Show resolved
Hide resolved
This will allow re-evaluating the output past a RENAME/DROP/KEEP, once an unmapping field is injected.
astefan
left a comment
There was a problem hiding this comment.
LGTM. Left some comments exploratory in nature and, if they are valid, a potential set of improvements. But can be addressed later with further work on this feature. Great work in this PR!
A comment for later discussions (if any), not a show stopper, more like a heads up for some potential confusion amongst users and weird hard-to-track-down bugs later in the development cycles.
The unmapped_message is materialized whenever KEEP is used. If I do FROM partial_mapping_sample_data | SORT unmapped_message I get an error about the field not being found. That's fair and this shows that there should be a mechanism for some "special" fields to be made available in a query command. I am also aware that this should improve with follow-up PRs.
If these "special" fields (unmapped, not visible etc) need a separate mechanism to be made available, why are we treating them equal to other fields, the "normal" ones? IMHO, to have a clean story around these unmapped fields, we need to "signal" their presence in a different way that shouldn't lead users to think that they are treated equally. With _insist command (that is snapshot only currently) this intention is very clear: _insist is a separate mechanism by which a user tells ESQL that whatever field name _insist refers to, that field is "special" in some way or another.
At some point, I think this transparent way of loading unmapped fields with keep will come back and bite us because these fields are not like the regular fields. We shouldn't treat them as regular fields, to eliminate any sort of confusion. This could also cripple our work on optimizations where we have several mechanisms to add evals to account for possible shadowed fields, where we move (potentially eliminate) commands around. This will make it easier to introduce bugs hard to track down and fix later in the development cycle.
Looking at how these fields are used and loaded and because they depend on being KEEPed in some way or another, I am wondering why the explicit reference to a potentially unmapped field is not part of FROM.
Just a humble 2c.
| private static LogicalPlan nullify(LogicalPlan plan, List<UnresolvedAttribute> unresolved) { | ||
| var nullAliases = nullAliases(unresolved); | ||
|
|
||
| // insert an Eval on top of every LeafPlan, if there's a UnaryPlan atop it |
There was a problem hiding this comment.
Any reason why you treat these two branches (unary and nary) differently? Why not in the same transformUp check?
|
|
||
| var transformed = load ? load(plan, unresolved) : nullify(plan, unresolved); | ||
|
|
||
| return transformed.equals(plan) ? plan : refreshPlan(transformed, unresolved); |
There was a problem hiding this comment.
Shouldn't this be enough with transformed == plan?
| return insisted; | ||
| } | ||
|
|
||
| // TODO: would an alternative to this be to drop the current Fork and have ResolveRefs#resolveFork re-resolve it. We might need |
There was a problem hiding this comment.
Having to re-resolve Fork (or better said, update whatever output Fork is creating) is something I encountered as well. It's a clash between the Fork's need to jump in the plan very early in the process and resolve the output columns and align them properly between its branches and the fact that some plans change/add/remove references from the plan and whatever Fork already resolved needs a "refresh".
Unfortunately, this is not the only tricky place in the ESQL planner that needs special treatment. Union types, inline stats, TS are other features that need special hand-holding.
Realistically, imho, we might be better off documenting these special treatments somewhere so that future us know how to approach stuff at that time and if needed.
There was a problem hiding this comment.
Yes, I think we should probably refactor Fork for more robustness (a similar comment).
| var refreshed = refreshUnresolved(plan, unresolved); | ||
| return refreshChildren(refreshed); | ||
| } | ||
|
|
||
| /** | ||
| * The UAs that haven't been resolved are marked as unresolvable with a custom message. This needs to be removed for | ||
| * {@link Analyzer.ResolveRefs} to attempt again to wire them to the newly added aliases. That's what this method does. | ||
| */ | ||
| private static LogicalPlan refreshUnresolved(LogicalPlan plan, List<UnresolvedAttribute> unresolved) { | ||
| return plan.transformExpressionsOnlyUp(UnresolvedAttribute.class, ua -> { | ||
| if (unresolved.contains(ua)) { | ||
| unresolved.remove(ua); | ||
| // Besides clearing the message, we need to refresh the nameId to avoid equality with the previous plan. | ||
| // (A `new UnresolvedAttribute(ua.source(), ua.name())` would save an allocation, but is problematic with subtypes.) | ||
| ua = (ua.withId(new NameId())).withUnresolvedMessage(null); | ||
| } | ||
| return ua; |
There was a problem hiding this comment.
I am curious why these two steps (refreshUnresolved for clearing UAs unresolved messages and refreshChildren) can't be performed in a single tree pass, I am clearly missing a detail that was probably obvious in some of your tests.
There was a problem hiding this comment.
They could probably have been, though it might not have been a clear code. But refreshing the plan has been factored away.
alex-spies
left a comment
There was a problem hiding this comment.
Gave it another go. Need to do at least another round, this has become huge.
Ideas for follow-up work, summarized from my comments below:
- We may want to issue a warning if a user sets
unmapped_fields:"load"but the source for one of the indices is disabled. - We need to figure out how to avoid having both
UnresolvedNamePatternandUnresolvedPattern. I think there's a way, and in the current state, the attribute/named expression hierarchy is pretty confusing with these two co-existing. - Our generative tests should probably randomly prepend one of the 3 settings for
unmapped_fields. - We want more tests for
unmapped_fields=\"load\". E.g.- Tests for the behavior of union types when an index has an unmapped/missing field that is multi-typed based on the mapping of 2 other indices.
- More tests that do something with the unmapped/missing fields (e.g. join or enrich on them!)
- More tests that do something before using the unmapped fields, esp. join on an unrelated field.
- Subqueries using an unmapped/missing field
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plan/logical/EsqlProject.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Maybe let's add comments to the two files to inform folks that whatever's added here should probably also be reflected in unmapped_fields.csv-spec and vice versa?
| 2024-10-23T12:15:03.360Z | 173.21.2.162 | 3450232 | Connected to 10.1.0.3! | ||
| ; | ||
|
|
||
| fieldIsMappedToNonKeywordSingleIndex |
There was a problem hiding this comment.
This test and the ones below until unmappedFieldAppearsLast (resp. unmappedFieldDoesNotAppearLast) lost their meaning IMO. I believe the originals from unmapped_field.csv-spec were about using INSIST on actually mapped fields. As it stands, the tests here are confusing, esp. because the name implies that they're about unmapped fields but they're not. I don't think they're worth keeping.
| 2024-10-23T12:15:03.360Z | Connected to 10.1.0.3! | Disconnected from 10.1.0.3 | ||
| ; | ||
|
|
||
| # Kept this test disabled to show the difference from IGNORE_🐔: if everywhere unmapped and not mentioned, a field (message) won't show |
There was a problem hiding this comment.
Do we have an IGNORE_🐔?
| # Kept this test disabled to show the difference from IGNORE_🐔: if everywhere unmapped and not mentioned, a field (message) won't show | |
| # Kept this test disabled to show the difference from INSIST_🐔: if everywhere unmapped and not mentioned, a field (message) won't show |
|
|
||
| # Kept this test disabled to show the difference from IGNORE_🐔: if everywhere unmapped and not mentioned, a field (message) won't show | ||
| # up at all. | ||
| fieldIsUnmappedButSourceIsDisabledSingleIndex-Ignore |
There was a problem hiding this comment.
I don't think there's much value in keeping around an ignored test to compare this with an in-development command. There's already a test at the very beginning of this file that shows that we don't just load unmapped fields if there's just a FROM.
Also applies below.
However! I do think that we should have a couple single-index tests with partial_mapping_excluded_source_sample_data. Because that index' mapping has a disabled source, and we should showcase the behavior in case that the source is outright disabled. (In which case, unmapped_fields:"nullify" and unmapped_fields:"load" seem to be equivalent.)
| ; | ||
|
|
||
| ##################### | ||
| # Multi index tests # |
There was a problem hiding this comment.
I think we need another set of tests where we have a union type that is unmapped in one of the indices, and outright missing (with source disabled) in another.
| insistOnTopOfInsistMultiIndex | ||
| required_capability: unmapped_fields | ||
| required_capability: index_metadata_field | ||
| required_capability: optional_fields | ||
|
|
||
| SET unmapped_fields="load"\; | ||
| FROM partial_mapping_sample_data, sample_data METADATA _index | ||
| | KEEP _index, @timestamp, foo, message, unmapped_message | ||
| | SORT @timestamp DESC | ||
| ; |
There was a problem hiding this comment.
Another test duplication that doesn't make sense without INSIST.
| fieldIsMappedToDifferentTypesMultiIndex | ||
| required_capability: index_metadata_field | ||
| required_capability: unmapped_fields | ||
| required_capability: optional_fields | ||
|
|
||
| SET unmapped_fields="load"\; | ||
| FROM sample_data_ts_long, sample_data METADATA _index | ||
| | KEEP _index, @timestamp | ||
| | SORT _index | ||
| ; |
There was a problem hiding this comment.
More tests that just show that existing behavior stays intact. Less useful than the original test that had an INSIST @timestamp, but still useful; although again maybe better solved via randomization.
| 2024-10-23T13:55:01.543Z | partial_mapping_sample_data | Connected to 10.1.0.1! | ||
| ; | ||
|
|
||
| partialMappingUnionTypes |
There was a problem hiding this comment.
pre-existing but misleading: there are no union types involved here. The only thing that happens is that we pass message to to_string first thing, but there's no mapping conflict to be solved.
There was a problem hiding this comment.
I wanted to know how we behave w.r.t. subqueries and found out that loading unmapped fields seems to work inside a subquery, but when attempted after, we encounter an ISE:
SET unmapped_fields=\"load\"; from (from partial_mapping_sample_data), (from partial_mapping_sample_data) | keep unmapped_message
{"error":{"root_cause":[{"type":"illegal_state_exception","reason":"Found 2 problems\nline 1:34: Plan [EsqlProject[[@timestamp{f}#45, client_ip{f}#47, event_duration{f}#46, message{f}#48, unmapped_message{f}#60]]] optimized incorrectly due to missing references [unmapped_message{f}#60]\nline 1:70: Plan [EsqlProject[[@timestamp{f}#49, client_ip{f}#51, event_duration{f}#50, message{f}#52, unmapped_message{f}#61]]] optimized incorrectly due to missing references [unmapped_message{f}#61]"}],"type":"illegal_state_exception","reason":"Found 2 problems\nline 1:34: Plan [EsqlProject[[@timestamp{f}#45, client_ip{f}#47, event_duration{f}#46, message{f}#48, unmapped_message{f}#60]]] optimized incorrectly due to missing references [unmapped_message{f}#60]\nline 1:70: Plan [EsqlProject[[@timestamp{f}#49, client_ip{f}#51, event_duration{f}#50, message{f}#52, unmapped_message{f}#61]]] optimized incorrectly due to missing references [unmapped_message{f}#61]"},"status":500}
There was a problem hiding this comment.
The same problem happens for unmapped_fields="nullify"
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/action/EsqlCapabilities.java
Show resolved
Hide resolved
|
Closing this in favor of #140463. We can't push to this PR's branch due to permissions, so we re-created it. The copy includes all of the commits from here. |
|
@astefan , to answer your question in Bogdan's absence: the use case we'd like to cover is "I have a query and I'm pointing it at different index patterns". E.g. in O11y, we have fields that are used in many indices, but may be missing in others; or a roll-over in a data stream may add a field, but we want a query to still work when we run against the pre-rolled over index. Another use case are pre-defined dashboards based on ESQL queries, where you may want to pop in a different index pattern but re-use the same query. For these cases, you'd have to |
This introduces support for mapping-unavailable fields (present and not mapped or just missing). The behaviour is controlled through a new SET setting
unmapped_fields, which can take the values "FAIL", "NULLIFY", "LOAD".An optional field behaves just like a "normal", mapped field, with regards to how it flows through the commands chain: it can be simply used in the commands, as if present in the source, but can no longer be referenced once dropped - explicitly, with
DROP, or not selected by aKEEP, orRENAMEthat doesn't reference it -, or past aSTATSreduction.However, unlike a mapped field, if it's not reference at all, it won't show up in the output of a simple
FROM index.Currently, the schema difference between nullified fields and the loaded ones is in the type: nullified ones are of data type
NULL, while the loaded ones areKEYWORD.The implementation difference w.r.t. logical plan building is that the nullified fields are created as
nullvalue aliasing on top of the data source, while the loaded one are pushed as extractors into the source (this leverages the INSIST work).The partially mapped fields are also covered: when the setting is "load", these fields will be extracted from those indices that have the field, but isn't mapped. In case there's a conflict between the loaded
KEYWORDfield and the mapped type in the fields that have this field mapped, an explicit conversion is needed, just like with union types.Related: #138888