ESQL: Fix injected attributes's IDs in UnionAll branches#141262
ESQL: Fix injected attributes's IDs in UnionAll branches#141262bpintea merged 6 commits intoelastic:mainfrom
Conversation
This fixes the generation of name IDs for the attributes corresponding to the unmapped fields and are pushed to different branches in UntionAll. So far, one set of IDs was generated and reused for all subplans. This is now updated to own set per subplan. A minor collateral proposed change: the CSV spec-based tests skipped due to missing capabilities are now logged.
|
Pinging @elastic/es-analytical-engine (Team:Analytics) |
|
Hi @bpintea, I've created a changelog YAML for you. |
| } | ||
|
|
||
| private static List<FieldAttribute> fieldsToLoad(List<UnresolvedAttribute> unresolved, Set<String> exclude) { | ||
| private static List<FieldAttribute> fieldsToLoad(Set<UnresolvedAttribute> unresolved, List<String> exclude) { |
There was a problem hiding this comment.
Isn't this a mistake? I would have expected the List to be the iteratee, and the Set to be the one we check contains on, but it seems to be the other way around.
There was a problem hiding this comment.
It's a Set because the initial collection of UnresolvedAttributes is dedup'd -- this is what unresolvedLinkedSet() produces (// Some plans may reference the same UA multiple times (Aggregate groupings in aggregates, Eval): dedupe)
It's a List because that's what EsRelation#output (and then Expressions#names` produces.
What we want here is to exclude those attributes produces by the EsRelation itself into which we would then later inject/insist the extractors.
Not sure if it's worth instantiating new collection types to wrap the existing ones.
| */ | ||
| private static Fork patchFork(Fork fork) { | ||
| List<LogicalPlan> newChildren = new ArrayList<>(fork.children().size()); | ||
| Holder<Boolean> changed = new Holder<>(false); |
There was a problem hiding this comment.
Could you add a comment explaining the different between changed and patched, since those names are too similar (or rename them so it's more obvious).
There was a problem hiding this comment.
Renamed changed to childrenChanged
| unresolved.forEach(u -> aliasesMap.computeIfAbsent(u.name(), k -> nullAlias(u))); | ||
| return new ArrayList<>(aliasesMap.values()); | ||
| private static List<Alias> nullAliases(Set<UnresolvedAttribute> unresolved) { | ||
| List<Alias> aliases = new ArrayList<>(unresolved.size()); |
There was a problem hiding this comment.
Why not use addAll (Or even just a basic map for that matter)?
There was a problem hiding this comment.
Not sure I understand how would addAll help. A map could, but I find the streams too heavy for a relatively simple iteration. But let me know if I misunderstood your suggestion.
| @@ -63,7 +63,10 @@ | |||
| import static org.hamcrest.Matchers.hasItems; | |||
There was a problem hiding this comment.
Newly merged golden tests? 😁
There was a problem hiding this comment.
Great. I'll extend in a subsequent PR, since this isn't going to be the only one.
| * are tested in integration tests. | ||
| */ | ||
| assumeFalse( | ||
| assumeFalseLogging( |
There was a problem hiding this comment.
- Since you're already touching this, perhaps change the logging from "X is not supported" to "capability" is not enabled?
- And if you do, you can just define a single nice helper function to check if a capability is enabled!
There was a problem hiding this comment.
- Not sure, I personally find the existing "CSV tests cannot currently..." or "... in csv tests" messages better, tbh, since it's not about a capability not being enabled (it actually is enabled, and that's the "problem"), but the CSV testing infrastructure not being developed enough to support a new feature, right? But will let the other chime in as well.
| if (descendantOutputsAttribute(project, attribute) == false) { | ||
| nullAliases.add(nullAlias(attribute)); | ||
| } | ||
| private static Project patchForkProject(Project project, Holder<Boolean> changed) { |
There was a problem hiding this comment.
- Please add a comment explaining what this method does.
- Why do you need a holden here? Can't you just check in the parent if the reference has changed?
There was a problem hiding this comment.
- I've added a comment.
- Thanks, fixed (got like that through iterations).
| if (projectOutput.equals(childOutput) == false) { | ||
| List<Attribute> delta = new ArrayList<>(childOutput); | ||
| delta.removeAll(projectOutput); | ||
| project = project.withProjections(mergeOutputAttributes(delta, projectOutput)); |
There was a problem hiding this comment.
Could we please avoid this pattern of renaming the input parameter and then returning it outside the block? Just use an early exit above.
There was a problem hiding this comment.
This is a pre-existing pattern. Some folks find it easier to read code with fewer returns. (Myself, I don't necessarily, but I don't mind this style either).
returning it outside the block
...reason being: if the control hasn't visited the block, the input is simply returned with no change.
There was a problem hiding this comment.
I personally find it unfathomable that it's harder to read code with more early exits than it is to read code with more changes if the input variable, but to each their own I guess 🥲.
| } | ||
|
|
||
| // Some plans may reference the same UA multiple times (Aggregate groupings in aggregates, Eval): dedupe | ||
| private static Set<UnresolvedAttribute> unresolvedLinkedSet(List<UnresolvedAttribute> unresolved) { |
There was a problem hiding this comment.
I think the return type here should be LinkedHashSet or at least SequencedSet, given the method name. If you opt for the latter, then consider also renaming this method.
There was a problem hiding this comment.
I think the return type here should be LinkedHashSet
Updated.
astefan
left a comment
There was a problem hiding this comment.
LGTM with two questions.
| var projectOutput = project.output(); | ||
| var childOutput = project.child().output(); | ||
| if (projectOutput.equals(childOutput) == false) { |
There was a problem hiding this comment.
- for a reviewer would have been easier to assess if this
equalscould be tricky or not by seeing the actual type of the.output(). In IDE this type isList. - is there a scenario where this
equalsis missed because the same elements exists in both lists but in different order?
There was a problem hiding this comment.
- I've updated the declarations.
- There could be, yes. But in this case the
deltalist in the branch will be empty. The project will still be recreated, but the resulting instance will be equal to the previous one and the operation will eventually either leave the plan unchanged or changed, but due to other modifications. In any case, there should be a guard against that emptydeltalist, to avoid creating a new necessary intense equal to the previous one -- thanks.
Since these changes aren't functionally impacting, I'd apply them to a follow-up PR (unless other changes will be required), if ok with you?
| Set<AbstractConvertFunction> converts = oldOutputToConvertFunctions.get(oldAttr.name()); | ||
| if (converts != null) { |
There was a problem hiding this comment.
Why not the contains and .get approach? I see the code that reaches this part is fairly safe to assume that there won't be any null sets returned for a key, but we are not sure how this code will evolve in the future.
There was a problem hiding this comment.
This isn't strictly related, but I thought I might need an update of this code and spotted the pattern.
There's nothing wrong from the functional PoV, but the code, as it was, checks if the key is in the map, then does it again, but fetching the corresponding value. The code as is in the proposed change only does the latter. If the result/value is null, the key isn't in there. (Well, I guess it could [have] be[en] a null value, to be exact, but that would have resulted in a NPE by now(?))
| convert.replaceChildren(Collections.singletonList(oldAttr)) | ||
| convert.replaceChildren(Collections.singletonList(oldAttr)), | ||
| null, // generate a new id | ||
| true // this'll be used to Project the synthetic attributes out when finishing analysis |
| | KEEP emp_* | ||
| | KEEP emp_no, * |
There was a problem hiding this comment.
This is now testing something else, but the original test was a valid case. Do we want to add the original one back?
| * null[INTEGER] AS salary#35]] | ||
| * \_Subquery[] | ||
| * \_Filter[TOLONG(does_not_exist1{r}#20) > 1[INTEGER]] | ||
| * \_Eval[[null[NULL] AS does_not_exist1#20, null[NULL] AS does_not_exist2#51]] |
There was a problem hiding this comment.
The comment reflects that we now don't re-use name ids from subquery branches; it's asserted in the test above, but not here - shouldn't we also assert the name ids for does_not_exist2 in this case?
| * \_Aggregate[[],[COUNT(*[KEYWORD],true[BOOLEAN],PT0S[TIME_DURATION]) AS c#4]] | ||
| * \_Eval[[null[NULL] AS does_not_exist#53]] | ||
| * \_EsRelation[employees][_meta_field{f}#24, emp_no{f}#18, first_name{f}#19, . | ||
| * \_Eval[[null[NULL] AS does_not_exist#54]] |
There was a problem hiding this comment.
This is one of those funny cases where we might wrongly break a plan by adding this EVAL does_not_exist = null here - the field might actually exist and the downstream STATS might actually be computing COUNT(does_not_exist) (even though the field does_not_exist is not in the output of this subquery).
Added to #138888
There was a problem hiding this comment.
Adding in #141340 (removeShadowing). This has been implemented for two out of three cases (the "load" case and adding the null-aliasing to an existing Eval, but was missing when adding a new Eval on top of a source).
Note however that this effort is mostly for producing a correct, but otherwise later still failing plan: if the null-aliasing is injected below a STATS that doesn't export the attribute (which is why the null-aliasing is done in the first place), the attribute will remain missing and the verification later will fail the query.
This is what the initially introduced testFailStatsThenKeep or testFailStatsThenEval test.
| * \_Eval[[TOLONG(does_not_exist2{r}#74) AS $$does_not_exist2$converted_to$long#78]] | ||
| * \_Eval[[TOLONG(does_not_exist1{r}#26) AS $$does_not_exist1$converted_to$long#69]] | ||
| * \_Eval[[null[KEYWORD] AS _meta_field#42, null[INTEGER] AS emp_no#43, null[KEYWORD] AS first_name#44, | ||
| * null[TEXT] AS gender#45, null[DATETIME] AS hire_date#46, null[TEXT] AS job#47, null[KEYWORD] AS job.raw#48, | ||
| * null[INTEGER] AS languages#49, null[KEYWORD] AS last_name#50, null[LONG] AS long_noidx#51, | ||
| * null[INTEGER] AS salary#52]] | ||
| * \_Subquery[] | ||
| * \_Filter[TOLONG(does_not_exist1{r}#26) > 2[INTEGER]] | ||
| * \_Eval[[null[NULL] AS does_not_exist1#26, null[NULL] AS does_not_exist2#71]] |
There was a problem hiding this comment.
This subquery branch used to be inconsistent: does_not_exist2 is added with name id 71 after the esrelation, but it's used to define $$does_not_exist2$converted_to$long later while being referenced under id 74.
We're not currently asserting correct name ids. Maybe we can borrow the dependency checker for this? (It might also become a part of the analyzer/verifier pipeline as part of #137362, but we don't have this yet)
| var unresolvedLinkedSet = unresolvedLinkedSet(unresolved); | ||
|
|
||
| var transformed = load ? load(plan, unresolved) : nullify(plan, unresolved); | ||
| var transformed = load ? load(plan, unresolvedLinkedSet) : nullify(plan, unresolvedLinkedSet); |
There was a problem hiding this comment.
If the unresolveds being in a linked (thus order-preserving) set is important, should the signature of load and nullify require a LinkedHashSet rather than a Set? We shouldn't be implicitly relying on a linked set if the compiler can guarantee this for us.
| Map<String, Alias> aliasesMap = new LinkedHashMap<>(unresolved.size()); | ||
| unresolved.forEach(u -> aliasesMap.computeIfAbsent(u.name(), k -> nullAlias(u))); | ||
| return new ArrayList<>(aliasesMap.values()); | ||
| private static List<Alias> nullAliases(Set<UnresolvedAttribute> unresolved) { |
There was a problem hiding this comment.
Same here: If the output list is supposed to be stable, I think we should explicitly require a LinkedHashSet.
@alex-spies will address the points in the next PR (along with the ones Andrei raised). BTW, will follow on some of the points left on the closed PR too. Gal, Andrei, Alex, thanks for the reviews and pointers! |
💔 Backport failed
You can use sqren/backport to manually backport by running |
|
|
||
| private static void checkMissingFork(QueryPlan<?> plan, Failures failures) { | ||
| for (QueryPlan<?> child : plan.children()) { | ||
| // TODO: this checks the set-semantics, but not the ordering |
There was a problem hiding this comment.
@ioanatia do we want a simple iteration on a subplan's output to check that at same position there's an equality of name and type with that in fork's output? Can there be a set-equality, but not same order?
💚 All backports created successfully
Questions ?Please refer to the Backport tool documentation |
) This fixes the generation of name IDs for the attributes corresponding to the unmapped fields and are pushed to different branches in `UntionAll`. So far, one set of IDs was generated and reused for all subplans. This is now updated to individual set per subplan. Along the change, the handling of `Fork` in `ResolveUnmapped` has been somewhat simplified. Also, more unit tests have been completed (where the plans are simple enough) and the plan comments updated to replace the `EsqlProject` with the now merged `Project`. A minor collateral proposed change: the CSV spec-based tests skipped due to missing capabilities are now logged. (cherry picked from commit 8e3113c)
…) (#141675) * ESQL: Fix injected attributes's IDs in UnionAll branches (#141262) This fixes the generation of name IDs for the attributes corresponding to the unmapped fields and are pushed to different branches in `UntionAll`. So far, one set of IDs was generated and reused for all subplans. This is now updated to individual set per subplan. Along the change, the handling of `Fork` in `ResolveUnmapped` has been somewhat simplified. Also, more unit tests have been completed (where the plans are simple enough) and the plan comments updated to replace the `EsqlProject` with the now merged `Project`. A minor collateral proposed change: the CSV spec-based tests skipped due to missing capabilities are now logged. (cherry picked from commit 8e3113c) * Fix tests 9.3 does not have #139058, so the implicit limits at the top of subquery branches are still in place. Adjust the expectations accordingly. * Checkstyle --------- Co-authored-by: Bogdan Pintea <bogdan.pintea@elastic.co>
This fixes the generation of name IDs for the attributes corresponding to the unmapped fields and are pushed to different branches in
UntionAll.So far, one set of IDs was generated and reused for all subplans. This is now updated to individual set per subplan. Along the change, the handling of
ForkinResolveUnmappedhas been somewhat simplified.Also, more unit tests have been completed (where the plans are simple enough) and the plan comments updated to replace the
EsqlProjectwith the now mergedProject.A minor collateral proposed change: the CSV spec-based tests skipped due to missing capabilities are now logged.