ESQL: Fix variable shadowing when pushing down past Project#108360
ESQL: Fix variable shadowing when pushing down past Project#108360alex-spies merged 42 commits intoelastic:mainfrom
Conversation
4ad33b1 to
c5c85d1
Compare
c5c85d1 to
79a2f2d
Compare
Changing this and relying on being able to rename the attributes generated by Dissect/Grok will break bwc: old nodes cannot rename the generated attributes.
| // Names in the pattern and layout can differ. | ||
| String[] patternNames = Expressions.names(dissect.parser().keyAttributes(Source.EMPTY)).toArray(new String[0]); | ||
|
|
||
| Layout layout = layoutBuilder.build(); | ||
| source = source.with( | ||
| new StringExtractOperator.StringExtractOperatorFactory( | ||
| attributeNames, | ||
| patternNames, | ||
| EvalMapper.toEvaluator(expr, layout), | ||
| () -> (input) -> dissect.parser().parser().parse(input) | ||
| ), |
There was a problem hiding this comment.
This and the corresponding change to planGrok are one of the main points of this PR.
|
Pinging @elastic/es-analytical-engine (Team:Analytics) |
luigidellaquila
left a comment
There was a problem hiding this comment.
LGTM, thanks Alex!
I left a couple of comments, but I think the general approach and the implementation correct
...ck/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/planner/LocalExecutionPlanner.java
Show resolved
Hide resolved
...k/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/optimizer/LogicalPlanOptimizer.java
Outdated
Show resolved
Hide resolved
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plan/GeneratingPlan.java
Show resolved
Hide resolved
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/optimizer/OptimizerRules.java
Show resolved
Hide resolved
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/parser/LogicalPlanBuilder.java
Show resolved
Hide resolved
|
|
||
| countSameFieldWithEval | ||
| required_capability: fixed_pushdown_past_project | ||
| from employees | stats b = count(gender), c = count(gender) by gender | eval b = gender | sort c asc |
There was a problem hiding this comment.
I'd really love to have a couple more tests here, where you have multiple expressions in the same EVAL (and GROK, DISSECT...), where some are masked and some are not.
Also, some tests where the EVAL uses masked names in following expressions
There was a problem hiding this comment.
I increased unit test coverage (multiple expressions in the same EVAL) and will add a couple more csv tests where we sometimes shadow, sometimes not.
I'll avoid adding them to stats.csv-spec, though, and will not include STATS commands in the tests; the fact that STATS triggered this bug is merely accidentaly, it's really RENAME ... | EVAL ... and similar that lead to this problem.
|
|
||
| countSameFieldWithEval | ||
| required_capability: fixed_pushdown_past_project | ||
| from employees | stats b = count(gender), c = count(gender) by gender | eval b = gender | sort c asc |
There was a problem hiding this comment.
I understand the goal is to make the query run to completion, but the first time I looked at this query, the alias b looks ambiguous, should it be gender or count(gender)? Seems like we return b as gender, it overwrites count(gender), but when pushdown happens, the order might be reversed. In SQL, if users code ambiguous column references, an error will return. Should we return an error here to indicate that b is ambiguous or make it return successfully here(if we have agreement in ES|QL that if the same alias is defined in multiple places, the last one will take effect)?
There was a problem hiding this comment.
I don't think we should throw errors. Masking happens all the time, even a simple | eval a = 1 | ... | eval a = 2 could be considered masking.
The intention is exactly to make sure that the final result for b is the value of gender, even if EVAL gets pushed down before STATS (that is what is supposed to happen with current planning rules)
There was a problem hiding this comment.
Thanks @fang-xing-esql - ESQL in general allows shadowing attribute names that have been available previously. Take a look at the shadowing... csv tests to see this in action. (The tests exist for most commands, eval.csv-spec may be the most important one, though.) Some PRs ago, I also updated our docs to describe behavior in case of conflicting names.
The main idea is that we want to be able to compose expressions in eval, like EVAL x = to_upper(field), x = concat(x, some_other_field).
| if (newNames.size() != extractedFields.size()) { | ||
| throw new IllegalArgumentException( | ||
| "Number of new names is [" + newNames.size() + "] but there are [" + extractedFields.size() + "] existing names." | ||
| ); | ||
| } |
There was a problem hiding this comment.
This check is common to all classes that implement GeneratingPlan. Maybe you could extract this in a default method in the interface or define an abstract class that extends UnaryPlan and implements GeneratingPlan instead (the abstract class is more appropriate I think).
There was a problem hiding this comment.
Default method is simple enough! I'm afraid of changing the class hierarchy of RegexExtract, Enrich and Eval to put an abstract class in between there: this might, maybe, mess up some instanceOf checks, and we might have to fiddle with EsqlNodeSubclassTests, which I like to avoid.
| } | ||
|
|
||
| @Override | ||
| public List<Attribute> generatedAttributes() { |
There was a problem hiding this comment.
What is the difference between generatedAttributes and extractedFields? Why not calling extractedFields() directly?
There was a problem hiding this comment.
extractedFields() exists on Grok and Dissect, but not on Eval nor Enrich; these have fields() (Aliases, though) and enrichFields() (NamedExpressions).
Having all of them implement a common interface with this generatedAttributes makes it much easier to write the pushdown rule for Grok, Dissect, Eval and Enrich - and also to test them.
Additionally, we're in the process of adding more plan nodes that, with respect to shadowing, should behave the same: Lookup and Inlinestats. Having an interface should make the rules easier to reason about.
| layoutBuilder.append(dissect.extractedFields()); | ||
| final Expression expr = dissect.inputExpression(); | ||
| String[] attributeNames = Expressions.names(dissect.extractedFields()).toArray(new String[0]); | ||
| // Names in the pattern and layout can differ. |
There was a problem hiding this comment.
When is this happening? Can you give an example?
There was a problem hiding this comment.
Expanded the comment in the latest push.
This happens whenever we call GeneratingPlan.withGeneratedNames on Grok and Dissect.
This enables us to have consistent names in our logical plans, without having to rewrite the format strings for grok and dissect.
|
|
||
| public record Parser(String pattern, String appendSeparator, DissectParser parser) { | ||
|
|
||
| public List<Attribute> keyAttributes(Source src) { |
There was a problem hiding this comment.
Moving this one here is a bit forced. Especially since it seems to act at the parser level (ie the use of ParsingException).
There was a problem hiding this comment.
Moving it - just the validation - back into LogicalPlanBuilder.visitDissectCommand.
| public List<Attribute> keyAttributes(Source src) { | ||
| Set<String> referenceKeys = parser.referenceKeys(); | ||
| if (referenceKeys.size() > 0) { | ||
| throw new ParsingException( |
There was a problem hiding this comment.
@luigidellaquila when is this possible, in practical terms? Can you give an example of query?
There was a problem hiding this comment.
DissectParser can create field names (together with values) at runtime, from data, see https://www.elastic.co/guide/en/elasticsearch/reference/current/dissect-processor.html#dissect-modifier-reference-keys
In ES|QL we don't support it because we need to know the schema at planning time.
We have a test for this as well https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/esql/src/test/java/org/elasticsearch/xpack/esql/parser/StatementParserTests.java#L756
|
Thanks for your reviews, @astefan , @luigidellaquila and @fang-xing-esql ! |
💔 Backport failed
You can use sqren/backport to manually backport by running |
💚 All backports created successfully
Questions ?Please refer to the Backport tool documentation |
…108360) Fix bugs caused by pushing down Eval, Grok, Dissect and Enrich past Rename, where after the pushdown, the columns added shadowed the columns to be renamed. For Dissect and Grok, this enables naming their generated attributes to deviate from the names obtained from the dissect/grok patterns. (cherry picked from commit e8a01bb) # Conflicts: # x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/optimizer/OptimizerRules.java # x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plan/logical/Dissect.java # x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plan/logical/Enrich.java # x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plan/logical/Eval.java # x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plan/logical/RegexExtract.java
…#111229) Fix bugs caused by pushing down Eval, Grok, Dissect and Enrich past Rename, where after the pushdown, the columns added shadowed the columns to be renamed. For Dissect and Grok, this enables naming their generated attributes to deviate from the names obtained from the dissect/grok patterns. (cherry picked from commit e8a01bb) # Conflicts: # x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/optimizer/OptimizerRules.java # x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plan/logical/Dissect.java # x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plan/logical/Enrich.java # x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plan/logical/Eval.java # x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plan/logical/RegexExtract.java
Fix #108008
The issue was caused due to the following situation:
Pushing down the
Evalhere was wrong and inconsistent, because we broke the renamey{r}#3 AS z.To push down the Eval, we give a different name to the
yproduced by the Eval, which is the main change in this PR:For
EvalandEnrich, we can use the existing aliasing mechanisms existing in the logical plans; forDissectandGrok, this PR enables naming their generated attributes to deviate from the names obtained from the dissect/grok patterns.