Skip to content

ESQL: Fix variable shadowing when pushing down past Project#108360

Merged
alex-spies merged 42 commits intoelastic:mainfrom
alex-spies:fix-pushdown-past-project-shadowing
Jul 23, 2024
Merged

ESQL: Fix variable shadowing when pushing down past Project#108360
alex-spies merged 42 commits intoelastic:mainfrom
alex-spies:fix-pushdown-past-project-shadowing

Conversation

@alex-spies
Copy link
Copy Markdown
Contributor

@alex-spies alex-spies commented May 7, 2024

Fix #108008

The issue was caused due to the following situation:

Eval[[x{r}#2 * 5[INTEGER] AS y]]
\_Project[[x{r}#2, y{r}#3, y{r}#3 AS z]]
  \_Row[[1[INTEGER] AS x, 2[INTEGER] AS y]]

Pushing down the Eval here was wrong and inconsistent, because we broke the rename y{r}#3 AS z.

To push down the Eval, we give a different name to the y produced by the Eval, which is the main change in this PR:

Project[[x{r}#2, y{r}#3 AS z, $$y$temp_name${r}#6 AS y]]
\_Eval[[x{r}#2 * 5[INTEGER] AS $$y$temp_name$]]
  \_Row[[1[INTEGER] AS x, 2[INTEGER] AS y]]

For Eval and Enrich, we can use the existing aliasing mechanisms existing in the logical plans; for Dissect and Grok, this PR enables naming their generated attributes to deviate from the names obtained from the dissect/grok patterns.

@alex-spies alex-spies force-pushed the fix-pushdown-past-project-shadowing branch from 4ad33b1 to c5c85d1 Compare May 7, 2024 12:29
@alex-spies alex-spies force-pushed the fix-pushdown-past-project-shadowing branch from c5c85d1 to 79a2f2d Compare May 7, 2024 12:30
@alex-spies alex-spies changed the title ESQL: Fix pushdown past project shadowing ESQL: Fix variable shadowing when pushing down past Project May 7, 2024
Comment on lines +422 to 431
// Names in the pattern and layout can differ.
String[] patternNames = Expressions.names(dissect.parser().keyAttributes(Source.EMPTY)).toArray(new String[0]);

Layout layout = layoutBuilder.build();
source = source.with(
new StringExtractOperator.StringExtractOperatorFactory(
attributeNames,
patternNames,
EvalMapper.toEvaluator(expr, layout),
() -> (input) -> dissect.parser().parser().parse(input)
),
Copy link
Copy Markdown
Contributor Author

@alex-spies alex-spies Jul 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and the corresponding change to planGrok are one of the main points of this PR.

@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

Copy link
Copy Markdown
Contributor

@luigidellaquila luigidellaquila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks Alex!

I left a couple of comments, but I think the general approach and the implementation correct


countSameFieldWithEval
required_capability: fixed_pushdown_past_project
from employees | stats b = count(gender), c = count(gender) by gender | eval b = gender | sort c asc
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd really love to have a couple more tests here, where you have multiple expressions in the same EVAL (and GROK, DISSECT...), where some are masked and some are not.
Also, some tests where the EVAL uses masked names in following expressions

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I increased unit test coverage (multiple expressions in the same EVAL) and will add a couple more csv tests where we sometimes shadow, sometimes not.

I'll avoid adding them to stats.csv-spec, though, and will not include STATS commands in the tests; the fact that STATS triggered this bug is merely accidentaly, it's really RENAME ... | EVAL ... and similar that lead to this problem.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.


countSameFieldWithEval
required_capability: fixed_pushdown_past_project
from employees | stats b = count(gender), c = count(gender) by gender | eval b = gender | sort c asc
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the goal is to make the query run to completion, but the first time I looked at this query, the alias b looks ambiguous, should it be gender or count(gender)? Seems like we return b as gender, it overwrites count(gender), but when pushdown happens, the order might be reversed. In SQL, if users code ambiguous column references, an error will return. Should we return an error here to indicate that b is ambiguous or make it return successfully here(if we have agreement in ES|QL that if the same alias is defined in multiple places, the last one will take effect)?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should throw errors. Masking happens all the time, even a simple | eval a = 1 | ... | eval a = 2 could be considered masking.
The intention is exactly to make sure that the final result for b is the value of gender, even if EVAL gets pushed down before STATS (that is what is supposed to happen with current planning rules)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @fang-xing-esql - ESQL in general allows shadowing attribute names that have been available previously. Take a look at the shadowing... csv tests to see this in action. (The tests exist for most commands, eval.csv-spec may be the most important one, though.) Some PRs ago, I also updated our docs to describe behavior in case of conflicting names.

The main idea is that we want to be able to compose expressions in eval, like EVAL x = to_upper(field), x = concat(x, some_other_field).

@alex-spies alex-spies added the test-full-bwc Trigger full BWC version matrix tests label Jul 23, 2024
Comment on lines +82 to +86
if (newNames.size() != extractedFields.size()) {
throw new IllegalArgumentException(
"Number of new names is [" + newNames.size() + "] but there are [" + extractedFields.size() + "] existing names."
);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is common to all classes that implement GeneratingPlan. Maybe you could extract this in a default method in the interface or define an abstract class that extends UnaryPlan and implements GeneratingPlan instead (the abstract class is more appropriate I think).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default method is simple enough! I'm afraid of changing the class hierarchy of RegexExtract, Enrich and Eval to put an abstract class in between there: this might, maybe, mess up some instanceOf checks, and we might have to fiddle with EsqlNodeSubclassTests, which I like to avoid.

}

@Override
public List<Attribute> generatedAttributes() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference between generatedAttributes and extractedFields? Why not calling extractedFields() directly?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extractedFields() exists on Grok and Dissect, but not on Eval nor Enrich; these have fields() (Aliases, though) and enrichFields() (NamedExpressions).

Having all of them implement a common interface with this generatedAttributes makes it much easier to write the pushdown rule for Grok, Dissect, Eval and Enrich - and also to test them.

Additionally, we're in the process of adding more plan nodes that, with respect to shadowing, should behave the same: Lookup and Inlinestats. Having an interface should make the rules easier to reason about.

layoutBuilder.append(dissect.extractedFields());
final Expression expr = dissect.inputExpression();
String[] attributeNames = Expressions.names(dissect.extractedFields()).toArray(new String[0]);
// Names in the pattern and layout can differ.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When is this happening? Can you give an example?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expanded the comment in the latest push.

This happens whenever we call GeneratingPlan.withGeneratedNames on Grok and Dissect.

This enables us to have consistent names in our logical plans, without having to rewrite the format strings for grok and dissect.


public record Parser(String pattern, String appendSeparator, DissectParser parser) {

public List<Attribute> keyAttributes(Source src) {
Copy link
Copy Markdown
Contributor

@astefan astefan Jul 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving this one here is a bit forced. Especially since it seems to act at the parser level (ie the use of ParsingException).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving it - just the validation - back into LogicalPlanBuilder.visitDissectCommand.

public List<Attribute> keyAttributes(Source src) {
Set<String> referenceKeys = parser.referenceKeys();
if (referenceKeys.size() > 0) {
throw new ParsingException(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@luigidellaquila when is this possible, in practical terms? Can you give an example of query?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DissectParser can create field names (together with values) at runtime, from data, see https://www.elastic.co/guide/en/elasticsearch/reference/current/dissect-processor.html#dissect-modifier-reference-keys

In ES|QL we don't support it because we need to know the schema at planning time.

We have a test for this as well https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/esql/src/test/java/org/elasticsearch/xpack/esql/parser/StatementParserTests.java#L756

Copy link
Copy Markdown
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@alex-spies
Copy link
Copy Markdown
Contributor Author

Thanks for your reviews, @astefan , @luigidellaquila and @fang-xing-esql !

@alex-spies alex-spies merged commit e8a01bb into elastic:main Jul 23, 2024
@alex-spies alex-spies deleted the fix-pushdown-past-project-shadowing branch July 23, 2024 13:02
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

💔 Backport failed

Status Branch Result
8.15 Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 108360

@alex-spies
Copy link
Copy Markdown
Contributor Author

💚 All backports created successfully

Status Branch Result
8.15

Questions ?

Please refer to the Backport tool documentation

alex-spies added a commit to alex-spies/elasticsearch that referenced this pull request Jul 24, 2024
…108360)

Fix bugs caused by pushing down Eval, Grok, Dissect and Enrich past Rename, where after the pushdown, the columns added shadowed the columns to be renamed.

For Dissect and Grok, this enables naming their generated attributes to deviate from the names obtained from the dissect/grok patterns.

(cherry picked from commit e8a01bb)

# Conflicts:
#	x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/optimizer/OptimizerRules.java
#	x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plan/logical/Dissect.java
#	x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plan/logical/Enrich.java
#	x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plan/logical/Eval.java
#	x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plan/logical/RegexExtract.java
elasticsearchmachine pushed a commit that referenced this pull request Jul 24, 2024
…#111229)

Fix bugs caused by pushing down Eval, Grok, Dissect and Enrich past Rename, where after the pushdown, the columns added shadowed the columns to be renamed.

For Dissect and Grok, this enables naming their generated attributes to deviate from the names obtained from the dissect/grok patterns.

(cherry picked from commit e8a01bb)

# Conflicts:
#	x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/optimizer/OptimizerRules.java
#	x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plan/logical/Dissect.java
#	x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plan/logical/Enrich.java
#	x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plan/logical/Eval.java
#	x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/plan/logical/RegexExtract.java
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Analytics/ES|QL AKA ESQL auto-backport Automatically create backport pull requests when merged backport pending >bug Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) test-full-bwc Trigger full BWC version matrix tests v8.15.0 v8.16.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ES|QL: Project optimized incorrectly due to missing references

5 participants