ESQL: Prune unused regex extract nodes in optimizer#140982
ESQL: Prune unused regex extract nodes in optimizer#140982astefan merged 22 commits intoelastic:mainfrom
Conversation
|
Pinging @elastic/es-analytical-engine (Team:Analytics) |
# Conflicts: # x-pack/plugin/esql/src/test/java/org/elasticsearch/xpack/esql/optimizer/LogicalPlanOptimizerTests.java
|
buildkite test this |
astefan
left a comment
There was a problem hiding this comment.
Thank you for providing this fix. It does look ok conceptually, but I think it needs more complex tests. Both those in csv-spec files and the unit tests in LogicalPlanOptimizerTests test simple scenarios; add some tests where you drop the fields generated by grok and dissect, not only keep and stats. Test the functionality with lookup join and inline stats as well. Shadow the fields generated by grok and dissect with renames, evals and redefine those fields as well.
| var firstBranch = fork.children().getFirst(); | ||
| var firstBranchProject = as(firstBranch, Project.class); | ||
| assertThat(firstBranchProject.projections().size(), equalTo(3)); | ||
| // Dissect has been pruned since x, y, z fields are not used in the final aggregation |
There was a problem hiding this comment.
Since you added this comment here, it would be more complete to add comments about the other x, y and z fields (from other fork branches) that are dropped since they are not used anymore.
| /** | ||
| * Prunes RegexExtract operations (Dissect and Grok) when none of their extracted fields are used. | ||
| * <p> | ||
| * Note: Due to limitations in {@link RegexExtract#withGeneratedNames(List)}, which requires the exact same |
There was a problem hiding this comment.
I don't understand this comment. Why the presence of that method be a reason for not partially pruning grok/dissect? Can you, please, explain?
Also, withGeneratedNames is not in RegexExtract, but GeneratingPlan. And eval is a GeneratingPlan as well, but that can be partially pruned (see pruneColumnsInEval method from PruneColumns).
There was a problem hiding this comment.
I initially assumed we couldn’t create a dissect or grok with a different number of extractedFields. However, I’ve updated the logic for partially pruning RegexExtract plans and used sealed to ensure no future subclasses of RegexExtract are missed in the switch inside pruneUnusedRegexExtract.
This patch now appears to break some queries. Please take a look at the comment I posted below.
# Conflicts: # x-pack/plugin/esql/src/test/java/org/elasticsearch/xpack/esql/optimizer/LogicalPlanOptimizerTests.java
|
Thank you for your review @astefan! At the moment, I’m not confident about the correct architectural direction here. If you have a better approach in mind, I’d appreciate your input. Context The Problem
When pruning occurs (
Why we can't just use Example |
|
@kanoshiou apologies for the delay of my reply. Please, go ahead and keep only the full pruning part of the Regex nodes. We'll consider the partial pruning for another future PR. It would be pity to not move on with this PR, it has some good code and tests that we should definitely have in the language. Thank you very much! Looking forward to review this PR after partial pruning is, for now, removed. |
|
@astefan I’ve removed the partial pruning logic. Feel free to review whenever you're free! |
|
buildkite test this |
|
@astefan the failed test has now heen resolved |
|
buildkite test this |
|
The failing test is not caused by this PR. Reference: #143174 |
|
buildkite test this |
|
buildkite test this |
| * Limit[1000[INTEGER],false,false] | ||
| * \_Project[[id{f}#12]] | ||
| * \_Dissect[x{r}#5,Parser[pattern=%{foo}, appendSeparator=, parser=org.elasticsearch.dissect.DissectParser@18e5d3b5],[foo{r}#6]] | ||
| * \_Project[[id{f}#12, $$languages$converted_to$keyword{f$}#14, $$languages$converted_to$keyword{f$}#14 AS x#5]] |
There was a problem hiding this comment.
This specific test has a special purpose and it should remain as is. I'll update it
astefan
left a comment
There was a problem hiding this comment.
LGTM. Thank you @kanoshiou
…prune-unused-regex-extract-nodes
…/kanoshiou/elasticsearch into prune-unused-regex-extract-nodes
|
buildkite test this |
|
buildkite test this |
|
buildkite test this |
|
buildkite test this |
|
buildkite test this |
|
Thank you for picking this up and polishing the final changes, @astefan! I appreciate the help in getting this merged. |
…cations * upstream/main: (56 commits) Mute org.elasticsearch.compute.lucene.read.ValueSourceReaderTypeConversionTests testLoadAll elastic#143471 [DOCS] Fix ES|QL function and commands lists versioning metadata (elastic#143402) Fix MMROperatorTests (elastic#143453) Fix CSV-escaped quotes in generated docs examples (elastic#143449) Fix SQL client parsing of array header values (elastic#143408) ESQL: Add extended distribution tests and fault injection for external sources (elastic#143420) ESQL: Fix datasource test failures on Windows and FIPS (elastic#143417) Add circuit breaker for query construction to prevent OOM from automaton-based queries (elastic#142150) Cleanup SpecIT logging configuration (elastic#143365) ESQL: Prune unused regex extract nodes in optimizer (elastic#140982) Ensure supported locale outside of Entitlements check (elastic#143405) feat(es|ql): add dense_vector support in coalesce (elastic#142974) [Test] Unmute SnapshotStressTestsIT (elastic#143359) Mute org.elasticsearch.xpack.esql.CsvIT test {csv-spec:lookup-join.LookupJoinWithCoalesceFilterOnRight} elastic#143443 Mute org.elasticsearch.xpack.esql.CsvIT test {csv-spec:lookup-join.MvJoinKeyOnTheLookupIndex} elastic#143442 ESQL: Fix CCS exchange sink cleanup (elastic#143325) Mute org.elasticsearch.xpack.esql.CsvIT test {csv-spec:lookup-join.MvJoinKeyOnTheLookupIndexAfterStats} elastic#143434 Mute org.elasticsearch.xpack.esql.CsvIT test {csv-spec:lookup-join.MvJoinKeyFromRow} elastic#143432 Mute org.elasticsearch.xpack.esql.qa.mixed.MixedClusterEsqlSpecIT test {csv-spec:k8s-timeseries.Datenanos_derivative_compared_to_rate} elastic#143431 Mute org.elasticsearch.multiproject.test.CoreWithMultipleProjectsClientYamlTestSuiteIT test {yaml=search.retrievers/result-diversification/10_mmr_result_diversification_retriever/Test MMR result diversification single index float type} elastic#143430 ...
Summary
Enables the optimizer to remove entire
RegexExtractoperations (DissectandGrok) when none of their extracted fields are used downstream, eliminating unnecessary pattern matching overhead.Context
Previously,
RegexExtractnodes remained in the logical plan even when all extracted fields were unused, causing unnecessary pattern matching execution. Due toRegexExtract's design constraints requiring field count to match the pattern, the optimizer can only remove the entire node, not prune individual fields.Closes #132437