Skip to content

ESQL: Prune unused regex extract nodes in optimizer#140982

Merged
astefan merged 22 commits intoelastic:mainfrom
kanoshiou:prune-unused-regex-extract-nodes
Mar 3, 2026
Merged

ESQL: Prune unused regex extract nodes in optimizer#140982
astefan merged 22 commits intoelastic:mainfrom
kanoshiou:prune-unused-regex-extract-nodes

Conversation

@kanoshiou
Copy link
Copy Markdown
Contributor

Summary

Enables the optimizer to remove entire RegexExtract operations (Dissect and Grok) when none of their extracted fields are used downstream, eliminating unnecessary pattern matching overhead.

Context

Previously, RegexExtract nodes remained in the logical plan even when all extracted fields were unused, causing unnecessary pattern matching execution. Due to RegexExtract's design constraints requiring field count to match the pattern, the optimizer can only remove the entire node, not prune individual fields.

Closes #132437

@elasticsearchmachine elasticsearchmachine added needs:triage Requires assignment of a team area label v9.4.0 external-contributor Pull request authored by a developer outside the Elasticsearch team labels Jan 20, 2026
@elasticsearchmachine elasticsearchmachine added Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) and removed needs:triage Requires assignment of a team area label labels Jan 22, 2026
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

# Conflicts:
#	x-pack/plugin/esql/src/test/java/org/elasticsearch/xpack/esql/optimizer/LogicalPlanOptimizerTests.java
@astefan astefan self-requested a review February 6, 2026 10:18
@astefan astefan self-assigned this Feb 6, 2026
@astefan
Copy link
Copy Markdown
Contributor

astefan commented Feb 10, 2026

buildkite test this

Copy link
Copy Markdown
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for providing this fix. It does look ok conceptually, but I think it needs more complex tests. Both those in csv-spec files and the unit tests in LogicalPlanOptimizerTests test simple scenarios; add some tests where you drop the fields generated by grok and dissect, not only keep and stats. Test the functionality with lookup join and inline stats as well. Shadow the fields generated by grok and dissect with renames, evals and redefine those fields as well.

var firstBranch = fork.children().getFirst();
var firstBranchProject = as(firstBranch, Project.class);
assertThat(firstBranchProject.projections().size(), equalTo(3));
// Dissect has been pruned since x, y, z fields are not used in the final aggregation
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you added this comment here, it would be more complete to add comments about the other x, y and z fields (from other fork branches) that are dropped since they are not used anymore.

/**
* Prunes RegexExtract operations (Dissect and Grok) when none of their extracted fields are used.
* <p>
* Note: Due to limitations in {@link RegexExtract#withGeneratedNames(List)}, which requires the exact same
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this comment. Why the presence of that method be a reason for not partially pruning grok/dissect? Can you, please, explain?
Also, withGeneratedNames is not in RegexExtract, but GeneratingPlan. And eval is a GeneratingPlan as well, but that can be partially pruned (see pruneColumnsInEval method from PruneColumns).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially assumed we couldn’t create a dissect or grok with a different number of extractedFields. However, I’ve updated the logic for partially pruning RegexExtract plans and used sealed to ensure no future subclasses of RegexExtract are missed in the switch inside pruneUnusedRegexExtract.

This patch now appears to break some queries. Please take a look at the comment I posted below.

kanoshiou and others added 4 commits February 13, 2026 09:46
# Conflicts:
#	x-pack/plugin/esql/src/test/java/org/elasticsearch/xpack/esql/optimizer/LogicalPlanOptimizerTests.java
@kanoshiou
Copy link
Copy Markdown
Contributor Author

Thank you for your review @astefan! At the moment, I’m not confident about the correct architectural direction here. If you have a better approach in mind, I’d appreciate your input.

Context
I am implementing partial pruning in PruneColumns. When only a subset of extractedFields is used, we create a new Dissect or Grok node with only the used fields.

The Problem
In LocalExecutionPlanner, the Dissect and Grok planning logic assumes a 1:1 positional correspondence between the extractedFields list (logical plan) and the parser's pattern keys (parser implementation).

  1. Layout is built from extractedFields (size $N$, pruned). This determines channel indices for all downstream operators.
  2. Operator (StringExtractOperator / ColumnExtractOperator) is initialized using the full pattern definition from the parser (size $M$, unpruned). This means the operator produces $M$ blocks at runtime.

When pruning occurs ($N &lt; M$):

  • The operator appends $M$ blocks to the page.
  • The layout expects only $N$ blocks.
  • This causes a corrupted page structure where downstream operators (like Aggregator) read from the wrong channel indices (off by $M - N$).

Why we can't just use extractedFields.name()
We cannot simply verify the operator using extractedFields.name() because of variable shadowing (ref PR #108360). When PushDownRegexExtract pushes logic past a Rename, the attribute names in extractedFields are changed to avoid conflicts, but the underlying parser still returns a map keyed by the original pattern names. The operator must use pattern names to look up values in the parser's result.

Example

# dissect.dissectStats
from employees 
| eval x = concat(gender, " foobar") 
| dissect x "%{a} %{b}" 
| stats n = max(emp_no) by a 
| keep a, n 
| sort a asc

@astefan
Copy link
Copy Markdown
Contributor

astefan commented Feb 25, 2026

@kanoshiou apologies for the delay of my reply.

Please, go ahead and keep only the full pruning part of the Regex nodes. We'll consider the partial pruning for another future PR. It would be pity to not move on with this PR, it has some good code and tests that we should definitely have in the language. Thank you very much! Looking forward to review this PR after partial pruning is, for now, removed.

@kanoshiou
Copy link
Copy Markdown
Contributor Author

@astefan I’ve removed the partial pruning logic. Feel free to review whenever you're free!

@astefan
Copy link
Copy Markdown
Contributor

astefan commented Feb 26, 2026

buildkite test this

@kanoshiou
Copy link
Copy Markdown
Contributor Author

@astefan the failed test has now heen resolved

@astefan
Copy link
Copy Markdown
Contributor

astefan commented Feb 26, 2026

buildkite test this

@kanoshiou
Copy link
Copy Markdown
Contributor Author

The failing test is not caused by this PR.

Reference: #143174

@astefan
Copy link
Copy Markdown
Contributor

astefan commented Feb 27, 2026

buildkite test this

@astefan
Copy link
Copy Markdown
Contributor

astefan commented Feb 27, 2026

buildkite test this

* Limit[1000[INTEGER],false,false]
* \_Project[[id{f}#12]]
* \_Dissect[x{r}#5,Parser[pattern=%{foo}, appendSeparator=, parser=org.elasticsearch.dissect.DissectParser@18e5d3b5],[foo{r}#6]]
* \_Project[[id{f}#12, $$languages$converted_to$keyword{f$}#14, $$languages$converted_to$keyword{f$}#14 AS x#5]]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This specific test has a special purpose and it should remain as is. I'll update it

Copy link
Copy Markdown
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you @kanoshiou

@astefan
Copy link
Copy Markdown
Contributor

astefan commented Mar 2, 2026

buildkite test this

@astefan
Copy link
Copy Markdown
Contributor

astefan commented Mar 2, 2026

buildkite test this

@astefan
Copy link
Copy Markdown
Contributor

astefan commented Mar 2, 2026

buildkite test this

@astefan
Copy link
Copy Markdown
Contributor

astefan commented Mar 3, 2026

buildkite test this

@astefan
Copy link
Copy Markdown
Contributor

astefan commented Mar 3, 2026

buildkite test this

@astefan astefan merged commit 3be56b9 into elastic:main Mar 3, 2026
37 checks passed
@kanoshiou
Copy link
Copy Markdown
Contributor Author

Thank you for picking this up and polishing the final changes, @astefan! I appreciate the help in getting this merged.

szybia added a commit to szybia/elasticsearch that referenced this pull request Mar 3, 2026
…cations

* upstream/main: (56 commits)
  Mute org.elasticsearch.compute.lucene.read.ValueSourceReaderTypeConversionTests testLoadAll elastic#143471
  [DOCS] Fix ES|QL function and commands lists versioning metadata (elastic#143402)
  Fix MMROperatorTests (elastic#143453)
  Fix CSV-escaped quotes in generated docs examples (elastic#143449)
  Fix SQL client parsing of array header values (elastic#143408)
  ESQL: Add extended distribution tests and fault injection for external sources (elastic#143420)
  ESQL: Fix datasource test failures on Windows and FIPS (elastic#143417)
  Add circuit breaker for query construction to prevent OOM from automaton-based queries (elastic#142150)
  Cleanup SpecIT logging configuration (elastic#143365)
  ESQL: Prune unused regex extract nodes in optimizer (elastic#140982)
  Ensure supported locale outside of Entitlements check (elastic#143405)
  feat(es|ql): add dense_vector support in coalesce (elastic#142974)
  [Test] Unmute SnapshotStressTestsIT (elastic#143359)
  Mute org.elasticsearch.xpack.esql.CsvIT test {csv-spec:lookup-join.LookupJoinWithCoalesceFilterOnRight} elastic#143443
  Mute org.elasticsearch.xpack.esql.CsvIT test {csv-spec:lookup-join.MvJoinKeyOnTheLookupIndex} elastic#143442
  ESQL: Fix CCS exchange sink cleanup (elastic#143325)
  Mute org.elasticsearch.xpack.esql.CsvIT test {csv-spec:lookup-join.MvJoinKeyOnTheLookupIndexAfterStats} elastic#143434
  Mute org.elasticsearch.xpack.esql.CsvIT test {csv-spec:lookup-join.MvJoinKeyFromRow} elastic#143432
  Mute org.elasticsearch.xpack.esql.qa.mixed.MixedClusterEsqlSpecIT test {csv-spec:k8s-timeseries.Datenanos_derivative_compared_to_rate} elastic#143431
  Mute org.elasticsearch.multiproject.test.CoreWithMultipleProjectsClientYamlTestSuiteIT test {yaml=search.retrievers/result-diversification/10_mmr_result_diversification_retriever/Test MMR result diversification single index float type} elastic#143430
  ...
tballison pushed a commit to tballison/elasticsearch that referenced this pull request Mar 3, 2026
shmuelhanoch pushed a commit to shmuelhanoch/elasticsearch that referenced this pull request Mar 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Analytics/ES|QL AKA ESQL >enhancement external-contributor Pull request authored by a developer outside the Elasticsearch team Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ESQL: prune more unneeded columns

4 participants