Skip to content

ES|QL: Late materialization golden tests#141082

Merged
GalLalouche merged 17 commits intoelastic:mainfrom
GalLalouche:tests/late_golden
Feb 13, 2026
Merged

ES|QL: Late materialization golden tests#141082
GalLalouche merged 17 commits intoelastic:mainfrom
GalLalouche:tests/late_golden

Conversation

@GalLalouche
Copy link
Contributor

@GalLalouche GalLalouche commented Jan 21, 2026

This PR adds golden tests for node-reduce late materialization of TopN, so we can remove its snapshot hiding.

To support this, I also added support for two new stages in golden tests: node_reduce, and local_node_reduce, record the pair output of PlannerUtils.reductionPlan, and the same pair under local optimization.

@GalLalouche GalLalouche added >test Issues or PRs that are addressing/adding tests Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) :Analytics/ES|QL AKA ESQL v9.4.0 labels Jan 21, 2026
Copy link
Contributor

@alex-spies alex-spies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @GalLalouche !

First pass, focused on the golden testing itself. This looks alright with minor comments, only.

Will do a second pass to have a look at the actual expectations.

@@ -0,0 +1,7 @@
ProjectExec[[hire_date{f}, salary{f}, emp_no{f}]]
\_TopNExec[[Order[hire_date{f},ASC,LAST]],20[INTEGER],null]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For follow-up: I see that we strip the name ids from the expectations.

Can we, instead, normalize the name ids but still assert them?

In many tests, the exact name id makes a major difference, and asserting them without golden tests is super painful. Tests for the unmapped fields feature could be drastically simplified with golden tests, for instance, but we must look at the name ids.

If you agree, I'd fold this test enhancement into #138888.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose this would be achievable using one of our Node transformations. I agree, let's do this as a follow up.

@GalLalouche GalLalouche marked this pull request as ready for review January 30, 2026 12:12
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

Copy link
Contributor

@alex-spies alex-spies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is super cool, thanks @GalLalouche !

I have mainly minor remarks related to how easy it is to review and understand the test expectations.

There is a non-minor point, namely the reduction_local_node_reduce.expected expectations don't seem to make sense to me; I don't think the reduce plan undergoes optimization in Production code - but if it does, we have a problem :) Could you please double check this? I left a related remark below.

@@ -0,0 +1,4 @@
ExchangeSinkExec[[emp_no{f}, height{f}],false]
\_ProjectExec[[_doc{f}, height{f}]]
\_FieldExtractExec[height{f}]<[],[]>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this isn't ideal. We could extract only in the reduce driver. An optimization for the future, maybe.

\_FragmentExec[filter=null, estimatedRowSize=0, reducer=[], fragment=[<>
Project[[_doc{f}, emp_no{f}, languages{f}, language_code{r}, language_name{f}]]
\_TopN[[Order[emp_no{f},ASC,LAST]],20[INTEGER],false]
\_Join[LEFT,[language_code{r}],[language_code{f}],null]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oooh, there's an interesting optimization opportunity here!

We could easily perform the lookup join in the reduce driver or even the coordinator, that would also be a form of late materialization - and could reduce the work on the lookup node by 1/n, where n is the number of data drivers running on the node. If the lookup shard isn't replicated, this could affect the overall latency, and if the lookup node is somehow under pressure (e.g. from concurrent query requests that involve joins), even more so.

@julian-elastic, I think @smalyshev noticed this as well (it is related to remote enriches); should we open a tracking issue for an optimization where we try to perform TopNs before joins? (That only works when the lookup isn't remote/CCS, but still).

Copy link
Contributor

@alex-spies alex-spies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very nice, thanks @GalLalouche !

I have mostly minor comments; feel free to address those at your own discretion and :shipit: !

A non-minor comment is that we're still performing local optimization of reduce plans in cases that don't need that. Actually, I can't think of a case where we should perform or gain anything out of local optimization of reduce plans. Since that affects already live production code, though, we can clean that up in a follow-up - in which case I'd like to ask that we create a tracking issue, please! (Even though that should be a tiny fix.) See below for more details.

NODE_REDUCE(new DualFileOutput("local_reduce_planned_reduce_driver", "local_reduce_planned_data_driver")),
/**
* A combination of {@link Stage#NODE_REDUCE} and {@link Stage#LOCAL_PHYSICAL_OPTIMIZATION}: first produce the node
* reduce and data node plans, and then perform local physical optimization on both.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was a bit confused, because my understanding is that node reduce plans shouldn't ever be optimized (at least for now). I learned that we do run the optimizer, even though I don't think it does anything on node reduce plans right now :)

I left a suggestion (for production code changes) below. If we plan to address that in a follow-up, maybe let's leave a TODO comment.

In terms of testing, I think only 1 plan is interesting here: the optimized physical data driver plan. The node reduce plan shouldn't change in this stage from what we got at just the NODE_REDUCE step.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #142392.

Comment on lines 276 to 277
var foo = localOptimize(reductionPlan.dataNodePlan(), conf);
var bar = verifyOrWrite(foo, outputPath(dualFileOutput.dataNodeOutput()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Var naming can probably be more specific than foo and bar :D

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😅

Inlined varfiable is best variable.

@@ -262,6 +262,8 @@ public static PhysicalPlan localPlan(
return EstimatesRowSize.estimateRowSize(f.estimatedRowSize(), localOptimized);
});

// TODO add a test assertion for the consistency checker (see
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I wonder if this is the right place for it.

Above, we only update the fragment - and run 2 optimizer passes on it, which both perform a consistency check at the end.

We're being inconsistent with the exchange that's downstream from the fragment; but that must've been the case already before we reached here - the localPlan method must've gotten an inconsistent plan to begin with!

So, maybe it's useful to add a check at the beginning of this method? (And only run it if assertions are enabled.) A check down here isn't bad, either, but may make it look like the optimizer passes above had any chance of messing up the plan, but they don't really.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to the beginning.

@GalLalouche
Copy link
Contributor Author

Thanks for the in-depth review, @alex-spies!

@GalLalouche GalLalouche enabled auto-merge (squash) February 12, 2026 16:48
@GalLalouche GalLalouche disabled auto-merge February 12, 2026 16:48
@GalLalouche GalLalouche enabled auto-merge (squash) February 12, 2026 16:48
@GalLalouche GalLalouche changed the title Late materialization golden tests ES|QL: Late materialization golden tests Feb 12, 2026
@GalLalouche GalLalouche merged commit 3041047 into elastic:main Feb 13, 2026
35 checks passed
sidosera pushed a commit to sidosera/elasticsearch that referenced this pull request Feb 13, 2026
This PR adds golden tests for node-reduce late materialization of TopN, so we can remove its snapshot hiding.

To support this, I also added support for two new stages in golden tests: node_reduce, and local_node_reduce, record the pair output of PlannerUtils.reductionPlan, and the same pair under local optimization.
GalLalouche added a commit that referenced this pull request Feb 23, 2026
sidosera pushed a commit to sidosera/elasticsearch that referenced this pull request Feb 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Analytics/ES|QL AKA ESQL Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) >test Issues or PRs that are addressing/adding tests v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants